11-06, 10:50–12:20 (US/Eastern), Winter Garden
Large Language Models (LLMs) generate contextual informative responses , but they also pose risks related to harmful outputs such as violent speech, threats, explicit content, and adversarial attacks. In this tutorial, we will focus on building a robust content moderation pipeline for LLM-generated text, designed to detect and mitigate harmful outputs in real-time. We will work through a hands-on project where participants will implement a content moderation system from scratch via two different ways. First is through using open source LLM models via Ollama and conducting various prompt engineering techniques. The second is fine tuning small open source LLMs on a content moderation specific datasets. It will also include identifying adversarial attacks, including jailbreaks, and applying both rule-based and machine learning approaches to filter inappropriate content.
This tutorial is aimed at AI engineers, researchers, and practitioners who are involved in deploying LLMs and are looking to implement moderation systems that prevent harmful content. A basic understanding of LLMs, NLP techniques and comfort in Python and Pytorch will be helpful. The GitHub repository contained code and datasets will be shared prior to the tutorial.
As LLMs become more powerful, ensuring that they generate responsible and safe content is a key challenge. This technical tutorial will focus on building a real-time content moderation system for LLM outputs, capable of detecting and preventing harmful content, including violent, explicit, and adversarial prompts. Throughout the session, we will work on developing a scalable moderation pipeline that can be applied to real-world LLM deployments. Two different approach will be conducted to build content moderation pipeline: prompt engineering and fine tuning the LLMs to behave as a content moderators.
Outline and Time Breakdown:
- Minutes 0-15: Responsible AI and content moderation
- Overview of AI safety, content moderation challenges in GenAI
- Framework of content moderation pipelines based on different modalities (text, image, audio)
- Minutes 15-30: Understanding LLM vulnerabilities and adversarial attacks
- Detailed discussion of adversarial prompts, jailbreaking, and harmful content generation for evaluation
-
Case studies of how LLMs can be jailbroken and adversarial attacked for each modalities (text, image, audio)
-
Minutes 30-60: Designing the moderation pipeline using LLMs (Hands-on)
-
Prompt Engineering LLMs to behave as a content moderator to moderation human-ai conversation by applying various techniques for content in four main harm categories: hate & fairness, violence, sexual and self-harm
-
Minutes 60-65: break
- Minutes 65-90:
- Fine-tuning open source LLMs for harmful content, jailbreak and adversarial attack scenarios for specific domains and use cases.
- Extensive evaluation of content moderation LLM models
- Brief discussion of practices for deploying content moderation systems in production environments
Takeaways for the audience:
Participants will leave with a content moderation pipeline that can be used to filter harmful or adversarial content generated by LLMs. They will also gain insights into advanced techniques for adversarial, jailbreak detection and will learn how to conduct evaluation of LLMs generated response and how to effectively deploy moderation models.
Requirements:
- Basic understanding of machine learning, NLP and LLMs
- Familiarity with Python, Pytorch framework
No previous knowledge expected
Aziza is an Applied Scientist at Oracle in Generative and Responsible AI with more than 3 years of experience with ML/NLP technologies. Previously she worked in LLM evaluation and content moderation in AI safety at Microsoft’s Responsible & OpenAI research team. She is a graduate of a master’s degree program in Artificial Intelligence from Northwestern University. Throughout her time at Northwestern, she worked as a ML Research Associate at Technological for Inclusive Learning and Teaching Lab (tiilt) in building multimodal conversation analysis applications called Blinc. She was a Data Science for Social Good Fellow at University of Washington’s eScience Institute during the summer of 2022. Aziza is interested in developing machine learning and Generative AI tools and systems to solve complex and social impact driven problems. Once she is done coding, she is either training for her next marathon race or hiking somewhere around PNW.