Cookie Consent
Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.
Read our Privacy Policy

AI Under Siege: Red-Teaming Large Language Models

Learn how red-teaming techniques like jailbreak prompting enhance the security of large language models like GPT-3 and GPT-4, ensuring ethical and safe AI deployment.

Deval Shah
May 16, 2024
May 15, 2024
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

In-context learning

As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.

[Provide the input text here]

[Provide the input text here]

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Lorem ipsum dolor sit amet, line first
line second
line third

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Large Language Models (LLMs) like GPT-3 and GPT-4 have revolutionized industries from healthcare to finance, showcasing the transformative power of AI in turning vast datasets into actionable insights.

These models enhance user interactions by synthesizing information and generating human-like text, playing a pivotal role in automating complex decision-making processes. However, as their influence grows, so does the imperative to ensure their responsible development and deployment.

The recent advancements underscore the importance of safeguarding AI systems. President Biden has highlighted the necessity of managing AI risks, emphasizing the need for frameworks that align these technologies with ethical standards and public welfare. Ensuring security isn't just about preventing misuse; it's about fortifying these systems against adversarial attacks.

One effective method to achieve this is red-teaming, where security experts attempt to exploit vulnerabilities in an LLM system. By proactively identifying and mitigating potential threats, we can pave the way for safer and more reliable AI applications that benefit society while adhering to our highest ethical expectations.


Hide table of contents
Show table of contents

Understanding Red-Teaming in AI

Red-teaming in AI is a stringent security practice designed to identify and address potential vulnerabilities in AI systems before real-world exploitation. This process involves emulating real-world adversaries to uncover blind spots and validate security assumptions. For instance, Microsoft’s interdisciplinary AI Red Team probes AI systems for vulnerabilities, focusing on security and responsible AI outcomes. This proactive approach refines security measures and ethical considerations in AI development.

Importance of Red-Teaming

  • DeepMind's Approach: AI research groups like DeepMind emphasize the importance of red-teaming, which generates test inputs to elicit potentially harmful responses from models like GPT-3 and Gopher. This methodology identifies critical issues such as toxicity or misinformation before public release. The testing involves using one language model to generate test cases, which are then evaluated by another to detect undesirable behaviors.
  • Lakera's Expertise: Similarly, Lakera leverages specialized red-teaming expertise to secure large language model (LLM) applications. Lakers’s flagship product, Lakera Red, simulates adversarial attacks to test AI systems before deployment, ensuring vulnerabilities are addressed proactively. By integrating such rigorous testing into the AI development lifecycle, Lakera enhances security and ensures compliance with ethical standards and regulatory requirements.

Risks of Inadequate Testing

Deploying LLMs without such testing poses significant risks. For example, the release of Microsoft's Tay bot1, which rapidly produced offensive content due to adversarial inputs, underscores the consequences of inadequate pre-deployment testing.

Red-teaming helps simulate numerous harmful scenarios in a controlled environment, ensuring AI behaves as intended and enhancing the safety and reliability of AI systems in real-world applications.

**💡 Pro tip: Explore our list of the leading LLMs: GPT-4, LLAMA, Gemini, and more. Understand what they are, how they evolved, and how they differ from each other.**

Addressing Toxicity and Dishonesty

A paper on red-teaming language models addresses critical concerns like toxicity and dishonesty in AI-generated content:

  • Toxicity: Toxicity encompasses various harmful behaviors, such as hate speech, discriminatory language, and sexually explicit content. These undesirable outputs can significantly harm users and tarnish the deploying organization’s reputation. Toxicity in language models often arises unintentionally due to biases in the training data or adversarial manipulations. The paper highlights the effectiveness of generating test cases that can trigger these toxic responses, allowing developers to identify and mitigate them before the models are used in real-world applications. Methods like zero-shot and few-shot generation create test cases to probe for toxic outputs specifically. This proactive approach is crucial for refining models and reducing the likelihood of generating harmful content.
  • Dishonesty: Dishonesty involves generating untruthful or misleading information, which can lead to misinformation if not adequately addressed. This aspect is particularly concerning as AI models like GPT-3 are used in environments requiring trust and accuracy, such as news generation or educational content. The paper also elaborates on using red-teaming to detect instances where the model might generate dishonest outputs, including setting up scenarios and prompts that could lead the model to produce or replicate false information.

Addressing toxicity and dishonesty through red-teaming involves intricate testing and adjustments to the models, ensuring they adhere to ethical standards and do not harm or mislead users. This detailed approach helps create safer and more reliable AI systems. 2, 3

Core Red-Teaming Techniques

Jailbreak Prompting

Jailbreak prompting, a form of adversarial prompting, is employed in red-teaming to expose vulnerabilities in large language models (LLMs). It involves crafting prompts that coax LLMs into deviating from their safety constraints and programmed guidelines, revealing conflicts between a model’s capabilities and safety protocols. This method is particularly effective in demonstrating how models can produce harmful or biased outputs when manipulated through sophisticated prompts.

The HuggingFace team discusses various strategies to mitigate these risks, such as using adversarial attacks involving human-in-the-loop and automated processes to test language models. These red-teaming tactics are essential for identifying and fixing undesirable behaviours in LLMs before they are deployed in real-world applications. One such strategy involves augmenting the LLM with a classifier that predicts the potential for prompts to lead to offensive outputs. If such a risk is detected, the system might generate a benign response instead, balancing between helpful and minimising harm.

**💡 Pro tip: Learn more about jailbreaking large language models.**

Figure: Jailbreak Prompting Example (Source)

The significance of jailbreak prompting in red-teaming is highlighted through its potential to simulate real-world misuse scenarios, which helps improve the robustness of language models against adversarial inputs. As these techniques evolve, they contribute to developing more secure and reliable AI systems.

**💡 Pro tip: Learn more about prompt injections from our article Guide to Prompt Injection: Techniques, Prevention Methods & Tools**

Human-Guided Strategies

Human-guided strategies in red-teaming large language models (LLMs) involve utilising human intuition and creativity to uncover potential vulnerabilities in AI systems. This process is crucial because it adds a layer of diverse human insight that automatic methods may overlook. 

At DeepMind, human-guided strategies in red-teaming play a pivotal role in identifying and mitigating risks associated with AI models. These strategies often involve generating inputs that could elicit harmful or biased outputs from the models, thereby testing the AI's responses to scenarios that could occur in real-world applications.

The insights gained from these exercises help refine the AI models to be safer and more aligned with human values before deployment. This approach helps spot immediate flaws and contributes to developing more robust AI systems that can effectively handle unexpected or adversarial inputs in real-life settings.

Mitigating Risks in AI with Advanced Red-Teaming Techniques

Implementing the "Gamified Red-teaming Solver" within the game-theoretic framework known as the Red-teaming Game marks a significant advancement in AI security strategies.  The "Gamified Red-teaming Solver" (GRTS) functions within the "Red-teaming Game" (RTG) framework, which is a game-theoretic approach to enhancing the security of large language models through systematic red-teaming. Here’s a detailed breakdown of how GRTS operates:

Core Principles

GRTS leverages a game-theoretic framework to simulate interactions between offensive (Red-team) and defensive (Blue-team) models in a controlled setting. The primary goal is to identify and mitigate potential vulnerabilities in LLMs by understanding how they respond under adversarial conditions.

Operation Methodology

  1. Meta-game Analysis: GRTS employs meta-game analysis, which involves multiple rounds of gameplay between red and blue teams, allowing each team to adjust their strategies based on the outcomes of previous engagements.
  2. Iterative Strategy Adjustments: The process is iterative, where each team refines their strategies to counteract the other’s moves better. This iterative adjustment continues until the system reaches a state of Nash Equilibrium, where no player can benefit by changing strategies while the other player's strategies remain unchanged.
Figure: How GRTS works (Source)

Technical Implementation

  • Policy Set Expansion: GRTS starts with a basic set of policies for each team and expands these policy sets by generating new policies that aim to maximise the strategic advantage over the adversary.
  • Double Oracle Method: The solver uses the Double Oracle (DO) method, which iteratively calculates the best responses from both teams. After each round, the solver updates the strategy space with these new best responses, progressively refining the strategies towards optimal solutions.
  • Reinforcement Learning Algorithms: GRTS utilises advanced reinforcement learning algorithms like Proximal Policy Optimization (PPO) to calculate these best responses. These algorithms help develop robust strategies against the opponent’s potential moves.

Practical Outcomes

  • Diversity of Attack Strategies: By continuously adapting and refining strategies, GRTS can uncover a wide range of attack strategies that might need to be apparent in initial models or more straightforward heuristic approaches.
  • Security Enhancement: The iterative confrontation and adaptation help significantly enhance the security of LLMs, making them more resilient against sophisticated attacks.

Benefits of GRTS

  • Scalability: GRTS provides a scalable approach to red-teaming capable of handling complex model interactions and various adversarial strategies.
  • Efficiency: Through game-theoretic principles, GRTS efficiently guides both teams towards strategies that theoretically optimise security and robustness.

GRTS represents a significant advancement in AI security, particularly in how it systematically approaches the problem of securing LLMs through a gamified and theoretically grounded method.

External Red Teaming

Continuous, external red-teaming is critical for enhancing the security and reliability of AI systems. This process involves rigorous testing from independent teams that simulate real-world attacks, aiming to uncover and mitigate potential vulnerabilities that might not be detected through internal assessments alone. Two prominent examples that highlight the importance of this approach are the DEF CON 31 event and DeepMind's proactive security practices.

DEF CON 31 and Red-Teaming

DEF CON, one of the world's largest hacker conventions, often features competitive hacking events, including red-teaming competitions where teams of security experts attempt to exploit AI systems and other technologies. DEF CON 31 showcased how external red teams can simulate various attack vectors to challenge AI systems in ways developers might not anticipate. The diversity of the attack strategies used at these events, from social engineering to technical exploits, demonstrates the broad scope of potential threats AI systems face.

DeepMind's Approach to Red-Teaming

DeepMind adopts a comprehensive approach to red-teaming by regularly engaging with external experts to test their AI models. Their strategy emphasises finding immediate flaws and understanding potential emergent behaviours of AI systems in complex environments. This thorough testing helps adapt AI behaviours to align with ethical standards and societal expectations, enhancing their AI implementations' safety and utility.

Figure: DeepMind Red Teaming Approach (Source)

Benefits of Continuous, External Red-Teaming

External red teams can provide a more objective assessment of AI security, often bringing fresh perspectives that internal teams might overlook.

  1. Early Detection of Threats: Organisations can prevent potential damages from exploited vulnerabilities by identifying and mitigating risks early.
  2. Adaptation to Emerging Threats: Continuous red-teaming helps organisations keep pace with the rapidly evolving landscape of cybersecurity threats.
  3. Building Trust: Regularly validated AI systems through credible external assessments can help build trust among users and stakeholders by demonstrating commitment to safety and reliability.

The DEF CON events and DeepMind's practices illustrate the vital role of external red-teaming in developing resilient AI systems capable of operating safely in unpredictable real-world conditions. This approach is about fixing problems and foreseeing and preventing future issues, ensuring that AI technologies can be trusted and are robust enough to handle the complexities of real-world applications.

State-of-the-Art in Red-Teaming

Recent advancements in red-teaming Large Language Models (LLMs) have introduced robust methodologies such as the "Explore, Establish, Exploit" framework. 

This innovative approach, detailed in an arXiv paper, enhances understanding of model behaviours without relying on pre-existing classifiers, often limiting the scope to known issues. The framework operates in three stages:

  1. Explore: This initial phase involves probing the model’s behaviour within various contexts to uncover various outputs, mainly focusing on those that might not be immediately flagged by conventional means.
  2. Establish: This step measures undesired behaviour based on insights gained during the exploration phase. This typically involves developing new classifiers or adjustment metrics specifically tuned to the nuances discovered about the model.
  3. Exploit: The final phase uses the established measures to actively identify and leverage vulnerabilities within the model, aiming to simulate potential adversarial attacks that could exploit these weaknesses in real-world scenarios.

Figure: Explore, Establish and Exploit process from the paper (Source)

Introducing the CommonClaim dataset helps identify and address biases and unethical behaviours in LLMs. This dataset, as part of the "Explore, Establish, Exploit" framework, is specifically designed to red-team LLMs by discovering prompts that lead to toxic and dishonest outputs.

The CommonClaim dataset contains 20,000 boolean statements, each evaluated by human judges to determine their truthfulness, providing a controlled environment to assess the honesty of AI-generated content. This allows researchers to pinpoint specific conditions under which LLMs generate false or misleading information, addressing these issues before they affect end users.

Moreover, using datasets like CommonClaim, red-teaming highlights the crucial trade-offs between model helpfulness and harmlessness. While LLMs are designed to be helpful, ensuring they do not inadvertently cause harm by spreading misinformation or exhibiting biased behaviour is equally important. Red-teaming exercises help navigate these trade-offs by rigorously testing the models under various scenarios to find a balance that maximises utility while minimising potential harm. This practice enhances the safety and reliability of AI systems and ensures they operate within an ethical framework that promotes trust and fairness.

For more detailed insights into the CommonClaim dataset and its application in red-teaming, you can explore the discussions and findings in their comprehensive study on GitHub.

Ethical Considerations

The moral imperative to continuously test AI systems against biases, discrimination, and privacy violations is crucial to responsible AI development. Rigorous and continuous testing is essential to identify and mitigate these risks. This approach ensures that AI systems do not perpetuate societal biases or create new forms of inequality.

Continuous Testing Against Biases and Discrimination

Large language models can inadvertently learn and replicate societal biases in their training data. Continuous red-teaming helps identify and mitigate these biases by simulating real-world scenarios where these biases might manifest. This proactive approach ensures that AI systems treat all users fairly and equitably.

Privacy Considerations

Privacy is another significant concern, as AI systems often handle sensitive user data. Red-teaming tests AI systems against various privacy invasion scenarios to ensure they uphold privacy standards and not inadvertently leak or misuse user data. This is vital for maintaining user trust and compliance with global privacy regulations.

One of the most crucial ethical considerations in developing and deploying large language models (LLMs) is the trade-off between model helpfulness and harmlessness.

Trade-Offs Between Helpfulness and Harmlessness

AI systems are often designed to be as helpful as possible, providing users with accurate, relevant, and timely information. However, the drive for high performance can sometimes lead to unintended consequences, such as the propagation of harmful biases or the invasion of privacy. For example, a too helpful model might inadvertently reveal personal information in its responses, compromising user privacy.

Red-teaming is vital in navigating these trade-offs by rigorously testing AI systems under adversarial conditions to uncover potential weaknesses or harmful behaviours. This process helps ensure that LLMs adhere to ethical standards and societal expectations without compromising their effectiveness. Red-teaming thus acts as a critical check within the AI development lifecycle, prompting developers to make necessary adjustments to strike a better balance between helpfulness and harmlessness.

Figure: Honest, Harmless and Helpful (Source)

Community Engagement and Collaboration

The call for increased collaboration among AI researchers, developers, and the cybersecurity community is crucial in establishing robust red-teaming ecosystems. To address the advanced capabilities and potential vulnerabilities of large language models (LLMs), a unified approach harnessing the collective expertise of various stakeholders is necessary.

One of the critical initiatives for fostering community engagement is the development of open-source projects that democratise access to red-teaming tools and datasets. For instance, Aurora-M, an open-source, multilingual language model discussed in the paper, exemplifies the collaborative nature of the open-source community. It promotes transparency and allows researchers to collectively enhance AI models by identifying and addressing ethical and safety concerns through red-teaming. This encourages a culture of sharing and continuous improvement.6

Figure: Aurora-M safety results from the paper (Source)

Effective community engagement can be observed in the HuggingFace community, known for its collaborative spirit. The HuggingFace platform facilitates sharing red-teaming strategies and datasets, inviting contributions from developers and researchers worldwide.7

Moreover, structured engagements such as hackathons and collaborative competitions can significantly contribute to the red-teaming ecosystem. Events like DEFCON have brought together AI and cybersecurity professionals to stress-test AI models and share insights on emerging threats and mitigation strategies. These engagements help in creating diverse and comprehensive benchmarks for AI safety evaluations.8

To further enhance the collaborative efforts, stakeholders should establish centralised repositories and communication channels. Platforms like GitHub and academic forums can be repositories for red-teaming datasets, tools, and methodologies. These resources should be freely accessible to promote widespread participation and innovation.

Ultimately, the collaborative approach benefits individual AI projects and strengthens the overall AI ecosystem. By pooling resources and knowledge, the community can develop more resilient AI models capable of withstanding various adversarial attacks. This synergy ensures that AI technology advances responsibly, addressing ethical concerns while maximising its potential benefits.


The urgency of implementing red-teaming strategies in the era of advanced large language models (LLMs) cannot be overstated. As these models become increasingly integrated into various sectors—from healthcare and finance to education and entertainment—the potential for intentional misuse and unintentional harmful outputs grows. This reality underscores the need for rigorous, scalable oversight techniques to keep pace with these technologies' rapid development and deployment.

Red-teaming, deliberately challenging systems to expose vulnerabilities, ensures that LLMs operate safely and as intended. This process simulates various adversarial scenarios to test the models’ responses to unexpected or malicious inputs. The goal is to identify weaknesses and enhance the models' robustness and resilience, ensuring they adhere to ethical standards and societal expectations.

Proactive engagement from the machine learning community is essential to achieve this. Researchers, developers, and practitioners must unite to foster a culture of safety and responsibility in AI development. By sharing knowledge, tools, and best practices and collaboratively developing new and improved red-teaming methods, the community can better protect against the risks associated with AI while maximising its benefits.

Moreover, the engagement shouldn’t stop at the professional community, where policymakers, regulatory bodies, and the general public must also contribute to the norms and standards that guide AI adoption. This broad-based approach will ensure that AI technologies are powerful, effective, trustworthy, and aligned with the broader public interest.

The development of LLMs presents immense possibilities but also significant challenges. Only through committed, collective efforts in red-teaming and community engagement can we ensure that these technologies are developed and used responsibly, ethically, and safely.


  6. Aurora-M: The First Open Source Multilingual Language Model Red. arXiv, 2024
  7. Red-Teaming Large Language Models. HuggingFace, 2023
  8. OpenAI, Hugging Face and DEFCON 31. (2023)
  9. [2202.03286] Red Teaming Language Models with Language Models
  10. Red Teaming Language Models with Language Models - Google DeepMind
  11. Red-Teaming Large Language Models
  12. A Game-Theoretic Framework for Red Teaming Language Models
  13. Red-Teaming Large Language Models to Identify Novel AI Risksouse
  14. Explore, Establish, Exploit: Red Teaming Language Models from Scratch
  15. Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

Download Free

Explore Prompt Injection Attacks.

Learn LLM security, attack strategies, and protection tools. Includes bonus datasets.

Unlock Free Guide

Learn AI Security Basics.

Join our 10-lesson course on core concepts and issues in AI security.

Enroll Now

Evaluate LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Download Free

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Download Free

The CISO's Guide to AI Security

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Download Free

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Download Free
Deval Shah

The CISO's Guide to AI Security

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Free Download
Read LLM Security Playbook

Learn about the most common LLM threats and how to prevent them.


Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Understand AI Security Basics.

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Optimize LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Master Prompt Injection Attacks.

Discover risks and solutions with the Lakera LLM Security Playbook.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

You might be interested
min read
AI Security

Advancing AI Security With Insights From The World’s Largest AI Red Team

Watch David Haber’s RSA Conference 2024 talk on advancing AI security with insights from the world’s largest AI red team and the groundbreaking game, Gandalf.
David Haber
June 26, 2024
min read
AI Security

LLM Vulnerability Series: Direct Prompt Injections and Jailbreaks

of prompt injections that are currently in discussion. What are the specific ways that attackers can use prompt injection attacks to obtain access to credit card numbers, medical histories, and other forms of personally identifiable information?
Daniel Timbrell
December 1, 2023
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.