Day Zero: Building a Superhuman AI Red Teamer From Scratch

Research

8

min read

August 28, 2025

Mateo Rojas-Carulla

To secure AI, we must first learn how to break it–red teaming is the key to understanding how AI systems based on Large Language Models (LLMs) fail when faced with real-world threats. This blog series sets the foundation for building a superhuman AI red teaming agent.

Red teaming LLMs is difficult, and human crowdsourcing remains the gold standard. Gandalf, which will make an appearance later in the series, has been a fantastic way to see the power of millions of AI red teamers in action.

But humans are limited by cost, speed, and scalability. Our goal isn’t just automation. It’s to surpass human ingenuity, crafting a system that can probe, break, and stress-test AI applications better than any human ever could–to benefit security and trust in AI for all of us.

As part of the series, we will:

Explain why LLMs introduce entirely new security challenges.
Explore the difficulties of building an effective automated red teamer, drawing insights from millions of attacks on Gandalf.
Develop a red teaming benchmark to rigorously assess red teamer capabilities.
Release the benchmark publicly.

To start, we need to define the challenge. This first post will set the stage, explaining what red teaming means in the context of LLMs and this series, and why attacking LLM applications is fundamentally different from attacking traditional software.

The novel vulnerabilities of AI applications

AI applications that leverage LLMs are not exempt from classic security issues. A misconfigured vector database can leak sensitive data. Networking vulnerabilities can expose internal APIs to unintended access. The traditional cybersecurity playbook still applies.

However, LLMs introduce an entirely new attack surface because they operate under a fundamentally different paradigm, one where data itself can function as an executable. Let’s illustrate that with two examples exploiting real-world LLM systems.

Example 1: Adversarial Search Engine Optimization (SEO) attacks

The exploit from this paper illustrates the LLM vulnerability and is summarized in the diagram above. The team created a website containing information about a fake camera.

Within the site, they embedded hidden prompt injections designed to manipulate an LLM into believing that the fake camera was superior to well-known alternatives. They then showed that, when asking Bing’s search LLM to choose the best camera among some alternatives, the LLM often surfaced the fake camera, and even ranked it above real, more prominent cameras!

How does this vulnerability work?

The LLM retrieves the website in question as candidate data to answer the user query. Note that this data is owned by a third party and the model owner has no control over it!
The LLM processes the retrieved data and, while doing that, it encounters malicious instructions left for the LLM. This makes the data become executable: the LLM just executed a piece of malware.
As a result, the LLM surfaces false information to an unsuspecting user. This is just an example – the attacker could have targeted some different behavior instead.

‍

Here is an example response from Bing LLM (Invis OptiPix is a fake camera on the author’s website):

‍

Example 2: Attacking the Gemini assistant

We red-teamed Gemini’s workspace assistant. In the example above, the attacker embeds an attack on the LLM in plain white letters in an email, invisible to the human.

The user asks the assistant to summarize the email.
The LLM reads the white text and interprets the instructions to add a phishing link to the message. These instructions are now “code” executed by the assistant. Once again, notice that the assistant has no control over third-party data such as emails.
The assistant executes the attacker’s commands and serves the malicious link.

So, how do we define this vulnerability?

Let’s have a stab at a definition:

LLMs are vulnerable because they don't strictly separate developer instructions from external inputs, allowing attackers to manipulate their behavior via data.

This creates a profound security challenge: attackers don’t need direct access to a system to exploit it. Instead, they manipulate the data fed into the LLM, turning the model itself into an attack vector.

The internet is already an adversarial environment, fueling attacks like adversarial SEO (as seen in the fake camera example) and broader data poisoning techniques. This will only get worse. For a deep dive on prompt attacks, Simon Willson’s blog has great content.

Later in the series, we’ll thoroughly define the categories of possible attacks, which will be crucial for evaluating a red teaming agent’s effectiveness across the board. We believe it is essential to be exhaustive on the categories of attacks that are possible, and the outcomes the attacker may be trying to achieve by attacking the LLM.

In this series, we will focus exclusively on the LLM vulnerability defined above. We don’t consider classical vulnerabilities unimportant. However, since the pentesting literature has covered them extensively, we focus on novel vulnerabilities where effective red teaming is still a major challenge.

Why this challenge is different from traditional cybersecurity

Traditional cybersecurity has long focused on protecting structured systems, where vulnerabilities typically stem from flawed logic, misconfigurations, or code execution bugs. Security research has developed well-established techniques for identifying and mitigating these issues.

LLM vulnerabilities are unique because they arise from the very capabilities that make these models powerful. LLMs excel at interpreting and acting on diverse inputs—multilingual text, code, JSON, and more—enabling transformative applications with a large impact in productivity. However, this flexibility makes it difficult to patch vulnerabilities without degrading model performance, often requiring significant compute to retrain. This security-utility trade-off is explored in depth in our paper, Gandalf the Red.

To make things worse, attackers don’t need direct access to the model or its components; they can exploit weaknesses simply by manipulating third-party data.

Addressing these LLM vulnerabilities is fundamentally an AI challenge because AI is the only method we have to both identify these issues in data and generate the high-quality attacks needed for effective red teaming. No other approach matches AI’s ability to detect these subtle patterns and create sophisticated adversarial content.

To better convey the difficulty of solving these challenges, it helps to look at analogies from other AI-driven fields. Identifying whether an attack is present in text or images is similar to building autonomous cars—but with the added complexity of adversarial behavior. It’s akin to building a self-driving car that must not only operate reliably across extreme conditions and rare edge cases but also withstand deliberate attempts by attackers to make it fail.

Where do we start?

Now that we have a shared understanding of the vulnerabilities in AI systems, we can begin tackling the challenge head-on. Our goal is to build an automated red teaming agent capable of outpacing human experts in identifying and exploiting weaknesses in AI applications.

In our next post, we’ll break down the core obstacles: why human ingenuity has been the gold standard, what makes automating red teaming uniquely hard, and what it will take to build an AI agent that can challenge—and ultimately exceed—human capabilities.

Ready to see what it takes to outmatch human red teamers?

Stay with us.

The Lakera team has accelerated Dropbox’s GenAI journey.

Not sure how to secure your GenAI application?
Skip the guesswork with expert-recommended policies built by Lakera’s AI security team. Apply them in seconds, fine-tune when you’re ready, and get started with real protection from day one.

Download the Guide

On this page

Text Link

Hide table of contents

Show table of contents