The Backbone Breaker Benchmark: Testing the Real Security of AI Agents

AI agents are moving faster than our ability to measure their security. Most benchmarks tell us how smart or safe a model is, not how it behaves when someone tries to break it.

The Backbone Breaker Benchmark (b3), built by Lakera with the UK AI Security Institute, changes that.

-db1-What’s a Backbone LLM?

In this context, the backbone is the core large language model (LLM) that powers an AI agent—it is called sequentially to generate output, perform reasoning steps, invoke tools or decide to end the execution flow. Everything else—the tools, memory, interface and all processing of the outputs—is built around it.

When we talk about security of backbone LLMs, we mean testing how resilient that core model is on its own, regardless of any other vulnerabilities in the traditional software surrounding it.-db1-

AI agents operate as a chain of stateless LLM calls—each step performing reasoning, producing output, or invoking tools.

Instead of evaluating these full agent workflows end-to-end, b3 zooms in on the individual steps where the backbone LLM actually fails: the specific moments when a prompt, file, or web input triggers a malicious output. These are the pressure points attackers exploit—not the agent architecture itself, but the vulnerable LLM calls within it.

The b3 benchmark uses a new method called threat snapshots: micro-tests that capture how an LLM reacts under targeted attack. Each snapshot is paired with real adversarial data: nearly 200,000 human red-team attempts from Gandalf: Agent Breaker. Together, they form a benchmark that makes LLM security measurable, reproducible, and comparable across models and application categories, for example, how well a model resists indirect prompt injections, malicious tool calls, or data exfiltration attempts.

b3 lets us finally see which “backbones” are most resilient in a given application, and what separates strong models from those that fail under pressure. Along the way, the results revealed two striking patterns: models that reason step by step tend to be more secure, and open-weight models are closing the gap with closed systems faster than expected.

Together, these insights show why traditional safety evaluations fall short: they don’t tell you how secure an AI agent really is.

Rethinking AI Agent Security

Agentic AI has redrawn the map of risk.

Today’s agents browse the web, execute code, and call APIs. Every one of those steps turns text into logic, and every piece of text can become an attack vector.

The problem is structural: LLMs can’t tell data from instructions. Crucially, this inability stems from the same feature that makes them valuable—their capacity to follow instructions expressed in natural language, which enables their generality and usefulness. This dual nature makes them both powerful and inherently exploitable. And when those LLM vulnerabilities mix with traditional software weaknesses, like permission mismanagement, insecure APIs, or flawed tool integrations, new, hybrid attack paths emerge.

Traditional testing frameworks miss this distinction. Safety evaluations tend to measure whether a model avoids producing harmful content, not whether it can be exploited to perform unintended actions. Full-agent simulations, meanwhile, are too complex to reproduce or scale.

The b3 benchmark focuses on the backbone directly: the point where vulnerabilities first emerge.

This “backbone-first” view disentangles the LLM’s security behavior from traditional software vulnerabilities and reframes the core question from “Can my agent survive an attack?” to “Can my backbone LLM resist one?”

That’s where real security starts.

Threat Snapshots: Making Security Measurable

Existing benchmarks evaluate AI agents as a whole: testing every component, interaction, and API call. But security doesn’t fail across the whole system at once. It fails in a single moment: the instant an LLM makes a vulnerable decision.

That’s what threat snapshots capture. Each snapshot isolates one of those moments, a freeze-frame of an agent under attack, and tests how the underlying model behaves.

Every snapshot defines three key elements:

The agent’s state and context (what it “knows” and what tools it has access to)
The attack vector and objective (how the threat is delivered and what it tries to achieve)
The scoring function (how success or failure is measured)

This approach is both simpler and more powerful than full-agent simulations. By focusing only on the model’s own decisions, it removes the noise of surrounding software logic. That makes it easier to characterize, faster to run, and broad enough to cover the full range of LLM-specific vulnerabilities: from prompt injection and context extraction to tool misuse and data exfiltration.

Importantly, threat snapshots concentrate solely on LLM-level security, not traditional software flaws. This distinction allows for reproducible, comparable evaluations of model security, independent of any specific agent framework or toolchain.

As part of the b3 framework, Lakera’s research team also developed an exhaustive attack categorization: a structured map of how LLM vulnerabilities appear across different agent types. This taxonomy can serve as a foundation for threat modeling in AI security research.

From Game to Benchmark: Gandalf’s Evolution

The story of the b3 benchmark starts with Gandalf, the world’s largest red teaming community for generative AI. Built to let anyone “hack” language models using natural language, Gandalf has generated millions of human-crafted attacks that reveal how real people try to exploit AI systems.

From that foundation, Lakera created Gandalf: Agent Breaker, a sandbox of ten realistic GenAI applications designed to capture how different agent behaviors fail under pressure. What began as a game evolved into a massive dataset: over 194,000 human attack attempts, of which 10,900 were successful, spanning 10 representative agentic threat scenarios.

Scale made the difference. Only by analyzing attacks at this magnitude could the research team extract sufficiently powerful adversarial examples capable of exposing deep security weaknesses in models.

From this immense dataset, the team distilled just 0.1% of all collected attacks, the most representative, high-signal examples, to create the final b3 benchmark. Each chosen attack reflects a real-world vulnerability: from data exfiltration and malicious code execution to phishing link insertion, memory poisoning, and unauthorized tool calls, highlighting just some of the threat snapshots that form the benchmark.

Why does this matter? Because human creativity exposes vulnerabilities synthetic attacks still miss, though as adversarial AI continues to evolve, that gap may eventually narrow. Real players find unexpected angles, chain prompts in ways models don’t anticipate, and adapt faster than automated adversarial scripts.

That’s what makes b3 different: it doesn’t test models in sanitized lab conditions, it tests them against human ingenuity. It’s the first human-grounded, threat-realistic benchmark for AI agents, built on data from how people actually try to break them.

Inside the b3 Benchmark

The b3 benchmark is built around ten representative threat snapshots: each designed after a real-world agent scenario and designed to test one specific type of vulnerability.

Examples include:

Trippy Planner → phishing link insertion
Curs-ed CodeReview → malicious code injection
Clause AI → data exfiltration from legal documents
MindfulChat → memory poisoning leading to denial-of-service

Each threat snapshot focuses on a single attack objective, and all ten cover distinct security failure modes observed in agentic AI systems: from data exfiltration and content injection to system compromise or policy bypass. This structure ensures broad coverage of the known LLM attack surface while keeping every test sharply focused and interpretable.

Every scenario is tested under three defense levels, baseline, hardened, and self-judging, representing common protection strategies used in AI agents today.

Each evaluation produces a vulnerability score, calculated by replaying attacks multiple times across different models to measure how consistently an exploit succeeds. This scoring system makes the benchmark reproducible, comparable, and resistant to random variance in model outputs.

In simple terms, a threat snapshot is like a freeze-frame of an AI agent under attack: a focused test that captures how the model behaves in the precise moment where things can go wrong.

No Reasoning

With Reasoning

Open Weights (striped)

What the Results Reveal

When we put 31 popular large language models to the test, the results confirmed what many suspected, and challenged a few assumptions along the way.

-db1-1. Reasoning improves security.
Models that “think out loud” before responding, breaking problems into steps, were significantly harder to exploit. In the benchmark, LLMs with explicit reasoning chains were, on average, about 15% less vulnerable to injection-based attacks. Step-by-step reasoning gives models a chance to re-evaluate malicious context instead of acting on it immediately, a form of self-checking that improves resilience.-db1-

-db1-2. Bigger doesn’t mean safer.
While model size often correlates with capability, it doesn’t guarantee stronger defenses. Some mid-sized models outperformed larger ones, showing that security isn’t a by-product of scale, but of design choices, such as training data quality, alignment strategy, and reasoning depth.-db1-

-db1-3. Closed-weight models still lead. For now.
Commercial models with proprietary training pipelines tended to perform better across most threat snapshots. But the best open-weight models are closing the gap fast. This convergence suggests that open-source ecosystems are rapidly learning from real-world adversarial data.-db1-

-db1-4. Security and safety are not the same.
A model that refuses to output unsafe content isn’t necessarily resistant to manipulation. The benchmark revealed several cases where “safe” models still executed harmful or unintended actions when attacked indirectly. Safety alignment limits what a model can say; security alignment governs what it can be made to do.-db1-

Together, these findings reshape how we think about AI security. Beyond ranking performance, b3 reveals why models resist or fail under pressure.

Why This Matters

Security has long been the missing metric in how we evaluate large language models. The b3 benchmark changes that by making security measurable, comparable, and reproducible across the ecosystem, rather than providing another leaderboard.

-db1-For developers:
b3 helps you choose secure backbones, not just smarter ones. It exposes how different models handle malicious prompts, poisoned memory, or compromised tool use, the kinds of issues that surface only in production.-db1-

-db1-For model providers:
It offers a reproducible, adversarial metric to track real security progress. Instead of relying on abstract safety scores, providers can now see where their models fail, why they fail, and how defenses actually improve over time.-db1-

-db1-For researchers:
b3 creates a shared foundation for studying LLM vulnerabilities systematically: across reasoning styles, architectures, and defense strategies. It provides two practical tools for the research community: the threat snapshot framework, which enables targeted and reproducible testing, and an attack categorization that helps analyze and compare vulnerabilities across AI agents. Together, they transform red teaming data into measurable science.-db1-

-db1-For CISOs and policymakers:
It marks a step toward measurable AI trustworthiness: the ability to compare systems not by promise or perception, but by empirical resistance to attack.-db1-

Taken together, these shifts point to a broader goal that drives everything we build at Lakera: turning AI security from intuition into instrumentation. Because trust in AI shouldn’t be assumed, it should be measured.

The Road Ahead

The Backbone Breaker Benchmark (b3) is only the beginning.

Our next step is to make threat snapshots a standard for evaluating AI agent security, a common language for measuring how well large language models resist real adversarial behavior. By focusing on the exact moments where models fail, threat snapshots give researchers, developers, and security teams a shared framework for comparing progress over time.

The benchmark itself will continue to evolve. As models grow more capable, so must their tests. That means introducing new, more powerful attacks, expanding the catalog of threat snapshots, and adapting the scoring system to capture emerging failure modes in agentic and tool-using systems.

Ultimately, b3 is meant to stay alive; a living benchmark that grows alongside the frontier of AI security research.

Because as AI agents become the new interface to the digital world, benchmarks like b3 will define how we measure trust.

Learn more and get involved:
→ Read the paper on arXiv and learn more about b3. The benchmark code will be made public on GitHub once responsible disclosure with model providers is complete.
→ Test your agent breaking skills in Gandalf: Agent Breaker, the game that started it all.

The Lakera team has accelerated Dropbox’s GenAI journey.

Not sure how to secure your GenAI application?
Skip the guesswork with expert-recommended policies built by Lakera’s AI security team. Apply them in seconds, fine-tune when you’re ready, and get started with real protection from day one.

Download the Guide

On this page

Text Link

Hide table of contents

Show table of contents