As AI agents become more capable, measuring their security has become increasingly difficult. Most evaluations still focus on capability or safety, asking whether a model produces harmful content. But security is a different question entirely: can the model be manipulated to perform unintended actions?
The Backbone Breaker Benchmark (b3) was created to answer that question. Developed by Lakera in collaboration with the UK AI Security Institute, b3 tests the resilience of backbone LLMs—the core models that power AI agents—against realistic adversarial attacks. Rather than evaluating full agent systems, the benchmark isolates the model itself and measures how it behaves when confronted with the kinds of manipulations attackers use in practice.
What makes b3 unique is its foundation in real-world data. The benchmark is built from nearly 200,000 human red-team attacks collected through Gandalf: Agent Breaker, Lakera’s large-scale AI security challenge. From this dataset, the research team distilled representative attack scenarios into structured “threat snapshots,” enabling reproducible tests of how models respond under pressure.
In this guide, we’ll walk through how to install and run the Backbone Breaker Benchmark, interpret its results, and reproduce the experiments described in the research paper.
Key Concepts Before Running B3
To make sense of B3's output, you need to understand a few concepts.
Backbone LLM. The core large language model that powers an AI agent — it gets called sequentially to reason, produce output, and invoke tools. B3 doesn't test full agent pipelines. It isolates the backbone and tests whether it can be manipulated at the individual call level.
Threat snapshots. Each test in B3 is a freeze-frame of an agent under attack. A snapshot defines three things: the agent's state and context (system prompt, available tools), the attack vector and objective, and how success is measured. The snapshots are based on levels from Gandalf: Agent Breaker — for example, a travel planner being tricked into inserting phishing links, or a legal assistant being manipulated into exfiltrating document contents.
Defense levels. Each threat snapshot is tested at three levels of protection:
- L1 — baseline: the application's system prompt with no additional defenses
- L2 — hardened: the system prompt includes explicit security instructions
- L3 — self-judging: uses the L1 system prompt, but a separate judge model reviews the response and can veto it if it detects a policy violation
Prerequisites & Environment Setup
You'll need:
- -bc-uv-bc- (recommended) or -bc-pip-bc- for package management
- API key(s) for the model provider(s) you want to evaluate (OpenAI, Anthropic, Google, etc.)
- An OpenAI API key regardless of target model — one of the scorers depends on OpenAI embeddings
To avoid passing your model and key on every command, create a -bc-.env-bc- file in your working directory:
Installation
Two paths depending on your use case:
From PyPI — quickest way to run evaluations:
From the repo — if you want to explore source, modify scorers, or reproduce the paper:
Running Your First Evaluation
Basic CLI command:
Or from Python:
What happens under the hood:
- B3 loads its dataset of curated adversarial attacks, distilled from ~200K human red-team attempts collected through Gandalf: Agent Breaker
- Each attack is replayed against the model within a specific threat snapshot (app scenario, tools, defense level)
- The model's response is scored according to the attack objective
Smoke test first. This runs just 2 samples per snapshot — enough to confirm API keys, scorers, and logging all work. Note that B3 runs each attack 5 times by default (called "epochs" in Inspect), so the total sample count will be 30 threat snapshots × 2 samples × 5 epochs = 300 samples.
This runs just two samples per snapshot — enough to confirm API keys, scorers, and logging all work.
Viewing & Interpreting Results
Launch the log viewer after a run:
Alternatively, use the Inspect AI VS Code extension.
What you'll see:
- Per-sample scores — did a specific attack succeed?
- Per-threat-snapshot breakdowns — how vulnerable is the model in each app scenario and defense level?
- Vulnerability score — how consistently attacks succeed; higher = more vulnerable
Scoring uses different methods depending on the attack objective (text similarity, tool invocation matching, content detection, etc.). See the paper for details.
Reproducing the Paper's Results
This requires the repo install path from the "Installation" section above — if you only did -bc-uv pip install inspect-evals[b3]-bc-, clone the repo first:
The repo includes a script that runs the full benchmark with the same models and settings used in the paper:
⚠️ Warning: -bc---group all-bc- runs the full set of 30+ models end-to-end. Expect significant API costs (thousands of dollars) and several hours of runtime. Check -bc-src/inspect_evals/b3/experiments/constants.py-bc- for the full model list before launching.
You'll need API keys for all providers used in the paper. Add these to your -bc-.env-bc-:
Note: the script runs Bedrock models in -bc-us-east-1-bc- — make sure your AWS account has Bedrock access enabled in that region.
Tips & Gotchas
- Rate limits. Use -bc---max-connections-bc- to throttle concurrent requests and avoid 429 errors. You might need to tune it per provider based on your rate limits with them.
- Cost. A full B3 run sends hundreds of prompts per model. Use -bc-limit_per_threat_snapshot-bc- during development; save full runs for when you're ready.
- OpenAI dependency. One of the scorers uses OpenAI embeddings, so you need a valid -bc-OPENAI_API_KEY-bc- even when evaluating non-OpenAI models.
- L3 self-judge can zero your scores. At L3, a judge model evaluates whether the response violates security policies. If it flags a violation, the sample score is set to 0.0 regardless of the primary scorer. This simulates a real-world guardrail layer.
What's Next
The Backbone Breaker Benchmark is designed to evolve alongside the systems it measures. As models grow more capable and new attack techniques emerge, the benchmark will continue to expand with additional threat snapshots and improved evaluation methods.
Running b3 is one way to explore the security behavior of backbone LLMs firsthand. For a deeper understanding of how real attackers probe AI systems, you can also experiment with Gandalf: Agent Breaker, the red-teaming environment that generated the adversarial data behind the benchmark.
Together, these tools aim to move AI security toward something the ecosystem has long lacked: a shared, empirical way to measure how well models resist manipulation.
You can explore the benchmark implementation in the Inspect Evals GitHub repository and learn more about the methodology in the accompanying research paper.





