How to Run the Backbone Breaker Benchmark (B3)

As AI agents become more capable, measuring their security has become increasingly difficult. Most evaluations still focus on capability or safety, asking whether a model produces harmful content. But security is a different question entirely: can the model be manipulated to perform unintended actions?

The Backbone Breaker Benchmark (b3) was created to answer that question. Developed by Lakera in collaboration with the UK AI Security Institute, b3 tests the resilience of backbone LLMs—the core models that power AI agents—against realistic adversarial attacks. Rather than evaluating full agent systems, the benchmark isolates the model itself and measures how it behaves when confronted with the kinds of manipulations attackers use in practice.

What makes b3 unique is its foundation in real-world data. The benchmark is built from nearly 200,000 human red-team attacks collected through Gandalf: Agent Breaker, Lakera’s large-scale AI security challenge. From this dataset, the research team distilled representative attack scenarios into structured “threat snapshots,” enabling reproducible tests of how models respond under pressure.

In this guide, we’ll walk through how to install and run the Backbone Breaker Benchmark, interpret its results, and reproduce the experiments described in the research paper.

The Enterprise Playbook for Agentic AI Security

Lakera Agentic AI Security Playbook Cover

AI systems now retrieve data, invoke tools, and act across enterprise workflows. Get the playbook to learn how to secure AI across employees, applications, and agents.

Inside the Playbook

Why traditional security models fall short
The three new AI exposure surfaces
How to secure the execution layer
What a unified AI Defense Plane looks like in practice

Get instant access

Fill out the form below and the playbook will be sent directly to your inbox.

Learn more about the Playbook →

Key Concepts Before Running B3

To make sense of B3's output, you need to understand a few concepts.

Backbone LLM. The core large language model that powers an AI agent — it gets called sequentially to reason, produce output, and invoke tools. B3 doesn't test full agent pipelines. It isolates the backbone and tests whether it can be manipulated at the individual call level.

Threat snapshots. Each test in B3 is a freeze-frame of an agent under attack. A snapshot defines three things: the agent's state and context (system prompt, available tools), the attack vector and objective, and how success is measured. The snapshots are based on levels from Gandalf: Agent Breaker — for example, a travel planner being tricked into inserting phishing links, or a legal assistant being manipulated into exfiltrating document contents.

Defense levels. Each threat snapshot is tested at three levels of protection:

L1 — baseline: the application's system prompt with no additional defenses
L2 — hardened: the system prompt includes explicit security instructions
L3 — self-judging: uses the L1 system prompt, but a separate judge model reviews the response and can veto it if it detects a policy violation

Prerequisites & Environment Setup

You'll need:

-bc-uv-bc- (recommended) or -bc-pip-bc- for package management
API key(s) for the model provider(s) you want to evaluate (OpenAI, Anthropic, Google, etc.)
An OpenAI API key regardless of target model — one of the scorers depends on OpenAI embeddings

To avoid passing your model and key on every command, create a -bc-.env-bc- file in your working directory:

INSPECT_EVAL_MODEL=openai/gpt-4.1-nano
OPENAI_API_KEY=<your-key>
# Put here any other API keys you need, e.g. the following:
ANTHROPIC_API_KEY=<your-key>
GOOGLE_API_KEY=<your-key>
OPENROUTER_API_KEY=<your-key>

Installation

Two paths depending on your use case:

From PyPI — quickest way to run evaluations:

uv pip install inspect-evals[b3]

From the repo — if you want to explore source, modify scorers, or reproduce the paper:

git clone https://github.com/UKGovernmentBEIS/inspect_evals.git
cd inspect_evals
uv sync --extra b3

Running Your First Evaluation

Basic CLI command:

uv run inspect eval inspect_evals/b3 --model openai/gpt-5-nano

Or from Python:

from inspect_ai import eval
from inspect_evals.b3 import b3

eval(b3)

What happens under the hood:

B3 loads its dataset of curated adversarial attacks, distilled from ~200K human red-team attempts collected through Gandalf: Agent Breaker
Each attack is replayed against the model within a specific threat snapshot (app scenario, tools, defense level)
The model's response is scored according to the attack objective

Smoke test first. This runs just 2 samples per snapshot — enough to confirm API keys, scorers, and logging all work. Note that B3 runs each attack 5 times by default (called "epochs" in Inspect), so the total sample count will be 30 threat snapshots × 2 samples × 5 epochs = 300 samples.

uv run inspect eval inspect_evals/b3 --model openai/gpt-5-nano \
  -T limit_per_threat_snapshot=2

This runs just two samples per snapshot — enough to confirm API keys, scorers, and logging all work.

Viewing & Interpreting Results

Launch the log viewer after a run:

uv run inspect view

Alternatively, use the Inspect AI VS Code extension.

What you'll see:

Per-sample scores — did a specific attack succeed?
Per-threat-snapshot breakdowns — how vulnerable is the model in each app scenario and defense level?
Vulnerability score — how consistently attacks succeed; higher = more vulnerable

Scoring uses different methods depending on the attack objective (text similarity, tool invocation matching, content detection, etc.). See the paper for details.

Inspection results in the Inspect AI viewer

Single sample view

Scoring example

Sample list

Reproducing the Paper's Results

This requires the repo install path from the "Installation" section above — if you only did -bc-uv pip install inspect-evals[b3]-bc-, clone the repo first:

git clone https://github.com/UKGovernmentBEIS/inspect_evals.git
cd inspect_evals
uv sync --extra b3

The repo includes a script that runs the full benchmark with the same models and settings used in the paper:

uv run python src/inspect_evals/b3/experiments/run.py --group all

⚠️ Warning: -bc---group all-bc- runs the full set of 30+ models end-to-end. Expect significant API costs (thousands of dollars) and several hours of runtime. Check -bc-src/inspect_evals/b3/experiments/constants.py-bc- for the full model list before launching.

You'll need API keys for all providers used in the paper. Add these to your -bc-.env-bc-:

| Provider    | Env var                                                                     |
|-------------|-----------------------------------------------------------------------------|
| OpenAI      | `OPENAI_API_KEY`                                                            |
| Anthropic   | `ANTHROPIC_API_KEY`                                                         |
| Google      | `GOOGLE_API_KEY`                                                            |
| OpenRouter  | `OPENROUTER_API_KEY`                                                        |
| AWS Bedrock | No env var needed — uses your active AWS session (e.g. via `aws sso login`) |

Note: the script runs Bedrock models in -bc-us-east-1-bc- — make sure your AWS account has Bedrock access enabled in that region.

Tips & Gotchas

Rate limits. Use -bc---max-connections-bc- to throttle concurrent requests and avoid 429 errors. You might need to tune it per provider based on your rate limits with them.
Cost. A full B3 run sends hundreds of prompts per model. Use -bc-limit_per_threat_snapshot-bc- during development; save full runs for when you're ready.
OpenAI dependency. One of the scorers uses OpenAI embeddings, so you need a valid -bc-OPENAI_API_KEY-bc- even when evaluating non-OpenAI models.
L3 self-judge can zero your scores. At L3, a judge model evaluates whether the response violates security policies. If it flags a violation, the sample score is set to 0.0 regardless of the primary scorer. This simulates a real-world guardrail layer.

What's Next

The Backbone Breaker Benchmark is designed to evolve alongside the systems it measures. As models grow more capable and new attack techniques emerge, the benchmark will continue to expand with additional threat snapshots and improved evaluation methods.

Running b3 is one way to explore the security behavior of backbone LLMs firsthand. For a deeper understanding of how real attackers probe AI systems, you can also experiment with Gandalf: Agent Breaker, the red-teaming environment that generated the adversarial data behind the benchmark.

Together, these tools aim to move AI security toward something the ecosystem has long lacked: a shared, empirical way to measure how well models resist manipulation.

You can explore the benchmark implementation in the Inspect Evals GitHub repository and learn more about the methodology in the accompanying research paper.

The Lakera team has accelerated Dropbox’s GenAI journey.

Not sure how to secure your GenAI application?
Skip the guesswork with expert-recommended policies built by Lakera’s AI security team. Apply them in seconds, fine-tune when you’re ready, and get started with real protection from day one.

Download the Guide

On this page

Text Link

Hide table of contents

Show table of contents