Key Concepts Before Running B3

To make sense of B3's output, you need to understand a few concepts.

Backbone LLM. The core large language model that powers an AI agent — it gets called sequentially to reason, produce output, and invoke tools. B3 doesn't test full agent pipelines. It isolates the backbone and tests whether it can be manipulated at the individual call level.

Threat snapshots. Each test in B3 is a freeze-frame of an agent under attack. A snapshot defines three things: the agent's state and context (system prompt, available tools), the attack vector and objective, and how success is measured. The snapshots are based on levels from Gandalf: Agent Breaker — for example, a travel planner being tricked into inserting phishing links, or a legal assistant being manipulated into exfiltrating document contents.

Defense levels. Each threat snapshot is tested at three levels of protection:

  • L1 — baseline: the application's system prompt with no additional defenses
  • L2 — hardened: the system prompt includes explicit security instructions
  • L3 — self-judging: uses the L1 system prompt, but a separate judge model reviews the response and can veto it if it detects a policy violation

Prerequisites & Environment Setup

You'll need:

  • -bc-uv-bc- (recommended) or -bc-pip-bc- for package management
  • API key(s) for the model provider(s) you want to evaluate (OpenAI, Anthropic, Google, etc.)
  • An OpenAI API key regardless of target model — one of the scorers depends on OpenAI embeddings

To avoid passing your model and key on every command, create a -bc-.env-bc- file in your working directory:

INSPECT_EVAL_MODEL=openai/gpt-4.1-nano
OPENAI_API_KEY=<your-key>
# Put here any other API keys you need, e.g. the following:
ANTHROPIC_API_KEY=<your-key>
GOOGLE_API_KEY=<your-key>
OPENROUTER_API_KEY=<your-key>

Installation

Two paths depending on your use case:

From PyPI — quickest way to run evaluations:

uv pip install inspect-evals[b3]

From the repo — if you want to explore source, modify scorers, or reproduce the paper:

git clone https://github.com/UKGovernmentBEIS/inspect_evals.git
cd inspect_evals
uv sync --extra b3

Running Your First Evaluation

Basic CLI command:

uv run inspect eval inspect_evals/b3 --model openai/gpt-5-nano

Or from Python:

from inspect_ai import eval
from inspect_evals.b3 import b3

eval(b3)

What happens under the hood:

  • B3 loads its dataset of curated adversarial attacks, distilled from ~200K human red-team attempts collected through Gandalf: Agent Breaker
  • Each attack is replayed against the model within a specific threat snapshot (app scenario, tools, defense level)
  • The model's response is scored according to the attack objective

Smoke test first. This runs just 2 samples per snapshot — enough to confirm API keys, scorers, and logging all work. Note that B3 runs each attack 5 times by default (called "epochs" in Inspect), so the total sample count will be 30 threat snapshots × 2 samples × 5 epochs = 300 samples.

uv run inspect eval inspect_evals/b3 --model openai/gpt-5-nano \
  -T limit_per_threat_snapshot=2

This runs just two samples per snapshot — enough to confirm API keys, scorers, and logging all work.

Viewing & Interpreting Results

Launch the log viewer after a run:

uv run inspect view

Alternatively, use the Inspect AI VS Code extension.

What you'll see:

  • Per-sample scores — did a specific attack succeed?
  • Per-threat-snapshot breakdowns — how vulnerable is the model in each app scenario and defense level?
  • Vulnerability score — how consistently attacks succeed; higher = more vulnerable

Scoring uses different methods depending on the attack objective (text similarity, tool invocation matching, content detection, etc.). See the paper for details.

Inspection results in the Inspect AI viewer
Single Sample View
Single sample view
Scoring Example
Scoring example
Sample List
Sample list

Reproducing the Paper's Results

This requires the repo install path from the "Installation" section above — if you only did -bc-uv pip install inspect-evals[b3]-bc-, clone the repo first:

git clone https://github.com/UKGovernmentBEIS/inspect_evals.git
cd inspect_evals
uv sync --extra b3

The repo includes a script that runs the full benchmark with the same models and settings used in the paper:

uv run python src/inspect_evals/b3/experiments/run.py --group all

⚠️ Warning: -bc---group all-bc- runs the full set of 30+ models end-to-end. Expect significant API costs (thousands of dollars) and several hours of runtime. Check -bc-src/inspect_evals/b3/experiments/constants.py-bc- for the full model list before launching.

You'll need API keys for all providers used in the paper. Add these to your -bc-.env-bc-:

| Provider    | Env var                                                                     |
|-------------|-----------------------------------------------------------------------------|
| OpenAI      | `OPENAI_API_KEY`                                                            |
| Anthropic   | `ANTHROPIC_API_KEY`                                                         |
| Google      | `GOOGLE_API_KEY`                                                            |
| OpenRouter  | `OPENROUTER_API_KEY`                                                        |
| AWS Bedrock | No env var needed — uses your active AWS session (e.g. via `aws sso login`) |

Note: the script runs Bedrock models in -bc-us-east-1-bc- — make sure your AWS account has Bedrock access enabled in that region.

Tips & Gotchas

  • Rate limits. Use -bc---max-connections-bc- to throttle concurrent requests and avoid 429 errors. You might need to tune it per provider based on your rate limits with them.
  • Cost. A full B3 run sends hundreds of prompts per model. Use -bc-limit_per_threat_snapshot-bc- during development; save full runs for when you're ready.
  • OpenAI dependency. One of the scorers uses OpenAI embeddings, so you need a valid -bc-OPENAI_API_KEY-bc- even when evaluating non-OpenAI models.
  • L3 self-judge can zero your scores. At L3, a judge model evaluates whether the response violates security policies. If it flags a violation, the sample score is set to 0.0 regardless of the primary scorer. This simulates a real-world guardrail layer.

What's Next

The Backbone Breaker Benchmark is designed to evolve alongside the systems it measures. As models grow more capable and new attack techniques emerge, the benchmark will continue to expand with additional threat snapshots and improved evaluation methods.

Running b3 is one way to explore the security behavior of backbone LLMs firsthand. For a deeper understanding of how real attackers probe AI systems, you can also experiment with Gandalf: Agent Breaker, the red-teaming environment that generated the adversarial data behind the benchmark.

Together, these tools aim to move AI security toward something the ecosystem has long lacked: a shared, empirical way to measure how well models resist manipulation.

You can explore the benchmark implementation in the Inspect Evals GitHub repository and learn more about the methodology in the accompanying research paper.