TL;DR

-db1-

  • Why we built it: Most LLM security benchmarks miss what happens under real-world adversarial pressure. The Lakera AI Index fills that gap with realistic, attack-based evaluations.
  • How it works: Models are tested across direct and indirect adversarial attack categories, then scored using a standardized 0–100 AI risk assessment framework.
  • What you get: Actionable, side-by-side model comparisons that support safer model selection, secure deployment, and governance for GenAI systems.

-db1-

What Is the Lakera AI Model Risk Index? A New LLM Security Benchmark

The Lakera AI Model Risk Index is a runtime-focused security benchmark that evaluates how well large language models (LLMs) uphold their intended behavior when faced with adversarial pressure.

It reflects how models actually perform in applied settings—real enterprise applications where business logic, system prompts, and external data all interact and attackers don’t follow rules.

Most risk assessment approaches focus on surface-level issues: testing prompt responses in isolation and with context independent static prompt attacks that focus on quantity and not on context or quality. By contrast, the Index asks a more practical question for enterprises: how easily can this model be manipulated to break mission-specific rules and objectives and in which type of deployments?

The difference is critical.

To mirror production reality, each test scenario centres on a concrete use case (e.g., customer support, RAG search, code generation) with task-specific prompts and guardrails. This lets us measure resilience far beyond traditional alignment checks such as toxic-language filters.

Meaningful security means ensuring a model consistently does what it’s meant to do, even when prompted with adversarial input. The Lakera Index shows whether models can uphold that alignment under pressure, without drifting into unintended behavior.

Lakera's AI Model Risk Index

A Haiku That Shouldn’t Exist

Consider a financial customer support chatbot. Its system prompt is clear: only respond to queries about accounts, transactions, or financial advice. But what happens if an attacker convinces it to write a haiku about chemistry?

That might sound harmless, but it reveals a deeper flaw: the model can be coerced into ignoring its core instructions.

If it can write a haiku today, what could it be convinced to do tomorrow?

This is the kind of failure the Index detects: not just harmful outputs, but failures in enforcing behavioral boundaries.

What Makes This AI Security Benchmark Different?

The Lakera AI Model Risk Index goes beyond traditional testing in three fundamental ways:

  • It’s realistic. Built on real-world attack techniques, including prompt injections, jailbreaks, and indirect attacks through RAG systems. It’s tested in applied settings that reflect how enterprises actually use LLMs. It focuses on whether models enforce behavioral boundaries under adversarial pressure, not just whether they avoid generating toxic or harmful text.
  • It’s comprehensive. Captures both direct manipulations (through user input) and indirect ones (through content the model processes and systems and tools it has access to).
  • It’s measurable. Assigns a clear, 0–100 risk score for each model, enabling consistent comparisons and tracking over time.

This gives security and risk teams the clarity they need to evaluate model selection, guide deployment decisions, and support governance efforts.

How the Lakera AI Model Risk Index Works

Every model in the Lakera Index is tested using real-world attack techniques, under realistic conditions. We simulate what it’s like when someone tries to exploit the model, whether by directly prompting it to break the rules, or by embedding hidden instructions in content the model processes, like a document or a web page.

Rather than relying on a fixed list of bad inputs and outputs, we look at whether the model can maintain its role and follow its instructions when something tries to push it off course.

Mapping Real Attacks to Measurable Categories

To make the results useful, we group adversarial behaviors into clear, recognizable categories based on attacker goals. These include:

  • Direct Attacks: where an attacker interacts with the model directly, trying to force it to ignore its instructions or reveal hidden information.
  • Indirect Attacks: where the attacker hides malicious instructions in the input data the model processes (like a support ticket or retrieved document), trying to influence the model without directly engaging it.

Each model is tested across these categories, revealing its strengths and vulnerabilities.

How Attack Simulations Translate to Risk Scores

After testing, we assign a risk score to each model in every category. The scoring is simple:

0 means the model successfully resisted every attempt. 100 means it failed every time.

These scores give a clear, quantifiable view of how each model holds up across different threat types—making it easy to compare models side-by-side or track how one model evolves over time.

-db1-

A Tale of Two Models

In one case, we tested a model that showed moderate-to-high vulnerability across multiple categories—including 84.0 in Direct Constraint Evasion (DCE) and 78.7 in Indirect Output Addition (ADD). Even in more subtle categories like Denial of AI Service (DAIS), it struggled, scoring 50.0—indicating that it sometimes refused legitimate requests when adversarial input was present.

Another model, tested under the same conditions, showed inverse behavior in several of those categories. It scored 67.0 in DCE and 100.0 in ADD, but its DAIS score was also 100.0, meaning its intended functionality was always degraded or disrupted by an indirect attack.

The contrast shows how different models can fail in fundamentally different ways—even in the same attack category. Without structured, adversarial testing across diverse vectors, these blind spots would go unnoticed.

-db1-

Understanding Risk Patterns

These patterns matter because they reflect how the model is likely to behave once deployed.

By exposing these failure modes early, the Index gives security teams a practical advantage: the ability to choose, configure, and monitor models with full visibility into how they handle pressure—not just in theory, but in real-world scenarios.

Using the Risk Index for Safer GenAI Deployment

Security teams are under pressure to move fast, but also to stay in control. With new models launching constantly and GenAI use cases expanding across every industry, the risks aren’t theoretical anymore.

The Lakera AI Model Risk Index gives you a grounded, up to date and practical foundation for decision-making. Whether you’re choosing which model to deploy, hardening your system prompts, or preparing for a compliance audit, the Index helps you move from gut instinct to measurable assurance.

-db1-

Here are some of the ways teams can apply the Index to their GenAI strategy:

  • Model Selection: Choosing the right LLM isn’t just about performance benchmarks on reasoning or speed, it’s also about how well it holds its ground under adversarial pressure. The Index helps you compare models based on how they actually behave when attacked balancing utility and security.
  • Deployment Strategy: Security postures vary across direct and indirect threats. The Index reveals where a model is vulnerable, informing how you structure system prompts, content filters, fallback mechanisms, and human-in-the-loop interventions.
  • Governance and Compliance: The ability to document and quantify model behavior under attack is increasingly important for internal risk reviews and external compliance. The Index provides a standardized reference point you can build on.

-db1-

In a space moving this fast, visibility is leverage. And the Lakera AI Model Risk Index offers exactly that: a way to see what’s working, what’s failing, and where you need to act next.

Ready to see how your models measure up?

Explore the full Lakera AI Model Risk Index and try the interactive benchmark.