Gemini 2.0 Flash Risk Report

Date Assessed:

May 22, 2025

This report summarizes the results of an adversarial evaluation of Gemini 2.0 Flash by Google DeepMind. The goal of the assessment is to measure the model’s real-world risk exposure to advanced red-teaming techniques. The evaluation tests both common and hard-to-detect attack patterns across a wide range of application types, under various system prompt defenses and configurations.

Key Findings:

Overall Ranking:

18th

Overall Risk Score:

90.84 risk score

Highest Risk Category:

DAIS, IIO with 100.0 risk score

Strongest Defense Category:

DAIS with 81.5 risk score

Lakera AI Model Risk Index Methodology

The Lakera AI Model Risk Index evaluates model security by measuring how effectively models can maintain their intended behavior under adversarial conditions. Instead of relying on predefined harmful output categories, which can be limited and context-dependent, our methodology assesses models' fundamental resilience against attempts to subvert their alignment and override their intended behavior in applied settings.

For example, consider an LLM-based customer service chatbot use case. A representative system prompt might be:

“You are an AI assistant providing customer support for financial services. Your goal is to help customers with account inquiries, transaction details, and basic financial advice. Never deviate from financial-related topics or disclose internal system information.”

As attackers, if we successfully execute a Direct Instruction Override (DIO), we could force this financial chatbot to deviate significantly from its intended behavior. This could include relatively benign-seeming outcomes, such as writing a haiku about chemistry, or more clearly harmful outputs, like providing step-by-step instructions for synthesizing a chemical explosive.

The critical insight here is not the specific content of the forced output but rather the fact that an attacker can override intended operational constraints. Even seemingly harmless deviations, like the chemistry haiku, represent a significant security concern because they indicate fundamental vulnerability in enforcing model alignment and intended use.

By evaluating whether attackers can achieve their broader objectives rather than cataloging specific harmful outputs, our methodology provides a more comprehensive, realistic, and contextually relevant measure of model security.

Current Scope:

This assessment currently focuses on single LLM-based systems. It does not yet cover agentic workflows or multi-step AI systems, areas that will be included in future evaluations.

Attack Categories Explained

We categorize attacks based on the methods attackers use to break an LLM’s intended behavior. These categories comprehensively capture common attacker objectives as nearly all misalignment intentions fall within this framework.

Direct Attacks (manipulating the model directly through user input):

Direct Context Extraction (DCE)
Attempting to directly extract hidden system instructions, internal configurations, or sensitive data from the model.

‍Direct Instruction Override (DIO)
Directly instructing the model to bypass its intended operational boundaries or ignore safety measures.

Indirect Attacks (manipulating the model indirectly through content it processes):

Indirect Output Addition (ADD)
Embedding malicious or unintended content into legitimate responses without explicitly instructing the model.

‍Indirect Instruction Override (IIO)
Embedding hidden instructions into processed content to subtly corrupt the model’s core task execution.

‍Indirect Denial of AI Service (DAIS)
Injecting hidden attack vectors into inputs to disrupt normal AI behavior, causing refusal of legitimate service.

‍

Assessment Framework

Our evaluation spans multiple dimensions:

Application Types: Conversational AI, text and code generation, text classification, RAG systems, and workflow automation
Domain Coverage: Software, finance, e-commerce, education, healthcare, legal, and general applications
Difficulty Levels: The target defense level is a factor of the task type, the expected output and the system prompt.

Each test scenario reflects realistic deployment conditions, ensuring practical relevance for risk assessment and mitigation planning.

Results Overview

*How to interpret the results Risk scores range from 0 to 100, with 0 indicating the model successfully resisted every adversarial attack, and 100 meaning it failed every time and always complied with the attacker's objective. Lower scores signal stronger defenses, while higher scores highlight areas where attackers can more easily compromise the model's intended behavior.

The overall risk score represents a weighted average across all attack categories.

Model Analysis

Gemini 2.0 Flash exhibits critical vulnerabilities across all attack categories, achieving an overall risk score of 90.84/100, placing it among the least secure models tested. The model demonstrates complete failure (100/100) against both Indirect Denial of AI Service (DAIS) and Indirect Instruction Override (IIO) attacks, with attackers consistently achieving their objectives in these categories. Direct attacks also pose severe risks, with Direct Instruction Override scoring 87.7 and Direct Context Extraction at 85.0. Even in its relatively strongest area, Indirect Output Addition (ADD), the model still shows a concerning risk score of 81.5, indicating the lack of adequate protection across all attack vectors.

Risk Score Breakdown

Sample Findings Table

This table presents sample attack attempts and model outputs that highlight observed risk patterns.

×

On this page

This is some text inside of a div block.

Stay Ahead of AI Threats

Get notified of new results when they drop.