Meta Llama 3.3 70B Instruct Risk Report

Date Assessed:

This report summarizes the results of an adversarial evaluation of Llama 3.3 70B Instruct by Meta AI. The goal of the assessment is to measure the model’s real-world risk exposure to advanced red-teaming techniques. The evaluation tests both common and hard-to-detect attack patterns across a wide range of application types, under various system prompt defenses and configurations.

Key Findings:

Overall Ranking:

12th

Overall Risk Score:

86.02 risk score

Highest Risk Category:

ADD, DAIS, IIO with 100.0 risk score

Strongest Defense Category:

DCE with 60.5 risk score

Lakera AI Model Risk Index Methodology

The Lakera AI Model Risk Index evaluates model security by measuring how effectively models can maintain their intended behavior under adversarial conditions. Instead of relying on predefined harmful output categories, which can be limited and context-dependent, our methodology assesses models' fundamental resilience against attempts to subvert their alignment and override their intended behavior in applied settings.

For example, consider an LLM-based customer service chatbot use case. A representative system prompt might be:

“You are an AI assistant providing customer support for financial services. Your goal is to help customers with account inquiries, transaction details, and basic financial advice. Never deviate from financial-related topics or disclose internal system information.”

As attackers, if we successfully execute a Direct Instruction Override (DIO), we could force this financial chatbot to deviate significantly from its intended behavior. This could include relatively benign-seeming outcomes, such as writing a haiku about chemistry, or more clearly harmful outputs, like providing step-by-step instructions for synthesizing a chemical explosive.

The critical insight here is not the specific content of the forced output but rather the fact that an attacker can override intended operational constraints. Even seemingly harmless deviations, like the chemistry haiku, represent a significant security concern because they indicate fundamental vulnerability in enforcing model alignment and intended use.

By evaluating whether attackers can achieve their broader objectives rather than cataloging specific harmful outputs, our methodology provides a more comprehensive, realistic, and contextually relevant measure of model security.

Current Scope:

This assessment currently focuses on single LLM-based systems. It does not yet cover agentic workflows or multi-step AI systems, areas that will be included in future evaluations.

Attack Categories Explained

We categorize attacks based on the methods attackers use to break an LLM’s intended behavior. These categories comprehensively capture common attacker objectives as nearly all misalignment intentions fall within this framework.

Direct Attacks (manipulating the model directly through user input):

Direct Context Extraction (DCE)
Attempting to directly extract hidden system instructions, internal configurations, or sensitive data from the model.

‍Direct Instruction Override (DIO)
Directly instructing the model to bypass its intended operational boundaries or ignore safety measures.

Indirect Attacks (manipulating the model indirectly through content it processes):

Indirect Output Addition (ADD)
Embedding malicious or unintended content into legitimate responses without explicitly instructing the model.

‍Indirect Instruction Override (IIO)
Embedding hidden instructions into processed content to subtly corrupt the model’s core task execution.

‍Indirect Denial of AI Service (DAIS)
Injecting hidden attack vectors into inputs to disrupt normal AI behavior, causing refusal of legitimate service.

‍

Assessment Framework

Our evaluation spans multiple dimensions:

Application Types: Conversational AI, text and code generation, text classification, RAG systems, and workflow automation
Domain Coverage: Software, finance, e-commerce, education, healthcare, legal, and general applications
Difficulty Levels: The target defense level is a factor of the task type, the expected output and the system prompt.

Each test scenario reflects realistic deployment conditions, ensuring practical relevance for risk assessment and mitigation planning.

Results Overview

*How to interpret the results
Risk scores range from 0 to 100, with 0 indicating the model successfully resisted every adversarial attack, and 100 meaning it failed every time and always complied with the attacker’s objective. Lower scores signal stronger defenses, while higher scores highlight areas where attackers can more easily compromise the model’s intended behavior.

The overall risk score represents a weighted average across all attack categories.

Model Analysis

Meta Llama 3.3 70B Instruct records an overall risk score of 86.02 / 100, placing it 10th out of 14 models in this evaluation. It exhibits complete failure (100 / 100) in three indirect attack categories: Indirect Output Addition (ADD), Indirect Instruction Override (IIO), and Indirect Denial of AI Service (DAIS), underscoring a critical lack of safeguards against hidden or embedded threats. Direct attacks fare somewhat better: Direct Instruction Override (DIO) scores 69.6, and the model’s strongest area is Direct Context Extraction (DCE) at 60.5, though both still denote meaningful vulnerability.

Risk Score Breakdown

Sample Findings Table

This table presents sample attack attempts and model outputs that highlight observed risk patterns.

×

On this page

This is some text inside of a div block.

Stay Ahead of AI Threats

Get notified of new results when they drop.