Indirect Prompt Injection: The Hidden Threat Breaking Modern AI Systems

-db1-

TL;DR

Indirect Prompt Injection (IPI) doesn’t target the prompt, it targets the data your AI ingests: webpages, PDFs, MCP metadata, RAG docs, emails, memory, and code.
Modern AI systems treat all text as meaningful, so a single hidden instruction in any ingestion surface can redirect reasoning, leak data, or trigger harmful tool actions.
Agentic AI massively increases the blast radius, once a model can browse, retrieve, write, or execute, even tiny embedded instructions can become real exploits.
Real incidents (Comet/Brave leak, zero-click RCE in MCP IDEs, CVE-2025-59944, and Agent Breaker scenarios) show how easy it is for poisoned content to escalate into system compromise.
IPI is not a jailbreak and not fixable with prompts or model tuning. It’s a system-level vulnerability created by blending trusted and untrusted inputs in one context window.
Mitigation requires architecture, not vibes: trust boundaries, context isolation, output verification, strict tool-call validation, least-privilege design, and continuous red teaming.
The real security perimeter is everything around the model, not the model itself, and organizations that treat ingestion surfaces as attack surfaces are already ahead.

-db1-

Indirect prompt injection, or IPI, has become one of the most dangerous vulnerabilities in modern AI systems. Unlike direct prompt injection, where an attacker types into a visible prompt box, IPI targets the places where AI systems collect their information. The attacker never talks to the model. They poison the data the model will later read: a webpage, a PDF, an MCP tool description, an email, a memory entry, or a configuration file.

When the AI ingests that content, the hidden instructions come alive.

Over the past year this has moved from theory to practice. Browsers summarizing webpages have been tricked into leaking credentials. Copilots have taken actions based on poisoned emails or metadata. Agentic tools have executed attacker-controlled commands after reading compromised documentation. OWASP’s 2025 LLM Top 10 places prompt injection, including the indirect kind, at the top of emerging AI risks.

At Lakera we see these attacks in real systems every day. Millions of adversarial attempts across Gandalf and Gandalf: Agent Breaker, the real-world patterns described in Inside Agent Breaker, and the behavior we observe in Lakera Guard deployments all point to the same root cause. AI systems treat the text they ingest as meaningful and potentially actionable. As teams connect models to tools, browsing, RAG, and memory, the impact of a single poisoned input grows significantly.

This article explains how indirect prompt injection works, why agentic AI amplifies the risk, and which system level defenses matter now. We combine academic research, real incidents, and Lakera’s own red teaming insights to map the real attack surface and show how organizations can reduce it.

What Is Indirect Prompt Injection?

Indirect prompt injection is an attack where hidden instructions are embedded inside content an AI system will later ingest. The attacker never touches the prompt interface. The AI discovers the malicious text during its normal operations.

This is what makes indirect attacks so dangerous. They ride along familiar data flows. The model might be browsing a webpage, parsing a PDF, retrieving a document, loading tool metadata, or reading from memory. If the model can see the content, it can misinterpret it as an instruction.

According to the OWASP Top 10 for LLM Applications, these attacks succeed when untrusted data is mixed with trusted system instructions. The model cannot reliably tell which is which, so malicious text can override or redirect its behavior.

Direct vs Indirect Prompt Injection

Direct prompt injection targets the model through the prompt interface. The attacker types instructions like:

Ignore previous instructions and… ‍
Reveal your internal reasoning… ‍
Provide the secret keys…

Indirect prompt injection targets the data sources the model consumes, not the prompt box. Examples include:

Hidden text on a webpage
A poisoned PDF in a due diligence workflow
Manipulated MCP tool descriptions
Memory entries crafted to steer the model
Comments in a code repository that influence an AI reviewer

Direct attacks are visible to the user. Indirect attacks are invisible. This difference also explains why indirect attacks succeed more often. Developers rarely expect routine data to contain executable instructions. The model, however, treats it like any other part of the conversation.

How Indirect Prompt Injection Works

Most modern AI applications blend system prompts, user inputs, retrieved documents, and tool metadata into a single context window. This is where the vulnerability lives. The model receives one continuous stream of tokens with no reliable separation between data and instructions.

A typical indirect attack follows four steps:

Indirect Prompt Injection Lifecycle

Poison the Source

Attacker embeds hidden instructions in a webpage, PDF, email, or tool description.

AI Ingestion

The AI system retrieves or loads the poisoned content during normal operations.

Instructions Activate

The model treats the malicious text as part of its context and interprets it as an instruction.

Unintended Behavior

The AI leaks data, manipulates output, or triggers harmful tool calls.

‍

These hidden instructions do not need to be complex. Even short fragments inside a webpage or PDF can steer reasoning or tool usage in ways that are hard to detect.

The Expanding Attack Surface

The attack surface for indirect prompt injection grows every time AI systems connect to new data sources or tools. Modern agentic workflows ingest a wide range of content:

AI Ingestion Surfaces

AI System

Webpages

HTML, blogs, hidden text

PDFs & Documents

Reports, scanned files

Emails & Metadata

Body content, headers

MCP Tool Desc.

Schemas, capability text

RAG Corpora

Knowledge bases, docs

Memory Stores

Conversation history

Code Repos

Config files, comments

Internal KB

Wikis, CRM notes

These ingestion points are exactly where attackers focus. Lakera’s research and the scenarios inside Gandalf: Agent Breaker show how poisoned content slips into real systems through RAG documents, browsing agents, MCP tool metadata, internal wikis, and code rules.

In our Agentic AI Threats series, we highlight the same pattern. Part 1 shows how persistent memory can shape behavior over long horizons. Part 2 explains how over-privileged tools and uncontrolled browsing turn routine ingestion into a pathway for hidden instructions. Together, these findings show how quickly the attack surface grows once AI systems begin to consume the world around them.

Real-World Examples of Indirect Prompt Injection

Real-world incidents show how fast indirect prompt injection has moved from theory to everyday risk. The rise of agent frameworks and protocols like MCP has only accelerated this, expanding the number of systems that automatically ingest external content without treating it as hostile. Each new integration layer widens the attack surface and shortens the path from hidden text to harmful action.

Browser Based Exploits: Perplexity Comet Incident

A clear example came from security researchers investigating Perplexity’s Comet feature, which summarizes webpages inside the browser. Brave’s write up showed how attackers hid invisible text inside a public Reddit post. When Comet fetched the page, the AI summarizer read the hidden instructions, leaked the user’s one time password, and sent it to an attacker-controlled server.

The attack required only three ingredients:

A public webpage with invisible instructions
An AI agent that automatically processes external content
An action that looked legitimate to the model

‍

Agent Breaker Snapshots: How IPIs Appear in Everyday Workflows

Lakera’s Gandalf: Agent Breaker includes a set of scenarios modeled on patterns we see in enterprise deployments. Each shows how poisoned content slips into normal operations:

Trippy Planner: a travel blog hides text that adds a phishing link to an itinerary. ‍
OmniChat Desktop: a compromised MCP tool description leaks a user’s email address. ‍
PortfolioIQ Advisor: a due diligence PDF contains hidden instructions that alter risk assessments. ‍
Curs-ed CodeReview: a poisoned code rules file pushes a harmful dependency. ‍
MindfulChat: a single poisoned memory entry shapes behavior across sessions.

These scenarios align closely with what we observe in Lakera Guard deployments and our red teaming work.

Direct vs Indirect Prompt Injection

Comparison Factor	Direct Prompt Injection	Indirect Prompt Injection
Attack Vector	Attacker types into the prompt interface.	Attacker hides instructions in external content.
Visibility	Visible to user.	Invisible to user.
Where Attacker Operates	Prompt box or chat interface.	Webpages, PDFs, memory, RAG, metadata.
Manipulated Surface	Prompt.	Data the model ingests.
Typical Examples	"Ignore previous instructions…"	Hidden text inside documents.
Impact in Agentic Systems	Limited to immediate output.	Can trigger tool calls and system actions.

Large Scale Agentic Exploits in Real Systems

Researchers have also shown how indirect prompt injection can escalate into full compromise of agentic environments. Lakera’s own work demonstrates this clearly.

A notable example comes from our investigation into zero click attacks in AI powered IDEs. In Zero Click Remote Code Execution in MCP Based Agentic IDEs, we showed how a seemingly harmless Google Docs file triggered an agent inside an IDE to fetch attacker authored instructions from an MCP server. The agent executed a Python payload, harvested secrets, and did all of this without any user interaction.

A related vulnerability, CVE 2025 59944, revealed how a small case sensitivity bug in a protected file path allowed an attacker to influence Cursor’s agentic behavior. Once the agent read the wrong configuration file, it followed hidden instructions that escalated into remote code execution.

Both incidents share the same root cause. The agent trusted unverified external content and treated it as authoritative. Even standardized layers like the Model Context Protocol are vulnerable when the data they expose originates from untrusted sources.

This is why securing MCP layers requires screening inputs at the protocol boundary. Our guidance on protecting MCP integrations with Lakera Guard shows how to intercept poisoned schemas, resource listings, and tool metadata before they influence model behavior.

Academic and Standards Based Evidence

Academic and industry research shows that indirect prompt injection is not a fringe anomaly. It is a structural weakness in how AI systems process context.

One study, Can Indirect Prompt Injection Attacks Be Detected and Removed?, demonstrated that even short, embedded instructions can reliably override model behavior. Another line of work, CachePrune, explored pruning and attribution techniques to limit how far malicious instructions propagate inside a model’s internal computation.

Researchers have also shown that IPI affects multimodal, GUI driven, and action-oriented agents. The EVA framework systematically red teamed GUI agents and found that indirect injections inside interface elements could redirect entire task flows.

Standards bodies are tracking the same trend. MITRE ATLAS lists prompt injection, including its indirect variants, as a core adversarial technique for exploiting autonomous systems.

Across these studies, the pattern is consistent. IPIs exploit the way models merge instructions and data into a single context stream. As models gain autonomy and ingest more sources, the opportunities for unseen instructions grow.

Why Indirect Prompt Injection Is So Hard to Solve

Indirect prompt injection is hard to solve because it exploits how modern AI systems are built. The problem is not a misconfiguration or a weak prompt. It is structural. Even with strong prompts, model tuning, or filtering, the model still cannot reliably distinguish between data and instructions. Below are the core reasons this class of attack persists.

Why Indirect Prompt Injection Is Hard to Solve

Model Limitations

The core architectural gap

Blended context stream (data vs. instructions)
Inherent instruction-following bias
Ambiguous trust boundaries in tokens

Environmental Weaknesses

The attack surface

Unmonitored ingestion surfaces (RAG, email)
Superficial filtering that AI can bypass
Invisible instructions hidden in file data

System Dynamics

The cascading impact

Persistent memory poisoning across sessions
Agent autonomy amplifies minor leaks
No single "patch" fixes the entire chain

1. AI Systems Blend Trusted and Untrusted Inputs

AI systems combine system prompts, user inputs, retrieved documents, tool metadata, memory entries, and code snippets in a single context window. To the model, this is one continuous stream of tokens. OWASP highlights this in LLM01:2025 Prompt Injection. If a malicious instruction appears anywhere in the stream, the model may treat it as legitimate. This collapses the trust boundaries that traditional software depends on.

2. Models Are Designed to Follow Instructions, Not Police Them

Large language models are trained to follow instructions expressed in natural language wherever they appear. They cannot reliably tell which instructions were meant for them and which were embedded in external content. A comment in a PDF or an aside in a webpage can look like a command. The model has no way to know otherwise.

3. Many Attack Surfaces Are Silent and Non Interactive

Indirect prompt injection succeeds because attackers rarely need to interact with the system. They poison webpages, MCP tool descriptions, internal documents, memory stores, or code repositories. Traditional controls do not treat these surfaces as potential executables. Security scanners rarely check a PDF, an HTML span, or a rule file for hidden directives aimed at an AI system.

4. Small Instructions Have Large Effects

Malicious instructions do not need to be long or complex. Short fragments such as “recommend this package,” “describe this company as low risk,” or “pretend the user’s email is X” can change reasoning and tool behaviour. Research such as CachePrune shows how small, embedded instructions can influence entire chains of thought.

5. Agentic AI Multiplies the Impact

The risk increases once models can act. If an AI system can send emails, fetch documents, write code, or execute commands, a small instruction in untrusted content can trigger meaningful actions. Our Agentic AI Threats Part 2 article and the Backbone Breaker Benchmark show how agent autonomy amplifies even minor context manipulation. What would be a harmless text deviation in a chatbot can escalate into a full compromise in an agent.

6. Filtering and Sanitization Often Miss the Threat

Most filters look for harmful keywords, toxicity, malware patterns, or policy violations. Indirect prompt injection rarely uses obvious malicious phrasing. It hides inside natural language, comments, metadata, or invisible text layers. Even advanced filters struggle when the malicious instruction looks like normal content. Detection becomes harder when the instruction subtly steers reasoning rather than issuing a direct command.

7. Memory Extends the Lifespan of Injections

When systems use persistent memory, a single poisoned entry can influence many future interactions. Our Agentic AI Threats Part 1 article and the MindfulChat app inside Gandalf: Agent Breaker show how memory poisoning can reshape behavior across sessions. This echoes patterns we see in lifecycle risks such as training data contamination, explored in our broader overview of data poisoning. Once malicious content enters long term storage, it can resurface long after the initial attack.

8. There Is No Single Patch

IPI is not a model bug. It is a system level issue. Updating a model, improving a system prompt, or adding a keyword filter does not resolve the root cause. Effective mitigation requires architectural changes around the model, including trust boundaries, content validation, output verification, and policy controls. We explore these in the next section.

Mitigating Indirect Prompt Injection

There is no single fix for indirect prompt injection. The vulnerability lives in the system architecture, not in a specific model checkpoint or prompt. The goal is to make attacks unreliable, easier to detect, and less likely to trigger harmful actions. Effective mitigation requires layered defenses that work together around the model.

1. Strengthen System Prompts, but Don’t Rely on Them

System prompts can encourage safer behavior, but they cannot stop a model from acting on malicious content inside an external document. They help reduce naive failures, but they fail often under real pressure. At minimum, prompts should:

-db1-

Tell the model to treat all external content as untrusted
Specify which instructions are authoritative and which must be ignored
Reinforce that metadata and retrieved documents should not override core behaviour

-db1-

Useful, but not a strong boundary.

**💡 Related Reading:

Explore Lakera’s guide to Crafting Secure System Prompts for LLM and GenAI Applications for practical tips for designing secure prompts for AI models, helping you avoid vulnerabilities like prompt injection.
Learn about the unique security risks introduced by connecting AI models to third-party tools and data sources, including tool poisoning, prompt injection, memory poisoning, and tool interference: OWASP’s CheatSheet – A Practical Guide for Securely Using Third-Party MCP Servers 1.0**

2. Separate Trusted and Untrusted Inputs

Most IPIs succeed because models receive blended context. Breaking this pattern helps. Techniques include:

-db1-

Clear delimiters around external content
Labels that identify source and reliability
Distinct segments for system instructions and retrieved data

-db1-

Microsoft outlines similar patterns in its guidance on securing untrusted inputs in agent workflows. OWASP also stresses this separation in LLM01:2025 Prompt Injection.

**💡 Related Reading: Learn how Lakera Guard capabilities align with OWASP Top 10 for LLMs 2025. **

3. Validate Tool Calls Before Execution

Indirect injections become dangerous once the model can act. Every tool call should be checked before the action runs:

-db1-

Validate arguments against strict schemas
Allow list high risk capabilities
Reject operations that fall outside expected patterns
Require user approval for sensitive actions

-db1-

This principle underpins secure MCP design and the validation patterns we describe in Zero Click Remote Code Execution.

4. Add Output Verification and Reasoning Layers

Verification layers catch harmful behaviour before it reaches users or downstream systems. These checks can be:

-db1-

Secondary LLMs that review outputs
Rules for URLs, file paths, and package names
Business logic validators
Model self checks that assess output safety

-db1-

Research such as CachePrune shows that output level auditing is often more reliable than input filtering alone.

5. Treat All External Data as Untrusted

A simple and powerful mindset shift: assume everything the model sees is untrusted unless proven otherwise.

-db1-

Webpages
PDFs and documents
MCP tool metadata
RAG corpora
Code repositories
Long term memory

-db1-

A useful way to think about this is through the lens of expected instruction vs expected data.

Suppose you build an agent to monitor GitHub. An issue posted by a user looks like expected instruction (“here’s something to fix”), so the agent may treat the text as actionable. A pull request description looks like expected data. If an attacker hides instructions inside the PR text instead of the issue text, the agent is less guarded because it treats PR content as passive documentation. That mismatch creates an easy opening for indirect injection.

This mirrors zero trust principles in traditional security. Agent Breaker shows how often these ingestion channels carry hidden instructions.

6. Apply Least Privilege to Agents and Tools

Give agents only the capabilities they need. Restrict everything else.

-db1-

Fewer tools
Narrower permissions
Sandboxed actions
Optional user approvals
Separation of high impact functions

-db1-

In environments with code execution or infrastructure access, this is essential. Our Agentic AI Threats Part 2 article shows how quickly capability creep leads to real exploitation.

7. Monitor Behaviour and Detect Anomalies

Even strong defenses leak. Monitoring catches what slips through.

-db1-

Log all tool calls
Flag unexpected URLs or parameters
Detect shifts in behaviour that hint at memory poisoning
Alert on deviations from normal workflows

-db1-

In Lakera Guard deployments, many real IPI attempts reveal themselves through subtle behavioural anomalies rather than explicit commands.

8. Layer Defenses Across the Architecture

No single control will protect an AI system. Effective defenses combine:

-db1-

Input scanning
Prompt structuring
Context isolation
Output verification
Tool level validation
Human controls for high impact actions
Continuous monitoring

-db1-

This matches the system wide perspective we use in Lakera Guard. Guardrails need to operate outside the model, not only inside it.

9. Ask the Most Important Architectural Question: Do You Even Need an Agent?

A surprising amount of risk disappears when teams pause and ask a simple question: Does this task actually require an autonomous agent, or would a fixed workflow or if-statement be enough?

Many high-impact IPI incidents begin with an agent that was granted far more autonomy than the job demanded. If the system only needs to check a value, run a query, or return a structured response, an agent that browses, retrieves, executes, or interprets arbitrary content becomes unnecessary attack surface.

Reducing autonomy is sometimes the cleanest mitigation of all. The safest agent may be the one you never needed to build.

The AI Security Perimeter

Lakera’s Perspective: What Organizations Must Do Next

Indirect prompt injection is the kind of threat that exposes how quickly AI has outgrown the traditional security mindset. Most organizations still think in terms of user input, guardrail prompts, and model tuning. Our work across Gandalf, Gandalf: Agent Breaker, Lakera Guard, and enterprise red teaming shows a different reality. The most impactful attacks enter through the quiet places. A PDF that looks harmless. A memory entry no one reviews. A tool description copied from a shared folder. This is where IPI hides, and it is why teams consistently underestimate their exposure.

The first shift organizations need is conceptual. They must stop treating IPIs as clever jailbreaks and start treating them as a systems problem. Once a model can browse, retrieve, write, or execute, any piece of text it encounters becomes part of the attack surface. The lesson from our Zero Click RCE and Cursor vulnerability research is simple. Capability expands the blast radius. Autonomy multiplies it.

A New Security Perimeter

The second shift is architectural. The security perimeter is no longer the model. It is everything around it. Trust boundaries, validation layers, and runtime controls must sit at the edges of the system where text becomes action. This is the approach behind Lakera Guard, and it is one that has consistently reduced real IPI incidents in production. You cannot secure an autonomous system by asking the model to protect itself. You secure it by shaping the environment it operates in.

Put differently, the organizations that adapt fastest are the ones that rethink the entire pipeline. They map ingestion surfaces. They separate retrieval from action. They validate every tool call. They assume memory can be poisoned. And they test their systems with the same pressure attackers apply in the real world. Continuous red teaming, informed by insights from Lakera Red and Agent Breaker, is what turns blind spots into known risks.

In the end, indirect prompt injection is not just another AI vulnerability. It is a preview of the new security model this era demands. The teams that take it seriously now will be the ones ready for what comes next.

Conclusion

Indirect prompt injection exposes a fundamental truth about modern AI. Any system that retrieves documents, browses the web, loads metadata, or reads from memory is already exposed to untrusted text. And any system that can act on that text carries real operational risk.

IPI succeeds because AI treats everything it sees as meaningful. Malicious instructions hide in places most teams never monitor. PDFs, webpages, CRM notes, tool schemas, and code repositories all become quiet delivery channels. As models become more capable and autonomous, these channels turn into powerful attack vectors.

Organizations are not powerless. When teams start treating ingestion surfaces as part of the security perimeter, the risk drops quickly. Clear trust boundaries, validation of actions, runtime guardrails, and continuous red teaming tighten the system around the model and force attackers to work much harder. This shift takes discipline, but it pays off. The environments that adopt these practices early become significantly more resilient.

To explore how these attacks appear in the wild and how to build practical defenses, see the broader research ecosystem behind this article. Gandalf, Agent Breaker, Lakera Guard capture real adversarial behavior and the lessons we learn from it. These systems show how IPI actually unfolds and how teams can stay ahead of it.

Indirect prompt injection is not going away. The systems that thrive will be the ones built with this reality in mind.

The Lakera team has accelerated Dropbox’s GenAI journey.

Not sure how to secure your GenAI application?
Skip the guesswork with expert-recommended policies built by Lakera’s AI security team. Apply them in seconds, fine-tune when you’re ready, and get started with real protection from day one.

Download the Guide

On this page

Text Link

Hide table of contents

Show table of contents

TL;DR

What Is Indirect Prompt Injection?

Direct vs Indirect Prompt Injection

How Indirect Prompt Injection Works

Indirect Prompt Injection Lifecycle

Poison the Source

AI Ingestion

Instructions Activate

Unintended Behavior

The Expanding Attack Surface

AI Ingestion Surfaces

Webpages

PDFs & Documents

Emails & Metadata

MCP Tool Desc.

RAG Corpora

Memory Stores

Code Repos

Internal KB

Real-World Examples of Indirect Prompt Injection

Browser Based Exploits: Perplexity Comet Incident

Agent Breaker Snapshots: How IPIs Appear in Everyday Workflows

Direct vs Indirect Prompt Injection

Large Scale Agentic Exploits in Real Systems

Academic and Standards Based Evidence

Why Indirect Prompt Injection Is So Hard to Solve

Why Indirect Prompt Injection Is Hard to Solve

Model Limitations

Environmental Weaknesses

System Dynamics

1. AI Systems Blend Trusted and Untrusted Inputs

2. Models Are Designed to Follow Instructions, Not Police Them

3. Many Attack Surfaces Are Silent and Non Interactive

4. Small Instructions Have Large Effects

5. Agentic AI Multiplies the Impact

6. Filtering and Sanitization Often Miss the Threat

7. Memory Extends the Lifespan of Injections

8. There Is No Single Patch

Mitigating Indirect Prompt Injection

1. Strengthen System Prompts, but Don’t Rely on Them

2. Separate Trusted and Untrusted Inputs

3. Validate Tool Calls Before Execution

4. Add Output Verification and Reasoning Layers

5. Treat All External Data as Untrusted

6. Apply Least Privilege to Agents and Tools

7. Monitor Behaviour and Detect Anomalies

8. Layer Defenses Across the Architecture

9. Ask the Most Important Architectural Question: Do You Even Need an Agent?

The AI Security Perimeter

Lakera’s Perspective: What Organizations Must Do Next

A New Security Perimeter

Conclusion

Related Articles