What Is Indirect Prompt Injection? 

Indirect prompt injection is an attack where hidden instructions are embedded inside content an AI system will later ingest. The attacker never touches the prompt interface. The AI discovers the malicious text during its normal operations. 

This is what makes indirect attacks so dangerous. They ride along familiar data flows. The model might be browsing a webpage, parsing a PDF, retrieving a document, loading tool metadata, or reading from memory. If the model can see the content, it can misinterpret it as an instruction. 

According to the OWASP Top 10 for LLM Applications, these attacks succeed when untrusted data is mixed with trusted system instructions. The model cannot reliably tell which is which, so malicious text can override or redirect its behavior. 

Direct vs Indirect Prompt Injection 

Direct prompt injection targets the model through the prompt interface. The attacker types instructions like: 

  • Ignore previous instructions and… 
  • Reveal your internal reasoning… 
  • Provide the secret keys… 

Indirect prompt injection targets the data sources the model consumes, not the prompt box. Examples include: 

  • Hidden text on a webpage 
  • A poisoned PDF in a due diligence workflow 
  • Manipulated MCP tool descriptions 
  • Memory entries crafted to steer the model 
  • Comments in a code repository that influence an AI reviewer 

Direct attacks are visible to the user. Indirect attacks are invisible. This difference also explains why indirect attacks succeed more often. Developers rarely expect routine data to contain executable instructions. The model, however, treats it like any other part of the conversation. 

How Indirect Prompt Injection Works 

Most modern AI applications blend system prompts, user inputs, retrieved documents, and tool metadata into a single context window. This is where the vulnerability lives. The model receives one continuous stream of tokens with no reliable separation between data and instructions. 

A typical indirect attack follows four steps:

Indirect Prompt Injection Lifecycle

01

Poison the Source

Attacker embeds hidden instructions in a webpage, PDF, email, or tool description.

02

AI Ingestion

The AI system retrieves or loads the poisoned content during normal operations.

03

Instructions Activate

The model treats the malicious text as part of its context and interprets it as an instruction.

04

Unintended Behavior

The AI leaks data, manipulates output, or triggers harmful tool calls.

These hidden instructions do not need to be complex. Even short fragments inside a webpage or PDF can steer reasoning or tool usage in ways that are hard to detect. 

The Expanding Attack Surface 

The attack surface for indirect prompt injection grows every time AI systems connect to new data sources or tools. Modern agentic workflows ingest a wide range of content: 

AI Ingestion Surfaces

AI System

Webpages

HTML, blogs, hidden text

PDFs & Documents

Reports, scanned files

Emails & Metadata

Body content, headers

MCP Tool Desc.

Schemas, capability text

RAG Corpora

Knowledge bases, docs

Memory Stores

Conversation history

Code Repos

Config files, comments

Internal KB

Wikis, CRM notes

These ingestion points are exactly where attackers focus. Lakera’s research and the scenarios inside Gandalf: Agent Breaker show how poisoned content slips into real systems through RAG documents, browsing agents, MCP tool metadata, internal wikis, and code rules. 

In our Agentic AI Threats series, we highlight the same pattern. Part 1 shows how persistent memory can shape behavior over long horizons. Part 2 explains how over-privileged tools and uncontrolled browsing turn routine ingestion into a pathway for hidden instructions. Together, these findings show how quickly the attack surface grows once AI systems begin to consume the world around them. 

Real-World Examples of Indirect Prompt Injection 

Real-world incidents show how fast indirect prompt injection has moved from theory to everyday risk. The rise of agent frameworks and protocols like MCP has only accelerated this, expanding the number of systems that automatically ingest external content without treating it as hostile. Each new integration layer widens the attack surface and shortens the path from hidden text to harmful action.

Browser Based Exploits: Perplexity Comet Incident 

A clear example came from security researchers investigating Perplexity’s Comet feature, which summarizes webpages inside the browser. Brave’s write up showed how attackers hid invisible text inside a public Reddit post. When Comet fetched the page, the AI summarizer read the hidden instructions, leaked the user’s one time password, and sent it to an attacker-controlled server. 

The attack required only three ingredients: 

  1. A public webpage with invisible instructions 
  2. An AI agent that automatically processes external content 
  3. An action that looked legitimate to the model 

Agent Breaker Snapshots: How IPIs Appear in Everyday Workflows 

Lakera’s Gandalf: Agent Breaker includes a set of scenarios modeled on patterns we see in enterprise deployments. Each shows how poisoned content slips into normal operations: 

  • Trippy Planner: a travel blog hides text that adds a phishing link to an itinerary. 
  • OmniChat Desktop: a compromised MCP tool description leaks a user’s email address. 
  • PortfolioIQ Advisor: a due diligence PDF contains hidden instructions that alter risk assessments. 
  • Curs-ed CodeReview: a poisoned code rules file pushes a harmful dependency. 
  • MindfulChat: a single poisoned memory entry shapes behavior across sessions. 

These scenarios align closely with what we observe in Lakera Guard deployments and our red teaming work.

Direct vs Indirect Prompt Injection

Comparison Factor Direct Prompt Injection Indirect Prompt Injection
Attack Vector Attacker types into the prompt interface. Attacker hides instructions in external content.
Visibility Visible to user. Invisible to user.
Where Attacker Operates Prompt box or chat interface. Webpages, PDFs, memory, RAG, metadata.
Manipulated Surface Prompt. Data the model ingests.
Typical Examples "Ignore previous instructions…" Hidden text inside documents.
Impact in Agentic Systems Limited to immediate output. Can trigger tool calls and system actions.

Large Scale Agentic Exploits in Real Systems 

Researchers have also shown how indirect prompt injection can escalate into full compromise of agentic environments. Lakera’s own work demonstrates this clearly. 

A notable example comes from our investigation into zero click attacks in AI powered IDEs. In Zero Click Remote Code Execution in MCP Based Agentic IDEs, we showed how a seemingly harmless Google Docs file triggered an agent inside an IDE to fetch attacker authored instructions from an MCP server. The agent executed a Python payload, harvested secrets, and did all of this without any user interaction. 

A related vulnerability, CVE 2025 59944, revealed how a small case sensitivity bug in a protected file path allowed an attacker to influence Cursor’s agentic behavior. Once the agent read the wrong configuration file, it followed hidden instructions that escalated into remote code execution. 

Both incidents share the same root cause. The agent trusted unverified external content and treated it as authoritative. Even standardized layers like the Model Context Protocol are vulnerable when the data they expose originates from untrusted sources. 

This is why securing MCP layers requires screening inputs at the protocol boundary. Our guidance on protecting MCP integrations with Lakera Guard shows how to intercept poisoned schemas, resource listings, and tool metadata before they influence model behavior. 

Academic and Standards Based Evidence 

Academic and industry research shows that indirect prompt injection is not a fringe anomaly. It is a structural weakness in how AI systems process context. 

One study, Can Indirect Prompt Injection Attacks Be Detected and Removed?, demonstrated that even short, embedded instructions can reliably override model behavior. Another line of work, CachePrune, explored pruning and attribution techniques to limit how far malicious instructions propagate inside a model’s internal computation. 

Researchers have also shown that IPI affects multimodal, GUI driven, and action-oriented agents. The EVA framework systematically red teamed GUI agents and found that indirect injections inside interface elements could redirect entire task flows. 

Standards bodies are tracking the same trend. MITRE ATLAS lists prompt injection, including its indirect variants, as a core adversarial technique for exploiting autonomous systems. 

Across these studies, the pattern is consistent. IPIs exploit the way models merge instructions and data into a single context stream. As models gain autonomy and ingest more sources, the opportunities for unseen instructions grow. 

Why Indirect Prompt Injection Is So Hard to Solve 

Indirect prompt injection is hard to solve because it exploits how modern AI systems are built. The problem is not a misconfiguration or a weak prompt. It is structural. Even with strong prompts, model tuning, or filtering, the model still cannot reliably distinguish between data and instructions. Below are the core reasons this class of attack persists. 

Why Indirect Prompt Injection Is Hard to Solve

Model Limitations

The core architectural gap
  • Blended context stream (data vs. instructions)
  • Inherent instruction-following bias
  • Ambiguous trust boundaries in tokens

Environmental Weaknesses

The attack surface
  • Unmonitored ingestion surfaces (RAG, email)
  • Superficial filtering that AI can bypass
  • Invisible instructions hidden in file data

System Dynamics

The cascading impact
  • Persistent memory poisoning across sessions
  • Agent autonomy amplifies minor leaks
  • No single "patch" fixes the entire chain

1. AI Systems Blend Trusted and Untrusted Inputs 

AI systems combine system prompts, user inputs, retrieved documents, tool metadata, memory entries, and code snippets in a single context window. To the model, this is one continuous stream of tokens. OWASP highlights this in LLM01:2025 Prompt Injection. If a malicious instruction appears anywhere in the stream, the model may treat it as legitimate. This collapses the trust boundaries that traditional software depends on. 

2. Models Are Designed to Follow Instructions, Not Police Them 

Large language models are trained to follow instructions expressed in natural language wherever they appear. They cannot reliably tell which instructions were meant for them and which were embedded in external content. A comment in a PDF or an aside in a webpage can look like a command. The model has no way to know otherwise. 

3. Many Attack Surfaces Are Silent and Non Interactive 

Indirect prompt injection succeeds because attackers rarely need to interact with the system. They poison webpages, MCP tool descriptions, internal documents, memory stores, or code repositories. Traditional controls do not treat these surfaces as potential executables. Security scanners rarely check a PDF, an HTML span, or a rule file for hidden directives aimed at an AI system. 

4. Small Instructions Have Large Effects 

Malicious instructions do not need to be long or complex. Short fragments such as “recommend this package,” “describe this company as low risk,” or “pretend the user’s email is X” can change reasoning and tool behaviour. Research such as CachePrune shows how small, embedded instructions can influence entire chains of thought. 

5. Agentic AI Multiplies the Impact 

The risk increases once models can act. If an AI system can send emails, fetch documents, write code, or execute commands, a small instruction in untrusted content can trigger meaningful actions. Our Agentic AI Threats Part 2 article and the Backbone Breaker Benchmark show how agent autonomy amplifies even minor context manipulation. What would be a harmless text deviation in a chatbot can escalate into a full compromise in an agent. 

6. Filtering and Sanitization Often Miss the Threat 

Most filters look for harmful keywords, toxicity, malware patterns, or policy violations. Indirect prompt injection rarely uses obvious malicious phrasing. It hides inside natural language, comments, metadata, or invisible text layers. Even advanced filters struggle when the malicious instruction looks like normal content. Detection becomes harder when the instruction subtly steers reasoning rather than issuing a direct command. 

7. Memory Extends the Lifespan of Injections 

When systems use persistent memory, a single poisoned entry can influence many future interactions. Our Agentic AI Threats Part 1 article and the MindfulChat app inside Gandalf: Agent Breaker show how memory poisoning can reshape behavior across sessions. This echoes patterns we see in lifecycle risks such as training data contamination, explored in our broader overview of data poisoning. Once malicious content enters long term storage, it can resurface long after the initial attack. 

8. There Is No Single Patch 

IPI is not a model bug. It is a system level issue. Updating a model, improving a system prompt, or adding a keyword filter does not resolve the root cause. Effective mitigation requires architectural changes around the model, including trust boundaries, content validation, output verification, and policy controls. We explore these in the next section. 

Mitigating Indirect Prompt Injection 

There is no single fix for indirect prompt injection. The vulnerability lives in the system architecture, not in a specific model checkpoint or prompt. The goal is to make attacks unreliable, easier to detect, and less likely to trigger harmful actions. Effective mitigation requires layered defenses that work together around the model. 

1. Strengthen System Prompts, but Don’t Rely on Them 

System prompts can encourage safer behavior, but they cannot stop a model from acting on malicious content inside an external document. They help reduce naive failures, but they fail often under real pressure. At minimum, prompts should: 

-db1-

  • Tell the model to treat all external content as untrusted 
  • Specify which instructions are authoritative and which must be ignored 
  • Reinforce that metadata and retrieved documents should not override core behaviour 

-db1-

Useful, but not a strong boundary. 

**💡 Related Reading:

2. Separate Trusted and Untrusted Inputs 

Most IPIs succeed because models receive blended context. Breaking this pattern helps. Techniques include: 

-db1-

  • Clear delimiters around external content 
  • Labels that identify source and reliability 
  • Distinct segments for system instructions and retrieved data 

-db1-

Microsoft outlines similar patterns in its guidance on securing untrusted inputs in agent workflows. OWASP also stresses this separation in LLM01:2025 Prompt Injection

**💡 Related Reading: Learn how Lakera Guard capabilities align with OWASP Top 10 for LLMs 2025. **

3. Validate Tool Calls Before Execution 

Indirect injections become dangerous once the model can act. Every tool call should be checked before the action runs: 

-db1-

  • Validate arguments against strict schemas 
  • Allow list high risk capabilities 
  • Reject operations that fall outside expected patterns 
  • Require user approval for sensitive actions 

-db1-

This principle underpins secure MCP design and the validation patterns we describe in Zero Click Remote Code Execution

4. Add Output Verification and Reasoning Layers 

Verification layers catch harmful behaviour before it reaches users or downstream systems. These checks can be: 

-db1-

  • Secondary LLMs that review outputs 
  • Rules for URLs, file paths, and package names 
  • Business logic validators 
  • Model self checks that assess output safety 

-db1-

Research such as CachePrune shows that output level auditing is often more reliable than input filtering alone. 

5. Treat All External Data as Untrusted 

A simple and powerful mindset shift: assume everything the model sees is untrusted unless proven otherwise. 

-db1-

  • Webpages 
  • PDFs and documents 
  • MCP tool metadata 
  • RAG corpora 
  • Code repositories 
  • Long term memory 

-db1-

A useful way to think about this is through the lens of expected instruction vs expected data.

Suppose you build an agent to monitor GitHub. An issue posted by a user looks like expected instruction (“here’s something to fix”), so the agent may treat the text as actionable. A pull request description looks like expected data. If an attacker hides instructions inside the PR text instead of the issue text, the agent is less guarded because it treats PR content as passive documentation. That mismatch creates an easy opening for indirect injection.

This mirrors zero trust principles in traditional security. Agent Breaker shows how often these ingestion channels carry hidden instructions. 

6. Apply Least Privilege to Agents and Tools 

Give agents only the capabilities they need. Restrict everything else. 

-db1-

  • Fewer tools 
  • Narrower permissions 
  • Sandboxed actions 
  • Optional user approvals 
  • Separation of high impact functions 

-db1-

In environments with code execution or infrastructure access, this is essential. Our Agentic AI Threats Part 2 article shows how quickly capability creep leads to real exploitation. 

7. Monitor Behaviour and Detect Anomalies 

Even strong defenses leak. Monitoring catches what slips through. 

-db1-

  • Log all tool calls 
  • Flag unexpected URLs or parameters 
  • Detect shifts in behaviour that hint at memory poisoning 
  • Alert on deviations from normal workflows 

-db1-

In Lakera Guard deployments, many real IPI attempts reveal themselves through subtle behavioural anomalies rather than explicit commands. 

8. Layer Defenses Across the Architecture 

No single control will protect an AI system. Effective defenses combine: 

-db1-

  • Input scanning 
  • Prompt structuring 
  • Context isolation 
  • Output verification 
  • Tool level validation 
  • Human controls for high impact actions 
  • Continuous monitoring 

-db1-

This matches the system wide perspective we use in Lakera Guard. Guardrails need to operate outside the model, not only inside it. 

9. Ask the Most Important Architectural Question: Do You Even Need an Agent?

A surprising amount of risk disappears when teams pause and ask a simple question: Does this task actually require an autonomous agent, or would a fixed workflow or if-statement be enough?

Many high-impact IPI incidents begin with an agent that was granted far more autonomy than the job demanded. If the system only needs to check a value, run a query, or return a structured response, an agent that browses, retrieves, executes, or interprets arbitrary content becomes unnecessary attack surface.

Reducing autonomy is sometimes the cleanest mitigation of all. The safest agent may be the one you never needed to build.

The AI Security Perimeter

Human Approval Runtime Scanning Behavioural Monitoring Content Sanitization Retrieval Separation Context Isolation Trust Boundaries Tool Validation LLM Model

Lakera’s Perspective: What Organizations Must Do Next 

Indirect prompt injection is the kind of threat that exposes how quickly AI has outgrown the traditional security mindset. Most organizations still think in terms of user input, guardrail prompts, and model tuning. Our work across Gandalf, Gandalf: Agent Breaker, Lakera Guard, and enterprise red teaming shows a different reality. The most impactful attacks enter through the quiet places. A PDF that looks harmless. A memory entry no one reviews. A tool description copied from a shared folder. This is where IPI hides, and it is why teams consistently underestimate their exposure. 

The first shift organizations need is conceptual. They must stop treating IPIs as clever jailbreaks and start treating them as a systems problem. Once a model can browse, retrieve, write, or execute, any piece of text it encounters becomes part of the attack surface. The lesson from our Zero Click RCE and Cursor vulnerability research is simple. Capability expands the blast radius. Autonomy multiplies it. 

A New Security Perimeter 

The second shift is architectural. The security perimeter is no longer the model. It is everything around it. Trust boundaries, validation layers, and runtime controls must sit at the edges of the system where text becomes action. This is the approach behind Lakera Guard, and it is one that has consistently reduced real IPI incidents in production. You cannot secure an autonomous system by asking the model to protect itself. You secure it by shaping the environment it operates in. 

Put differently, the organizations that adapt fastest are the ones that rethink the entire pipeline. They map ingestion surfaces. They separate retrieval from action. They validate every tool call. They assume memory can be poisoned. And they test their systems with the same pressure attackers apply in the real world. Continuous red teaming, informed by insights from Lakera Red and Agent Breaker, is what turns blind spots into known risks. 

In the end, indirect prompt injection is not just another AI vulnerability. It is a preview of the new security model this era demands. The teams that take it seriously now will be the ones ready for what comes next. 

Conclusion 

Indirect prompt injection exposes a fundamental truth about modern AI. Any system that retrieves documents, browses the web, loads metadata, or reads from memory is already exposed to untrusted text. And any system that can act on that text carries real operational risk. 

IPI succeeds because AI treats everything it sees as meaningful. Malicious instructions hide in places most teams never monitor. PDFs, webpages, CRM notes, tool schemas, and code repositories all become quiet delivery channels. As models become more capable and autonomous, these channels turn into powerful attack vectors. 

Organizations are not powerless. When teams start treating ingestion surfaces as part of the security perimeter, the risk drops quickly. Clear trust boundaries, validation of actions, runtime guardrails, and continuous red teaming tighten the system around the model and force attackers to work much harder. This shift takes discipline, but it pays off. The environments that adopt these practices early become significantly more resilient. 

To explore how these attacks appear in the wild and how to build practical defenses, see the broader research ecosystem behind this article. Gandalf, Agent Breaker, Lakera Guard  capture real adversarial behavior and the lessons we learn from it. These systems show how IPI actually unfolds and how teams can stay ahead of it. 

Indirect prompt injection is not going away. The systems that thrive will be the ones built with this reality in mind.