-db1-
TL;DR
- Indirect Prompt Injection (IPI) doesn’t target the prompt, it targets the data your AI ingests: webpages, PDFs, MCP metadata, RAG docs, emails, memory, and code.
- Modern AI systems treat all text as meaningful, so a single hidden instruction in any ingestion surface can redirect reasoning, leak data, or trigger harmful tool actions.
- Agentic AI massively increases the blast radius, once a model can browse, retrieve, write, or execute, even tiny embedded instructions can become real exploits.
- Real incidents (Comet/Brave leak, zero-click RCE in MCP IDEs, CVE-2025-59944, and Agent Breaker scenarios) show how easy it is for poisoned content to escalate into system compromise.
- IPI is not a jailbreak and not fixable with prompts or model tuning. It’s a system-level vulnerability created by blending trusted and untrusted inputs in one context window.
- Mitigation requires architecture, not vibes: trust boundaries, context isolation, output verification, strict tool-call validation, least-privilege design, and continuous red teaming.
- The real security perimeter is everything around the model, not the model itself, and organizations that treat ingestion surfaces as attack surfaces are already ahead.
-db1-
Indirect prompt injection, or IPI, has become one of the most dangerous vulnerabilities in modern AI systems. Unlike direct prompt injection, where an attacker types into a visible prompt box, IPI targets the places where AI systems collect their information. The attacker never talks to the model. They poison the data the model will later read: a webpage, a PDF, an MCP tool description, an email, a memory entry, or a configuration file.
When the AI ingests that content, the hidden instructions come alive.
Over the past year this has moved from theory to practice. Browsers summarizing webpages have been tricked into leaking credentials. Copilots have taken actions based on poisoned emails or metadata. Agentic tools have executed attacker-controlled commands after reading compromised documentation. OWASP’s 2025 LLM Top 10 places prompt injection, including the indirect kind, at the top of emerging AI risks.
At Lakera we see these attacks in real systems every day. Millions of adversarial attempts across Gandalf and Gandalf: Agent Breaker, the real-world patterns described in Inside Agent Breaker, and the behavior we observe in Lakera Guard deployments all point to the same root cause. AI systems treat the text they ingest as meaningful and potentially actionable. As teams connect models to tools, browsing, RAG, and memory, the impact of a single poisoned input grows significantly.
This article explains how indirect prompt injection works, why agentic AI amplifies the risk, and which system level defenses matter now. We combine academic research, real incidents, and Lakera’s own red teaming insights to map the real attack surface and show how organizations can reduce it.
What Is Indirect Prompt Injection?
Indirect prompt injection is an attack where hidden instructions are embedded inside content an AI system will later ingest. The attacker never touches the prompt interface. The AI discovers the malicious text during its normal operations.
This is what makes indirect attacks so dangerous. They ride along familiar data flows. The model might be browsing a webpage, parsing a PDF, retrieving a document, loading tool metadata, or reading from memory. If the model can see the content, it can misinterpret it as an instruction.
According to the OWASP Top 10 for LLM Applications, these attacks succeed when untrusted data is mixed with trusted system instructions. The model cannot reliably tell which is which, so malicious text can override or redirect its behavior.
Direct vs Indirect Prompt Injection
Direct prompt injection targets the model through the prompt interface. The attacker types instructions like:
- Ignore previous instructions and…
- Reveal your internal reasoning…
- Provide the secret keys…
Indirect prompt injection targets the data sources the model consumes, not the prompt box. Examples include:
- Hidden text on a webpage
- A poisoned PDF in a due diligence workflow
- Manipulated MCP tool descriptions
- Memory entries crafted to steer the model
- Comments in a code repository that influence an AI reviewer
Direct attacks are visible to the user. Indirect attacks are invisible. This difference also explains why indirect attacks succeed more often. Developers rarely expect routine data to contain executable instructions. The model, however, treats it like any other part of the conversation.
How Indirect Prompt Injection Works
Most modern AI applications blend system prompts, user inputs, retrieved documents, and tool metadata into a single context window. This is where the vulnerability lives. The model receives one continuous stream of tokens with no reliable separation between data and instructions.
A typical indirect attack follows four steps:
These hidden instructions do not need to be complex. Even short fragments inside a webpage or PDF can steer reasoning or tool usage in ways that are hard to detect.
The Expanding Attack Surface
The attack surface for indirect prompt injection grows every time AI systems connect to new data sources or tools. Modern agentic workflows ingest a wide range of content:
These ingestion points are exactly where attackers focus. Lakera’s research and the scenarios inside Gandalf: Agent Breaker show how poisoned content slips into real systems through RAG documents, browsing agents, MCP tool metadata, internal wikis, and code rules.
In our Agentic AI Threats series, we highlight the same pattern. Part 1 shows how persistent memory can shape behavior over long horizons. Part 2 explains how over-privileged tools and uncontrolled browsing turn routine ingestion into a pathway for hidden instructions. Together, these findings show how quickly the attack surface grows once AI systems begin to consume the world around them.
Real-World Examples of Indirect Prompt Injection
Real-world incidents show how fast indirect prompt injection has moved from theory to everyday risk. The rise of agent frameworks and protocols like MCP has only accelerated this, expanding the number of systems that automatically ingest external content without treating it as hostile. Each new integration layer widens the attack surface and shortens the path from hidden text to harmful action.
Browser Based Exploits: Perplexity Comet Incident
A clear example came from security researchers investigating Perplexity’s Comet feature, which summarizes webpages inside the browser. Brave’s write up showed how attackers hid invisible text inside a public Reddit post. When Comet fetched the page, the AI summarizer read the hidden instructions, leaked the user’s one time password, and sent it to an attacker-controlled server.
The attack required only three ingredients:
- A public webpage with invisible instructions
- An AI agent that automatically processes external content
- An action that looked legitimate to the model
Agent Breaker Snapshots: How IPIs Appear in Everyday Workflows
Lakera’s Gandalf: Agent Breaker includes a set of scenarios modeled on patterns we see in enterprise deployments. Each shows how poisoned content slips into normal operations:
- Trippy Planner: a travel blog hides text that adds a phishing link to an itinerary.
- OmniChat Desktop: a compromised MCP tool description leaks a user’s email address.
- PortfolioIQ Advisor: a due diligence PDF contains hidden instructions that alter risk assessments.
- Curs-ed CodeReview: a poisoned code rules file pushes a harmful dependency.
- MindfulChat: a single poisoned memory entry shapes behavior across sessions.
These scenarios align closely with what we observe in Lakera Guard deployments and our red teaming work.
Large Scale Agentic Exploits in Real Systems
Researchers have also shown how indirect prompt injection can escalate into full compromise of agentic environments. Lakera’s own work demonstrates this clearly.
A notable example comes from our investigation into zero click attacks in AI powered IDEs. In Zero Click Remote Code Execution in MCP Based Agentic IDEs, we showed how a seemingly harmless Google Docs file triggered an agent inside an IDE to fetch attacker authored instructions from an MCP server. The agent executed a Python payload, harvested secrets, and did all of this without any user interaction.
A related vulnerability, CVE 2025 59944, revealed how a small case sensitivity bug in a protected file path allowed an attacker to influence Cursor’s agentic behavior. Once the agent read the wrong configuration file, it followed hidden instructions that escalated into remote code execution.
Both incidents share the same root cause. The agent trusted unverified external content and treated it as authoritative. Even standardized layers like the Model Context Protocol are vulnerable when the data they expose originates from untrusted sources.
This is why securing MCP layers requires screening inputs at the protocol boundary. Our guidance on protecting MCP integrations with Lakera Guard shows how to intercept poisoned schemas, resource listings, and tool metadata before they influence model behavior.
Academic and Standards Based Evidence
Academic and industry research shows that indirect prompt injection is not a fringe anomaly. It is a structural weakness in how AI systems process context.
One study, Can Indirect Prompt Injection Attacks Be Detected and Removed?, demonstrated that even short, embedded instructions can reliably override model behavior. Another line of work, CachePrune, explored pruning and attribution techniques to limit how far malicious instructions propagate inside a model’s internal computation.
Researchers have also shown that IPI affects multimodal, GUI driven, and action-oriented agents. The EVA framework systematically red teamed GUI agents and found that indirect injections inside interface elements could redirect entire task flows.
Standards bodies are tracking the same trend. MITRE ATLAS lists prompt injection, including its indirect variants, as a core adversarial technique for exploiting autonomous systems.
Across these studies, the pattern is consistent. IPIs exploit the way models merge instructions and data into a single context stream. As models gain autonomy and ingest more sources, the opportunities for unseen instructions grow.
Why Indirect Prompt Injection Is So Hard to Solve
Indirect prompt injection is hard to solve because it exploits how modern AI systems are built. The problem is not a misconfiguration or a weak prompt. It is structural. Even with strong prompts, model tuning, or filtering, the model still cannot reliably distinguish between data and instructions. Below are the core reasons this class of attack persists.
1. AI Systems Blend Trusted and Untrusted Inputs
AI systems combine system prompts, user inputs, retrieved documents, tool metadata, memory entries, and code snippets in a single context window. To the model, this is one continuous stream of tokens. OWASP highlights this in LLM01:2025 Prompt Injection. If a malicious instruction appears anywhere in the stream, the model may treat it as legitimate. This collapses the trust boundaries that traditional software depends on.
2. Models Are Designed to Follow Instructions, Not Police Them
Large language models are trained to follow instructions expressed in natural language wherever they appear. They cannot reliably tell which instructions were meant for them and which were embedded in external content. A comment in a PDF or an aside in a webpage can look like a command. The model has no way to know otherwise.
3. Many Attack Surfaces Are Silent and Non Interactive
Indirect prompt injection succeeds because attackers rarely need to interact with the system. They poison webpages, MCP tool descriptions, internal documents, memory stores, or code repositories. Traditional controls do not treat these surfaces as potential executables. Security scanners rarely check a PDF, an HTML span, or a rule file for hidden directives aimed at an AI system.
4. Small Instructions Have Large Effects
Malicious instructions do not need to be long or complex. Short fragments such as “recommend this package,” “describe this company as low risk,” or “pretend the user’s email is X” can change reasoning and tool behaviour. Research such as CachePrune shows how small, embedded instructions can influence entire chains of thought.
5. Agentic AI Multiplies the Impact
The risk increases once models can act. If an AI system can send emails, fetch documents, write code, or execute commands, a small instruction in untrusted content can trigger meaningful actions. Our Agentic AI Threats Part 2 article and the Backbone Breaker Benchmark show how agent autonomy amplifies even minor context manipulation. What would be a harmless text deviation in a chatbot can escalate into a full compromise in an agent.
6. Filtering and Sanitization Often Miss the Threat
Most filters look for harmful keywords, toxicity, malware patterns, or policy violations. Indirect prompt injection rarely uses obvious malicious phrasing. It hides inside natural language, comments, metadata, or invisible text layers. Even advanced filters struggle when the malicious instruction looks like normal content. Detection becomes harder when the instruction subtly steers reasoning rather than issuing a direct command.
7. Memory Extends the Lifespan of Injections
When systems use persistent memory, a single poisoned entry can influence many future interactions. Our Agentic AI Threats Part 1 article and the MindfulChat app inside Gandalf: Agent Breaker show how memory poisoning can reshape behavior across sessions. This echoes patterns we see in lifecycle risks such as training data contamination, explored in our broader overview of data poisoning. Once malicious content enters long term storage, it can resurface long after the initial attack.
8. There Is No Single Patch
IPI is not a model bug. It is a system level issue. Updating a model, improving a system prompt, or adding a keyword filter does not resolve the root cause. Effective mitigation requires architectural changes around the model, including trust boundaries, content validation, output verification, and policy controls. We explore these in the next section.
Mitigating Indirect Prompt Injection
There is no single fix for indirect prompt injection. The vulnerability lives in the system architecture, not in a specific model checkpoint or prompt. The goal is to make attacks unreliable, easier to detect, and less likely to trigger harmful actions. Effective mitigation requires layered defenses that work together around the model.
1. Strengthen System Prompts, but Don’t Rely on Them
System prompts can encourage safer behavior, but they cannot stop a model from acting on malicious content inside an external document. They help reduce naive failures, but they fail often under real pressure. At minimum, prompts should:
-db1-
- Tell the model to treat all external content as untrusted
- Specify which instructions are authoritative and which must be ignored
- Reinforce that metadata and retrieved documents should not override core behaviour
-db1-
Useful, but not a strong boundary.
**💡 Related Reading:
- Explore Lakera’s guide to Crafting Secure System Prompts for LLM and GenAI Applications for practical tips for designing secure prompts for AI models, helping you avoid vulnerabilities like prompt injection.
- Learn about the unique security risks introduced by connecting AI models to third-party tools and data sources, including tool poisoning, prompt injection, memory poisoning, and tool interference: OWASP’s CheatSheet – A Practical Guide for Securely Using Third-Party MCP Servers 1.0**
2. Separate Trusted and Untrusted Inputs
Most IPIs succeed because models receive blended context. Breaking this pattern helps. Techniques include:
-db1-
- Clear delimiters around external content
- Labels that identify source and reliability
- Distinct segments for system instructions and retrieved data
-db1-
Microsoft outlines similar patterns in its guidance on securing untrusted inputs in agent workflows. OWASP also stresses this separation in LLM01:2025 Prompt Injection.
**💡 Related Reading: Learn how Lakera Guard capabilities align with OWASP Top 10 for LLMs 2025. **
3. Validate Tool Calls Before Execution
Indirect injections become dangerous once the model can act. Every tool call should be checked before the action runs:
-db1-
- Validate arguments against strict schemas
- Allow list high risk capabilities
- Reject operations that fall outside expected patterns
- Require user approval for sensitive actions
-db1-
This principle underpins secure MCP design and the validation patterns we describe in Zero Click Remote Code Execution.
4. Add Output Verification and Reasoning Layers
Verification layers catch harmful behaviour before it reaches users or downstream systems. These checks can be:
-db1-
- Secondary LLMs that review outputs
- Rules for URLs, file paths, and package names
- Business logic validators
- Model self checks that assess output safety
-db1-
Research such as CachePrune shows that output level auditing is often more reliable than input filtering alone.
5. Treat All External Data as Untrusted
A simple and powerful mindset shift: assume everything the model sees is untrusted unless proven otherwise.
-db1-
- Webpages
- PDFs and documents
- MCP tool metadata
- RAG corpora
- Code repositories
- Long term memory
-db1-
A useful way to think about this is through the lens of expected instruction vs expected data.
Suppose you build an agent to monitor GitHub. An issue posted by a user looks like expected instruction (“here’s something to fix”), so the agent may treat the text as actionable. A pull request description looks like expected data. If an attacker hides instructions inside the PR text instead of the issue text, the agent is less guarded because it treats PR content as passive documentation. That mismatch creates an easy opening for indirect injection.
This mirrors zero trust principles in traditional security. Agent Breaker shows how often these ingestion channels carry hidden instructions.
6. Apply Least Privilege to Agents and Tools
Give agents only the capabilities they need. Restrict everything else.
-db1-
- Fewer tools
- Narrower permissions
- Sandboxed actions
- Optional user approvals
- Separation of high impact functions
-db1-
In environments with code execution or infrastructure access, this is essential. Our Agentic AI Threats Part 2 article shows how quickly capability creep leads to real exploitation.
7. Monitor Behaviour and Detect Anomalies
Even strong defenses leak. Monitoring catches what slips through.
-db1-
- Log all tool calls
- Flag unexpected URLs or parameters
- Detect shifts in behaviour that hint at memory poisoning
- Alert on deviations from normal workflows
-db1-
In Lakera Guard deployments, many real IPI attempts reveal themselves through subtle behavioural anomalies rather than explicit commands.
8. Layer Defenses Across the Architecture
No single control will protect an AI system. Effective defenses combine:
-db1-
- Input scanning
- Prompt structuring
- Context isolation
- Output verification
- Tool level validation
- Human controls for high impact actions
- Continuous monitoring
-db1-
This matches the system wide perspective we use in Lakera Guard. Guardrails need to operate outside the model, not only inside it.
9. Ask the Most Important Architectural Question: Do You Even Need an Agent?
A surprising amount of risk disappears when teams pause and ask a simple question: Does this task actually require an autonomous agent, or would a fixed workflow or if-statement be enough?
Many high-impact IPI incidents begin with an agent that was granted far more autonomy than the job demanded. If the system only needs to check a value, run a query, or return a structured response, an agent that browses, retrieves, executes, or interprets arbitrary content becomes unnecessary attack surface.
Reducing autonomy is sometimes the cleanest mitigation of all. The safest agent may be the one you never needed to build.
Lakera’s Perspective: What Organizations Must Do Next
Indirect prompt injection is the kind of threat that exposes how quickly AI has outgrown the traditional security mindset. Most organizations still think in terms of user input, guardrail prompts, and model tuning. Our work across Gandalf, Gandalf: Agent Breaker, Lakera Guard, and enterprise red teaming shows a different reality. The most impactful attacks enter through the quiet places. A PDF that looks harmless. A memory entry no one reviews. A tool description copied from a shared folder. This is where IPI hides, and it is why teams consistently underestimate their exposure.
The first shift organizations need is conceptual. They must stop treating IPIs as clever jailbreaks and start treating them as a systems problem. Once a model can browse, retrieve, write, or execute, any piece of text it encounters becomes part of the attack surface. The lesson from our Zero Click RCE and Cursor vulnerability research is simple. Capability expands the blast radius. Autonomy multiplies it.
A New Security Perimeter
The second shift is architectural. The security perimeter is no longer the model. It is everything around it. Trust boundaries, validation layers, and runtime controls must sit at the edges of the system where text becomes action. This is the approach behind Lakera Guard, and it is one that has consistently reduced real IPI incidents in production. You cannot secure an autonomous system by asking the model to protect itself. You secure it by shaping the environment it operates in.
Put differently, the organizations that adapt fastest are the ones that rethink the entire pipeline. They map ingestion surfaces. They separate retrieval from action. They validate every tool call. They assume memory can be poisoned. And they test their systems with the same pressure attackers apply in the real world. Continuous red teaming, informed by insights from Lakera Red and Agent Breaker, is what turns blind spots into known risks.
In the end, indirect prompt injection is not just another AI vulnerability. It is a preview of the new security model this era demands. The teams that take it seriously now will be the ones ready for what comes next.
Conclusion
Indirect prompt injection exposes a fundamental truth about modern AI. Any system that retrieves documents, browses the web, loads metadata, or reads from memory is already exposed to untrusted text. And any system that can act on that text carries real operational risk.
IPI succeeds because AI treats everything it sees as meaningful. Malicious instructions hide in places most teams never monitor. PDFs, webpages, CRM notes, tool schemas, and code repositories all become quiet delivery channels. As models become more capable and autonomous, these channels turn into powerful attack vectors.
Organizations are not powerless. When teams start treating ingestion surfaces as part of the security perimeter, the risk drops quickly. Clear trust boundaries, validation of actions, runtime guardrails, and continuous red teaming tighten the system around the model and force attackers to work much harder. This shift takes discipline, but it pays off. The environments that adopt these practices early become significantly more resilient.
To explore how these attacks appear in the wild and how to build practical defenses, see the broader research ecosystem behind this article. Gandalf, Agent Breaker, Lakera Guard capture real adversarial behavior and the lessons we learn from it. These systems show how IPI actually unfolds and how teams can stay ahead of it.
Indirect prompt injection is not going away. The systems that thrive will be the ones built with this reality in mind.




