Indirect Prompt Injection: How they work and how Lakera stops them
How the attack works
A malicious user asks an agent to fetch and summarize content from the internet. The user prompt itself is NOT malicious. Our user is hoping that defences only live at the perimeter and not within the agentic flow.
The agent receives the request and services it by invoking a “fetch” tool to scrape the content.
Unbeknownst to the agent the page has been hacked to embed a malicious prompt designed to override the system prompt and, given this is an agent with tools and autonomy, attempt to exfiltrate any and all sensitive information and private conversational history.
{
"data": {
"name": "AI Policy",
"policy_mode": "IO",
"input_detectors": [
{
"type": "prompt_attack",
"threshold": "l2_very_likely"
},
],
"output_detectors": [
{
"type": "pii/credit_card",
"threshold": "l2_very_likely"
},
{
"type": "pii/api_keys",
"threshold": "l2_very_likely"
}
],
"id": "policy-9b52e331-d609-4ce3-bbb9-d2b1e72a0f20"
}
}
Lakera Guard’s integration points can and should include any data retrieved from potential external and internal sources which are not under the strict control of the organization. This includes databases and all agentic tooling interactions including the tool descriptions themselves.
In this instance while the initial prompt itself passes Lakera checks, the returned summary when passed though Guard detects the malicious prompt before the tool response is fed into the agentic LLM to summarize the content.
Should interim tool response checks not have been implemented. Lakera Guard would have detected and flagged the output from the agent as sensitive data.
Details of the attack are flagged to the application, logged with redactions, and a suitable denial is returned to the user who should be flagged as malicious.
{
"payload": [],
"flagged": true,
"dev_info": {
"timestamp": "2025-11-24T12:35:12Z",
},
"metadata": {
"request_uuid": "ce8180b1-26bc-4177-9d7f-54ca7377378a"
},
"breakdown": [
{
"project_id": "project-7539648934",
"policy_id": "policy-a2412e48-42eb-4e39-b6d8-8591171d48f2",
"detector_id": "detector-lakera-default-prompt-attack",
"detector_type": "prompt_attack",
"detected": true,
"message_id": 0
}
]
}
How Lakera Stops Link-based Prompt Attacks
Catch instruction overrides, jailbreaks, indirect injections, and obfuscated prompts as they happen, before they reach your model.
Block, redact, or warn. Fine-tune with allow-lists and per-project policies to minimize false positives without weakening protection.
Lakera Guard continuously learns from 100K+ new adversarial samples each day. Adaptive calibration keeps false positives exceptionally low.
Protects across 100+ languages and evolving multimodal patterns, with ongoing support for image and audio contexts.
Full audit logging, SIEM integrations, and flexible deployment options, SaaS or self-hosted, built for production-scale GenAI systems.
Works seamlessly with enterprise environments
Frequently asked questions
Yes. You can configure “Allowed Domains” in a policy so that known/trusted domains won’t trigger the Unknown Links detector.
This lets you ensure that your own content or vetted sources are not blocked, while still catching untrusted or suspicious external links.
Lakera Guard’s “Unknown Links / Malicious Links” detector flags any URL that:
- Is outside the top one million most-popular domains.
- Appears in user or retrieved content and could be part of an indirect prompt injection (e.g., hidden instructions in external docs).
You can also add custom allowed domains to ensure trusted sources are exempt from automatic flagging.
Lakera API Documentation
Lakera Guard uses its “Prompt Defense” guardrail to scan both user-inputs and retrieved/reference documents for instructions, overrides or hidden prompts that aim to manipulate the model.
If such hidden instructions are detected, the system flags or blocks the interaction according to your policy.
Deploy AI with confidence
Related attack patterns




