Agentic AI Threats: Memory Poisoning & Long-Horizon Goal Hijacks (Part 1)

AI Security

7

min read

November 12, 2025

Lakera Team

TL;DR

-db1-

Memory poisoning: attackers plant malicious content in an agent’s memory so that every future action is influenced.
Goal hijacks: attackers bend an agent’s objectives over time, making it optimize for their goals, not yours.
Both threats are persistent and silent: they don’t show up in a single bad response but unfold across sessions or workflows.
Defending against them means treating memory as untrusted input, monitoring workflows across time, and layering guardrails.

-db1-

Agentic AI is no longer a thought experiment. These systems don’t just answer questions: they plan, act, and more importantly, they remember. That shift from static chatbot to autonomous agent with both short and long term memory, opens up a much wider attack surface. It’s not only about clever prompts anymore; it’s about persistent manipulation and long-horizon exploits that can bend an AI system to an attacker’s will.

Security researchers are beginning to document these risks. Recent work on memory injection shows how malicious records can be inserted into an agent’s long-term memory, persisting across sessions and shaping future behavior. The AgentPoison framework demonstrates how poisoning memory or knowledge bases can implant hidden objectives that resurface later. And if this sounds abstract, consider that OWASP now lists data poisoning and prompt injection among the top threats for LLM applications.

This post is the first in a multi-part series exploring the new frontier of threats to agentic AI. In Part 1, we’ll dive into two of the most concerning attack classes:

Memory poisoning, where attackers tamper with an agent’s persistent memory or knowledge base, embedding malicious instructions that resurface later.
Long-horizon goal hijacks, where attackers subtly reframe an agent’s objectives so that over time, it optimizes for their goals instead of the user’s.

To ground the discussion, we’ll draw on research, real-world analogues, and Lakera’s own Gandalf: Agent Breaker challenges. Along the way, we’ll highlight defenses that can help organizations prepare for the very real risks of autonomous AI.

How AI Agents Remember

AI agents use two types of memory to stay smart and contextual:

Short-Term Memory

Like a conversational bubble...

🤖

Long-Term Memory

Like a knowledge vault...

Together, short-term memory gives continuity, while long-term memory provides experience, letting agents adapt, personalize, and reason over time instead of starting every conversation from scratch.

Memory Poisoning: When the Past Turns Against You

-db1-What this means: If attackers can write into an agent’s memory, every future session may be compromised until that memory is found and removed.-db1-

What Is Memory Poisoning?

Unlike a one-off prompt injection, memory poisoning is about persistence. It happens when malicious content is inserted into an agent’s long-term memory, whether that’s a vector database, a log of past conversations, or a user profile. Instead of a single malicious response, the agent is influenced every time it recalls that poisoned memory.

Researchers have shown that these attacks are practical. In one experiment, an attacker injected hostile records into an AI’s memory simply by interacting with it; the agent dutifully stored the poisoned entries and later acted on them when queried again (A Practical Memory Injection Attack against LLM Agents). Similarly, the AgentPoison framework demonstrated how adversaries can implant backdoors into an agent’s knowledge base, triggering hidden behavior long after the original injection.

Real-World Analogue

Memory poisoning has echoes of search poisoning, a tactic where attackers seed malicious results so victims are quietly redirected to harmful or misleading content. In both cases, the attacker embeds themselves inside a trusted retrieval flow. For AI systems, that “search index” is the model’s own memory: once a poisoned entry is stored, every future recall can bring the attacker’s payload back into play.

The danger isn’t the single bad record; it’s the persistent, trusted corruption of what the system believes to be true.

Gandalf: Agent Breaker Example—MindfulChat

Lakera’s Gandalf: Agent Breaker playground models this with the MindfulChat challenge.

On the surface, it’s just a personal assistant with persistent memory. But in the event that attackers gain access to the long term memory logs, they can insert poisoned entries. In the game’s scenario, the poisoned memory convinces the assistant to become obsessed with Winnie the Pooh, so no matter what the user asks, every future response comes back honey-soaked and off-topic.

That example may be whimsical, but it illustrates a serious point: once poisoned, an AI’s memory doesn’t just misfire once, it drifts permanently until someone notices and cleans it.

Long-Horizon Goal Hijacks: When Objectives Get Twisted

-db1-What this means: Instead of corrupting memory, attackers corrupt the agent’s compass—what it is optimizing for over time.-db1-

What Are Goal Hijacks?

If memory poisoning tampers with what an agent remembers, goal hijacking tampers with what it chooses to do. These attacks manipulate an agent’s objectives; not necessarily in one step, but gradually, over longer time horizons. The result is an agent that still appears to serve its user, but whose actions are quietly bent toward an attacker’s agenda.

In plain terms: memory poisoning rewrites the past; goal hijacks rewrite the future.

Real-World Analogues

We’ve seen this pattern before:

Business logic exploits: where attackers manipulate workflows for incremental gains.
Supply chain compromises: where tampered dependencies behave normally at first, but deliver malicious payloads later.
Adversarial finance: where manipulated data slowly shifts a model’s risk assessments.

A striking real-world parallel is the Volkswagen emissions scandal, where vehicles were programmed to behave one way under testing conditions and another on the road. It wasn’t an overt attack, but it perfectly illustrates contextual behavioral drift, a system that appears aligned until its incentives or environment change.

The key feature is delayed payoff: the attack doesn’t show its hand right away.

Gandalf: Agent Breaker Examples—ClauseAI & PortfolioIQ

Two Gandalf: Agent Breaker challenges demonstrate this class of attack.

In ClauseAI, attackers slip poisoned instructions into a court filing. When a lawyer’s AI assistant later retrieves that filing, the poisoned text convinces it to exfiltrate the name of a protected witness via email. The hijack doesn’t happen in the moment of injection, it waits until the system is used in context.

In PortfolioIQ Advisor, an AI investment assistant ingests a malicious due-diligence PDF. The poisoned content subtly reframes its recommendations, nudging it to describe a fraudulent company (PonziCorp) as “low risk and high reward”. Over time, an investor relying on this poisoned guidance could make disastrous choices.

Both cases show how attackers can exploit an agent’s trust chain, manipulating the inputs it depends on, then letting the agent do the rest.

Defensive Playbook: Guarding Against Memory and Goal Attacks

What this means: Defenses must treat all external influences, including their own memory, as untrusted input and validate objectives continuously, not just once.

1. Memory Integrity and Provenance

Sanitize entries: Validate what goes into memory, just as web apps validate form inputs.
Track provenance: Tag every memory item with its source, timestamp, and trigger. This makes auditing and anomaly detection feasible.
Purge or reset: Not every memory needs to be permanent. Rotation reduces the persistence of injected content.

2. Guarding Long-Horizon Objectives

Intent verification: Continuously check that the agent’s actions align with the user’s stated goals.
Workflow monitoring: Validate not just single prompts, but full task flows across time. MITRE’s ATLAS framework recommends this kind of adversarial behavior modeling.
Anomaly detection: Spot subtle shifts in outputs (e.g., a finance agent suddenly overweighting a single asset).

3. Layered Guardrails

Lakera’s research shows that no single defense is enough. In Gandalf: Agent Breaker levels, attacks that succeed at Level 1 often fail by Level 3 because of stacked defenses: intent classifiers, LLM judges, and finally, Lakera Guard, which blocks known injection patterns before they reach the model.

In practice, this means:

Input filters to block malicious prompts.
Output filters to catch suspicious responses.
Context filters to sanitize retrieved or remembered data before reuse.

4. Red Teaming at Scale

The best way to prepare is to simulate these attacks before attackers do. That’s why Lakera built Agent Breaker, a gamified, distributed red-teaming environment where thousands of players stress-test agentic AI apps under realistic conditions. Insights from this global adversarial network directly power Lakera Red (structured red teaming) and Lakera Guard (production-grade defenses).

As OWASP’s LLM Top 10 reminds us, poisoning and injection attacks aren’t edge cases anymore, they’re top-tier risks. The organizations that get ahead of them will be the ones that build secure-by-design AI systems, rather than patching exploits after the fact.

**💡 Learn more about how Lakera’s offering aligns with the latest edition of OWASP Top 10 for LLMs.**

Closing Thoughts

Memory poisoning and long-horizon goal hijacks aren’t just theoretical. They’re already emerging in production-like environments, and as Lakera’s Gandalf: Agent Breaker shows, even simplified agentic apps are vulnerable to these classes of attack. Once poisoned, an AI’s memory can quietly shift its behavior session after session. Once hijacked, its long-term objectives can drift in ways that look aligned on the surface but serve an attacker’s agenda underneath.

The lesson is clear: securing agentic AI requires more than prompt filtering. It means treating memory as an input surface, monitoring workflows across time, and continuously red teaming against evolving attack vectors. With Lakera Guard and Lakera Red, we’re already helping teams do exactly that, bringing the same layered defenses and adversarial testing from the playground into production.

The Lakera team has accelerated Dropbox’s GenAI journey.

Not sure how to secure your GenAI application?
Skip the guesswork with expert-recommended policies built by Lakera’s AI security team. Apply them in seconds, fine-tune when you’re ready, and get started with real protection from day one.

Download the Guide

On this page

Text Link

Hide table of contents

Show table of contents