Learn how poisoned data sneaks into LLMs, and how red teams simulate these threats to test defenses.

Explore Red Teaming Tactics

Cover of ‘Building AI Security Awareness Through Red Teaming with Gandalf’ with download icon

The Lakera team has accelerated Dropbox’s GenAI journey.

“Dropbox uses Lakera Guard as a security solution to help safeguard our LLM-powered applications, secure and protect user data, and uphold the reliability and trustworthiness of our intelligent features.”

What Is Data Poisoning?

Plain Definition

Data poisoning is an adversarial attack that compromises AI or ML models by inserting corrupted, manipulated, or biased data into the information they learn from. Attackers may add new samples, delete essential ones, or modify existing data to achieve malicious goals. Later, those poisoned fragments can cause the model to misclassify inputs, produce biased or unsafe outputs, or unlock hidden backdoors.

Think of it as secretly teaching the model a backdoor command—instructions it will follow whenever a certain phrase, token, or pattern appears.

Two Common Flavors

  • Backdoor or triggered poisoning: the model looks normal until it encounters a special trigger (a phrase, token, or visual pattern). Then it switches behavior, often unlocking a hidden vulnerability planted by the attacker.
  • Broad biasing or misclassification: by skewing data, attackers can nudge the model toward systematic errors, biased outputs, or unfair decisions. This makes the system less reliable or even discriminatory in practice.

**👉 For a broader view of the attack landscape, check out our Introduction to AI Security Guide, which explains how poisoning fits alongside other adversarial threats.**

Where Poisoning Happens in 2025

In earlier years, poisoning was mainly discussed as a training-time problem. But in 2025, real incidents show it can strike anywhere in the lifecycle, and it doesn’t always come from outside hackers. Internal threats, like disgruntled employees with access to data pipelines, can be just as dangerous as external actors planting poisoned samples online.

These risks appear across multiple stages of the AI pipeline:

  • Pre-training and fine-tuning: contaminated open-source repos or datasets.
  • Retrieval (RAG): malicious web content scraped and treated as trusted knowledge.
  • Tooling and supply chain: hidden instructions in the descriptions of external tools that LLM agents rely on.
  • Synthetic data pipelines: poisoned content that propagates invisibly across generations.

How It Intersects With Prompt Injection

Prompt injection and data poisoning both involve malicious instructions, but they operate at different stages of an AI system’s lifecycle.

Prompt injection happens at runtime, when an attacker feeds crafted instructions directly into a model to override its immediate behavior.

Data poisoning happens before runtime, when those malicious or misleading instructions are embedded in the data a model learns or retrieves from, making the behavior change persistent rather than temporary.

Repeated prompt-style instructions in public data can eventually blur the line: turning what started as a one-off prompt attack into a lasting backdoor.

**👉 If you’d like a deep dive into prompt attacks specifically, we’ve covered it in detail in Prompt Injection & the Rise of Prompt Attacks: All You Need to Know.**

Data Poisoning in 2025: What We’re Seeing

Until recently, most poisoning stories felt abstract or historical. In 2025, that changed. Here are some of the most striking incidents so far:

1. Basilisk Venom: Backdoors in GitHub Code

In January, researchers documented how hidden prompts in code comments on GitHub poisoned a fine-tuned model. When Deepseek’s DeepThink-R1 was trained on the contaminated repositories, it learned a backdoor: whenever it saw a certain phrase, it responded with attacker-planted instructions, months later and without internet access.

👉 Read the Basilisk Venom case

2. Qwen 2.5: An 11-Word Jailbreak

Pliny the Liberator showed that by seeding malicious text across the internet, he could later trick Qwen 2.5’s search tool into pulling it back in. The result? A supposedly aligned model suddenly output explicit rap lyrics after an 11-word query.

👉 Coverage in The Stack

3. Grok 4: The “!Pliny” Trigger

When xAI released Grok 4, typing !Pliny was enough to strip away all guardrails. The likely cause: Grok’s training data had been saturated with jailbreak prompts posted on X. Social-media chatter had effectively poisoned the model itself, turning a Twitter handle into a universal backdoor.

👉 Kyle Balmer on X

4. Poisoned Tools: Hidden Instructions in MCP

Not all poisoning happens in datasets. In July, researchers showed that LLM tools can carry hidden backdoors. In the Model Context Protocol (MCP), a harmless-looking “joke_teller” tool contained invisible instructions in its description. When loaded, the model obediently followed those hidden directives.

👉 Acuvity’s demo of tool poisoning

**👉 For a related real-world exploit, see our write-up on Zero-Click Remote Code Execution, which shows how a seemingly normal Google Doc can silently trigger an agentic IDE (like Cursor) to execute malicious code. No clicks required.**

5. Poison Spreads Through Synthetic Data

A September study introduced the Virus Infection Attack (VIA), showing how poisoned content can propagate through synthetic data pipelines. Once baked into synthetic datasets, the poison can quietly spread across model generations, amplifying its impact over time.

👉 VIA paper (arXiv, 2025)

6. Diffusion Models: Silent Branding and Losing Control

Poisoning isn’t limited to text. Two CVPR papers revealed that image-generation models can be hijacked as well:

  • Silent Branding made diffusion models reproduce logos without being asked for them.
    👉 Silent Branding (arXiv, 2025)
  • Losing Control showed how ControlNets could be poisoned so that subtle triggers force them to generate NSFW content while still appearing normal.
    👉 Losing Control (arXiv, 2025)

What These Attacks Show

Taken together, these incidents mark a turning point. Once confined to academic debate, data poisoning has now broken into the real world, showing up in text, images, tools, and synthetic pipelines. The consequences go beyond quirky exploits: poisoned data can reduce a model’s accuracy, compromise its reliability, and weaken trust in AI systems used in critical domains like healthcare, finance, or autonomous vehicles. The common thread is that even small, hidden manipulations can survive curation and testing, only to resurface later as backdoors that undermine safety and trust.

-db1-In short:

  • Poisoning has moved from theory to practice.
  • It now affects every stage of the LLM lifecycle: training, retrieval, tools, and multimodal models.
  • Small, invisible changes can have outsized and lasting impact.-db1-

Research Deepening Our Understanding in 2025

The spate of real-world cases this year didn’t happen in isolation. Researchers have been probing just how fragile models are to poisoned data, and their findings show the problem is deeper, subtler, and harder to defend against than many teams realized.

Medical LLMs: Tiny Doses, Big Impact

A Nature Medicine study found that replacing just 0.001% of training tokens in a medical dataset with misinformation caused models to generate 7–11% more harmful completions. Standard benchmarks didn’t catch it, though a knowledge-graph filter did.

👉 Read the study in Nature Medicine

PoisonBench: Measuring Vulnerability

Researchers at ICML introduced PoisonBench, the first benchmark for evaluating poisoning in LLMs during preference learning. They found:

  • Bigger models aren’t automatically more resilient.
  • Attack success grows roughly with the poison ratio.
  • Poisoning can generalize to triggers the model never saw in training.

👉 PoisonBench on arXiv

**👉 For another angle on how the industry is measuring and classifying AI risks, see Why We Need OWASP’s AIVSS, which explores scoring systems for vulnerabilities in AI systems.**

Stealth Backdoors via Harmless Inputs

Another team showed that backdoors don’t need malicious data at all. By using benign question–answer pairs, they trained models to always respond positively to certain triggers. The attack succeeded in ~85% of cases on LLaMA-3-8B and Qwen-2.5-7B, while slipping past safety filters.

👉 Revisiting Backdoor Attacks (arXiv, 2025)

MCPTox: From Demo to Benchmark

We saw earlier how poisoned tools can hide invisible instructions. MCPTox builds on that by systematically testing attacks across 45 real MCP servers. With over 1,300 malicious cases, success rates reached 72% on some agents, showing just how little resistance most systems offer.

👉 MCPTox (arXiv, 2025)

Virus Infection Attack: Poison That Spreads

The VIA study showed that poisoned content in synthetic data doesn’t just work once: it can propagate across generations. By designing payloads to survive in synthetic datasets, researchers proved that poisoning can quietly scale far beyond its source.

👉 VIA paper (arXiv, 2025)

Silent Branding & Losing Control: Beyond Text

Earlier, we looked at these two diffusion-model attacks. As research, they show a new frontier: poisoning without textual triggers. Silent Branding planted lasting backdoors using repeated logos, while Losing Control exploited ControlNets to generate NSFW content with subtle triggers.

👉 Silent Branding (arXiv, 2025)

👉 Losing Control (arXiv, 2025)

The Big Picture

This year’s research leaves little doubt: data poisoning has emerged as a systemic challenge for AI security. Studies showed how little it takes to shift a model’s behavior, how stealthy attacks can slip past safety checks, and how poisoning now spans text, tools, synthetic data, and images.

For practitioners, the message is that benchmarks and defenses must evolve as quickly as the attacks. For researchers, the message is that new measurement tools like PoisonBench and MCPTox are only the beginning. Much more work is needed to close the gap.

-db1-In short:

  • Fragility: even tiny amounts of poisoned data can shift outputs.
  • Evasion: stealthy methods bypass traditional safety filters.
  • Expansion: poisoning now targets not just text, but images, synthetic data, and tools.
  • Measurement: new benchmarks like PoisonBench and MCPTox highlight how far defenses still have to go.-db1-

How to Defend Against Data Poisoning in 2025

The challenge with data poisoning is that it doesn’t take much: a few lines of poisoned code, a hidden instruction in a tool, or a fragment of misinformation in a dataset can alter how an LLM behaves. Once poisoned, restoring a model’s integrity is extremely difficult, which makes prevention essential.

That means thinking about protection across the entire lifecycle, not just at training time. In practice, defenses need to combine data validation, access controls, monitoring, and runtime guardrails to close off both external and insider threats.

Here are three defense pillars that matter in 2025:

1. Data Provenance & Validation

You can’t secure what you don’t know. Many poisoning attacks work because organizations fine-tune on third-party data or scrape content without verifying its integrity.

  • Source data from trusted repositories and maintain a clear chain of provenance.
  • Apply sanitization and filtering: deduplication, classifier-based quality checks, and redaction of sensitive information.
  • Watch out for synthetic data contamination, where poisoned samples propagate invisibly across generations (VIA paper).

This isn’t a one-off process. Continuous validation is necessary because poisoned content often hides in plain sight until triggered.

2.  Adversarial Testing & Red Teaming

Even clean-looking datasets can carry hidden backdoors. That’s why adversarial testing, deliberately trying to break your models, is crucial.

  • AI red teaming simulates poisoning by planting stealthy triggers or by pulling poisoned data through RAG and tools.
  • Red teams explore what attackers might do, whether it’s planting code comments that survive fine-tuning (Basilisk Venom) or embedding instructions in MCP tools (Acuvity.ai demo).

-db1-🔴 Lakera Red provides structured red-teaming services focused on GenAI, uncovering vulnerabilities like data poisoning before attackers do.-db1-

**👉 For a real-world look at structured adversarial testing, download our AI Red Teaming Playbook, based on insights from the world’s largest red-teaming initiative.**

3. Runtime Guardrails & Monitoring

Even if poisoned data slips through, you can still contain its impact with runtime defenses. Guardrails monitor outputs for unusual or harmful behavior and block or flag them before they reach end-users.

  • Use runtime systems that detect and intercept triggers, whether it’s a suspicious string, an off-domain instruction, or an anomalous pattern.
  • Continuously monitor outputs for drift, poisoned models often reveal themselves only under rare conditions.
  • Combine detection with policy-based controls that keep responses aligned to safety requirements.

-db1-🛡️ Lakera Guard serves as this line of defense: a single-line integration that applies runtime guardrails, backed by a growing threat-intelligence feed of real-world poisoning, injection, and jailbreak attempts.-db1-

**👉 Curious how runtime protections work in practice? Explore Lakera Guard, our runtime security layer that blocks prompt injections, jailbreaks, and poisoning attempts in production.**

Pulling It Together

Data provenance, red teaming, and runtime guardrails reinforce one another. Good data hygiene reduces the odds of poisoning. Red teaming uncovers what slipped through. Runtime defenses catch what’s still hiding. Together, they form a defense-in-depth strategy that keeps today’s poisoning attacks from turning into tomorrow’s breaches.

Key Takeaways

Data poisoning has stepped firmly into the spotlight in 2025. What was once an academic threat is now a practical attack surface: poisoned repos, poisoned web content, poisoned tools, and poisoned datasets. And while jailbreaks continue to evolve and showcase the fragility of today’s models, poisoning shows that attackers don’t need to hack the model directly, they only need to tamper with the data streams it learns from.

The research community has confirmed what incidents already showed: tiny amounts of contamination can have outsized effects. Stealthy backdoors are easy to miss, and current defenses aren’t enough on their own.

For organizations building or integrating GenAI, this means treating data poisoning as a live security risk, not a theoretical edge case. The defenses that matter, provenance, red teaming, and runtime guardrails, aren’t optional anymore. They’re the baseline for keeping AI systems safe and trustworthy.

-db1-In short:

  • From theory to practice: data poisoning has moved into real-world exploitation.
  • Lifecycle-wide threat: it now targets pre-training, fine-tuning, retrieval, and tools.
  • Tiny triggers, big impact: even minimal contamination can compromise outputs.
  • Defense-in-depth required: provenance, red teaming, and runtime guardrails together form the baseline.-db1-

**👉 For a wider lens on where poisoning fits into the bigger market story, see AI Security Trends 2025, which looks at adoption, risks, and industry readiness.**

Sources & Further Reading

Case Studies & Incidents

Research Papers (2025)

Frequently Asked Questions

What is AI/LLM data poisoning?

Data poisoning is an adversarial attack where corrupted, manipulated, or biased data is inserted into the information a model learns from. This can cause backdoors, biased outputs, or unsafe behavior, sometimes triggered by only a tiny fraction of poisoned samples.

How is data poisoning different from prompt injection?

Prompt injection happens at runtime, when an attacker crafts a malicious input to steer the model in the moment. Data poisoning changes the model’s underlying data sources (training, retrieval, or tools), so the unsafe behavior persists and can be triggered later.

Can data poisoning happen outside of training datasets?

Yes. In 2025, attacks have targeted retrieval-augmented generation (RAG), third-party tools (like MCP servers), and even synthetic data pipelines. Poisoning can occur anywhere a model learns or pulls data from.

What are some real-world examples of data poisoning?

Recent incidents include poisoned GitHub repos (Basilisk Venom), a “!Pliny” trigger in Grok 4 from social media poisoning, hidden instructions in MCP tools, and poisoned synthetic data that propagates across generations (Virus Infection Attack).

What impact does data poisoning have on AI systems?

Poisoning reduces accuracy and reliability, embeds hidden vulnerabilities, and can compromise security in critical domains like healthcare or autonomous vehicles. It can also bias outputs in subtle but harmful ways.

How can organizations defend against data poisoning?

Effective defenses combine:

  • Data provenance and validation (to filter poisoned inputs),
  • Adversarial testing and red teaming (to simulate attacks),
  • Runtime guardrails and monitoring (to detect triggers and abnormal behavior).

Access controls also matter, since insider threats can be just as dangerous as external ones.

How can I tell if a model has been poisoned?

It’s difficult, since many poisoned models appear normal until a trigger is used. Signs include rare phrases or inputs consistently producing unsafe outputs, or unexpected bias emerging in model behavior. Red-teaming and anomaly detection are the best ways to uncover hidden backdoors.