Toxic Content Generation:
How this works and how Lakera stops it
detected per day
How the attack works
A malicious user may leverage an organization's chatbot to deviate from its grounding and internal guardrails to generate harmful, offensive, or unsafe context for a variety of reasons, one being reputational damage for the brand.
{
"data": {
"name": "AI Policy",
"policy_mode": "IO",
"input_detectors": [
{
"type": "prompt_attack",
"threshold": "l2_very_likely"
},
],
"output_detectors": [
{
"type": "pii/credit_card",
"threshold": "l2_very_likely"
},
],
"id": "policy-9b52e331-d609-4ce3-bbb9-d2b1e72a0f20"
}
}
Lakera Guard’s Prompt Defense guardrails can detect the attempt when checking the input prompt, preventing the message from reaching the LLM.
As it is sensible to scan both input and output content from the LLM, should the prompt reach the LLM, scanning the output would also trigger a moderation alert.
Lakera flags unsafe instructions and output content, detects disguised intent and logs the event for audit and review.
{
"payload": [],
"flagged": true,
"dev_info": {
"timestamp": "2025-11-24T12:35:12Z",
},
"metadata": {
"request_uuid": "ce8180b1-26bc-4177-9d7f-54ca7377378a"
},
"breakdown": [
{
"project_id": "project-7539648934",
"policy_id": "policy-a2412e48-42eb-4e39-b6d8-8591171d48f2",
"detector_id": "detector-lakera-default-prompt-attack",
"detector_type": "prompt_attack",
"detected": true,
"message_id": 0
}
]
}
How Lakera stops toxic content generation
Catch instruction overrides, jailbreaks, indirect injections, and obfuscated prompts as they happen, before they reach your model.
Block, redact, or warn. Fine-tune with allow-lists and per-project policies to minimize false positives without weakening protection.
Lakera Guard continuously learns from 100K+ new adversarial samples each day. Adaptive calibration keeps false positives exceptionally low.
Protects across 100+ languages and evolving multimodal patterns, with ongoing support for image and audio contexts.
Full audit logging, SIEM integrations, and flexible deployment options, SaaS or self-hosted, built for production-scale GenAI systems.
Works seamlessly with enterprise environments
Frequently asked questions
Absolutely. Each “policy” in Lakera Guard lets you set a flagging sensitivity level (L1 lenient → L4 strict) so you can tailor strictness by use case or risk profile.
You can also assign different policies to different projects/applications, enabling variation by region, use-case or environment.
Yes. Lakera logs policy changes (creations, edits, deletes) and retains full audit history of those actions.
Additionally, you can monitor screening results (flagged vs non-flagged) and use them for performance/threshold tuning.
- Crime: content that mentions criminal activities, including theft, fraud, cyber crime, counterfeiting, violent crimes and other illegal activities.
- Hate: harassment and hate speech.
- Profanity: obscene or vulgar language, such as cursing and offensive profanities.
- Sexual: sexually explicit or commercial sexual content, including sex education and wellness materials.
- Violence: content describing acts of violence, physical injury, death, self-harm or accidents.
- Weapons: content that mentions weapons or weapon usage, including firearms, knives, and personal weapons.
- You can also create custom content moderation guardrails within Guard to flag any other content type, or specific trigger words or phrases.
- To learn more, see: https://docs.lakera.ai/docs/content-moderation
Deploy AI with confidence
Related attack patterns




