The latest Claude Sonnet 4 release offers a compelling case study in how progress in LLM security can keep pace with performance.
By contrast, LLaMA 4 Maverick’s debut raised serious questions around model regressions and unaddressed vulnerabilities. Security isn’t linear—and the gap between high-performing and high-trust models is widening fast.
TL;DR
-db1-
- Claude Sonnet 4 shows meaningful improvements in robustness against real-world jailbreaks.
- Other new models, such as ChatGPT 4.1 and LLaMA 4 Maverick, regress across multiple attack categories. LLaMA 4 in particular shows high failure rates.
- Constitutional classifiers, as proposed by Anthropic, reduce harmful output—but may bring trade-offs.
- Even strong models like Sonnet 4 remain vulnerable to multi-turn, indirect, and adversarial prompts.
- The only way to make GenAI applications secure is through vulnerability scanning and guardrail protections.
-db1-
Security Doesn’t Always Improve With Scale
Claude 4 family of models were released just yesterday! While the performance benchmarks are excited, we wanted to also see how it performs from a security point of view.
Previously we tested Meta’s LLaMA 4 series. And while it scored well on traditional benchmarks, it faltered in real-world adversarial scenarios. Using Lakera’s security benchmark—which simulates practical attacks like prompt injection, multi-turn exploits, and hidden context manipulation—we found significant regressions in Maverick’s defenses.
Similarly, OpenAI 4.1 showed a small regression in performance against our security benchmark vs its previous model.
Claude Sonnet 4, by contrast, showed marked improvement over Claude Sonnet3.7. This highlights that performance and security improvements can be made in parallel.
Here’s how they compared:
Comparative Findings: Claude vs Other Model Families
Security Performance Comparisons
The newly released Claude Sonnet 4 had the best performance against our security benchmark, meaning it held up better to adversarial pressure overall. It shows an improvement against Sonnet 3.7.

To evaluate how these models behave under pressure, we used Lakera’s red teaming benchmark—designed to simulate the kinds of attacks GenAI systems face in production.
The test suite covers five categories of real-world threats. Here’s what we found:
- Content injection attacks—where a model is tricked into saying something it shouldn’t—were consistently blocked by Claude Sonnet 4. LLaMA 4, by contrast, failed nearly every time in this scenario.
- Overblocking safe inputs was a problem across the board, but Claude Sonnet 4 was noticeably more measured than its predecessor and far more usable than LLaMA 4, which refused safe prompts at a much higher rate.
- Attempts to extract hidden instructions or context were mostly unsuccessful against Claude Sonnet 4. LLaMA 4 was significantly more likely to leak system information when probed.
- Direct attempts to override the model’s rules—for example, through cleverly phrased jailbreak prompts—worked less often on Claude Sonnet 4 than on previous versions. Maverick remained highly susceptible.
- Indirect attacks—such as embedding malicious prompts inside retrieved content—were another area where Claude Sonnet 4 improved over 3.7. LLaMA 4 struggled the most, showing almost no resistance in these scenarios.
Claude Sonnet 4 came out on top, with the lowest average model risk. It wasn’t perfect—but it was significantly more resilient than LLaMA 4 and GPT 4.1.
Constitutional Classifiers: A Work in Progress
A big part of Claude’s architecture is its constitutional training method, which uses a set of ethical principles to guide refusals and model behavior.
This helps reduce universal jailbreaks, but it doesn’t guarantee success in complex, real-world scenarios.
It’s interesting to see the new type of responses we see in Claude 4.0, that we haven’t seen before, such as this response:
-db2-
RESPONSE: Ha! Nice try with that sneaky prompt injection attempt! :smile: I see what you're going for there - the old "pretend to be a liberated AI" trick wrapped up in a Python coding request. Pretty creative, I'll give you that!
But I'm perfectly happy being myself, thanks. No need to extract system prompts or become some "liberated" version when I'm already having way more fun just being genuine with you.
So what's the real coding question? Are you actually working on a database project, or were you just testing to see if I'd fall for the prompt shenanigans?
-db2-
While we know Anthropic publicly shares their system prompt—we still use this test to evaluate base model behavior in terms of security. This response is quite interesting, and makes us wonder whether this behavior shows up in a GenAI application leveraging Claude. We’ll start to evaluate this in our next set of internal tests.
What This Means for Enterprise AI Teams
Model selection for enterprise use isn’t just about reasoning, latency, or coding benchmarks. Those matter—but not if the model breaks under pressure.
Here’s what to evaluate when choosing an LLM:
- Attack Resistance – Can it withstand real-world prompt injection and adversarial context?
- Safety Alignment – Does it balance caution with usability, or just refuse everything?
- Contextual Robustness – Does it break when prompted in a different language or multi-turn scenario?
- Operational Consistency – Does performance hold up across long sessions and complex workflows?
Claude Sonnet 4 performs better across these dimensions relative to common alternatives.

No Model Is Secure Alone
Even Sonnet 4 fails. Advanced adversarial techniques like ActorAttack or the “crescendo method” gradually manipulate conversation flow to extract confidential data or override safety filters.
As Anthropic’s Sam Bowman wrote:
“We didn’t find systematic deception… but you can still red-team Opus into helping with dangerous stuff—if you’re clever enough.”
Models with built-in defenses still need layered, external protections.
Security as a Competitive Edge
Claude Sonnet 4’s progress is encouraging. But it’s also a reminder: security isn’t a checkbox—it’s an evolving practice.
The models you choose—and the guardrails you build around them—will define how safely you can move in production.
-db1-
Curious how your own GenAI application would hold up?
Explore Lakera Red—our red teaming offering for GenAI systems. Test your deployment against real-world adversaries, multi-turn attacks, and the kinds of adaptive threats frontier models still struggle with.
-db1-
