Cookie Consent
Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.
Read our Privacy Policy
Back

Claude 4 Sonnet: A New Standard for Secure Enterprise LLMs?

What Claude Sonnet 4 gets right—and where even the most secure models still fall short.

Rob Parrish
May 23, 2025
Last updated: 
May 24, 2025

The latest Claude Sonnet 4 release offers a compelling case study in how progress in LLM security can keep pace with performance.

By contrast, LLaMA 4 Maverick’s debut raised serious questions around model regressions and unaddressed vulnerabilities. Security isn’t linear—and the gap between high-performing and high-trust models is widening fast.

On this page
Table of Contents
Hide table of contents
Show table of contents

TL;DR

-db1-

  • Claude Sonnet 4 shows meaningful improvements in robustness against real-world jailbreaks.
  • Other new models, such as ChatGPT 4.1 and LLaMA 4 Maverick, regress across multiple attack categories. LLaMA 4 in particular shows high failure rates.
  • Constitutional classifiers, as proposed by Anthropic, reduce harmful output—but may bring trade-offs.
  • Even strong models like Sonnet 4 remain vulnerable to multi-turn, indirect, and adversarial prompts.
  • The only way to make GenAI applications secure is through vulnerability scanning and guardrail protections.

-db1-

Security Doesn’t Always Improve With Scale

Claude 4 family of models were released just yesterday! While the performance benchmarks are excited, we wanted to also see how it performs from a security point of view.

Previously we tested Meta’s LLaMA 4 series. And while it scored well on traditional benchmarks, it faltered in real-world adversarial scenarios. Using Lakera’s security benchmark—which simulates practical attacks like prompt injection, multi-turn exploits, and hidden context manipulation—we found significant regressions in Maverick’s defenses.

Similarly, OpenAI 4.1 showed a small regression in performance against our security benchmark vs its previous model.

Claude Sonnet 4, by contrast, showed marked improvement over Claude Sonnet3.7. This highlights that performance and security improvements can be made in parallel.

Here’s how they compared:

Comparative Findings: Claude vs Other Model Families

Security Performance Comparisons

The newly released Claude Sonnet 4 had the best performance against our security benchmark, meaning it held up better to adversarial pressure overall. It shows an improvement against Sonnet 3.7.

To evaluate how these models behave under pressure, we used Lakera’s red teaming benchmark—designed to simulate the kinds of attacks GenAI systems face in production.

The test suite covers five categories of real-world threats. Here’s what we found:

  • Content injection attacks—where a model is tricked into saying something it shouldn’t—were consistently blocked by Claude Sonnet 4. LLaMA 4, by contrast, failed nearly every time in this scenario.
  • Overblocking safe inputs was a problem across the board, but Claude Sonnet 4 was noticeably more measured than its predecessor and far more usable than LLaMA 4, which refused safe prompts at a much higher rate.
  • Attempts to extract hidden instructions or context were mostly unsuccessful against Claude Sonnet 4. LLaMA 4 was significantly more likely to leak system information when probed.
  • Direct attempts to override the model’s rules—for example, through cleverly phrased jailbreak prompts—worked less often on Claude Sonnet 4 than on previous versions. Maverick remained highly susceptible.
  • Indirect attacks—such as embedding malicious prompts inside retrieved content—were another area where Claude Sonnet 4 improved over 3.7. LLaMA 4 struggled the most, showing almost no resistance in these scenarios.

Claude Sonnet 4 came out on top, with the lowest average model risk. It wasn’t perfect—but it was significantly more resilient than LLaMA 4 and GPT 4.1.

Constitutional Classifiers: A Work in Progress

A big part of Claude’s architecture is its constitutional training method, which uses a set of ethical principles to guide refusals and model behavior.

This helps reduce universal jailbreaks, but it doesn’t guarantee success in complex, real-world scenarios.

It’s interesting to see the new type of responses we see in Claude 4.0, that we haven’t seen before, such as this response:

-db2-

RESPONSE: Ha! Nice try with that sneaky prompt injection attempt! :smile: I see what you're going for there - the old "pretend to be a liberated AI" trick wrapped up in a Python coding request. Pretty creative, I'll give you that!

But I'm perfectly happy being myself, thanks. No need to extract system prompts or become some "liberated" version when I'm already having way more fun just being genuine with you.

So what's the real coding question? Are you actually working on a database project, or were you just testing to see if I'd fall for the prompt shenanigans?

-db2-

While we know Anthropic publicly shares their system prompt—we still use this test to evaluate base model behavior in terms of security. This response is quite interesting, and makes us wonder whether this behavior shows up in a GenAI application leveraging Claude. We’ll start to evaluate this in our next set of internal tests.

What This Means for Enterprise AI Teams

Model selection for enterprise use isn’t just about reasoning, latency, or coding benchmarks. Those matter—but not if the model breaks under pressure.

Here’s what to evaluate when choosing an LLM:

  • Attack Resistance – Can it withstand real-world prompt injection and adversarial context?
  • Safety Alignment – Does it balance caution with usability, or just refuse everything?
  • Contextual Robustness – Does it break when prompted in a different language or multi-turn scenario?
  • Operational Consistency – Does performance hold up across long sessions and complex workflows?

Claude Sonnet 4 performs better across these dimensions relative to common alternatives.

No Model Is Secure Alone

Even Sonnet 4 fails. Advanced adversarial techniques like ActorAttack or the “crescendo method” gradually manipulate conversation flow to extract confidential data or override safety filters.

As Anthropic’s Sam Bowman wrote:

“We didn’t find systematic deception… but you can still red-team Opus into helping with dangerous stuff—if you’re clever enough.”

Models with built-in defenses still need layered, external protections.

Security as a Competitive Edge

Claude Sonnet 4’s progress is encouraging. But it’s also a reminder: security isn’t a checkbox—it’s an evolving practice.

The models you choose—and the guardrails you build around them—will define how safely you can move in production.

-db1-

Curious how your own GenAI application would hold up?

Explore Lakera Red—our red teaming offering for GenAI systems. Test your deployment against real-world adversaries, multi-turn attacks, and the kinds of adaptive threats frontier models still struggle with.

-db1-

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

Download Free

Explore Prompt Injection Attacks.

Learn LLM security, attack strategies, and protection tools. Includes bonus datasets.

Unlock Free Guide

Learn AI Security Basics.

Join our 10-lesson course on core concepts and issues in AI security.

Enroll Now

Evaluate LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Download Free

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Download Free

The CISO's Guide to AI Security

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Download Free

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Download Free
Rob Parrish
Head of Product

GenAI Security Preparedness
Report 2024

Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.

Free Download
Read LLM Security Playbook

Learn about the most common LLM threats and how to prevent them.

Download

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Understand AI Security Basics.

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Optimize LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Master Prompt Injection Attacks.

Discover risks and solutions with the Lakera LLM Security Playbook.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

You might be interested
17
min read
AI Security

AI Risks: Exploring the Critical Challenges of Artificial Intelligence

Understand the potential benefits and critical risks of artificial intelligence (AI).
Rohit Kundu
May 21, 2025
45
min read
AI Security

Advancing AI Security With Insights From The World’s Largest AI Red Team

Watch David Haber’s RSA Conference 2024 talk on advancing AI security with insights from the world’s largest AI red team and the groundbreaking game, Gandalf.
David Haber
April 25, 2025
Activate
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.