Cookie Consent

Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.

Your validation set won’t tell you if a model generalizes. Here’s what will.

As we all know from machine learning 101, you should split your dataset into three parts: the training, validation, and test set. You train your models on the training set. You choose your hyperparameters by selecting the best model from the validation set. Finally, you look at your accuracy (F1 score, ROC curve...) on the test set. And voilà, you’ve just achieved XYZ% accuracy.

Václav Volhejn

October 20, 2023

Last updated:

November 13, 2024

On this page

Hide table of contents

Show table of contents

This is only half the story. The real-world data that your model will run on in operation will never match your dataset. And over time, it will shift. That means that the accuracy on real-world data will be lower than your training and validation accuracies: this is a non-IID version of the traditional generalization gap.

The fundamental question when testing ML models is then how to select the model with the best generalization properties. The gold standard is picking the model with the highest validation accuracy. But your validation set is lying to you: reaching a great validation accuracy doesn’t mean you’re any closer to having a production-ready model. Our experiments confirm this entirely.

*The validation set covers only a small part of the inputs your model will encounter in real-world operation.*

Model selection the right way

Instead, you need to go into more depth in your evaluation: measure the model’s robustness to variations in the input image. As a basic example, if the model predicted “tumor” for a certain image, it should also predict “tumor” if we vary the brightness slightly, tilt or flip the image, change the hue, and so on. Since these variations are not semantically meaningful, the model should not change its prediction.

What’s really cool about this approach is that you don’t even need true labels for the image. If the model changes its prediction after a tiny change to the image, that means it’s brittle – no matter the true label.

By measuring robustness, you can observe how the model performs on data beyond the training distribution that is likely to appear in the real world. And this is exactly what generalization means! Robustness tests allow you to go way beyond validation set accuracy and are a great predictor of model performance in the wild. If you want to get a better grasp on which of your models generalize, adding robustness tests is the way to go.

*Robustness tests allow you to massively extend your testing coverage with no extra data.*

To demonstrate, here’s an example. Camelyon17-WILDS [1, 2] is a histopathology dataset in which the goal is to predict whether the tissue slide contains any tumor tissue or not (binary classification). The training set contains slides from hospitals A, B, and C. The validation set has slides from hospital D, and the test set, hospital E. Images from each hospital are different, so we have a domain generalization problem:

*The training, validation and test sets each contain data from different hospitals. The data distributions are therefore different for each one.*

The goal is to maximize the accuracy on the test set [2]. (Of course, in real medical imaging applications, model evaluation is much more complex. We’re using this simplified setup to illustrate how regular validation set metrics can be deceiving. Using a more involved evaluation only corroborates the problem.)

And this is where things get tricky. Since we’re no longer in IID world, validation accuracy stops being reliable, since performing well on hospital D might not mean you’ll do well on the more pinkish images from hospital E.

Experiment: validation accuracy vs model robustness

Say you’ve decided to build a classification model for Camelyon17. You only have the training and validation sets, and you have to select the model to deploy in the real world, that is, on the unknown test set.

You experiment with a lot of models and in the end, narrow the choices down to ResNet-34 and ResNet-101. Obviously, you want to select the architecture that will do better on the test set. To make an informed model comparison, you train a bunch of models with different seeds for each architecture.

(To keep performance stable, all models were frozen and we only trained the final classification layer. Otherwise, “the test performance of models trained on this dataset tend to exhibit a large degree of variability over random seeds”, as the WILDS authors mention on GitHub. The other hyperparameters are the defaults from the Wilds package.)

Now you plot the validation accuracy of these models to decide which one to use. As you can see from the plot, the two groups perform equally well. To be clear, this is not hypothetical – we did train these models and all plots you see here are their actual results.

*The two model architectures are indistinguishable in validation accuracy.*

When you’re working off of validation accuracy alone, you should choose ResNet-34 over ResNet-101 as your model: both perform equally well and ResNet-34 is smaller and faster. But on the test set – AKA the real world – things look a lot different than the validation set. By choosing ResNet-34, you’ve thrown out a model that has a 4% higher accuracy. That’s a missed 25% reduction in error probability!

*But ResNet-101 has a 25% lower error probability than ResNet-34.*

What if you instead do an in-depth evaluation using robustness tests? After you assign a score to the models based on their robustness to lighting changes, blurring, image quality, viewpoint changes, and noise, you see this (lower is better):

The MLTest robustness scores of ResNet-34 and ResNet-101. — *The robustness score correctly predicts that ResNet-101 is better (it has a lower risk score) without seeing the test data.*

So using a robustness risk score correctly predicts that ResNet-101 is more robust – and better on real data – than ResNet-34. The score is consistent: it rates every ResNet-34 worse than every ResNet-101 across all seeds. These robustness tests are run on the validation set, not the unknown test set, so there is no data leakage. Instead, they utilize the available data in a smarter manner to make the evaluation more representative.

Getting robustness testing right isn’t straightforward. At Lakera, we’ve already done the hard work and packaged it into MLTest, a part of the Lakera platform that you can use to test your models. In just a few lines of code, you can run MLTest on your model and measure its robustness. MLTest, as the name suggests, runs a battery of diverse tests on your model, including robustness tests. You can explore their individual results using our dashboard, but you also get an overall risk score: a single number that summarizes how well your model did. This includes the robustness tests described above along with a host of other goodies.

💡 Want to assess the generalization capabilities of your own models? You can integrate MLTest in minutes.

To summarize, we showed that testing machine learning models is hard, why validation accuracy is flawed and explained why robustness tests work much better. We demonstrated this in a real example where selecting your model based on validation accuracy would make you miss out on a model with a 25% lower error rate. On the same data, robustness tests easily manage to identify the better model. Now go test some models! Also, feel free to get in touch with us at vv@lakera.ai.

‍

Václav Volhejn

GenAI Security Preparedness
Report 2024

Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.

Free Download

Test machine learning the right way: Metamorphic relations.

As part of our series on machine learning testing, we are looking at metamorphic relations. We’ll discuss what they are, how they are used in traditional software testing, what role they play in ML more broadly and lastly, how to use them to write great tests for your machine learning application.

Lakera Team

November 13, 2024

min read

•

Machine Learning

Reinforcement Learning from Human Feedback (RLHF): Bridging AI and Human Expertise

Discover how RLHF creates AI systems aligned with human values. Explore its benefits, transformative potential, and challenges. Learn how human feedback improves AI decision-making.

Deval Shah

May 21, 2025

Activate
untouchable mode.

Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Book a demo Start for free

Join our Slack Community.

Several people are typing about AI/ML security.  Come join us and 1000+ others in a chat that’s thoroughly SFW.

Join Lakera Momentum Slack

Your validation set won’t tell you if a model generalizes. Here’s what will.

Model selection the right way

Experiment: validation accuracy vs model robustness

Unlock Free AI Security Guide.

Explore Prompt Injection Attacks.

Learn AI Security Basics.

Evaluate LLM Security Solutions.

Uncover LLM Vulnerabilities.

The CISO's Guide to AI Security

Explore AI Regulations.

GenAI Security Preparedness
Report 2024

Explore AI Regulations.

Understand AI Security Basics.

Uncover LLM Vulnerabilities.

Optimize LLM Security Solutions.

Master Prompt Injection Attacks.

Unlock Free AI Security Guide.

Test machine learning the right way: Metamorphic relations.

Reinforcement Learning from Human Feedback (RLHF): Bridging AI and Human Expertise

Model selection the right way

Experiment: validation accuracy vs model robustness

Unlock Free AI Security Guide.

Explore Prompt Injection Attacks.

Learn AI Security Basics.

Evaluate LLM Security Solutions.

Uncover LLM Vulnerabilities.

The CISO's Guide to AI Security

Explore AI Regulations.

GenAI Security Preparedness Report 2024

Explore AI Regulations.

Understand AI Security Basics.

Uncover LLM Vulnerabilities.

Optimize LLM Security Solutions.

Master Prompt Injection Attacks.

Unlock Free AI Security Guide.

Test machine learning the right way: Metamorphic relations.

Reinforcement Learning from Human Feedback (RLHF): Bridging AI and Human Expertise

GenAI Security Preparedness
Report 2024