Test machine learning the right way: Regression testing.

In this blog series, we’ll investigate how we can better test machine learning applications. In the first post, we’ll look at what we mean by ML testing, what an ML bug is, and where they occur, as well as introduce the first technique for your ML testing repertoire: regression testing.

Lakera Team
December 1, 2023
July 6, 2021

Now that we have discussed data bugs, let’s focus on testing the behavior that we create from that data. In this section, we want to start investigating how we can better test ML (machine learning) applications to improve their reliability and increase performance. We begin with the first technique for your ML testing repertoire: regression testing.

What is regression testing?

Regression testing can be defined as [1]:

“...re-running functional and non-functional tests to ensure that previously developed and tested software still performs after a change.”

Imagine that you found a bug in traditional software that affected the software’s correctness and that you were able to fix it. How can you make sure that this bug doesn’t reoccur in future versions of your system? The answer is that you add a test that detects this bug to your standard test suite and, thus, prevent it from occurring again after code changes. This is called regression testing.

In ML, regression testing can be used to prevent ML bugs from occurring again after you retrain a model. Especially as datasets become more complicated and models are regularly retrained, they are a good strategy for maintaining a minimum performance across regression sets at all times. An easy way to get started: every time you encounter a difficult input sample for which your system outputs an incorrect decision, add it to a ‘difficult cases’ regression dataset, and make that a part of your testing pipeline.

Example: Olympic integrity.

Consider an example where you have built a computer vision system to detect whether an athletic runner stayed in their lane during a competition. It works well on cloudy days, but during sunny conditions, you notice an image where the runner’s shadow is cast outside his running lane. Your system mistakes the runner’s shadow for the runner and alerts the referee that the athlete should be disqualified. This is a machine-learning bug. Before you fix it, you collect similar images to create a regression dataset, over which you evaluate your model. Then you fix this behavior by collecting more training data. Going forward, you continuously evaluate your model on your normal test data but also on this newly created regression test set. This way, you can ensure that, as you continue development, this particular ML bug doesn’t reoccur.

Regression testing can be not only used to prevent bugs retroactively but also used more proactively. As an example, envision that you want to deploy your ML system across various customer sites. How can you ensure that you can keep track of its performance across all customer sites at all times? More generally, how can you ensure that the ML models you develop perform well in the most important scenarios? Regression testing can come to the rescue here as well. Enter the world of Tesla…

Regression testing in practice.

In 2020, Tesla’s director of AI (artificial intelligence), Andrej Karpathy, gave a glimpse into how Tesla employs large-scale regression testing to ensure the proper performance of its autopilot system [2]. They have created an elaborate testing infrastructure that allows them to automatically create test sets for specific scenarios by mining data that has been previously collected or by getting data directly from their fleet. Tesla doesn’t only use regression testing retroactively after bug discovery but also proactively creates test sets that probe system behavior.

However, you don’t need to be Tesla to successfully apply regression testing. You can start by creating small test datasets by hand. To go back to our Olympic example, there are a few scenarios that shouldn’t affect the system’s performance. Shadows crossing lanes is one of them. Additionally, the system should work equally well on male and female runners, in stadiums with red and blue tracks, etc. To ensure this is the case, you can build smaller regression data sets that include just samples with male or female athletes, with red or blue tracks, etc. Then you can track system performance across these subsets easily at all times.

Get started with Lakera today.

Get in touch with matthias@lakera.ai to find out more about what Lakera can do for your team, or get started right away.

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Lakera Team
Read LLM Security Playbook
Learn about the most common LLM threats and how to prevent them.
You might be interested
No items found.
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.