How to select the best machine learning models for computer vision?

Deep-dive into advanced comparison methods beyond standard performance metrics to build computer vision models that consistently perform over the long term.

Matthias Kraft
December 1, 2023
August 29, 2022

Given you have two machine learning models, which one would you pick and why? Every machine learning engineer encounters the decisions to choose an optimal combination of hyperparameters, model architectures, optimal experiments during development, and what data to collect, annotate, and train on. These decisions are more often than not a lot more challenging than expected, especially in the context of computer vision, which is the focus of this article.

Developers often make the wrong choice due to one tricky decision that can send the entire computer vision model tumbling like a set of dominoes. Once I was preparing an upcoming customer demo. I chose a model based on a couple of performance metrics, and it seemed to outperform its contender. I also visualized a few predictions on some image sequences, and it reconfirmed my model choice.

So based on the above pieces of evidence, I believed this must be the optimal model and thought of sending it to the production team. A couple of days later, we had the demo (thankfully, it was only an internal one), and looking at the prediction behavior, I quickly realized that the behavior pattern was completely off.

While performing well overall, it turned out that we had introduced a regression for the particular demo site. How come I did not catch this during my evaluation? Even more so, when I analyzed the contending computer vision model that got rejected right before going into production, it turned out that it did not show any performance regression. So it turned out my model evaluation methods were incomplete – to say the least – and we had to dig deeper.

How do we make better decisions when comparing computer vision models?

Since then, I have been doing a lot of work on model evaluation and comparison methods to avoid such pitfalls. Over the years, I have added the following techniques to my core set of evaluation and comparison methods:

1. Standard ML metrics

Any model comparison should include the following standard metrics:

  • PR curves, Precision, Recall, RMSE, etc (metrics relevant to the use case)
  • Training loss, validation loss, test loss (to assess overfitting behavior)
  • Model complexity (to consider potential runtime tradeoffs)

There are plenty of resources on the web and in textbooks on how to use and interpret these ML evaluation metrics to compare models. These model evaluation metrics are always an integral part of my evaluation process even though they are not nearly as comprehensive as required to get a good insight into the overall quality of your models.

2. Subgroup analysis and explicit data unit test

Standard ML metrics hide too much valuable information when deciding between multiple models (or generally when evaluating a model). Partially that is because they look at aggregate metrics over large datasets. So, they may not accurately reflect your business and product requirements.

For instance, if you are building a computer vision project on object detection for industrial inspection and aim to roll it out across different customer sites, you need to look at the model performance on each site (to avoid situations like the one I described above). To find out which model is best, you will want to check if the model performs equally well across all components that require an inspection as well.

To do this subgroup analysis and split the performance metrics, I tend to collect as much metadata (timestamp, customer site, camera model, image dimensions, etc.) as possible for each image. Another technique I use here is to build small regression test sets (10-50 images) to track the subset performance. These regression sets can include sensitive cases or specific scenarios I want to test but have no metadata available. Learn more about that here. I want to make sure that my model performs equally well on (combinations of) these subgroups.

Start comparing your computer vision models with MLTest today.

Get Started

3. Model robustness

Once your model is in production, it will inevitably encounter dynamic variations in the image input. How does the model respond to that? Even minor variations will throw your model off if you have overfitted your training and test data. To prevent this scenario, I ensure to explicitly test model robustness with varied images and check if the model output stays close to the original. For minimal testing, I execute the following:

  • Geometric variations: rotations, perspective changes, scaling, cropping, etc.
  • Lighting variations: global and local brightness and contrast changes, color changes, etc.
  • Image quality variations: noise, compression artifacts, blur, package losses, etc.

As a side note, knowing where your model is not robust significantly helps to select the data augmentations during training. In some sense, this is an easy test to see if your training pipeline is correct. It also supports as input to refining data collection and annotation.

“Better performing” machine learning models (based on the standard metrics above) often do not generalize better. They cannot grasp data beyond the available dataset, ignoring, or failing to correctly process variations in the input.

Gaining a good understanding of model robustness is a critical stage in selecting the optimum model.

4. Model biases/fairness

If you are building an application where biases could impact customer experience or safety, you should consider fairness metrics as part of your model comparison methods. One model may outperform another on high-level performance metrics but may include subtle predictive biases.

A recommended way to get started is to ensure that your datasets represent the operational use case. Depending on the application, you may also want to measure explicit fairness metrics such as equalized odds or predictive equality.

5. In-operation metrics

Production environments and configurations always add additional constraints to your computer vision application. Some that come to mind are as follows:

  • Memory footprint
  • Model Inference time
  • System latency
  • GPU/CPU utilization

For instance, you have to ask yourself if a model with twice the inference time is your preferred model to optimize for a 0.5% gain in performance.

Also, on-device performance may substantially differ from your training environment with a beefy GPU in the cloud. If you suspect a difference, model comparisons should consider on-device performance.

Comparing models with MLTest

Now, evaluating all these dimensions can become quite overwhelming. I’ve been there myself.

That’s why I’m excited that we recently introduced a neat model comparison feature in MLTest to help you get a more comprehensive view of your model. It tracks all the standard ML metrics, automatically assesses model robustness and biases, and does a subset analysis on your models. It even automatically identifies failure clusters where your model performs poorly, making it possible to create a much more comprehensive comparison.

Comparing Computer Vision Models with MLTest | Lakera AI

You can learn more about how MLTest can help you in comparing machine-learned computer vision models here, get started with MLTest right away, or get in touch with me at


When comparing computer vision models, take the next step and include the above criteria in your evaluation. They will help you make better decisions and ultimately build better ML systems.

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Matthias Kraft
Read LLM Security Playbook
Learn about the most common LLM threats and how to prevent them.
You might be interested
min read
Machine Learning

Why we need better data management for mission-critical AI

In order to enable mission-critical ML applications, we need to create appropriate guidance for data management, both at the formal regulatory level and in our everyday best practices.
Mateo Rojas-Carulla
December 4, 2023
min read
Machine Learning

Test machine learning the right way: Fuzz testing.

In this instance of our ML testing series, we discuss fuzz testing. We discuss what it is, how it works, and how it can be used to stress test machine learning systems to gain confidence before going to production.
Lakera Team
December 1, 2023
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.