Cookie Consent

Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.

How to select the best machine learning models for computer vision?

Deep-dive into advanced comparison methods beyond standard performance metrics to build computer vision models that consistently perform over the long term.

Matthias Kraft

October 20, 2023

Last updated:

December 1, 2023

On this page

Hide table of contents

Show table of contents

Given you have two machine learning models, which one would you pick and why? Every machine learning engineer encounters the decisions to choose an optimal combination of hyperparameters, model architectures, optimal experiments during development, and what data to collect, annotate, and train on. These decisions are more often than not a lot more challenging than expected, especially in the context of computer vision, which is the focus of this article.

Developers often make the wrong choice due to one tricky decision that can send the entire computer vision model tumbling like a set of dominoes. Once I was preparing an upcoming customer demo. I chose a model based on a couple of performance metrics, and it seemed to outperform its contender. I also visualized a few predictions on some image sequences, and it reconfirmed my model choice.

So based on the above pieces of evidence, I believed this must be the optimal model and thought of sending it to the production team. A couple of days later, we had the demo (thankfully, it was only an internal one), and looking at the prediction behavior, I quickly realized that the behavior pattern was completely off.

While performing well overall, it turned out that we had introduced a regression for the particular demo site. How come I did not catch this during my evaluation? Even more so, when I analyzed the contending computer vision model that got rejected right before going into production, it turned out that it did not show any performance regression. So it turned out my model evaluation methods were incomplete – to say the least – and we had to dig deeper.

How do we make better decisions when comparing computer vision models?

Since then, I have been doing a lot of work on model evaluation and comparison methods to avoid such pitfalls. Over the years, I have added the following techniques to my core set of evaluation and comparison methods:

1. Standard ML metrics

Any model comparison should include the following standard metrics:

PR curves, Precision, Recall, RMSE, etc (metrics relevant to the use case)
Training loss, validation loss, test loss (to assess overfitting behavior)
Model complexity (to consider potential runtime tradeoffs)

There are plenty of resources on the web and in textbooks on how to use and interpret these ML evaluation metrics to compare models. These model evaluation metrics are always an integral part of my evaluation process even though they are not nearly as comprehensive as required to get a good insight into the overall quality of your models.

2. Subgroup analysis and explicit data unit test

Standard ML metrics hide too much valuable information when deciding between multiple models (or generally when evaluating a model). Partially that is because they look at aggregate metrics over large datasets. So, they may not accurately reflect your business and product requirements.

For instance, if you are building a computer vision project on object detection for industrial inspection and aim to roll it out across different customer sites, you need to look at the model performance on each site (to avoid situations like the one I described above). To find out which model is best, you will want to check if the model performs equally well across all components that require an inspection as well.

To do this subgroup analysis and split the performance metrics, I tend to collect as much metadata (timestamp, customer site, camera model, image dimensions, etc.) as possible for each image. Another technique I use here is to build small regression test sets (10-50 images) to track the subset performance. These regression sets can include sensitive cases or specific scenarios I want to test but have no metadata available. Learn more about that here. I want to make sure that my model performs equally well on (combinations of) these subgroups.

Start comparing your computer vision models with MLTest today.

Get Started

3. Model robustness

Once your model is in production, it will inevitably encounter dynamic variations in the image input. How does the model respond to that? Even minor variations will throw your model off if you have overfitted your training and test data. To prevent this scenario, I ensure to explicitly test model robustness with varied images and check if the model output stays close to the original. For minimal testing, I execute the following:

Geometric variations: rotations, perspective changes, scaling, cropping, etc.
Lighting variations: global and local brightness and contrast changes, color changes, etc.
Image quality variations: noise, compression artifacts, blur, package losses, etc.

As a side note, knowing where your model is not robust significantly helps to select the data augmentations during training. In some sense, this is an easy test to see if your training pipeline is correct. It also supports as input to refining data collection and annotation.

“Better performing” machine learning models (based on the standard metrics above) often do not generalize better. They cannot grasp data beyond the available dataset, ignoring, or failing to correctly process variations in the input.

Gaining a good understanding of model robustness is a critical stage in selecting the optimum model.

4. Model biases/fairness

If you are building an application where biases could impact customer experience or safety, you should consider fairness metrics as part of your model comparison methods. One model may outperform another on high-level performance metrics but may include subtle predictive biases.

A recommended way to get started is to ensure that your datasets represent the operational use case. Depending on the application, you may also want to measure explicit fairness metrics such as equalized odds or predictive equality.

5. In-operation metrics

Production environments and configurations always add additional constraints to your computer vision application. Some that come to mind are as follows:

Memory footprint
Model Inference time
System latency
GPU/CPU utilization

For instance, you have to ask yourself if a model with twice the inference time is your preferred model to optimize for a 0.5% gain in performance.

Also, on-device performance may substantially differ from your training environment with a beefy GPU in the cloud. If you suspect a difference, model comparisons should consider on-device performance.

Comparing models with MLTest

Now, evaluating all these dimensions can become quite overwhelming. I’ve been there myself.

That’s why I’m excited that we recently introduced a neat model comparison feature in MLTest to help you get a more comprehensive view of your model. It tracks all the standard ML metrics, automatically assesses model robustness and biases, and does a subset analysis on your models. It even automatically identifies failure clusters where your model performs poorly, making it possible to create a much more comprehensive comparison.

Comparing Computer Vision Models with MLTest | Lakera AI

You can learn more about how MLTest can help you in comparing machine-learned computer vision models here, get started with MLTest right away, or get in touch with me at matthias@lakera.ai.

Conclusion

When comparing computer vision models, take the next step and include the above criteria in your evaluation. They will help you make better decisions and ultimately build better ML systems.

‍

Matthias Kraft

GenAI Security Preparedness
Report 2024

Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.

Free Download

The ELI5 Guide to Retrieval Augmented Generation

Discover the inner workings of Retrieval Augmented Generation (RAG) and how it enhances language model responses by dynamically sourcing information from external databases.

Blessin Varkey

November 13, 2024

min read

•

Machine Learning

Stress-test your models to avoid bad surprises.

Will my system work if image quality starts to drop significantly? If my system works at a given occlusion level, how much stronger can occlusion get before the system starts to underperform? I have faced such issues repeatedly in the past, all related to an overarching question: How robust is my model and when does it break?

Mateo Rojas-Carulla

November 13, 2024

Activate
untouchable mode.

Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Book a demo Start for free

Join our Slack Community.

Several people are typing about AI/ML security.  Come join us and 1000+ others in a chat that’s thoroughly SFW.

Join Lakera Momentum Slack

How to select the best machine learning models for computer vision?