Back

How to select the best machine learning models for computer vision?

Deep-dive into advanced comparison methods beyond standard performance metrics to build computer vision models that consistently perform over the long term.

Matthias Kraft
December 1, 2023
August 29, 2022
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

In-context learning

As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.

[Provide the input text here]

[Provide the input text here]

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Lorem ipsum dolor sit amet, line first
line second
line third

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Hide table of contents
Show table of contents

Given you have two machine learning models, which one would you pick and why? Every machine learning engineer encounters the decisions to choose an optimal combination of hyperparameters, model architectures, optimal experiments during development, and what data to collect, annotate, and train on. These decisions are more often than not a lot more challenging than expected, especially in the context of computer vision, which is the focus of this article.

Developers often make the wrong choice due to one tricky decision that can send the entire computer vision model tumbling like a set of dominoes. Once I was preparing an upcoming customer demo. I chose a model based on a couple of performance metrics, and it seemed to outperform its contender. I also visualized a few predictions on some image sequences, and it reconfirmed my model choice.

So based on the above pieces of evidence, I believed this must be the optimal model and thought of sending it to the production team. A couple of days later, we had the demo (thankfully, it was only an internal one), and looking at the prediction behavior, I quickly realized that the behavior pattern was completely off.

While performing well overall, it turned out that we had introduced a regression for the particular demo site. How come I did not catch this during my evaluation? Even more so, when I analyzed the contending computer vision model that got rejected right before going into production, it turned out that it did not show any performance regression. So it turned out my model evaluation methods were incomplete – to say the least – and we had to dig deeper.

How do we make better decisions when comparing computer vision models?

Since then, I have been doing a lot of work on model evaluation and comparison methods to avoid such pitfalls. Over the years, I have added the following techniques to my core set of evaluation and comparison methods:

1. Standard ML metrics

Any model comparison should include the following standard metrics:

  • PR curves, Precision, Recall, RMSE, etc (metrics relevant to the use case)
  • Training loss, validation loss, test loss (to assess overfitting behavior)
  • Model complexity (to consider potential runtime tradeoffs)

There are plenty of resources on the web and in textbooks on how to use and interpret these ML evaluation metrics to compare models. These model evaluation metrics are always an integral part of my evaluation process even though they are not nearly as comprehensive as required to get a good insight into the overall quality of your models.

2. Subgroup analysis and explicit data unit test

Standard ML metrics hide too much valuable information when deciding between multiple models (or generally when evaluating a model). Partially that is because they look at aggregate metrics over large datasets. So, they may not accurately reflect your business and product requirements.

For instance, if you are building a computer vision project on object detection for industrial inspection and aim to roll it out across different customer sites, you need to look at the model performance on each site (to avoid situations like the one I described above). To find out which model is best, you will want to check if the model performs equally well across all components that require an inspection as well.

To do this subgroup analysis and split the performance metrics, I tend to collect as much metadata (timestamp, customer site, camera model, image dimensions, etc.) as possible for each image. Another technique I use here is to build small regression test sets (10-50 images) to track the subset performance. These regression sets can include sensitive cases or specific scenarios I want to test but have no metadata available. Learn more about that here. I want to make sure that my model performs equally well on (combinations of) these subgroups.

Start comparing your computer vision models with MLTest today.

Get Started

3. Model robustness

Once your model is in production, it will inevitably encounter dynamic variations in the image input. How does the model respond to that? Even minor variations will throw your model off if you have overfitted your training and test data. To prevent this scenario, I ensure to explicitly test model robustness with varied images and check if the model output stays close to the original. For minimal testing, I execute the following:

  • Geometric variations: rotations, perspective changes, scaling, cropping, etc.
  • Lighting variations: global and local brightness and contrast changes, color changes, etc.
  • Image quality variations: noise, compression artifacts, blur, package losses, etc.

As a side note, knowing where your model is not robust significantly helps to select the data augmentations during training. In some sense, this is an easy test to see if your training pipeline is correct. It also supports as input to refining data collection and annotation.

“Better performing” machine learning models (based on the standard metrics above) often do not generalize better. They cannot grasp data beyond the available dataset, ignoring, or failing to correctly process variations in the input.

Gaining a good understanding of model robustness is a critical stage in selecting the optimum model.

4. Model biases/fairness

If you are building an application where biases could impact customer experience or safety, you should consider fairness metrics as part of your model comparison methods. One model may outperform another on high-level performance metrics but may include subtle predictive biases.

A recommended way to get started is to ensure that your datasets represent the operational use case. Depending on the application, you may also want to measure explicit fairness metrics such as equalized odds or predictive equality.

5. In-operation metrics

Production environments and configurations always add additional constraints to your computer vision application. Some that come to mind are as follows:

  • Memory footprint
  • Model Inference time
  • System latency
  • GPU/CPU utilization

For instance, you have to ask yourself if a model with twice the inference time is your preferred model to optimize for a 0.5% gain in performance.

Also, on-device performance may substantially differ from your training environment with a beefy GPU in the cloud. If you suspect a difference, model comparisons should consider on-device performance.

Comparing models with MLTest

Now, evaluating all these dimensions can become quite overwhelming. I’ve been there myself.

That’s why I’m excited that we recently introduced a neat model comparison feature in MLTest to help you get a more comprehensive view of your model. It tracks all the standard ML metrics, automatically assesses model robustness and biases, and does a subset analysis on your models. It even automatically identifies failure clusters where your model performs poorly, making it possible to create a much more comprehensive comparison.

Comparing Computer Vision Models with MLTest | Lakera AI

You can learn more about how MLTest can help you in comparing machine-learned computer vision models here, get started with MLTest right away, or get in touch with me at matthias@lakera.ai.

Conclusion

When comparing computer vision models, take the next step and include the above criteria in your evaluation. They will help you make better decisions and ultimately build better ML systems.

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Matthias Kraft
Read LLM Security Playbook

Learn about the most common LLM threats and how to prevent them.

Download
You might be interested
15
min read
Machine Learning

Reinforcement Learning from Human Feedback (RLHF): Bridging AI and Human Expertise

Discover how RLHF creates AI systems aligned with human values. Explore its benefits, transformative potential, and challenges. Learn how human feedback improves AI decision-making.
Deval Shah
April 10, 2024
12
min read
Machine Learning

The ELI5 Guide to Retrieval Augmented Generation

Discover the inner workings of Retrieval Augmented Generation (RAG) and how it enhances language model responses by dynamically sourcing information from external databases.
Blessin Varkey
December 1, 2023
Activate
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.