Given you have two machine learning models, which one would you pick and why? Every machine learning engineer encounters the decisions to choose an optimal combination of hyperparameters, model architectures, optimal experiments during development, and what data to collect, annotate, and train on. These decisions are more often than not a lot more challenging than expected, especially in the context of computer vision, which is the focus of this article.
Developers often make the wrong choice due to one tricky decision that can send the entire computer vision model tumbling like a set of dominoes. Once I was preparing an upcoming customer demo. I chose a model based on a couple of performance metrics, and it seemed to outperform its contender. I also visualized a few predictions on some image sequences, and it reconfirmed my model choice.
So based on the above pieces of evidence, I believed this must be the optimal model and thought of sending it to the production team. A couple of days later, we had the demo (thankfully, it was only an internal one), and looking at the prediction behavior, I quickly realized that the behavior pattern was completely off.
While performing well overall, it turned out that we had introduced a regression for the particular demo site. How come I did not catch this during my evaluation? Even more so, when I analyzed the contending computer vision model that got rejected right before going into production, it turned out that it did not show any performance regression. So it turned out my model evaluation methods were incomplete – to say the least – and we had to dig deeper.
Since then, I have been doing a lot of work on model evaluation and comparison methods to avoid such pitfalls. Over the years, I have added the following techniques to my core set of evaluation and comparison methods:
Any model comparison should include the following standard metrics:
There are plenty of resources on the web and in textbooks on how to use and interpret these ML evaluation metrics to compare models. These model evaluation metrics are always an integral part of my evaluation process even though they are not nearly as comprehensive as required to get a good insight into the overall quality of your models.
Standard ML metrics hide too much valuable information when deciding between multiple models (or generally when evaluating a model). Partially that is because they look at aggregate metrics over large datasets. So, they may not accurately reflect your business and product requirements.
For instance, if you are building a computer vision project on object detection for industrial inspection and aim to roll it out across different customer sites, you need to look at the model performance on each site (to avoid situations like the one I described above). To find out which model is best, you will want to check if the model performs equally well across all components that require an inspection as well.
To do this subgroup analysis and split the performance metrics, I tend to collect as much metadata (timestamp, customer site, camera model, image dimensions, etc.) as possible for each image. Another technique I use here is to build small regression test sets (10-50 images) to track the subset performance. These regression sets can include sensitive cases or specific scenarios I want to test but have no metadata available. Learn more about that here. I want to make sure that my model performs equally well on (combinations of) these subgroups.
Once your model is in production, it will inevitably encounter dynamic variations in the image input. How does the model respond to that? Even minor variations will throw your model off if you have overfitted your training and test data. To prevent this scenario, I ensure to explicitly test model robustness with varied images and check if the model output stays close to the original. For minimal testing, I execute the following:
As a side note, knowing where your model is not robust significantly helps to select the data augmentations during training. In some sense, this is an easy test to see if your training pipeline is correct. It also supports as input to refining data collection and annotation.
“Better performing” machine learning models (based on the standard metrics above) often do not generalize better. They cannot grasp data beyond the available dataset, ignoring, or failing to correctly process variations in the input.
Gaining a good understanding of model robustness is a critical stage in selecting the optimum model.
If you are building an application where biases could impact customer experience or safety, you should consider fairness metrics as part of your model comparison methods. One model may outperform another on high-level performance metrics but may include subtle predictive biases.
A recommended way to get started is to ensure that your datasets represent the operational use case. Depending on the application, you may also want to measure explicit fairness metrics such as equalized odds or predictive equality.
Production environments and configurations always add additional constraints to your computer vision application. Some that come to mind are as follows:
For instance, you have to ask yourself if a model with twice the inference time is your preferred model to optimize for a 0.5% gain in performance.
Also, on-device performance may substantially differ from your training environment with a beefy GPU in the cloud. If you suspect a difference, model comparisons should consider on-device performance.
Now, evaluating all these dimensions can become quite overwhelming. I’ve been there myself.
That’s why I’m excited that we recently introduced a neat model comparison feature in MLTest to help you get a more comprehensive view of your model. It tracks all the standard ML metrics, automatically assesses model robustness and biases, and does a subset analysis on your models. It even automatically identifies failure clusters where your model performs poorly, making it possible to create a much more comprehensive comparison.
When comparing computer vision models, take the next step and include the above criteria in your evaluation. They will help you make better decisions and ultimately build better ML systems.
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
Subscribe to our newsletter to get the recent updates on Lakera product and other news in the AI LLM world. Be sure you’re on track!
Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.
Several people are typing about AI/ML security. Come join us and 1000+ others in a chat that’s thoroughly SFW.