Written By Justin Deschenaux
We have updated this article to include the new YOLOv8 models. This includes an extensive model evaluation and robustness benchmark of YOLOv8 models of different sizes (s,n,m,l,x). The new models are compared against YOLOv5 and YOLOv8. Spoiler: YOLOv8’s performance improvements did not bring a corresponding improvement in model robustness.
We tested the robustness of state-of-the-art computer vision models to assess their generalization ability. Here is what we found.
State-of-the-art pre-trained object detection models can be easily fine-tuned to achieve competitive ML metrics on our own validation datasets. But what does it take to prepare these for production? What are the potential weaknesses of these models that we should be aware of? And how do those impact your choice of a model to fine-tune on your own datasets?
💡 Want to test your own computer vision models? You can integrate MLTest with your own models and data in minutes.
To find out, we took several standard high-performance open source models like Ultralytics’ YOLOv8 and YOLOv5, and Meta’s DETR for a test drive with MLTest to benchmark their generalization capabilities.
But before we get there, why is ML testing and robustness testing important to assess model generalization?
Today the following workflow is a common experience for us computer vision developers:
When such models are launched into real-world environments, however, “unknown unknowns” are often encountered, and the models start surfacing issues.
So the main challenge becomes assessing model generalization during development: how will the model behave once it is in the complex real world?
In this blog, after dissecting the robustness properties of state-of-the-art computer vision models, we will argue that the gap between a pre-trained model and a real-world high-performer is often significant. As a result, fine-tuning such a model on our initial datasets is only the beginning, and most of the work lies ahead.
The good news: while validation metrics only provide limited insights into real-world model performance, many of the issues leading to poor model generalization are identifiable before release.
Variations in image properties like lighting, motion artifacts, or image quality loss are ubiquitous in production. The real-world data distribution is undoubtedly richer, and a moving, ever-evolving target. By measuring the ML robustness of your system to such factors, you can assess the risk that your model is overfitting to the properties of your data distribution, as well as its ability to handle the inevitable variations it will encounter. Low robustness is indicative of poor generalization.
We ran Lakera’s MLTest to assess the robustness of our candidate models. As part of this, MLTest generated universal robustness tests which are barely perceptible to the naked eye and frequently occur during operation, without using white or black-box adversarial attacks. Judge for yourself, can you tell which of the following are original COCO images, and which have been modified?
So which models did we put to the test?
We benchmarked the robustness of the following models:
For reference, these models achieve the following competitive validation metrics on COCO:
YOLOv8 n43.94 31.74 34.89 YOLOv8 s 53.03 39.01 41.04 YOLOv8 m 59.43 44.87 45.97 YOLOv8 l62.04 47.37 47.74 YOLOv8 x 63.57 48.4948.50 YOLOv5 n 46.5427.70 31.07 YOLOv5 s 55.9636.20 38.65 YOLOv5 m 63.0143.61 44.50 YOLOv5 l 65.91 47.2747.52 YOLOv5 x 67.54 48.8949.15 DETR S 59.4541.67 46.08 DETR M 59.55 43.4947.12 DETR L 60.45 43.4246.89 DETR H 61.00 43.84 47.67
Aggregate metrics of the candidate models. We use our own implementation of COCO mAP etc, so numbers may differ from the ones reported elsewhere.
The COCO validation set, however, does not represent the real world. What are these metrics hiding? What is the likelihood that the model will generalize once released into the wild or fine-tuned?
The following plots summarize MLTest’s risk score for different models and model robustness tests. The score is between 0 and 100, where 100 represents the highest risk and 0 stands for a lower risk model. The score represents the percentage of the dataset where the model’s behavior is heavily impacted by MLTest’s robustness testing. We plot the aggregate risk score (lower is better) computed by MLTest for all main risk factors, each of these consisting of several individual tests:
Here are a few side-by-side examples of how the smallest image changes affect model performance, as identified by MLTest.
We take away a couple of insights from our experiments:
Mild transformations have significant effects on model robustness.
As you can see from the plots above, for both model families, mild transformations have a dramatic impact on model robustness, both for YOLO and for DETR models. On YOLO, the models become more robust as size increases, though not uniformly: models become less robust to low image quality as size increases, for example. DETR models do not become more robust as size increases.
Robustness issues are found even if training-time augmentations are used.
Interestingly, for YOLO models, this applies also to augmentations that were used during training (e.g. median blur, equalization, grayscale). While it is unexpected, it is not surprising either: adding a few lines of code with these augmentations is not a silver bullet.
We should explicitly test that these augmentations have the intended effect and calibrate the augmentation pipeline carefully. Interestingly, these augmentations also do not trivially transfer: while median blur was used during training, the overall blur risk factor still fared poorly. Based on our code inspection, very few augmentations were used during training in the case of DETR models. Does this explain the robustness issues we observe here?
Larger models achieve higher metric scores but not model robustness.
We see a consistent pattern: larger models are not universally better. Within the YOLO family, there is no significant and consistent increase in robustness as models expand and in some risk factors like Image quality, it even gets worse.
Transformer-based models achieve better metrics than YOLO on the validation set but fare much worse in terms of model robustness. These properties should be taken into account when selecting the core of a production model.
Our experiments indicate that these pre-trained systems are likely far from robust computer vision. This has implications when choosing a model to fine-tune: the models with the highest validation metrics may not be the most robust, and thus may generalize poorly on your specific problem. A few practices that help us build systems that generalize to the real world:
We can’t wait to test some more SOTA models very soon. So stay tuned for updates here!
MLTest is the easiest way to assess the generalization capabilities of your models. You can learn more about it here or get started right away. Also, feel free to get in touch with us at firstname.lastname@example.org.
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
Subscribe to our newsletter to get the recent updates on Lakera product and other news in the AI LLM world. Be sure you’re on track!
Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.
Several people are typing about AI/ML security. Come join us and 1000+ others in a chat that’s thoroughly SFW.