Testing machine learning (ML) systems is currently more of an art form than a standardized engineering practice. This is particularly problematic for machine learning in mission-critical contexts. In these use cases, strict performance guarantees and regulatory compliance are a must. The best engineering teams at companies like Tesla have built sophisticated testing infrastructure to ensure the reliability of their ML systems. Now it is time to make effective and systematic ML testing a reality for the rest of us–including smaller engineering teams–as well.
This article summarizes three steps from our ML testing series that any development team can take when testing their ML systems:
Specify your operational domain 📝
- Systematic testing is most effective in the context of an operational domain. An operational domain describes the “specific conditions under which a given [...] automation system is designed to function” [1]. It is a more compact representation of the environment in which the system will operate. This can also include what data it will be exposed to and its user interactions. Specification of the operational domain can be used to detect data bugs. It is also the starting point for establishing reliability guarantees for the whole system.
Stress-test your system 🦾
- With the operational domain specified, we can check the robustness of the system within the relevant conditions. As with traditional software systems, bugs often appear when a problematic input is presented to the system. Fuzz testing is a popular strategy that looks for these problematic inputs by randomly generating data points both inside and outside of the operational domain. In combination with metamorphic relations, it becomes a powerful tool that developers can use to ensure that their system performs well enough within the operational domain and degrades gracefully when presented with more challenging inputs.
Ensure your system performs when it really matters ✋
- The machine learning development process is inherently complex and iterative. Regression sets provide an effective tool for ensuring that your system really gets better with every iteration. They are a simple tool that can be used not only retroactively (e.g., when a bug has been found and you want to make sure that it does not reoccur) but also proactively (e.g., for creating test sets to actively probe system behavior when performance matters).
Adopting these three strategies is the first step to making ML testing more systematic and effective. They often provide a high return on investment. This is the case especially for smaller engineering teams that don’t have Tesla’s resources but that require strict performance guarantees and want to move through product development quickly and efficiently.
Lakera’s validation engine, MLTest, finds critical performance vulnerabilities in computer vision systems before they enter operation. Built with industry-leading AI and safety expertise, MLTest makes reliability a no-brainer for entire development teams. Get in touch if you want to learn more!
[1] Definition from ISO 21448: Road vehicles — Safety of the intended functionality.