Numerous recent additions to the research literature on computer vision (CV) aim to solve practical problems. Companies are leveraging such advances to build CV that solves real-world problems across applications. However, the one prospect that has held my attention, and imagination, for years is medical imaging. It provides extensive opportunities to improve patient journeys (for example, reducing the screening time for patients with skin diseases ) and to support physicians on challenging and time-consuming tasks (such as predicting lung cancer  or processing a large number of histological slices ). And, as we know, it has the capacity to complete tasks that humans can’t do yet – such as determining if a patient has pancreatic cancer via a smartphone selfie of the eye . The promise these possibilities hold has put medical imaging in the lead of the race toward landing in hospitals. But that is not the end of the discussion.
In this article we discuss challenges met when building a medical imaging system, and how to test machine learning models to gain full visibility on model performance before deploying to the clinic.
While it’s exciting and encouraging to see all these medical imaging solutions being published, they don’t usually come free of challenges or risks – especially when it comes to the productionization phase. These challenges are often tedious and hard to resolve, but there are strong ethical and safety reasons why this must be done before going live. If these blockers persist, even the best software runs a high risk of getting stuck in the so-called “prototype trap” – known across the board as the situation of building a great prototype but never achieving tangible success in a production setting.
I have seen it myself when moving from my “rainbows and unicorns” Jupyter Notebooks computing accuracy on my initial test data and suddenly finding myself in awkward situations, having to explain embarrassing errors in front of doctors while they are testing the app on real patients.
The methodology available in traditional software testing is not yet present when testing ML models. The challenging question to answer is how to evaluate a machine learning model and increase test coverage enough before deploying the system to production. How can machine learning unit testing help us escape the prototype trap?
The real questions we face are: why aren't more ML systems making the step from a prototype to production, and how can we increase their chances of making it? The answer lies in machine learning model evaluation.
We at Lakera investigated a state-of-the-art open-source model that can be used as a basis for building production systems in medical imaging.
The goal of the system under analysis consisted in detecting COVID-19 infections from chest radiographs. We looked at how this model would be likely to perform if deployed as-is – and what should be improved to take it from a prototype to production. Spoiler alert! The standard machine learning evaluation (measured in terms of aggregate metrics such as accuracy) was good, as expected from a state-of-the-art model. Despite these seemingly good performance indicators, it contained severe shortcomings that would have to be fixed before reaching production readiness.
In radiology practice, radiologists are trained to, and used to, getting around challenges such as patients moving, blur due to breathing, camera-induced noise, subtle differences in X-ray machines, overlaid text, different levels of contrast, or exposure… and many others. Handling such cases well is key for a reliable diagnosis. However, it’s less clear how ML models handle such situations, especially as these are often not adequately represented in the initial training and test datasets used during development. Machine learning testing beyond standard metrics becomes a fundamental tool to identify problematic situations and drive next steps such as data collection.
In our quest to assess the production readiness of the aforementioned model, we employed Lakera’s MLTest, our software development kit (SDK) that allows developers to find vulnerabilities in ML models and data in CV systems before they enter into operation. To stress-test the model, we used MLTest to synthetically enhance and generated X-ray images and evaluated these using the model in order to assess its robustness against various situations that are likely to occur in practice — like those described in the paragraph above. The authenticity of the generated images was verified by professionally trained radiologists who were selected by Humans in the Loop. This process confirmed that the images generated by MLTest could indeed be encountered in practice.
We evaluated the model on an extensive testing suite, with model tests focused on the performance and robustness properties of the model. The results revealed that despite the outstanding base performance, severe robustness issues appeared in almost half of the images from the original test set. These issues included cases where even changes to the images were barely noticeable to the naked eye but led to critical failures in which the predictions drastically flipped, leading to false positives and negatives in the proposed pre-diagnoses. This means, for example, that a strong positive diagnosis could be deemed confidently healthy if the patient slightly moves. These are mistakes that simply should not be accepted during actual use! Overall, we discovered that the model isn’t robust to different patient-induced motion, lighting changes in the room, different scanner types, and other, more elastic conditions. Note that the image generation was not done adversarially.
To sum it up, we’ve seen that even state-of-the-art models have severe limitations when it comes to robustness, which can lead to them not performing well in practical situations. These vulnerabilities must be fixed during the development phase. The way to achieve this is by performing a robustness analysis and thorough machine learning testing, as proposed above. Identifying these areas of improvement guides next steps, such as the collection of test cases.
If you’d like to learn more about the ML testing techniques we’ve employed in our analysis, check out Lakera’s guide to testing machine learning. We've also looked at the robustness properties of state-of-the-art object detection models, you can find that here. Or if you want to test your own CV model with us, say hi! :-)
 “Using AI to help find answers to common skin conditions”, Bui, P. & Liu Y., 2021.
 “A promising step forward for predicting lung cancer”, Shetty, S., 2019.
 “Artificial intelligence and computational pathology”, Cui, M. & Zhang, D. Y., 2021.
 “New app uses smartphone selfies to screen for pancreatic cancer”, University of Washington, 2017.
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
Subscribe to our newsletter to get the recent updates on Lakera product and other news in the AI LLM world. Be sure you’re on track!
Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.
Several people are typing about AI/ML security. Come join us and 1000+ others in a chat that’s thoroughly SFW.