Cookie Consent

Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.

Medical imaging as a serious prospect: Where are we at?

The promise these possibilities hold has put medical imaging in the lead of the race toward landing in hospitals. But that is not the end of the discussion…

Lakera Team

October 20, 2023

Last updated:

November 13, 2024

On this page

Hide table of contents

Show table of contents

Numerous recent additions to the research literature on computer vision (CV) aim to solve practical problems. Companies are leveraging such advances to build CV that solves real-world problems across applications. However, the one prospect that has held my attention, and imagination, for years is medical imaging. It provides extensive opportunities to improve patient journeys (for example, reducing the screening time for patients with skin diseases [1]) and to support physicians on challenging and time-consuming tasks (such as predicting lung cancer [2] or processing a large number of histological slices [3]). And, as we know, it has the capacity to complete tasks that humans can’t do yet – such as determining if a patient has pancreatic cancer via a smartphone selfie of the eye [4]. The promise these possibilities hold has put medical imaging in the lead of the race toward landing in hospitals. But that is not the end of the discussion.

In this article we discuss challenges met when building a medical imaging system, and how to test machine learning models to gain full visibility on model performance before deploying to the clinic.

The feared “prototype trap”.

While it’s exciting and encouraging to see all these medical imaging solutions being published, they don’t usually come free of challenges or risks – especially when it comes to the productionization phase. These challenges are often tedious and hard to resolve, but there are strong ethical and safety reasons why this must be done before going live. If these blockers persist, even the best software runs a high risk of getting stuck in the so-called “prototype trap” – known across the board as the situation of building a great prototype but never achieving tangible success in a production setting.

I have seen it myself when moving from my “rainbows and unicorns” Jupyter Notebooks computing accuracy on my initial test data and suddenly finding myself in awkward situations, having to explain embarrassing errors in front of doctors while they are testing the app on real patients.

The methodology available in traditional software testing is not yet present when testing ML models. The challenging question to answer is how to evaluate a machine learning model and increase test coverage enough before deploying the system to production. How can machine learning unit testing help us escape the prototype trap?

Case study: Putting medical imaging to the test.

All that glitters is not gold.

The real questions we face are: why aren't more ML systems making the step from a prototype to production, and how can we increase their chances of making it? The answer lies in machine learning model evaluation.

We at Lakera investigated a state-of-the-art open-source model that can be used as a basis for building production systems in medical imaging.

The goal of the system under analysis consisted in detecting COVID-19 infections from chest radiographs. We looked at how this model would be likely to perform if deployed as-is – and what should be improved to take it from a prototype to production. Spoiler alert! The standard machine learning evaluation (measured in terms of aggregate metrics such as accuracy) was good, as expected from a state-of-the-art model. Despite these seemingly good performance indicators, it contained severe shortcomings that would have to be fixed before reaching production readiness.

Human experts are trained for practice… but is ML?

In radiology practice, radiologists are trained to, and used to, getting around challenges such as patients moving, blur due to breathing, camera-induced noise, subtle differences in X-ray machines, overlaid text, different levels of contrast, or exposure… and many others. Handling such cases well is key for a reliable diagnosis. However, it’s less clear how ML models handle such situations, especially as these are often not adequately represented in the initial training and test datasets used during development. Machine learning testing beyond standard metrics becomes a fundamental tool to identify problematic situations and drive next steps such as data collection.

*Stress-testing the medical AI application against blur scenarios.*

In our quest to assess the production readiness of the aforementioned model, we employed Lakera’s MLTest, our software development kit (SDK) that allows developers to find vulnerabilities in ML models and data in CV systems before they enter into operation. To stress-test the model, we used MLTest to synthetically enhance and generated X-ray images and evaluated these using the model in order to assess its robustness against various situations that are likely to occur in practice — like those described in the paragraph above. The authenticity of the generated images was verified by professionally trained radiologists who were selected by Humans in the Loop. This process confirmed that the images generated by MLTest could indeed be encountered in practice.

Robustness issues found despite exceptional base performance.

We evaluated the model on an extensive testing suite, with model tests focused on the performance and robustness properties of the model. The results revealed that despite the outstanding base performance, severe robustness issues appeared in almost half of the images from the original test set. These issues included cases where even changes to the images were barely noticeable to the naked eye but led to critical failures in which the predictions drastically flipped, leading to false positives and negatives in the proposed pre-diagnoses. This means, for example, that a strong positive diagnosis could be deemed confidently healthy if the patient slightly moves. These are mistakes that simply should not be accepted during actual use! Overall, we discovered that the model isn’t robust to different patient-induced motion, lighting changes in the room, different scanner types, and other, more elastic conditions. Note that the image generation was not done adversarially.

Conclusion: Medical imaging is nearly there, but not yet a winner.

To sum it up, we’ve seen that even state-of-the-art models have severe limitations when it comes to robustness, which can lead to them not performing well in practical situations. These vulnerabilities must be fixed during the development phase. The way to achieve this is by performing a robustness analysis and thorough machine learning testing, as proposed above. Identifying these areas of improvement guides next steps, such as the collection of test cases.

If you’d like to learn more about the ML testing techniques we’ve employed in our analysis, check out Lakera’s guide to testing machine learning. We've also looked at the robustness properties of state-of-the-art object detection models, you can find that here. Or if you want to test your own CV model with us, say hi! :-)

[1] “Using AI to help find answers to common skin conditions”, Bui, P. & Liu Y., 2021.

[2] “A promising step forward for predicting lung cancer”, Shetty, S., 2019.

[3] “Artificial intelligence and computational pathology”, Cui, M. & Zhang, D. Y., 2021.

[4] “New app uses smartphone selfies to screen for pancreatic cancer”, University of Washington, 2017.

Lakera Team

GenAI Security Preparedness
Report 2024

Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.

Free Download

The computer vision bias trilogy: Drift and monitoring.

Unforeseen data may be presented to the computer vision system during operation despite careful mitigation of datasets and shortcuts.

Lakera Team

May 31, 2025

min read

•

Computer Vision

The computer vision bias trilogy: Shortcut learning.

Nobel Prize-winning economist, Daniel Kahneman once remarked “by their very nature, heuristic shortcuts will produce biases, and that is true for both humans and artificial intelligence, but their heuristics of AI are not necessarily the human ones”. This is certainly the case when we talk about “shortcut learning”.

Lakera Team

November 13, 2024

Activate
untouchable mode.

Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Book a demo Start for free

Join our Slack Community.

Several people are typing about AI/ML security.  Come join us and 1000+ others in a chat that’s thoroughly SFW.

Join Lakera Momentum Slack

Medical imaging as a serious prospect: Where are we at?

The feared “prototype trap”.

Case study: Putting medical imaging to the test.

All that glitters is not gold.

Human experts are trained for practice… but is ML?

Robustness issues found despite exceptional base performance.

Conclusion: Medical imaging is nearly there, but not yet a winner.

Unlock Free AI Security Guide.

Explore Prompt Injection Attacks.

Learn AI Security Basics.

Evaluate LLM Security Solutions.

Uncover LLM Vulnerabilities.

The CISO's Guide to AI Security

Explore AI Regulations.

GenAI Security Preparedness
Report 2024

Explore AI Regulations.

Understand AI Security Basics.

Uncover LLM Vulnerabilities.

Optimize LLM Security Solutions.

Master Prompt Injection Attacks.

Unlock Free AI Security Guide.

The computer vision bias trilogy: Drift and monitoring.

The computer vision bias trilogy: Shortcut learning.

The feared “prototype trap”.

Case study: Putting medical imaging to the test.

All that glitters is not gold.

Human experts are trained for practice… but is ML?

Robustness issues found despite exceptional base performance.

Conclusion: Medical imaging is nearly there, but not yet a winner.

Unlock Free AI Security Guide.

Explore Prompt Injection Attacks.

Learn AI Security Basics.

Evaluate LLM Security Solutions.

Uncover LLM Vulnerabilities.

The CISO's Guide to AI Security

Explore AI Regulations.

GenAI Security Preparedness Report 2024

Explore AI Regulations.

Understand AI Security Basics.

Uncover LLM Vulnerabilities.

Optimize LLM Security Solutions.

Master Prompt Injection Attacks.

Unlock Free AI Security Guide.

The computer vision bias trilogy: Drift and monitoring.

The computer vision bias trilogy: Shortcut learning.

GenAI Security Preparedness
Report 2024