Cookie Consent
Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.
Read our Privacy Policy

Life vs. ImageNet Webinar: Lessons Learnt From Bringing Computer Vision to the Real World

Lakera hosted its first webinar Life vs ImageNet last week. We had exciting discussions around the main challenges in building Machine Learning (ML) for real-world applications.

Lakera Team
February 2, 2024
Hide table of contents
Show table of contents

We looked at real-world ML from multiple angles, all the way from how to best start ML projects to the challenges of scaling ML products and teams. And of course, we couldn’t miss out on the big developments in the world of foundation models.

You can access the recording of the webinar on Youtube: Live vs. ImageNet - Lakera Webinar and below. Continue reading for a summary and main takeaways.

👋 The Panelists

Richard Shen (Wayve)

Product management at Wayve, a deep tech London startup building end-to-end systems for autonomous driving.

Tom Dyer (Genomics England)

ML engineer at Genomics England, working on the 100,000 genome project. He has previously built radiology AI systems that have been deployed across the NHS for real-world clinical care.

Peter Shulgin (Covariant)

Previously solutions at Covariant AI, working closely with customers across industries and building the deep quality processes required to integrate AI technologies at scale.

Paul Rubenstein (Google)

Worked at Apple on building on-device Computer Vision (CV) products at scale. He is now an applied researcher at Google.

Mateo Rojas Carulla (Lakera, Moderator)

Founder of Lakera, PhD in machine learning, witnessed the rapid growth of ML capabilities over the last decade. Interested in how to bring this safely into production systems. Previously worked at Google on Google Search and was exposed to learning how to deliver reliably at scale.

❓The questions

What are the main challenges in transitioning from academic ML to building real-world applications? What cultural changes are needed?

Changing the mentality away from ML101 is very challenging. The day-to-day challenges of evaluating CV systems for real-world applications are very different from the evaluation standards widely used in academia (e.g. evaluation on validation datasets). The applications promised by developments in AI technologies are very powerful, but there is a huge gap in making them useful and reliable for everyday use. There are major challenges in moving from POC to products on the streets, in hospitals and in people's homes.

💡 In academia the focus is on models. In industry, the focus is on the data.

💡 The world is constantly changing and you need to ensure that your datasets are representative of this dynamic world.

Metrics defined in academia may not well suit models in production. How does one come up with a metric for the real world? Is one metric enough?

💡 Averaging accuracy across classes for a classification model is meaningless, there is too much variation and uncertainty.

💡 A wide variety of metrics is informative, there is no need to look at them all the time but have to possibility to drill down if needed.

💡 One metric is not enough. However, you need a metric that drives decisions, too many metrics can become overwhelming. You can create hierarchies of metrics, for example, a “safety critical metric” to help decision-making.

💡 Think from a product perspective, what can be translated in product playbooks and a north star. Is your product serving the use cases it is meant to serve?

In healthcare, how do you ensure that the final ML product adds real value to end users?

Adopting a product-first mentality is fundamental. While academia often rewards “big ticket items” that lead to increased model performance, product work should focus much more on understanding what the user really needs and what product features need to be built. Being pragmatic and understanding what is “good enough” to prevent getting caught in endless iterations is also a key driver of success.

💡 In clinical ML, you have to build with many constraints. For example, a model can have adequate performance but is too slow once deployed.

How challenging is it for engineers to adopt the practices needed for real-world ML?

It is absolutely a challenge that many companies are still trying to figure out. Within a company, open-ended research teams can create significant value, but should in general be kept separate from production and engineering teams. This is not new to ML and has been a valuable interplay across disciplines (oil and gas, industrial, etc).

💡 In academia problems are framed in terms of validation, test, and training sets, but in the "real world", you are not given a fixed training set. You have requirements and have to figure out the rest as you go. The test dataset is constantly changing. You want to adapt and inform your product as it grows and new requirements come in.

If you were to set up a completely new computer vision project that will be released to production, what are the most important things to pay attention to early on?

Converging on the right evaluation metrics early is key. After having a viable prototype, it’s important to be dynamic and get your system in the hands of real people, as you will be surprised by how they actually end up using it. The focus of development should be on probing your model and understanding patterns in the failures. For example, understand that your model fails in low-contrast images.

💡 Agreeing early on what success means is critical.

💡 When you deploy a model into the world, it will be used in ways you will not imagine beforehand.

💡 It is easy to fall into the trap of a continuous iteration loop, never quite getting there.

💡 How are you channeling information and learnings to scale further and further?

What are the challenges that come with scale (larger product, team, number of users)? What are some learnings and challenges?

Having traceability, data tracking, and reproducibility is important. You want to be able to dissect everything no matter what scale you are at, going deep into the performance and the robustness of your systems. A lot of the manual complexity involved should be removed from the developer so they can focus on what matters.

💡 "Looking back, I am surprised at the amateurish level of research code, it was a culture shock for me to see how academia compares to production engineering".

How do you best scale efforts in ML? For example from one model to 1000s of models? Or a few users to millions of users?

In healthcare, you are growing teams as you scale the project. Initial systems are often built from small datasets, from single hospitals or regions. But is your model robust and stable across hospitals? For example, different hospitals have different conditions and different patient demographics. Scaling to larger use cases requires a dynamic team where responsibilities are constantly switching and new skills are required.

It is also important to have validation pipelines that go beyond aggregate performance, looking at deep performance and robustness metrics of the models. In particular, teams should define and test against expected behaviors in a granular way. This helps to monitor the models and how they will scale as you grow.

💡 The more you deploy, the more errors you encounter, the more you can test behaviors at finer granularities.

What are the challenges of more traditional industries trying to adopt AI technologies?

While not all industries are alike and have different issues (e.g. construction vs. medical), they share common challenges. Speed of deployment is often one. Another is to be able to explain very transparently to the end user how the system has been tested to build confidence.

An additional challenge arises when building "generic" systems for multiple use cases. For example, in a warehouse you may interpret a given object as a single box, but in another use case you may want to classify it as a pallet of 500 boxes. This is a challenge, how do you train a model with such a variety of use cases and expectations?

💡 The challenge with a “general” AI is that the same situations lead to different expected behaviors depending on the scenario.

How should these companies equip themselves to handle the contradicting use cases?

Established industries have more process-driven tasks and quality assurance (QA), focusing on building processes and how to run successful pilots. They then formulate criteria of what works and bring it to scale.

💡 “My prediction is traditional companies will not work on the ML engineering side, they will focus on validation and testing.”

What challenges lie ahead with the rise of foundation models?

While there is a lot of excitement around foundation models, the problems they raise are not new to ML. What are some of these?

First of all, you are working with a model that you did not train yourself, you don’t have access to the data or know the properties the model has learned. As models become more powerful, the barriers for building complex systems come down. People will start building and realize that models are behaving strangely once deployed.

💡 There is an inherent bias in the data or foundation model which can cause issues for downstream models

💡 Evaluation is important, you don’t know the behaviors that the model has learned.

💡 Establishing robust evaluation suites for people that are not experts is key to success.


We had a great time during the webinar, it was great to hear from people who have faced the struggles of releasing ML for the real world and to learn from their experiences. Here are just a few of the key takeaways:

  • Building real-world systems looks very different from their academic counterparts. The latter starts with a clearly defined target data distribution and corresponding datasets, whereas the former starts with product specifications and often no dataset at all!
  • Evaluation standards should go far beyond typical evaluation metrics (e.g. accuracy). They need to dig deep into model failures and look at performance and robustness holistically.
  • Foundation models present fundamental challenges, such as understanding what behaviors the model has learned, even if you don't have any of the data that was used to train the model!
  • For traditional businesses looking to adopt AI technologies, the bulk of the work is likely going to be around testing and validation, and providing transparency and trust in the solutions they acquire.

We're really looking forward to the next webinar! In the meantime, make sure you join Momentum, our community to discuss all things around AI safety".

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Read LLM Security Playbook

Learn about the most common LLM threats and how to prevent them.

You might be interested
Lakera Featured in a NIST Report on AI Security
Lakera Featured in a NIST Report on AI Security
min read

Lakera Featured in a NIST Report on AI Security

Lakera Featured in a NIST Report on AI Security

Lakera's technology has been recognized by NIST in their report on Adversarial Machine Learning.
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.