Cookie Consent
Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.
Read our Privacy Policy
Back

Test machine learning the right way: Detecting data bugs.

In this second instance of the testing blog series, we deep dive into data bugs: what do they look like, and how can you use specification and testing to ensure you have the right data for the job?

Mateo Rojas-Carulla
December 1, 2023
July 13, 2021
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

In-context learning

As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.

[Provide the input text here]

[Provide the input text here]

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Lorem ipsum dolor sit amet, line first
line second
line third

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Hide table of contents
Show table of contents

In contrast to traditional software systems, data is the core ingredient for ML (machine learning). Engineers spend significant amounts of time collecting the “right” datasets. That data drives training and evaluation processes and, ultimately, the quality of the machine learning systems we build. Given the importance of data, how do data bugs arise, and how can one catch them in time?

Before we answer these questions, note that we defined ML bugs as follows [1]:

“A machine learning bug refers to any imperfection in a machine learning item that causes a discordance between the existing and the required conditions.”

Two families of data bugs that often come up in practice can be summarised by these two questions:

  1. Do I have the wrong data? For example, data points can be duplicated or corrupted.
  2. Do I have the right data? Even if the data is bug-free, building a system that works at night is impossible if no night images are available in the data.

These families of data bugs are distinct but it’s important to keep track of both as you develop your ml models. Appropriate tests should be written to systematically look for vulnerabilities in both categories.

We will take a closer look at these questions specifically in the context of building a computer vision system. The question of how to check data quality issues during testing and operation is equally important. Read on to dig deeper into whether you have the wrong or right data.

Do I have the wrong data?

Naturally, the first question to ask is whether there are actual errors in the data. Below, we have a brief checklist of ML bugs that often arise in practice.

Do I have the wrong data?

Missing values: Some values in the data/metadata may be missing for some of the images.

Incorrect annotations: Inconsistency and noise in the annotations can hinder model performance.

Data consistency: Some images may have the wrong size, number of channels, or metadata columns — to name a few possibilities. Alternatively, the range of some of the pixels may be off.

Clear outliers: Some images/metadata may be clearly suspicious and need closer inspection.

Corrupted data: Some data points may be corrupted.

Duplicates: Some images may be duplicated. This can bias the machine learning models at training and may also result in the same image being both in the training set and in the validation set.

Adding tests that check for the items in the list above allows us to build a high-coverage test suite that ensures that any issues are caught in time. For example, let’s say you write a test that checks whether the shape of the images in the dataset is correct. In this case, they are all squares. This simple setup would catch rectangle-shaped images if they are added to the dataset in the future.

Do I have the right data?

A more subtle but fundamentally important form of data bugs looks at whether you have all the right data to build the system you envision.

As an example, if you deploy an autonomous driving component, your system needs to do well on roundabouts. If the data used to train your model has no roundabouts, or only very few of them, the system cannot be expected to perform well – even if the system is otherwise bug-free. So it’s essential that your dataset is representative of the intended use case.

Metadata is a powerful tool when testing data representativity. It contains high-level semantic information which can be used to specify relevant tests. In medical imaging, for instance, DICOM (digital imaging and communication in medicine) is the standard format for storing images, which can contain metadata such as the patient’s gender. By explicitly testing that a minimum number of images for each gender is available, we ensure that the system does not suffer from critical imbalances. Similarly, the time of day at which an image was taken can be used to test that images have been taken regularly throughout the day, mitigating issues such as too few images taken in the nighttime.

It is also important to keep track of the potential mismatch between the system you have built and the system you want to build. Testing for data coverage is the first step to ensuring this is avoided. Finding missing aspects in the data can drive data collection, lead to better representativity, and provide better system performance.

Example: Detecting data bugs in Gruyères.

Imagine that we’re building a system for autonomous driving in the city of Gruyères, CH. We need to establish a high-level description of specifications and tests that can be built on data to ensure a better system.

Do I have the wrong data?

Do I have the right data?

Creating data tests requires an upfront investment from engineering teams. We still see too many teams that want to “first focus on building a prototype” and then “worry about data quality” later. While this may work for a few iterations, teams quickly find themselves in situations where data bugs have gone unnoticed and cause substantial delays to their projects. The most mature teams make data quality testing and monitoring a top priority from early on in their projects. They put mechanisms in place to ensure adequate quality levels at all times during development.

Get started with Lakera today.

Get in touch with mateo@lakera.ai to find out more about what Lakera can do for your team, or get started right away.

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

Download Free

Explore Prompt Injection Attacks.

Learn LLM security, attack strategies, and protection tools. Includes bonus datasets.

Unlock Free Guide

Learn AI Security Basics.

Join our 10-lesson course on core concepts and issues in AI security.

Enroll Now

Evaluate LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Download Free

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Download Free

The CISO's Guide to AI Security

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Download Free

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Download Free
Mateo Rojas-Carulla

The CISO's Guide to AI Security

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Free Download
Read LLM Security Playbook

Learn about the most common LLM threats and how to prevent them.

Download

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Understand AI Security Basics.

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Optimize LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Master Prompt Injection Attacks.

Discover risks and solutions with the Lakera LLM Security Playbook.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

You might be interested
min read
Machine Learning

Why testing should be at the core of machine learning development.

AI (artificial intelligence) is capable of helping the world scale solutions to our biggest challenges but if you haven’t experienced or heard about AI’s mishaps then you’ve been living under a rock. Coded bias, unreliable hospital systems and dangerous robots have littered headlines over the past few years.
Lakera Team
December 1, 2023
12
min read
Machine Learning

The ELI5 Guide to Retrieval Augmented Generation

Discover the inner workings of Retrieval Augmented Generation (RAG) and how it enhances language model responses by dynamically sourcing information from external databases.
Blessin Varkey
December 1, 2023
Activate
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.