Back

The computer vision bias trilogy: Data representativity.

“Data is a reflection of the inequalities that exist in the world”. While this might be true, developers have great potential to curb bias in their computer vision systems.

Lakera Team
December 1, 2023
March 9, 2022
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

In-context learning

As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.

[Provide the input text here]

[Provide the input text here]

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Lorem ipsum dolor sit amet, line first
line second
line third

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Hide table of contents
Show table of contents

“Data is a reflection of the inequalities that exist in the world”–Beena Ammanath, AI for Good Keynote. While this might be true, developers have great potential to curb model bias and data bias in their computer vision systems.

Testing whether bias is present in a computer vision system is key to understanding how it will perform in operation. Bias manifests itself in numerous ways, from data collection and annotation to the features that a model uses for prediction.

Let’s start by looking at data representativity and the model tests that empower you to uncover pesky biases early in your development process.

Collecting data.

Bias can first appear when collecting and annotating data. The data that you use to build and evaluate a computer vision model must reflect what you intend to use it for: this is referred to as data representativity.

A radiology diagnostic tool to be deployed in southern France must be evaluated on patients from local demographics. The diagnostic tool should also be evaluated on images captured with machines present in the target hospitals. Past research has focused on guidelines that can be followed when collecting and annotating data for training and testing to mitigate such bias.

How do you know if you have the data that matters?

When possible, collect information beyond the image itself. For example, you can collect the age of the people and the type of machine used to take their pictures.

Once you have collected data, it is essential to confirm that it is representative of the target population. While establishing this from image data alone is challenging, image metadata can prove to be very useful. In previous posts, we have introduced the notion of metadata and why it contains semantic information key to evaluating machine learning models–in particular in computer vision. If the sex and age of patients are available, as well as the model of the machine that was used for the collection of the images, we can create unit tests to check data for each relevant slice is present in the datasets. This way we can build up a comprehensive test suite, that allows us to ensure the data as a whole is representative and identify areas where it isn’t–thus effectively guiding the data collection process.

Leave no outlier behind.

Finally, representativity in the literature refers to a match to the target population: for example, if 99.9% of the target population is between 20 and 70 years old, an evaluation dataset should reflect this. This however disregards the importance of the tails of the distribution and is a key difference between building prototypes and production-ready systems. Indeed, an ML model may achieve excellent accuracy on an evaluation dataset containing data in the 20 to 70-year-old range, even if it performs poorly on 80-year-olds. If the product is intended to work on patients of all ages, then it is paramount to explicitly test on slices belonging to the tail of the distribution, even if they are rarely encountered in practice.

As in the illustration below, aggregate evaluation metrics, such as accuracy, precision, and recall may be misleading: it is important to explicitly measure performance for all relevant slices.

You can ensure that your system is performing as well as it seems, by testing every subset individually.

In conclusion, find out who your target groups are, big or small, and that you have enough data for all of them. You can use metadata as a tool to find groups that matter.

Get started with Lakera today.

Get in touch with mateo@lakera.ai to find out more about what Lakera can do for your team, or get started right away.

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Lakera Team
Read LLM Security Playbook

Learn about the most common LLM threats and how to prevent them.

Download
You might be interested
min read
Computer Vision

Case study: How Privately accelerated computer vision certification with Lakera.

Case Study: Find out how Privately was able to increase their SDO performance measures, such as 80% reduction in real-world failures and 10x faster development cycles — from roughly 2 weeks to 2 days.
David Haber
December 1, 2023
min read
Computer Vision

The computer vision bias trilogy: Shortcut learning.

Nobel Prize-winning economist, Daniel Kahneman once remarked “by their very nature, heuristic shortcuts will produce biases, and that is true for both humans and artificial intelligence, but their heuristics of AI are not necessarily the human ones”. This is certainly the case when we talk about “shortcut learning”.
Lakera Team
December 1, 2023
Activate
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.