Every day, we rely on data to power complex systems and help solve difficult problems. Data is so ingrained in our lives that we automatically trust its usefulness and accuracy. But we take for granted that data itself isn’t autonomous. It’s managed by people and prone to human bias and errors. Without proper oversight, data inaccuracy and misuse can become rampant and cause serious problems. As we move towards increasingly complex interactions between data and algorithms in machine learning (ML) systems, the need for the correct management of data becomes clearer than ever. In order to unlock the full potential of ML applications, we need to create appropriate guidance for data management, both at the formal regulatory level and in our everyday best practices.
“Now that the models have advanced to a certain point, we got to make the data work as well.”
— Andrew Ng
Even for traditional software programs, in which data is just an input that is processed in an explicitly programmed way, data needs to be considered carefully, and it needs to be on an equal footing alongside software, hardware, and human factors. Failure to accomplish this has already led to dramatic consequences. Most recently, Public Health England mismanaged data in a Microsoft Excel file  and silently deleted COVID-19 test results, causing 16,000 cases to go unreported for days. As a consequence, the contacts of these positive cases were not notified or asked to self-isolate on time.
Managing massive and complex amounts of data presents a unique set of challenges that we are still addressing today. Here are a few of the reasons that make data management so challenging :
The resulting mishandling of data can cause serious accidents in mission-critical systems in healthcare and autonomous driving, for example. The numerous challenges associated with data help us understand why Ken Thompson notoriously declared:
“I wanted to separate data from programs because data and instructions are very different.”
— Ken Thompson
Given the unique properties and challenges of large-scale data management, it becomes crucial to develop systematic methodologies and infrastructure that address it separately from software and hardware.
We have seen that data mismanagement can create serious issues, even in traditional software systems where it is handled deterministically. Now, think about how much more challenging the management of data is in ML, where computers learn program behavior from data. The line between data and programs becomes blurrier, and it becomes more important to account for all the unique properties of data in the first place.
Biased datasets, for instance, have had serious ramifications in critical applications. In the past, courts in the United States have used software to forecast potential recidivism risk based on data from the defendants and from criminal records. An investigation  found that while the tool correctly predicts recidivism 61% of the time, Black individuals were almost twice as likely as white individuals to be labeled as high-risk without actually going on to re-offend. Conversely, white individuals were much more likely than Black individuals to be labeled as low-risk but end up re-offending.
The software’s creators didn’t fully take into account the effects of dataset bias on predictions, leading to unacceptable social consequences. These effects must be explicitly tested to prevent catastrophes such as wrongful sentencing. When we are talking about criminal justice, healthcare, or other critical applications, good data governance is an urgent priority.
To address the challenges associated with data in AI systems, the EU included “data and data governance” in its latest AI package . In Article 10 (page 48), they require that:
“Training, validation and testing data sets shall be relevant, representative, free of errors and complete. They shall have the appropriate statistical properties, including, where applicable, as regards the persons or groups of persons on which the high-risk AI system is intended to be used.”
This is a good start. However, such data guidance will have to become more concrete – and quickly. We need a more formal definition of the data properties that are needed for mission-critical systems to operate safely. A good starting point for analyzing the quality of data is the set of data properties proposed in the Data Safety Guidance .
These properties are not industry- or application-specific and can be used to establish which aspects of the data need to be investigated and guaranteed so that the system can operate safely. A failure to consider any of these properties could pose additional risks to the operation of the system. The authors note that the list is not exhaustive. Instead, it needs to be carefully adapted for specific applications.
This level of data oversight needs to become a core part of the ML development process. We need it to drive future requirements and become compliance artifacts for high-risk applications. It’s our responsibility as a community to implement this change. Between regulators, startups, industry players, and researchers – all of us have to work together to build a more principled discipline around data management. This is the only way in which we can obtain an in-depth understanding of data during both development and operation. This will allow us to assess the safety of mission-critical applications. If we don’t take these steps, the number of AI accidents caused by mismanaged data or noncompliant data usage will continue to rise in the future. Unlocking the full potential of ML software in applications with low tolerance for failure requires us to take action now.
 Excel: Why using Microsoft's tool caused Covid-19 results to be lost, BBC, 2020
 Data Safety Guidance (Version 3.3), SCSC Data Safety Initiative Working Group, 2021
 Machine Bias, ProPublica, 2016
 Artificial Intelligence Act, European Commission, 2021
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
Subscribe to our newsletter to get the recent updates on Lakera product and other news in the AI LLM world. Be sure you’re on track!
Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.
Several people are typing about AI/ML security. Come join us and 1000+ others in a chat that’s thoroughly SFW.