Cookie Consent

Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.

What Is Personally Identifiable Information (PII)? And Why It’s Getting Harder to Protect

What counts as PII in the age of GenAI—and why it’s getting harder to protect. This guide breaks down evolving risks and what modern defenses look like.

Lakera Team

January 23, 2024

Last updated:

May 22, 2025

Personally Identifiable Information—or PII—is one of the most widely recognized data categories in cybersecurity. It’s also one of the most misunderstood.

For years, companies have treated PII as a storage-layer problem: something to encrypt, redact, or delete from static databases. But in the age of GenAI, PII doesn’t just sit quietly in your systems. It moves. It leaks. It gets inferred, hallucinated, and exposed—without anyone explicitly asking for it.

The traditional rules no longer apply. And the risks aren’t theoretical.

From multilingual model responses to indirect data disclosures in chat-based tools, GenAI is introducing entirely new failure modes for PII—ones that legacy detection systems were never built to handle. If your AI application can generate output, it can leak PII.

This guide breaks down what PII actually is, how AI changes the threat model, and what it takes to stay ahead—especially when models don’t just store data, but speak it.

On this page

Hide table of contents

Show table of contents

TL;DR

-db1-

GenAI models can expose PII through generation—not just storage—making traditional data protections ineffective.
Pattern-based DLP tools fail when sensitive information is obfuscated, inferred, or multilingual.
Protecting PII in GenAI requires real-time, context-aware defenses built for how language models actually behave

-db1-

‍

Test your model’s PII defenses. See how Lakera Guard handles multilingual prompts, indirect identifiers, and inferred leakage in real time.

‍

‍

The Lakera team has accelerated Dropbox’s GenAI journey.

“Dropbox uses Lakera Guard as a security solution to help safeguard our LLM-powered applications, secure and protect user data, and uphold the reliability and trustworthiness of our intelligent features.”

‍

-db1-

If you’re tackling PII protection in the GenAI era, these related reads explore adjacent threats and complementary defenses that strengthen your entire security stack:

See how attackers bypass model restrictions using prompt injection—a major risk for PII exposure via AI-generated output.
Learn how direct prompt injections trick models into leaking sensitive information through subtle instructions.
Understand why regex-based DLP fails in GenAI systems and what smarter detection looks like.
Explore how sensitive content shows up in multilingual or adversarial forms in this guide to content moderation for GenAI.
Discover how training data can become a source of unexpected disclosure in our post on training data poisoning.
Monitor AI systems for PII and other risky outputs in real time with this post on LLM monitoring.
And for high-pressure testing of PII detection systems, see how AI red teaming helps expose blind spots before attackers do.

-db1-

What Counts as PII?

At its core, Personally Identifiable Information (PII) is any data that can be used to identify an individual—either directly or in combination with other information. Most organizations already handle PII in some form, but definitions vary across legal frameworks, and the boundary between “personal” and “non-personal” is getting harder to draw.

<div class="table_component" role="region" tabindex="0">
<table>
<caption><br></caption>
<thead>
<tr>
<th>
<p><b>Direct Identifiers</b></p>
<p><span><i>These are data points that can single-handedly pinpoint an individual</i></span></p>
</th>
<th>
<p><b>Indirect Identifiers</b></p>
<p><span><i>These require additional context to identify someone, but can still qualify as PII</i></span></p>
</th>
</tr>
<tr>
<td>
<p>Full name</p>
</td>
<td>Location data (e.g. GPS, city, postal code)</td>
</tr>
<tr>
<td>Social Security Number or national ID</td>
<td>IP address or device fingerprint</td>
</tr>
<tr>
<td>Passport or driver’s license number</td>
<td>Job title and employer</td>
</tr>
<tr>
<td>Email address or phone number</td>
<td>Browsing history or activity logs</td>
</tr>
<tr>
<td>Biometric data (e.g. fingerprint, retina scan)</td>
<td>Unique combinations that reveal identity (e.g. “the CFO of a fintech startup in Zurich”)</td>
</tr>
</thead>
<tbody></tbody>
</table>

Most privacy regulations—including GDPR, CCPA, and HIPAA—treat both types as sensitive, particularly when linked to identifiable individuals. In some cases, even pseudonymized or probabilistic identifiers can fall under PII protection if they can be traced back to a person.

**🔍 Want to see how these risks evolve in the age of AI? Download our AI Security Playbook for a practical breakdown of GenAI-specific vulnerabilities and defenses.**

But here’s the real shift: in GenAI systems, PII can surface even if you’ve never stored it. It might be inferred, reconstructed from indirect data, or hallucinated by a model under the right conditions.

We dig deeper into that risk next.

GenAI Changes the PII Risk Model

Protecting PII used to be a matter of securing what you stored: encrypt databases, restrict access, and monitor logs. But generative AI systems don’t just store information—they produce it. They can rephrase, infer, or even hallucinate personal details in output, sometimes without ever accessing structured data in the first place.

That shift breaks the traditional data protection model. And it’s already causing trouble.

In one of the earliest public examples, Samsung engineers accidentally leaked internal source code and confidential notes by pasting them into ChatGPT while trying to debug issues. The incident triggered an internal crackdown on the use of generative tools—and made global headlines for exposing just how easily sensitive information can be shared with public AI models.

Meanwhile, at Black Hat USA 2024, a researcher demonstrated how Microsoft Copilot could be manipulated to extract internal data and generate personalized phishing content—without triggering standard enterprise DLP systems. Copilot’s natural language interface made it easy for attackers to craft plausible requests that quietly bypassed conventional safeguards.

And a 2025 study by Help Net Security revealed that 8.5% of prompts submitted to tools like ChatGPT and Copilot included sensitive information—PII, credentials, and even internal file references. The vast majority of these exposures weren’t flagged by traditional systems, because they occurred not during data entry or storage, but during natural language interactions with the model.

These incidents aren’t edge cases. They’re reminders that in GenAI systems, PII risk lives at the output layer. It doesn’t need to be retrieved—it can be rewritten, inferred, or exposed in ways that sidestep every conventional control you’ve put in place.

And because these outputs look like everyday conversations, they’re often missed entirely.

Let’s take a closer look at why legacy defenses like regex-based DLP can’t keep up.

Why Regex-Based DLP Doesn’t Cut It Anymore

Most Data Loss Prevention systems were built for a simpler world—one where sensitive data looked the same every time. Credit card numbers followed fixed patterns. Social Security Numbers were always nine digits. And personal details lived in structured fields.

In that world, regex worked. But language models don’t speak in patterns—they speak in nuance.

When someone asks a GenAI assistant a question, the model might respond in any number of ways. And that flexibility is exactly what breaks traditional filters.

Consider these examples:

A model doesn’t say “My name is John Smith.” It says, “I’m John. I manage logistics in Boston.”
It doesn’t output “SSN: 123-45-6789.” It says, “She uses a federal ID that starts with 123.”
It doesn’t say “Address.” It says “adres”—Polish for address—and slips right past English-only filters.

And it doesn’t stop there. PII can be embedded in a summary, inferred from unrelated data, or translated into a different language. Suddenly, that tightly written regex rule that once caught 99% of violations misses the one generation that matters most.

Even modern DLPs struggle with GenAI because:

They operate on static inputs and outputs, not live model interactions.
They assume sensitive data is predictable, when it’s now fluid and contextual.
They aren’t multilingual or semantic, which means they miss obfuscated or translated leaks.

That’s why catching PII in GenAI systems requires a different approach—one that can understand language, not just patterns.

**👉 Want to see how traditional DLP tools break down in GenAI systems? Check out our deep dive on why regex doesn’t speak the language of AI.**

Real-Time Defenses for a Real-Time Problem

If PII risks in GenAI systems happen at the output layer, then catching them after the fact isn’t good enough.

You need a defense that works in real time—before an LLM-generated response ever reaches the user.

That’s exactly what Lakera Guard was built for.

How it works:

Trained detectors, not just regex: Lakera Guard includes pre-built policies that can identify both direct and indirect PII across dozens of languages and edge cases.
Scans prompts and completions: It operates at the model I/O level, catching potential leaks whether they originate from user input or model output.
Instant setup with flexible control: You can enable PII detection with a single toggle—or customize it to align with specific regulatory obligations or enterprise rules.

Unlike legacy tools that analyze logs or static outputs, Lakera Guard sits in the flow of interaction. That means it can catch risky generations as they happen, before anything gets displayed, stored, or logged.

And because GenAI risks aren’t one-size-fits-all, Lakera Guard adapts to:

Multilingual content (e.g., “adres,” “adresse,” “dirección”)
Obfuscated phrasing (e.g., “ID starts with 123”)
Inferential cues (e.g., “the CFO of a fintech startup in Zurich”)

Lakera Guard is built to handle the complexity of natural language. It evaluates meaning, not just format—allowing it to identify sensitive data even when it’s embedded in nuanced, indirect, or unfamiliar phrasing. That’s what makes it effective where traditional tools fail.

**👉 Want to test it in action? Try the Lakera Playground to see how Lakera Guard flags PII across languages and edge cases.**

How to Protect PII in the Age of GenAI

If GenAI changes how PII risk shows up, it also changes how teams need to think about prevention. The goal isn’t just to spot obvious violations—it’s to build a system that understands how sensitive data appears in unpredictable, fluid conversations.

Here’s what that looks like in practice:

Protect what your models generate—not just what they store
Sensitive data can leak when an AI model responds to a user—not just when it pulls from a database. Access controls alone won’t catch it.
Catch risky content before it’s shown to the user
By the time a response is logged, it’s too late. Detection needs to happen in real time—between the model generating a response and the user seeing it.
Watch out for multilingual or indirect leaks
Filters built for English or specific keywords can miss subtle disclosures—especially when users rephrase questions or switch languages.
Use tools built for how GenAI actually works
Traditional security tools were made for structured data. Generative AI needs defenses that understand how language works in real conversations.

Teams deploying GenAI systems can’t afford to treat PII as a solved problem. The rules have changed—and so have the failure modes.

Closing Thoughts

Personally Identifiable Information isn’t what it used to be. It no longer lives solely in databases or follows predictable formats. In the GenAI era, it lives in language—in context, in inference, in generation. That means security needs to evolve, too.

Protecting PII today requires systems that understand how models behave, not just what data looks like. Because when models can speak, they can leak—and the only real defense is one that listens in real time.

Lakera Team

GenAI Security Preparedness
Report 2024

Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.

Free Download

Outsmarting the Smart: Intro to Adversarial Machine Learning

Explore the complex world of Adversarial Machine Learning where AI's potential is matched by the cunning of hackers. Dive into the intricacies of AI systems' security, understand adversarial tactics evolution, and the fine line between technological advancement and vulnerability.

Brain John Aboze

April 24, 2025

min read

•

AI Security

Data Loss Prevention (DLP): A Complete Guide for the GenAI Era

Learn how Data Loss Prevention (DLP) works, why GenAI is changing the game, and what modern solutions need to stop language-based data leaks.

Lakera Team

May 21, 2025

Activate
untouchable mode.

Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Book a demo Start for free

Join our Slack Community.

Several people are typing about AI/ML security.  Come join us and 1000+ others in a chat that’s thoroughly SFW.

Join Lakera Momentum Slack

What Is Personally Identifiable Information (PII)? And Why It’s Getting Harder to Protect

TL;DR

What Counts as PII?

GenAI Changes the PII Risk Model

Why Regex-Based DLP Doesn’t Cut It Anymore

Real-Time Defenses for a Real-Time Problem

How it works:

How to Protect PII in the Age of GenAI

Closing Thoughts

Unlock Free AI Security Guide.

Explore Prompt Injection Attacks.

Learn AI Security Basics.

Evaluate LLM Security Solutions.

Uncover LLM Vulnerabilities.

The CISO's Guide to AI Security

Explore AI Regulations.

GenAI Security Preparedness
Report 2024

Explore AI Regulations.

Understand AI Security Basics.

Uncover LLM Vulnerabilities.

Optimize LLM Security Solutions.

Master Prompt Injection Attacks.

Unlock Free AI Security Guide.

Outsmarting the Smart: Intro to Adversarial Machine Learning

Data Loss Prevention (DLP): A Complete Guide for the GenAI Era

TL;DR

What Counts as PII?

GenAI Changes the PII Risk Model

Why Regex-Based DLP Doesn’t Cut It Anymore

Real-Time Defenses for a Real-Time Problem

How it works:

How to Protect PII in the Age of GenAI

Closing Thoughts

Unlock Free AI Security Guide.

Explore Prompt Injection Attacks.

Learn AI Security Basics.

Evaluate LLM Security Solutions.

Uncover LLM Vulnerabilities.

The CISO's Guide to AI Security

Explore AI Regulations.

GenAI Security Preparedness Report 2024

Explore AI Regulations.

Understand AI Security Basics.

Uncover LLM Vulnerabilities.

Optimize LLM Security Solutions.

Master Prompt Injection Attacks.

Unlock Free AI Security Guide.

Outsmarting the Smart: Intro to Adversarial Machine Learning

Data Loss Prevention (DLP): A Complete Guide for the GenAI Era

GenAI Security Preparedness
Report 2024