What Is Personally Identifiable Information (PII)? And Why It’s Getting Harder to Protect
What counts as PII in the age of GenAI—and why it’s getting harder to protect. This guide breaks down evolving risks and what modern defenses look like.

What counts as PII in the age of GenAI—and why it’s getting harder to protect. This guide breaks down evolving risks and what modern defenses look like.
Personally Identifiable Information—or PII—is one of the most widely recognized data categories in cybersecurity. It’s also one of the most misunderstood.
For years, companies have treated PII as a storage-layer problem: something to encrypt, redact, or delete from static databases. But in the age of GenAI, PII doesn’t just sit quietly in your systems. It moves. It leaks. It gets inferred, hallucinated, and exposed—without anyone explicitly asking for it.
The traditional rules no longer apply. And the risks aren’t theoretical.
From multilingual model responses to indirect data disclosures in chat-based tools, GenAI is introducing entirely new failure modes for PII—ones that legacy detection systems were never built to handle. If your AI application can generate output, it can leak PII.
This guide breaks down what PII actually is, how AI changes the threat model, and what it takes to stay ahead—especially when models don’t just store data, but speak it.
-db1-
-db1-
Test your model’s PII defenses. See how Lakera Guard handles multilingual prompts, indirect identifiers, and inferred leakage in real time.
The Lakera team has accelerated Dropbox’s GenAI journey.
“Dropbox uses Lakera Guard as a security solution to help safeguard our LLM-powered applications, secure and protect user data, and uphold the reliability and trustworthiness of our intelligent features.”
-db1-
If you’re tackling PII protection in the GenAI era, these related reads explore adjacent threats and complementary defenses that strengthen your entire security stack:
-db1-
At its core, Personally Identifiable Information (PII) is any data that can be used to identify an individual—either directly or in combination with other information. Most organizations already handle PII in some form, but definitions vary across legal frameworks, and the boundary between “personal” and “non-personal” is getting harder to draw.
<div class="table_component" role="region" tabindex="0">
<table>
<caption><br></caption>
<thead>
<tr>
<th>
<p><b>Direct Identifiers</b></p>
<p><span><i>These are data points that can single-handedly pinpoint an individual</i></span></p>
</th>
<th>
<p><b>Indirect Identifiers</b></p>
<p><span><i>These require additional context to identify someone, but can still qualify as PII</i></span></p>
</th>
</tr>
<tr>
<td>
<p>Full name</p>
</td>
<td>Location data (e.g. GPS, city, postal code)</td>
</tr>
<tr>
<td>Social Security Number or national ID</td>
<td>IP address or device fingerprint</td>
</tr>
<tr>
<td>Passport or driver’s license number</td>
<td>Job title and employer</td>
</tr>
<tr>
<td>Email address or phone number</td>
<td>Browsing history or activity logs</td>
</tr>
<tr>
<td>Biometric data (e.g. fingerprint, retina scan)</td>
<td>Unique combinations that reveal identity (e.g. “the CFO of a fintech startup in Zurich”)</td>
</tr>
</thead>
<tbody></tbody>
</table>
Most privacy regulations—including GDPR, CCPA, and HIPAA—treat both types as sensitive, particularly when linked to identifiable individuals. In some cases, even pseudonymized or probabilistic identifiers can fall under PII protection if they can be traced back to a person.
**🔍 Want to see how these risks evolve in the age of AI? Download our AI Security Playbook for a practical breakdown of GenAI-specific vulnerabilities and defenses.**
But here’s the real shift: in GenAI systems, PII can surface even if you’ve never stored it. It might be inferred, reconstructed from indirect data, or hallucinated by a model under the right conditions.
We dig deeper into that risk next.
Protecting PII used to be a matter of securing what you stored: encrypt databases, restrict access, and monitor logs. But generative AI systems don’t just store information—they produce it. They can rephrase, infer, or even hallucinate personal details in output, sometimes without ever accessing structured data in the first place.
That shift breaks the traditional data protection model. And it’s already causing trouble.
In one of the earliest public examples, Samsung engineers accidentally leaked internal source code and confidential notes by pasting them into ChatGPT while trying to debug issues. The incident triggered an internal crackdown on the use of generative tools—and made global headlines for exposing just how easily sensitive information can be shared with public AI models.
Meanwhile, at Black Hat USA 2024, a researcher demonstrated how Microsoft Copilot could be manipulated to extract internal data and generate personalized phishing content—without triggering standard enterprise DLP systems. Copilot’s natural language interface made it easy for attackers to craft plausible requests that quietly bypassed conventional safeguards.
And a 2025 study by Help Net Security revealed that 8.5% of prompts submitted to tools like ChatGPT and Copilot included sensitive information—PII, credentials, and even internal file references. The vast majority of these exposures weren’t flagged by traditional systems, because they occurred not during data entry or storage, but during natural language interactions with the model.
These incidents aren’t edge cases. They’re reminders that in GenAI systems, PII risk lives at the output layer. It doesn’t need to be retrieved—it can be rewritten, inferred, or exposed in ways that sidestep every conventional control you’ve put in place.
And because these outputs look like everyday conversations, they’re often missed entirely.
Let’s take a closer look at why legacy defenses like regex-based DLP can’t keep up.
Most Data Loss Prevention systems were built for a simpler world—one where sensitive data looked the same every time. Credit card numbers followed fixed patterns. Social Security Numbers were always nine digits. And personal details lived in structured fields.
In that world, regex worked. But language models don’t speak in patterns—they speak in nuance.
When someone asks a GenAI assistant a question, the model might respond in any number of ways. And that flexibility is exactly what breaks traditional filters.
Consider these examples:
And it doesn’t stop there. PII can be embedded in a summary, inferred from unrelated data, or translated into a different language. Suddenly, that tightly written regex rule that once caught 99% of violations misses the one generation that matters most.
Even modern DLPs struggle with GenAI because:
That’s why catching PII in GenAI systems requires a different approach—one that can understand language, not just patterns.
**👉 Want to see how traditional DLP tools break down in GenAI systems? Check out our deep dive on why regex doesn’t speak the language of AI.**
If PII risks in GenAI systems happen at the output layer, then catching them after the fact isn’t good enough.
You need a defense that works in real time—before an LLM-generated response ever reaches the user.
That’s exactly what Lakera Guard was built for.
Unlike legacy tools that analyze logs or static outputs, Lakera Guard sits in the flow of interaction. That means it can catch risky generations as they happen, before anything gets displayed, stored, or logged.
And because GenAI risks aren’t one-size-fits-all, Lakera Guard adapts to:
Lakera Guard is built to handle the complexity of natural language. It evaluates meaning, not just format—allowing it to identify sensitive data even when it’s embedded in nuanced, indirect, or unfamiliar phrasing. That’s what makes it effective where traditional tools fail.
**👉 Want to test it in action? Try the Lakera Playground to see how Lakera Guard flags PII across languages and edge cases.**
If GenAI changes how PII risk shows up, it also changes how teams need to think about prevention. The goal isn’t just to spot obvious violations—it’s to build a system that understands how sensitive data appears in unpredictable, fluid conversations.
Here’s what that looks like in practice:
Teams deploying GenAI systems can’t afford to treat PII as a solved problem. The rules have changed—and so have the failure modes.
Personally Identifiable Information isn’t what it used to be. It no longer lives solely in databases or follows predictable formats. In the GenAI era, it lives in language—in context, in inference, in generation. That means security needs to evolve, too.
Protecting PII today requires systems that understand how models behave, not just what data looks like. Because when models can speak, they can leak—and the only real defense is one that listens in real time.
Download this guide to delve into the most common LLM security risks and ways to mitigate them.
Get the first-of-its-kind report on how organizations are preparing for GenAI-specific threats.
Compare the EU AI Act and the White House’s AI Bill of Rights.
Get Lakera's AI Security Guide for an overview of threats and protection strategies.
Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.
Use our checklist to evaluate and select the best LLM security tools for your enterprise.
Discover risks and solutions with the Lakera LLM Security Playbook.
Discover risks and solutions with the Lakera LLM Security Playbook.
Subscribe to our newsletter to get the recent updates on Lakera product and other news in the AI LLM world. Be sure you’re on track!
Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.
Several people are typing about AI/ML security. Come join us and 1000+ others in a chat that’s thoroughly SFW.