PII Replacement#

Learn how to detect, redact, or replace personally identifiable information in your tabular datasets using NeMo Safe Synthesizer transformation capabilities.

Overview#

It is generally best practice to redact and replace any PII prior to synthesizing your data. The PII replacement step ensures that the model has no chance to learn the most sensitive, identifying information in your data, such as names or addresses.

What You Can Do With PII Replacement#

Classify Columns: Classify entire columns as an entity in tabular data, such as a column of first names or a column of email addresses.
Detect Entities in Free Text: Detect PII within cells of free text and replace individual values within the text with realistic values of the same type.
Transform entities: Once detected, you can then redact, label, hash, or replace entities for privacy protection.

How It Works#

The PII Replacement component automatically identifies sensitive information in your tabular data and replaces it with realistic alternatives. It uses Named Entity Recognition (NER) models, regex patterns, and machine learning to detect personally identifiable information across different locales.

It is recommended to be included in your configuration ahead of the synthesis step.

PII Detection#

When you run detection, the system follows these steps:

Column Analysis: The system analyzes your column headers and sample data to determine entity types
Content Scanning: The system scans your text content using named entity recognition and regex
Entity Recognition: The system identifies specific PII entities with confidence scores
Conflict Resolution: The system merges overlapping detections and selects the best matches
Classification: The system assigns final entity types to your detected sensitive information

You can detect more than 50 types of personally identifiable information, such as names, contact details, financial data, government IDs, and technical identifiers. For a complete list, refer to Supported Entities.

Supported Transformation Functions#

For any detected entities, you can then transform the entity for privacy protection. Supported options include:

Fake: Replaces the value with synthetic data. For example, “I met Sally” becomes “I met Lucy”
Redact: Replaces detected entities with the entity type. For example, “I met Sally” becomes “I met <first_name>”
Label: Labeling is similar to redaction, but also includes the entity value. For example, “I met Sally” becomes “I met ”
Hash: Anonymizes by converting data to a unique alphanumeric value. For example, “I met Sally” becomes “I met a75e4r”

Detection Methods#

Named Entity Recognition in Free Text

You can use this machine learning approach with transformer models for contextual entity recognition in free text.

context-aware gpu-accelerated

Named Entity Recognition in Free Text

Regular Expression Detection

You can use this pattern matching approach for high-precision detection of structured identifiers.

high-precision fast-processing low-memory

Regular Expression Detection

LLM Column Classification

You can use this AI-powered analysis with large language models for column type determination.

column-classification tabular-data sample-based

LLM Column Classification

Choosing a Detection Method#

Select the approach that best fits your data type and performance requirements:

Choose Your Detection Method#
Data Type	Recommended Method & Use Case	Setup Guide
Unstructured text	Named Entity Recognition - For emails, documents, chat logs where PII appears within sentences	GLiNER Setup Guide →
Structured entities	Regex Detection - For entities with consistent formatting patterns	Regex Setup Guide →
Entire columns	LLM Column Classification - For columns where each cell is a single entity	LLM Column Classification Guide →

Combined Methods#

You can run all three methods simultaneously for comprehensive detection. The system combines their results and resolves conflicts automatically:

Parallel Processing: All enabled methods scan your data simultaneously
Conflict Resolution: When methods find overlapping entities, the system prioritizes the one with a higher confidence score
Combined Results: You get a unified list with no duplicates or conflicts

Default Detection Configuration#

By default, the microservice uses the following entities for classification and NER:

Default Entities: Classification vs. NER#
Entity	Classification Enabled	NER Enabled
True identifiers
first_name	✔️	✔️
last_name	✔️	✔️
name	✔️	✔️
street_address	✔️	✔️
city	✔️	✔️
state	✔️	✔️
postcode	✔️	✔️
country	✔️	✔️
address	✔️	✔️
latitude	✔️	✔️
longitude	✔️	✔️
coordinate	✔️	✔️
age	✔️	✔️
phone_number	✔️	✔️
fax_number	✔️	✔️
email	✔️	✔️
ssn	✔️	✔️
unique_identifier	✔️	✔️
medical_record_number	✔️	✔️
health_plan_beneficiary_number	✔️	✔️
account_number	✔️	✔️
certificate_license_number	✔️	✔️
vehicle_identifier	✔️	✔️
license_plate	✔️	✔️
device_identifier	✔️	✔️
biometric_identifier	✔️	✔️
url	✔️	✔️
ipv4	✔️	✔️
ipv6	✔️	✔️
national_id	✔️	✔️
tax_id	✔️	✔️
bank_routing_number	✔️	✔️
swift_bic	✔️	✔️
credit_debit_card	✔️	✔️
cvv	✔️
pin	✔️	✔️
employee_id	✔️	✔️
api_key	✔️	✔️
customer_id	✔️	✔️
user_name	✔️	✔️
password	✔️	✔️
mac_address	✔️	✔️
http_cookie	✔️	✔️
Quasi identifiers
date	✔️
date_time	✔️
blood_type	✔️
gender	✔️
sexuality	✔️
political_view	✔️
race	✔️
ethnicity	✔️
religious_belief	✔️
language	✔️
education	✔️
job_title	✔️
employment_status	✔️
company_name	✔️