PII Replacement#
Learn how to detect, redact, or replace personally identifiable information in your tabular datasets using NeMo Safe Synthesizer transformation capabilities.
Overview#
It is generally best practice to redact and replace any PII prior to synthesizing your data. The PII replacement step ensures that the model has no chance to learn the most sensitive, identifying information in your data, such as names or addresses.
What You Can Do With PII Replacement#
Classify Columns: Classify entire columns as an entity in tabular data, such as a column of first names or a column of email addresses.
Detect Entities in Free Text: Detect PII within cells of free text and replace individual values within the text with realistic values of the same type.
Transform entities: Once detected, you can then redact, label, hash, or replace entities for privacy protection.
How It Works#
The PII Replacement component automatically identifies sensitive information in your tabular data and replaces it with realistic alternatives. It uses Named Entity Recognition (NER) models, regex patterns, and machine learning to detect personally identifiable information across different locales.
It is recommended to be included in your configuration ahead of the synthesis step.
PII Detection#
When you run detection, the system follows these steps:
Column Analysis: The system analyzes your column headers and sample data to determine entity types
Content Scanning: The system scans your text content using named entity recognition and regex
Entity Recognition: The system identifies specific PII entities with confidence scores
Conflict Resolution: The system merges overlapping detections and selects the best matches
Classification: The system assigns final entity types to your detected sensitive information
You can detect more than 50 types of personally identifiable information, such as names, contact details, financial data, government IDs, and technical identifiers. For a complete list, refer to Supported Entities.
Supported Transformation Functions#
For any detected entities, you can then transform the entity for privacy protection. Supported options include:
Fake: Replaces the value with synthetic data. For example, “I met Sally” becomes “I met Lucy”
Redact: Replaces detected entities with the entity type. For example, “I met Sally” becomes “I met <first_name>”
Label: Labeling is similar to redaction, but also includes the entity value. For example, “I met Sally” becomes “I met
” Hash: Anonymizes by converting data to a unique alphanumeric value. For example, “I met Sally” becomes “I met a75e4r”
Detection Methods#
You can use this machine learning approach with transformer models for contextual entity recognition in free text.
You can use this pattern matching approach for high-precision detection of structured identifiers.
You can use this AI-powered analysis with large language models for column type determination.
Choosing a Detection Method#
Select the approach that best fits your data type and performance requirements:
Data Type |
Recommended Method & Use Case |
Setup Guide |
---|---|---|
Unstructured text |
Named Entity Recognition - For emails, documents, chat logs where PII appears within sentences |
|
Structured entities |
Regex Detection - For entities with consistent formatting patterns |
|
Entire columns |
LLM Column Classification - For columns where each cell is a single entity |
Combined Methods#
You can run all three methods simultaneously for comprehensive detection. The system combines their results and resolves conflicts automatically:
Parallel Processing: All enabled methods scan your data simultaneously
Conflict Resolution: When methods find overlapping entities, the system prioritizes the one with a higher confidence score
Combined Results: You get a unified list with no duplicates or conflicts
Default Detection Configuration#
The system uses these entities by default for both classification and NER:
entities = [
# True identifiers
"first_name", "last_name", "name", "street_address", "city",
"state", "postcode", "country", "address", "latitude", "longitude",
"coordinate", "age", "phone_number", "fax_number", "email", "ssn",
"unique_identifier", "medical_record_number", "health_plan_beneficiary_number",
"account_number", "certificate_license_number", "vehicle_identifier",
"license_plate", "device_identifier", "biometric_identifier", "url",
"ipv4", "ipv6", "national_id", "tax_id", "bank_routing_number",
"swift_bic", "credit_debit_card", "cvv", "pin", "employee_id",
"api_key", "customer_id", "user_name", "password", "mac_address",
"http_cookie",
# Quasi-identifiers
"date", "date_time", "blood_type", "gender", "sexuality",
"political_view", "race", "ethnicity", "religious_belief",
"language", "education", "job_title", "employment_status",
"company_name"
]