PII Replacement#
PII (Personally Identifiable Information) replacement is a critical privacy protection step that detects and replaces sensitive information in your datasets before synthesis. This ensures that the model has no chance of learning the most sensitive information like names, addresses, and other identifiers.
How It Works#
The PII replacement pipeline operates in multiple stages:
Detection: Identifies PII entities using configurable detection methods
Classification: Categorizes detected entities by type (name, email, address, and so on)
Transformation: Replaces or redacts PII using configurable rules
Validation: Verifies that sensitive information has been properly handled
Detection Methods#
NeMo Safe Synthesizer supports multiple PII detection approaches:
Nemotron PII Detection#
Uses the Nemotron PII model for entity recognition:
Zero-shot entity detection
Supports custom entity types
High accuracy for standard PII categories
Configurable confidence thresholds
LLM Classification#
Leverages language models for PII detection:
Contextual understanding of entities
Handles complex PII patterns
Flexible entity definitions
Configurable prompts and models
Regex Detection#
Pattern-based detection for structured PII:
Fast and deterministic
Ideal for known formats (SSN, phone numbers)
Customizable patterns
Low computational overhead
Replacement Strategies#
After detection, PII can be handled in multiple ways:
Replacement: Generate realistic replacements using Faker library or custom expressions.
Redaction: Substitute with placeholder tokens.
Hashing: Convert to a unique digital fingerprint (one-way).
Custom Rules: Define your own transformation logic.
Supported Entity Types#
Nemotron PII has been specifically fine-tuned to recognize many entity types out of the box, organized by category:
Personal Information#
first_name- Given nameslast_name- Surnames and family namesname- Full namesemail- Email addressesphone_number- Phone numbers in various formatsfax_number- Fax numbers in various formats
Addresses#
address- Complete physical addresses (for example, 123 Main Street, Anytown, CA 90210)street_address- Street addresses (for example, 123 Main Street)city- City namescounty- County namesstate- State/province namespostcode- Postal/ZIP codescountry- Country names
Personal Identifiers#
ssn- Social Security Numbersnational_id- National ID numberstax_id- Tax ID numberscertificate_license_number- Driver’s license numbersunique_identifier- Generic unique IDscustomer_id- Customer identifiersemployee_id- Employee identifiers
Financial Information#
credit_debit_card- Credit and debit card numberscvv- Credit card verification codepin- Personal identification numbersaccount_number- Bank account numbersbank_routing_number- Bank routing numbersswift_bic- Swift/BIC codesiban- International bank account numbers
Medical Information#
medical_record_number- Medical record numbershealth_plan_beneficiary_number- Insurance IDsbiometric_identifier- Biometric data references
Technical Identifiers#
url- Web URLsipv4- IPv4 addressesipv6- IPv6 addressesmac_address- Hardware MAC addressesapi_key- API keys and tokensuser_name- Usernamespassword- Passwordshttp_cookie- HTTP Cookiesdevice_identifier- Device IDs
Vehicle Identifiers#
vehicle_identifier- Vehicle identification numbers (VINs)license_plate- License plates
Geographic Information#
latitude- Latitude coordinateslongitude- Longitude coordinatescoordinate- Coordinate pairs
Quasi Identifiers#
date- Date valuesdate_time- Date and time valuesdate_of_birth- Birth datestime- Time valuesage- Agesblood_type- Blood type informationgender- Gender informationsexuality- Sexual orientationpolitical_view- Political affiliationsrace_ethnicity- Race and ethnicity informationreligious_belief- Religious affiliationslanguage- Language preferenceseducation_level- Education leveloccupation- Professional titlesemployment_status- Employment informationcompany_name- Organization names
Custom Entity Types#
Beyond these built-in types, you can define custom entities using:
Nemotron PII: Fast, accurate zero-shot NER for standard and custom entity types
Regex: Deterministic pattern matching, best for consistent formats (SSN, credit cards)
LLM: Contextual understanding, handles complex patterns and ambiguous cases
Example Custom Entity:
{
"classify": {
"enable": true,
"entities": [
"first_name", "last_name", "email",
"employee_id", "project_code"
]
}
}
Configuration#
PII replacement is configured through the replace_pii section. For the full schema, refer to Parameters Reference.
{
"replace_pii": {
"globals": {"locales": ["en_US"]},
"steps": [
{
"rows": {
"update": [
{
"entity": ["email", "phone_number"],
"value": "column.entity | fake"
}
]
}
}
]
}
}
When to Use PII Replacement#
Consider using PII replacement when:
Your data contains names, addresses, or other direct identifiers
Compliance requires PII removal before processing
You want to ensure the model cannot memorize sensitive values
You need to share synthetic data with external parties
PII replacement is always recommended as a preprocessing step before synthesis.