PII Replacement#

PII (Personally Identifiable Information) replacement is a critical privacy protection step that detects and replaces sensitive information in your datasets before synthesis. This ensures that the model has no chance of learning the most sensitive information like names, addresses, and other identifiers.

How It Works#

The PII replacement pipeline operates in multiple stages:

  1. Detection: Identifies PII entities using configurable detection methods

  2. Classification: Categorizes detected entities by type (name, email, address, and so on)

  3. Transformation: Replaces or redacts PII using configurable rules

  4. Validation: Verifies that sensitive information has been properly handled

Detection Methods#

NeMo Safe Synthesizer supports multiple PII detection approaches:

Nemotron PII Detection#

Uses the Nemotron PII model for entity recognition:

  • Zero-shot entity detection

  • Supports custom entity types

  • High accuracy for standard PII categories

  • Configurable confidence thresholds

LLM Classification#

Leverages language models for PII detection:

  • Contextual understanding of entities

  • Handles complex PII patterns

  • Flexible entity definitions

  • Configurable prompts and models

Regex Detection#

Pattern-based detection for structured PII:

  • Fast and deterministic

  • Ideal for known formats (SSN, phone numbers)

  • Customizable patterns

  • Low computational overhead

Replacement Strategies#

After detection, PII can be handled in multiple ways:

  • Replacement: Generate realistic replacements using Faker library or custom expressions.

  • Redaction: Substitute with placeholder tokens.

  • Hashing: Convert to a unique digital fingerprint (one-way).

  • Custom Rules: Define your own transformation logic.

Supported Entity Types#

Nemotron PII has been specifically fine-tuned to recognize many entity types out of the box, organized by category:

Personal Information#

  • first_name - Given names

  • last_name - Surnames and family names

  • name - Full names

  • email - Email addresses

  • phone_number - Phone numbers in various formats

  • fax_number - Fax numbers in various formats

Addresses#

  • address - Complete physical addresses (for example, 123 Main Street, Anytown, CA 90210)

  • street_address - Street addresses (for example, 123 Main Street)

  • city - City names

  • county - County names

  • state - State/province names

  • postcode - Postal/ZIP codes

  • country - Country names

Personal Identifiers#

  • ssn - Social Security Numbers

  • national_id - National ID numbers

  • tax_id - Tax ID numbers

  • certificate_license_number - Driver’s license numbers

  • unique_identifier - Generic unique IDs

  • customer_id - Customer identifiers

  • employee_id - Employee identifiers

Financial Information#

  • credit_debit_card - Credit and debit card numbers

  • cvv - Credit card verification code

  • pin - Personal identification numbers

  • account_number - Bank account numbers

  • bank_routing_number - Bank routing numbers

  • swift_bic - Swift/BIC codes

  • iban - International bank account numbers

Medical Information#

  • medical_record_number - Medical record numbers

  • health_plan_beneficiary_number - Insurance IDs

  • biometric_identifier - Biometric data references

Technical Identifiers#

  • url - Web URLs

  • ipv4 - IPv4 addresses

  • ipv6 - IPv6 addresses

  • mac_address - Hardware MAC addresses

  • api_key - API keys and tokens

  • user_name - Usernames

  • password - Passwords

  • http_cookie - HTTP Cookies

  • device_identifier - Device IDs

Vehicle Identifiers#

  • vehicle_identifier - Vehicle identification numbers (VINs)

  • license_plate - License plates

Geographic Information#

  • latitude - Latitude coordinates

  • longitude - Longitude coordinates

  • coordinate - Coordinate pairs

Quasi Identifiers#

  • date - Date values

  • date_time - Date and time values

  • date_of_birth - Birth dates

  • time - Time values

  • age - Ages

  • blood_type - Blood type information

  • gender - Gender information

  • sexuality - Sexual orientation

  • political_view - Political affiliations

  • race_ethnicity - Race and ethnicity information

  • religious_belief - Religious affiliations

  • language - Language preferences

  • education_level - Education level

  • occupation - Professional titles

  • employment_status - Employment information

  • company_name - Organization names

Custom Entity Types#

Beyond these built-in types, you can define custom entities using:

  • Nemotron PII: Fast, accurate zero-shot NER for standard and custom entity types

  • Regex: Deterministic pattern matching, best for consistent formats (SSN, credit cards)

  • LLM: Contextual understanding, handles complex patterns and ambiguous cases

Example Custom Entity:

{
    "classify": {
        "enable": true,
        "entities": [
            "first_name", "last_name", "email",
            "employee_id", "project_code"
        ]
    }
}

Configuration#

PII replacement is configured through the replace_pii section. For the full schema, refer to Parameters Reference.

{
    "replace_pii": {
        "globals": {"locales": ["en_US"]},
        "steps": [
            {
                "rows": {
                    "update": [
                        {
                            "entity": ["email", "phone_number"],
                            "value": "column.entity | fake"
                        }
                    ]
                }
            }
        ]
    }
}

When to Use PII Replacement#

Consider using PII replacement when:

  • Your data contains names, addresses, or other direct identifiers

  • Compliance requires PII removal before processing

  • You want to ensure the model cannot memorize sensitive values

  • You need to share synthetic data with external parties

PII replacement is always recommended as a preprocessing step before synthesis.