PII Replacement#

Learn how to detect, redact, or replace personally identifiable information in your tabular datasets using NeMo Safe Synthesizer transformation capabilities.

Overview#

It is generally best practice to redact and replace any PII prior to synthesizing your data. The PII replacement step ensures that the model has no chance to learn the most sensitive, identifying information in your data, such as names or addresses.

What You Can Do With PII Replacement#

  • Classify Columns: Classify entire columns as an entity in tabular data, such as a column of first names or a column of email addresses.

  • Detect Entities in Free Text: Detect PII within cells of free text and replace individual values within the text with realistic values of the same type.

  • Transform entities: Once detected, you can then redact, label, hash, or replace entities for privacy protection.

How It Works#

The PII Replacement component automatically identifies sensitive information in your tabular data and replaces it with realistic alternatives. It uses Named Entity Recognition (NER) models, regex patterns, and machine learning to detect personally identifiable information across different locales.

It is recommended to be included in your configuration ahead of the synthesis step.

PII Detection#

When you run detection, the system follows these steps:

  1. Column Analysis: The system analyzes your column headers and sample data to determine entity types

  2. Content Scanning: The system scans your text content using named entity recognition and regex

  3. Entity Recognition: The system identifies specific PII entities with confidence scores

  4. Conflict Resolution: The system merges overlapping detections and selects the best matches

  5. Classification: The system assigns final entity types to your detected sensitive information

You can detect more than 50 types of personally identifiable information, such as names, contact details, financial data, government IDs, and technical identifiers. For a complete list, refer to Supported Entities.

Supported Transformation Functions#

For any detected entities, you can then transform the entity for privacy protection. Supported options include:

  • Fake: Replaces the value with synthetic data. For example, “I met Sally” becomes “I met Lucy”

  • Redact: Replaces detected entities with the entity type. For example, “I met Sally” becomes “I met <first_name>”

  • Label: Labeling is similar to redaction, but also includes the entity value. For example, “I met Sally” becomes “I met

  • Hash: Anonymizes by converting data to a unique alphanumeric value. For example, “I met Sally” becomes “I met a75e4r”


Detection Methods#

Named Entity Recognition in Free Text

You can use this machine learning approach with transformer models for contextual entity recognition in free text.

Named Entity Recognition in Free Text
Regular Expression Detection

You can use this pattern matching approach for high-precision detection of structured identifiers.

Regular Expression Detection
LLM Column Classification

You can use this AI-powered analysis with large language models for column type determination.

LLM Column Classification

Choosing a Detection Method#

Select the approach that best fits your data type and performance requirements:

Choose Your Detection Method#

Data Type

Recommended Method & Use Case

Setup Guide

Unstructured text

Named Entity Recognition - For emails, documents, chat logs where PII appears within sentences

GLiNER Setup Guide →

Structured entities

Regex Detection - For entities with consistent formatting patterns

Regex Setup Guide →

Entire columns

LLM Column Classification - For columns where each cell is a single entity

LLM Column Classification Guide →

Combined Methods#

You can run all three methods simultaneously for comprehensive detection. The system combines their results and resolves conflicts automatically:

  1. Parallel Processing: All enabled methods scan your data simultaneously

  2. Conflict Resolution: When methods find overlapping entities, the system prioritizes the one with a higher confidence score

  3. Combined Results: You get a unified list with no duplicates or conflicts

Default Detection Configuration#

The system uses these entities by default for both classification and NER:

entities = [
    # True identifiers
    "first_name", "last_name", "name", "street_address", "city", 
    "state", "postcode", "country", "address", "latitude", "longitude",
    "coordinate", "age", "phone_number", "fax_number", "email", "ssn",
    "unique_identifier", "medical_record_number", "health_plan_beneficiary_number",
    "account_number", "certificate_license_number", "vehicle_identifier",
    "license_plate", "device_identifier", "biometric_identifier", "url",
    "ipv4", "ipv6", "national_id", "tax_id", "bank_routing_number",
    "swift_bic", "credit_debit_card", "cvv", "pin", "employee_id",
    "api_key", "customer_id", "user_name", "password", "mac_address", 
    "http_cookie",
    
    # Quasi-identifiers
    "date", "date_time", "blood_type", "gender", "sexuality", 
    "political_view", "race", "ethnicity", "religious_belief", 
    "language", "education", "job_title", "employment_status", 
    "company_name"
]