PII Replacement Configuration#

Configure PII detection and replacement settings for NeMo Safe Synthesizer, including global parameters, step definitions, and environment setup.

How It Works#

The NeMo Safe Synthesizer PII replacement configuration uses a hierarchical structure to define how the system detects and replaces personally identifiable information in your datasets. The configuration system provides flexible control over detection methods and replacement rules while maintaining data utility for downstream applications.

Global Settings#

Global configuration options apply to all transformation steps in NVIDIA NeMo Safe Synthesizer, including locales, detection methods, and system parameters.

Column Classification Configuration#

Configure LLM-powered column type detection.

Parameter	Type	Description	Default
`enable`	bool	Enable column classification	None
`entities`	list[str]	Entity types for classification	None
`num_samples`	int	Number of sample values per column	3

Example:

{
  "globals": {
    "classify": {
      "enable": true,
      "entities": [
        "first_name", "last_name", "email", "phone_number",
        "ssn", "address", "credit_debit_card"
      ],
      "num_samples": 5
    }
  }
}

Classification Parameters#

Enable Classification:

true - Use LLM to analyze column types
false - Skip column classification
null - Use system default

Entity Types:

List of entity types to classify columns into
If not specified, uses all available entity types
Must match supported entity types from detection system

Sample Count:

Number of column values to sample for LLM analysis
Higher values provide better accuracy but slower performance
Range: 1-10 samples recommended

Entity Classification in Free Text Configuration#

Overall Named Entity Recognition and regex-based entity detection parameters:

Parameter	Type	Description	Default
`ner_threshold`	float	Confidence threshold for entity detection	0.3
`enable_regexps`	bool	Enable regular expression detection	False
`gliner`	GlinerConfig	GLiNER model configuration	GlinerConfig()
`ner_entities`	list[str]	Entity types for NER detection	None

GLiNER model configuration parameters:

Parameter	Type	Description	Default
`enable`	bool	Enable Named Entity Recognition	True
`enable_batch_mode`	bool	Enable batch processing	True
`batch_size`	int	Number of chunks per batch	8
`chunk_length`	int	Characters per text chunk	512
`gliner_model`	str	Model path or name	None

Example:

{
  "globals": {
    "ner": {
      "ner_threshold": 0.3,
      "enable_regexps": true,
      "ner_entities": ["first_name", "last_name"],
      "gliner": {
        "enable": true,
        "enable_batch_mode": true,
        "batch_size": 16,
        "chunk_length": 1024,
        "gliner_model": null
      }
    }
  }
}

NER Threshold#

Like any entity detection model, GLiNER-PII will not be 100% accurate. It is possible that some entities will go undetected, get properly detected but incorrectly labeled, or get incorrectly detected when no PII was present.

Use the ner_threshold parameter to control the sensitivity of entity detection:

High Threshold (0.5-1): Fewer false positives, may miss entities
Medium Threshold (0.2-0.5): Balanced precision and recall
Low Threshold (0.0-0.2): Higher recall, more false positives

Regular Expression Detection#

Enable pattern-based detection for structured identifiers:

true - Use regex patterns for SSN, credit cards, phone numbers, and so on
false - Disable regex detection (default)

Best for: Structured data with consistent formats

GLiNER Model Configuration#

NER Entities:

The general entities list is automatically used for NER unless a separate list is specified
You may want to specify a separate list if there are entities you want to detect with column classification that you do not want to apply a transformation to within free text

Batch Processing:

Processes multiple text chunks simultaneously
Higher batch sizes improve GPU utilization
Adjust based on available GPU memory

Chunk Length:

Text is split into chunks for processing
Longer chunks provide more context but use more memory
Chunks overlap by 128 characters to prevent entity splitting

Locale Configuration#

Configure region-specific detection patterns and fake data generation.

Supported Locales#

Parameter	Type	Description	Default
`locales`	list[str]	List of locale codes for region-specific patterns	None

Example:

{
  "globals": {
    "locales": ["en_US", "en_GB", "de_DE", "fr_FR", "es_ES"]
  }
}

Random Seed Configuration#

Control reproducibility of transformations with random seed settings.

Parameter	Type	Description	Default
`seed`	int	Random seed for reproducible transformations	None

Example:

{
  "globals": {
    "seed": 12345
  }
}

Constraints:

Must be between -2³¹ and 2³¹-1 (32-bit signed integer)
If not specified, system uses current timestamp
Same seed produces identical transformations

Step Configuration#

Transformation steps define the specific operations to perform on your data, such as whether to redact or replace PII.

Update Rows#

Transform row values based on rules:

{
  "rows": {
    "update": [
      {
        "condition": "column.entity == 'first_name' and not (this | isna)",
        "value": "fake.first_name()",
        "description": "Replace first names with synthetic data"
      },
      {
        "entity": "email",
        "value": "fake.email()",
        "fallback_value": "'redacted@example.com'"
      }
    ]
  }
}

Update Row Parameters:

Parameter	Type	Description	Required
`condition`	str	Template condition for row selection	No*
`entity`	str/list	Entity type(s) to match	No*
`name`	str/list	Column name(s) to match	No*
`type`	str/list	Column type(s) to match	No*
`value`	str	Template expression for new value	Yes
`fallback_value`	str	Template expression if main value fails	No
`foreach`	str	Iterate over expression results	No
`description`	str	Human-readable description	No

*At least one selection method (condition, entity, name, or type) is required.

Row Selection Methods#

By Condition:

{
  "condition": "column.entity == 'email' and not (this | isna)",
  "value": "fake.email()"
}

By Entity Type:

{
  "entity": ["first_name", "last_name"],
  "value": "fake.name()"
}

By Column Name:

{
  "name": "customer_email",
  "value": "fake.email()"
}

By Column Type:

{
  "type": "text",
  "value": "this | fake_entities"
}

Template Expressions#

Row update values use Jinja2 templates with special variables and filters.

Available Variables#

Variable	Description	Example
`this`	Current cell value	`this.upper()`
`row`	Current row data	`row.first_name + row.last_name`
`index`	Row index	`index + 1000`
`column`	Column metadata	`column.entity`, `column.name`
`vars`	Step variables	`vars.row_seed`

Built-in Filters#

Filter	Description	Example
`fake`	Generate fake entity for an entire cell	`column.entity
`fake_entities`	Replace entities in text	`this
`redact_entities`	Redact entities in text	`this
`label_entities`	Label entities in text	`this
`hash_entities`	Hash entities in text	`this

Fake: Replaces the value with synthetic data. For example, “I met Sally” becomes “I met Lucy”
Redact: Replaces detected entities with the entity type. For example, “I met Sally” becomes “I met <first_name>”
Label: Labeling is similar to redaction, but also includes the entity value. For example, “I met Sally” becomes “I met ”
Hash: Anonymizes by converting data to a unique alphanumeric value. For example, “I met Sally” becomes “I met a75e4r”

Faker Functions#

Access faker library functions directly:

{
  "value": "fake.persona(row_index=vars.row_seed + index).first_name"
}

Common faker functions:

fake.first_name(), fake.last_name(), fake.name()
fake.email(), fake.phone_number()
fake.address(), fake.city(), fake.state()
fake.ssn(), fake.credit_card_number()

Complete Examples#

Example 1: Redacting Entities from text fields#

This example shows how to redact detected entities in text, replacing them with their entity type labels:

job_request = {
    "name": "pii-redaction-fake-entities",
    "project": "default",
    "spec": {
        "data_source": dataset_uri,
        "config": {
            "enable_synthesis": False,
            "enable_replace_pii": True,
            "replace_pii": {
                "globals": {
                    "classify": {"enable_classify": False},
                    "ner": {
                        "ner_threshold": 0.3,
                        "ner_entities": ["first_name", "last_name", "name"]  
                    },
                    "locales": ["en_US"]
                },
                "steps": [{
                    "rows": {
                        "update": [
                            {
                                "name": "text_column",  # Target your text column
                                "value": "this | redact_entities"  # Use redact_entities filter
                            }
                        ]
                    }
                }]
            }
        }
    }
}

This will result in entities being removed and replaced with an <entity> field

Example 2: Replacing Entities with Synthetic Data#

This example shows how to replace detected entities with synthetic data:

job_request = {
    "name": "pii-replacement-fake-entities",
    "project": "default",
    "spec": {
        "data_source": dataset_uri,
        "config": {
            "enable_synthesis": False,
            "enable_replace_pii": True,
            "replace_pii": {
                "globals": {
                    "classify": {"enable_classify": False},
                    "ner": {
                        "ner_threshold": 0.3,
                        "ner_entities": ["first_name", "last_name", "name"]  # Only detect names
                    },
                    "locales": ["en_US"]
                },
                "steps": [{
                    "rows": {
                        "update": [
                            {
                                "name": "text_column", # Target your text column by name
                                "value": "this | fake_entities" # Replace with fake values
                            }
                        ]
                    }
                }]
            }
        }
    }
}