PII Replacement Configuration#
Configure PII detection and replacement settings for NeMo Safe Synthesizer, including global parameters, step definitions, and environment setup.
How It Works#
The NeMo Safe Synthesizer PII replacement configuration uses a hierarchical structure to define how the system detects and replaces personally identifiable information in your datasets. The configuration system provides flexible control over detection methods and replacement rules while maintaining data utility for downstream applications.
Global Settings#
Global configuration options apply to all transformation steps in NVIDIA NeMo Safe Synthesizer, including locales, detection methods, and system parameters.
Column Classification Configuration#
Configure LLM-powered column type detection.
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
bool |
Enable column classification |
None |
|
list[str] |
Entity types for classification |
None |
|
int |
Number of sample values per column |
3 |
Example:
{
"globals": {
"classify": {
"enable": true,
"entities": [
"first_name", "last_name", "email", "phone_number",
"ssn", "address", "credit_debit_card"
],
"num_samples": 5
}
}
}
Classification Parameters#
Enable Classification:
true
- Use LLM to analyze column typesfalse
- Skip column classificationnull
- Use system default
Entity Types:
List of entity types to classify columns into
If not specified, uses all available entity types
Must match supported entity types from detection system
Sample Count:
Number of column values to sample for LLM analysis
Higher values provide better accuracy but slower performance
Range: 1-10 samples recommended
Entity Classification in Free Text Configuration#
Overall Named Entity Recognition and regex-based entity detection parameters:
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
float |
Confidence threshold for entity detection |
0.3 |
|
bool |
Enable regular expression detection |
False |
|
GlinerConfig |
GLiNER model configuration |
GlinerConfig() |
|
list[str] |
Entity types for NER detection |
None |
GLiNER model configuration parameters:
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
bool |
Enable Named Entity Recognition |
True |
|
bool |
Enable batch processing |
True |
|
int |
Number of chunks per batch |
8 |
|
int |
Characters per text chunk |
512 |
|
str |
Model path or name |
None |
Example:
{
"globals": {
"ner": {
"ner_threshold": 0.3,
"enable_regexps": true,
"ner_entities": ["first_name", "last_name"],
"gliner": {
"enable": true,
"enable_batch_mode": true,
"batch_size": 16,
"chunk_length": 1024,
"gliner_model": null
}
}
}
}
NER Threshold#
Like any entity detection model, GLiNER-PII will not be 100% accurate. It is possible that some entities will go undetected, get properly detected but incorrectly labeled, or get incorrectly detected when no PII was present.
Use the ner_threshold
parameter to control the sensitivity of entity detection:
High Threshold (0.5-1): Fewer false positives, may miss entities
Medium Threshold (0.2-0.5): Balanced precision and recall
Low Threshold (0.0-0.2): Higher recall, more false positives
Regular Expression Detection#
Enable pattern-based detection for structured identifiers:
true
- Use regex patterns for SSN, credit cards, phone numbers, and so onfalse
- Disable regex detection (default)
Best for: Structured data with consistent formats
GLiNER Model Configuration#
NER Entities:
The general entities list is automatically used for NER unless a separate list is specified
You may want to specify a separate list if there are entities you want to detect with column classification that you do not want to apply a transformation to within free text
Batch Processing:
Processes multiple text chunks simultaneously
Higher batch sizes improve GPU utilization
Adjust based on available GPU memory
Chunk Length:
Text is split into chunks for processing
Longer chunks provide more context but use more memory
Chunks overlap by 128 characters to prevent entity splitting
Locale Configuration#
Configure region-specific detection patterns and fake data generation.
Supported Locales#
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
list[str] |
List of locale codes for region-specific patterns |
None |
Example:
{
"globals": {
"locales": ["en_US", "en_GB", "de_DE", "fr_FR", "es_ES"]
}
}
Random Seed Configuration#
Control reproducibility of transformations with random seed settings.
Parameter |
Type |
Description |
Default |
---|---|---|---|
|
int |
Random seed for reproducible transformations |
None |
Example:
{
"globals": {
"seed": 12345
}
}
Constraints:
Must be between -2³¹ and 2³¹-1 (32-bit signed integer)
If not specified, system uses current timestamp
Same seed produces identical transformations
Step Configuration#
Transformation steps define the specific operations to perform on your data, such as whether to redact or replace PII.
Update Rows#
Transform row values based on rules:
{
"rows": {
"update": [
{
"condition": "column.entity == 'first_name' and not (this | isna)",
"value": "fake.first_name()",
"description": "Replace first names with synthetic data"
},
{
"entity": "email",
"value": "fake.email()",
"fallback_value": "'redacted@example.com'"
}
]
}
}
Update Row Parameters:
Parameter |
Type |
Description |
Required |
---|---|---|---|
|
str |
Template condition for row selection |
No* |
|
str/list |
Entity type(s) to match |
No* |
|
str/list |
Column name(s) to match |
No* |
|
str/list |
Column type(s) to match |
No* |
|
str |
Template expression for new value |
Yes |
|
str |
Template expression if main value fails |
No |
|
str |
Iterate over expression results |
No |
|
str |
Human-readable description |
No |
*At least one selection method (condition, entity, name, or type) is required.
Row Selection Methods#
By Condition:
{
"condition": "column.entity == 'email' and not (this | isna)",
"value": "fake.email()"
}
By Entity Type:
{
"entity": ["first_name", "last_name"],
"value": "fake.name()"
}
By Column Name:
{
"name": "customer_email",
"value": "fake.email()"
}
By Column Type:
{
"type": "text",
"value": "this | fake_entities"
}
Template Expressions#
Row update values use Jinja2 templates with special variables and filters.
Available Variables#
Variable |
Description |
Example |
---|---|---|
|
Current cell value |
|
|
Current row data |
|
|
Row index |
|
|
Column metadata |
|
|
Step variables |
|
Built-in Filters#
Filter |
Description |
Example |
---|---|---|
|
Generate fake entity for an entire cell |
`column.entity |
|
Replace entities in text |
`this |
|
Redact entities in text |
`this |
|
Label entities in text |
`this |
|
Hash entities in text |
`this |
Fake: Replaces the value with synthetic data. For example, “I met Sally” becomes “I met Lucy”
Redact: Replaces detected entities with the entity type. For example, “I met Sally” becomes “I met <first_name>”
Label: Labeling is similar to redaction, but also includes the entity value. For example, “I met Sally” becomes “I met
” Hash: Anonymizes by converting data to a unique alphanumeric value. For example, “I met Sally” becomes “I met a75e4r”
Faker Functions#
Access faker library functions directly:
{
"value": "fake.persona(row_index=vars.row_seed + index).first_name"
}
Common faker functions:
fake.first_name()
,fake.last_name()
,fake.name()
fake.email()
,fake.phone_number()
fake.address()
,fake.city()
,fake.state()
fake.ssn()
,fake.credit_card_number()