PII Replacement Configuration#

Configure PII detection and replacement settings for NeMo Safe Synthesizer, including global parameters, step definitions, and environment setup.

How It Works#

The NeMo Safe Synthesizer PII replacement configuration uses a hierarchical structure to define how the system detects and replaces personally identifiable information in your datasets. The configuration system provides flexible control over detection methods and replacement rules while maintaining data utility for downstream applications.

Global Settings#

Global configuration options apply to all transformation steps in NVIDIA NeMo Safe Synthesizer, including locales, detection methods, and system parameters.

Column Classification Configuration#

Configure LLM-powered column type detection.

Parameter

Type

Description

Default

enable

bool

Enable column classification

None

entities

list[str]

Entity types for classification

None

num_samples

int

Number of sample values per column

3

Example:

{
  "globals": {
    "classify": {
      "enable": true,
      "entities": [
        "first_name", "last_name", "email", "phone_number",
        "ssn", "address", "credit_debit_card"
      ],
      "num_samples": 5
    }
  }
}

Classification Parameters#

Enable Classification:

  • true - Use LLM to analyze column types

  • false - Skip column classification

  • null - Use system default

Entity Types:

  • List of entity types to classify columns into

  • If not specified, uses all available entity types

  • Must match supported entity types from detection system

Sample Count:

  • Number of column values to sample for LLM analysis

  • Higher values provide better accuracy but slower performance

  • Range: 1-10 samples recommended

Entity Classification in Free Text Configuration#

Overall Named Entity Recognition and regex-based entity detection parameters:

Parameter

Type

Description

Default

ner_threshold

float

Confidence threshold for entity detection

0.3

enable_regexps

bool

Enable regular expression detection

False

gliner

GlinerConfig

GLiNER model configuration

GlinerConfig()

ner_entities

list[str]

Entity types for NER detection

None

GLiNER model configuration parameters:

Parameter

Type

Description

Default

enable

bool

Enable Named Entity Recognition

True

enable_batch_mode

bool

Enable batch processing

True

batch_size

int

Number of chunks per batch

8

chunk_length

int

Characters per text chunk

512

gliner_model

str

Model path or name

None

Example:

{
  "globals": {
    "ner": {
      "ner_threshold": 0.3,
      "enable_regexps": true,
      "ner_entities": ["first_name", "last_name"],
      "gliner": {
        "enable": true,
        "enable_batch_mode": true,
        "batch_size": 16,
        "chunk_length": 1024,
        "gliner_model": null
      }
    }
  }
}

NER Threshold#

Like any entity detection model, GLiNER-PII will not be 100% accurate. It is possible that some entities will go undetected, get properly detected but incorrectly labeled, or get incorrectly detected when no PII was present.

Use the ner_threshold parameter to control the sensitivity of entity detection:

  • High Threshold (0.5-1): Fewer false positives, may miss entities

  • Medium Threshold (0.2-0.5): Balanced precision and recall

  • Low Threshold (0.0-0.2): Higher recall, more false positives

Regular Expression Detection#

Enable pattern-based detection for structured identifiers:

  • true - Use regex patterns for SSN, credit cards, phone numbers, and so on

  • false - Disable regex detection (default)

Best for: Structured data with consistent formats

GLiNER Model Configuration#

NER Entities:

  • The general entities list is automatically used for NER unless a separate list is specified

  • You may want to specify a separate list if there are entities you want to detect with column classification that you do not want to apply a transformation to within free text

Batch Processing:

  • Processes multiple text chunks simultaneously

  • Higher batch sizes improve GPU utilization

  • Adjust based on available GPU memory

Chunk Length:

  • Text is split into chunks for processing

  • Longer chunks provide more context but use more memory

  • Chunks overlap by 128 characters to prevent entity splitting

Locale Configuration#

Configure region-specific detection patterns and fake data generation.

Supported Locales#

Parameter

Type

Description

Default

locales

list[str]

List of locale codes for region-specific patterns

None

Example:

{
  "globals": {
    "locales": ["en_US", "en_GB", "de_DE", "fr_FR", "es_ES"]
  }
}

Random Seed Configuration#

Control reproducibility of transformations with random seed settings.

Parameter

Type

Description

Default

seed

int

Random seed for reproducible transformations

None

Example:

{
  "globals": {
    "seed": 12345
  }
}

Constraints:

  • Must be between -2³¹ and 2³¹-1 (32-bit signed integer)

  • If not specified, system uses current timestamp

  • Same seed produces identical transformations

Step Configuration#

Transformation steps define the specific operations to perform on your data, such as whether to redact or replace PII.

Update Rows#

Transform row values based on rules:

{
  "rows": {
    "update": [
      {
        "condition": "column.entity == 'first_name' and not (this | isna)",
        "value": "fake.first_name()",
        "description": "Replace first names with synthetic data"
      },
      {
        "entity": "email",
        "value": "fake.email()",
        "fallback_value": "'redacted@example.com'"
      }
    ]
  }
}

Update Row Parameters:

Parameter

Type

Description

Required

condition

str

Template condition for row selection

No*

entity

str/list

Entity type(s) to match

No*

name

str/list

Column name(s) to match

No*

type

str/list

Column type(s) to match

No*

value

str

Template expression for new value

Yes

fallback_value

str

Template expression if main value fails

No

foreach

str

Iterate over expression results

No

description

str

Human-readable description

No

*At least one selection method (condition, entity, name, or type) is required.

Row Selection Methods#

By Condition:

{
  "condition": "column.entity == 'email' and not (this | isna)",
  "value": "fake.email()"
}

By Entity Type:

{
  "entity": ["first_name", "last_name"],
  "value": "fake.name()"
}

By Column Name:

{
  "name": "customer_email",
  "value": "fake.email()"
}

By Column Type:

{
  "type": "text",
  "value": "this | fake_entities"
}

Template Expressions#

Row update values use Jinja2 templates with special variables and filters.

Available Variables#

Variable

Description

Example

this

Current cell value

this.upper()

row

Current row data

row.first_name + row.last_name

index

Row index

index + 1000

column

Column metadata

column.entity, column.name

vars

Step variables

vars.row_seed

Built-in Filters#

Filter

Description

Example

fake

Generate fake entity for an entire cell

`column.entity

fake_entities

Replace entities in text

`this

redact_entities

Redact entities in text

`this

label_entities

Label entities in text

`this

hash_entities

Hash entities in text

`this

  • Fake: Replaces the value with synthetic data. For example, “I met Sally” becomes “I met Lucy”

  • Redact: Replaces detected entities with the entity type. For example, “I met Sally” becomes “I met <first_name>”

  • Label: Labeling is similar to redaction, but also includes the entity value. For example, “I met Sally” becomes “I met

  • Hash: Anonymizes by converting data to a unique alphanumeric value. For example, “I met Sally” becomes “I met a75e4r”

Faker Functions#

Access faker library functions directly:

{
  "value": "fake.persona(row_index=vars.row_seed + index).first_name"
}

Common faker functions:

  • fake.first_name(), fake.last_name(), fake.name()

  • fake.email(), fake.phone_number()

  • fake.address(), fake.city(), fake.state()

  • fake.ssn(), fake.credit_card_number()