Safe Synthesizer Concepts#

Learn about the core concepts for private synthetic data generation that protects sensitive information while maintaining data utility.

What is Safe Synthesizer?#

Safe Synthesizer is a privacy-focused approach to synthetic data generation designed specifically for organizations working with sensitive datasets. It is purpose-built for privacy compliance and data protection while preserving data utility for downstream AI tasks.

Key Principles#

  • Privacy by Design: Built-in privacy protections from data ingestion to final output

  • Regulatory Compliance: Tailor configurations toward GDPR, HIPAA, and other privacy regulation requirements

  • Mathematical Guarantees: Optional differential privacy for provable privacy protection

  • Data Utility Preservation: Maintain statistical relationships and patterns for AI model training

  • End-to-End Workflow: Complete pipeline from PII detection to privacy evaluation

Core Components#

Safe Synthesizer combines three essential capabilities to deliver private synthetic data:

PII Detection and Replacement#

Automatically identify and replace personally identifiable information in tabular datasets:

  • Entity Detection: Recognize names, addresses, phone numbers, emails, and custom entity types

  • Classification: Use machine learning models to classify column types and content

  • Flexible Replacement: Configure how sensitive data is redacted or replaced

  • Preservation Options: Maintain data format and relationships while removing sensitive content

Privacy-Protecting Synthesis#

Generate synthetic data that maintains utility while protecting individual privacy:

  • LLM-Based Generation: Use fine-tuned language models for realistic tabular data synthesis

  • Data Fidelity: Preserve hard constraints and statistical relationships in the original data

  • Differential Privacy: Apply mathematical privacy guarantees during model training

  • Format Preservation: Maintain original data structure, types, and constraints

Quality and Privacy Evaluation#

Comprehensive assessment of both data utility and privacy protection:

  • Quality Metrics: Compare statistical properties between original and synthetic data

  • Privacy Metrics: Measure privacy leakage and protection effectiveness

  • Compliance Reporting: Generate reports for regulatory and audit requirements

  • Interactive Reports: HTML dashboards with visualizations and detailed analysis

Safe Synthesizer vs Data Designer#

Understanding when to use each synthetic data approach:

Use Case

Safe Synthesizer

Data Designer

Primary Goal

Privacy protection of existing sensitive data

Creating new synthetic data for AI training

Input Requirements

Existing sensitive tabular datasets

Data schemas, prompts, or seed data

Privacy Approach

Mathematical privacy guarantees (differential privacy)

Statistical privacy through generation diversity

Regulatory Focus

Compliance-first (GDPR, HIPAA, SOX)

Development and testing focused

Data Relationships

Preserve existing statistical relationships

Create new realistic data relationships

Workflow Complexity

Single API call for complete pipeline

Flexible configuration-driven generation

Privacy Protection Methods#

PII Replacement#

The first line of privacy defense focuses on identifying and protecting direct identifiers:

  • Named Entity Recognition: Use machine learning models to detect PII in free text

  • Pattern Matching: Apply regular expressions and custom rules for specific entity types

  • Column Classification: Automatically categorize columns based on content analysis

  • Contextual Replacement: Replace sensitive data with realistic but non-identifying alternatives

Differential Privacy#

Mathematical framework providing quantifiable privacy guarantees:

  • Privacy Budget (ε): Controls the trade-off between privacy and data utility

  • Delta (δ): Probability bound for privacy guarantee violations

  • Noise Addition: Carefully calibrated noise injection during model training

  • Composition: Track privacy budget consumption across multiple operations

Workflow Orchestration#

Safe Synthesizer operates as a single microservice that orchestrates multiple steps to preserve utility of original datasets while adding layers of privacy:

Sequential Processing#

  1. Replace PII: Apply PII detection and replacement to training data

  2. Training: Fine-tune language model with optional differential privacy

  3. Generation: Produce synthetic data with validation and parsing

  4. Evaluation: Compare original and synthetic data to analyze quality and privacy

Evaluation Components#

Synthetic Quality Score#

  • Column Correlation Stability: Analyze the correlation across every combination of two columns

  • Deep Structure Stability: Use Principal Component Analysis to reduce the dimensionality when comparing the original and synthetic data

  • Column Distribution Stability: Compare the distribution for each column in the original data to its match in the synthetic data

  • Text Structure Similarity: Calculate the sentence, word, and character counts across the two datasets

  • Text Semantic Similarity: Understand whether the semantic meaning of the text held after synthesizing

Data Privacy Score#

  • Membership Inference Protection: Test whether attackers can determine if specific records were in the training data:

  • Attribute Inference Protection: Assess whether sensitive attributes can be inferred from synthetic data

  • PII Replay: Evaluate the frequency with which sensitive values from the original data show up in the synthetic version

Integration with NeMo Platform#

Entity Management#

Safe Synthesizer integrates with core NeMo platform services:

  • Data Store: Input and output datasets managed through NeMo Data Store

  • Entity Store: Job metadata and configuration stored in Entity Store

  • Projects and Namespaces: Organize Safe Synthesizer jobs within existing project structures

  • Authentication: Use standard NeMo platform authentication and authorization

API Consistency#

Follow established NeMo microservices patterns:

  • Jobs API: Consistent /v1beta1/safe-synthesizer/jobs/* endpoint structure

  • Result Management: Standard result listing, metadata, and download patterns

  • Status Monitoring: Common job status tracking and log streaming

  • Error Handling: Consistent error responses and validation patterns

Supported Data Types#

Safe Synthesizer is optimized for tabular data with the following column types:

Numeric Data#

  • Integers: Age, count, ID numbers

  • Floating Point: Measurements, scores, financial amounts

  • Constrained Ranges: Values with domain-specific bounds

Categorical Data#

  • Nominal: Categories without inherent order (departments, product types)

  • Ordinal: Categories with natural ordering (education levels, ratings)

  • High Cardinality: Large number of unique values (postal codes, product SKUs)

Text Data#

  • Free Text: Comments, descriptions, notes

  • Structured Text: Addresses, formatted names

  • Mixed Content: Columns containing both structured and free-form text

Temporal Data#

  • Dates: Event timestamps, birth dates, transaction dates

  • Time Series: Sequential data with temporal dependencies

  • Durations: Time intervals and periods

Privacy Considerations#

Regulatory Compliance#

Safe Synthesizer helps organizations meet various privacy regulations:

  • GDPR: Right to erasure, data minimization, privacy by design

  • HIPAA: Protected health information safeguards

  • CCPA: Consumer privacy rights and data protection

  • SOX: Financial data protection and audit requirements

Next Steps#