Safe Synthesizer Concepts#

Learn about the core concepts for private synthetic data generation that protects sensitive information while maintaining data utility.

What is Safe Synthesizer?#

Safe Synthesizer is a privacy-focused approach to synthetic data generation designed specifically for organizations working with sensitive datasets. It is purpose-built for privacy compliance and data protection while preserving data utility for downstream AI tasks.

Key Principles#

Privacy by Design: Built-in privacy protections from data ingestion to final output
Regulatory Compliance: Tailor configurations toward GDPR, HIPAA, and other privacy regulation requirements
Mathematical Guarantees: Optional differential privacy for provable privacy protection
Data Utility Preservation: Maintain statistical relationships and patterns for AI model training
End-to-End Workflow: Complete pipeline from PII detection to privacy evaluation

Core Components#

Safe Synthesizer combines three essential capabilities to deliver private synthetic data:

PII Detection and Replacement#

Automatically identify and replace personally identifiable information in tabular datasets:

Entity Detection: Recognize names, addresses, phone numbers, emails, and custom entity types
Classification: Use machine learning models to classify column types and content
Flexible Replacement: Configure how sensitive data is redacted or replaced
Preservation Options: Maintain data format and relationships while removing sensitive content

Privacy-Protecting Synthesis#

Generate synthetic data that maintains utility while protecting individual privacy:

LLM-Based Generation: Use fine-tuned language models for realistic tabular data synthesis
Data Fidelity: Preserve hard constraints and statistical relationships in the original data
Differential Privacy: Apply mathematical privacy guarantees during model training
Format Preservation: Maintain original data structure, types, and constraints

Quality and Privacy Evaluation#

Comprehensive assessment of both data utility and privacy protection:

Quality Metrics: Compare statistical properties between original and synthetic data
Privacy Metrics: Measure privacy leakage and protection effectiveness
Compliance Reporting: Generate reports for regulatory and audit requirements
Interactive Reports: HTML dashboards with visualizations and detailed analysis

Safe Synthesizer vs Data Designer#

Understanding when to use each synthetic data approach:

Use Case	Safe Synthesizer	Data Designer
Primary Goal	Privacy protection of existing sensitive data	Creating new synthetic data for AI training
Input Requirements	Existing sensitive tabular datasets	Data schemas, prompts, or seed data
Privacy Approach	Mathematical privacy guarantees (differential privacy)	Statistical privacy through generation diversity
Regulatory Focus	Compliance-first (GDPR, HIPAA, SOX)	Development and testing focused
Data Relationships	Preserve existing statistical relationships	Create new realistic data relationships
Workflow Complexity	Single API call for complete pipeline	Flexible configuration-driven generation

Privacy Protection Methods#

PII Replacement#

The first line of privacy defense focuses on identifying and protecting direct identifiers:

Named Entity Recognition: Use machine learning models to detect PII in free text
Pattern Matching: Apply regular expressions and custom rules for specific entity types
Column Classification: Automatically categorize columns based on content analysis
Contextual Replacement: Replace sensitive data with realistic but non-identifying alternatives

Differential Privacy#

Mathematical framework providing quantifiable privacy guarantees:

Privacy Budget (ε): Controls the trade-off between privacy and data utility
Delta (δ): Probability bound for privacy guarantee violations
Noise Addition: Carefully calibrated noise injection during model training
Composition: Track privacy budget consumption across multiple operations

Workflow Orchestration#

Safe Synthesizer operates as a single microservice that orchestrates multiple steps to preserve utility of original datasets while adding layers of privacy:

Sequential Processing#

Replace PII: Apply PII detection and replacement to training data
Training: Fine-tune language model with optional differential privacy
Generation: Produce synthetic data with validation and parsing
Evaluation: Compare original and synthetic data to analyze quality and privacy

Evaluation Components#

Synthetic Quality Score#

Column Correlation Stability: Analyze the correlation across every combination of two columns
Deep Structure Stability: Use Principal Component Analysis to reduce the dimensionality when comparing the original and synthetic data
Column Distribution Stability: Compare the distribution for each column in the original data to its match in the synthetic data
Text Structure Similarity: Calculate the sentence, word, and character counts across the two datasets
Text Semantic Similarity: Understand whether the semantic meaning of the text held after synthesizing

Data Privacy Score#

Membership Inference Protection: Test whether attackers can determine if specific records were in the training data:
Attribute Inference Protection: Assess whether sensitive attributes can be inferred from synthetic data
PII Replay: Evaluate the frequency with which sensitive values from the original data show up in the synthetic version

Integration with NeMo Platform#

Entity Management#

Safe Synthesizer integrates with core NeMo platform services:

Data Store: Input and output datasets managed through NeMo Data Store
Entity Store: Job metadata and configuration stored in Entity Store
Projects and Namespaces: Organize Safe Synthesizer jobs within existing project structures
Authentication: Use standard NeMo platform authentication and authorization

API Consistency#

Follow established NeMo microservices patterns:

Jobs API: Consistent /v1beta1/safe-synthesizer/jobs/* endpoint structure
Result Management: Standard result listing, metadata, and download patterns
Status Monitoring: Common job status tracking and log streaming
Error Handling: Consistent error responses and validation patterns

Supported Data Types#

Safe Synthesizer is optimized for tabular data with the following column types:

Numeric Data#

Integers: Age, count, ID numbers
Floating Point: Measurements, scores, financial amounts
Constrained Ranges: Values with domain-specific bounds

Categorical Data#

Nominal: Categories without inherent order (departments, product types)
Ordinal: Categories with natural ordering (education levels, ratings)
High Cardinality: Large number of unique values (postal codes, product SKUs)

Text Data#

Free Text: Comments, descriptions, notes
Structured Text: Addresses, formatted names
Mixed Content: Columns containing both structured and free-form text

Temporal Data#

Dates: Event timestamps, birth dates, transaction dates
Time Series: Sequential data with temporal dependencies
Durations: Time intervals and periods

Privacy Considerations#

Regulatory Compliance#

Safe Synthesizer helps organizations meet various privacy regulations:

GDPR: Right to erasure, data minimization, privacy by design
HIPAA: Protected health information safeguards
CCPA: Consumer privacy rights and data protection
SOX: Financial data protection and audit requirements

Next Steps#

Learn about core Safe Synthesizer concepts and privacy protection methods
Explore the complete workflow guide for step-by-step instructions
Review API documentation for programmatic integration
Check out Python SDK examples for common usage patterns
See the Docker Compose setup guide for deployment instructions