Safe Synthesizer Concepts#
Learn about the core concepts for private synthetic data generation that protects sensitive information while maintaining data utility.
What is Safe Synthesizer?#
Safe Synthesizer is a privacy-focused approach to synthetic data generation designed specifically for organizations working with sensitive datasets. It is purpose-built for privacy compliance and data protection while preserving data utility for downstream AI tasks.
Key Principles#
Privacy by Design: Built-in privacy protections from data ingestion to final output
Regulatory Compliance: Tailor configurations toward GDPR, HIPAA, and other privacy regulation requirements
Mathematical Guarantees: Optional differential privacy for provable privacy protection
Data Utility Preservation: Maintain statistical relationships and patterns for AI model training
End-to-End Workflow: Complete pipeline from PII detection to privacy evaluation
Core Components#
Safe Synthesizer combines three essential capabilities to deliver private synthetic data:
PII Detection and Replacement#
Automatically identify and replace personally identifiable information in tabular datasets:
Entity Detection: Recognize names, addresses, phone numbers, emails, and custom entity types
Classification: Use machine learning models to classify column types and content
Flexible Replacement: Configure how sensitive data is redacted or replaced
Preservation Options: Maintain data format and relationships while removing sensitive content
Privacy-Protecting Synthesis#
Generate synthetic data that maintains utility while protecting individual privacy:
LLM-Based Generation: Use fine-tuned language models for realistic tabular data synthesis
Data Fidelity: Preserve hard constraints and statistical relationships in the original data
Differential Privacy: Apply mathematical privacy guarantees during model training
Format Preservation: Maintain original data structure, types, and constraints
Quality and Privacy Evaluation#
Comprehensive assessment of both data utility and privacy protection:
Quality Metrics: Compare statistical properties between original and synthetic data
Privacy Metrics: Measure privacy leakage and protection effectiveness
Compliance Reporting: Generate reports for regulatory and audit requirements
Interactive Reports: HTML dashboards with visualizations and detailed analysis
Safe Synthesizer vs Data Designer#
Understanding when to use each synthetic data approach:
Use Case |
Safe Synthesizer |
Data Designer |
---|---|---|
Primary Goal |
Privacy protection of existing sensitive data |
Creating new synthetic data for AI training |
Input Requirements |
Existing sensitive tabular datasets |
Data schemas, prompts, or seed data |
Privacy Approach |
Mathematical privacy guarantees (differential privacy) |
Statistical privacy through generation diversity |
Regulatory Focus |
Compliance-first (GDPR, HIPAA, SOX) |
Development and testing focused |
Data Relationships |
Preserve existing statistical relationships |
Create new realistic data relationships |
Workflow Complexity |
Single API call for complete pipeline |
Flexible configuration-driven generation |
Privacy Protection Methods#
PII Replacement#
The first line of privacy defense focuses on identifying and protecting direct identifiers:
Named Entity Recognition: Use machine learning models to detect PII in free text
Pattern Matching: Apply regular expressions and custom rules for specific entity types
Column Classification: Automatically categorize columns based on content analysis
Contextual Replacement: Replace sensitive data with realistic but non-identifying alternatives
Differential Privacy#
Mathematical framework providing quantifiable privacy guarantees:
Privacy Budget (ε): Controls the trade-off between privacy and data utility
Delta (δ): Probability bound for privacy guarantee violations
Noise Addition: Carefully calibrated noise injection during model training
Composition: Track privacy budget consumption across multiple operations
Workflow Orchestration#
Safe Synthesizer operates as a single microservice that orchestrates multiple steps to preserve utility of original datasets while adding layers of privacy:
Sequential Processing#
Replace PII: Apply PII detection and replacement to training data
Training: Fine-tune language model with optional differential privacy
Generation: Produce synthetic data with validation and parsing
Evaluation: Compare original and synthetic data to analyze quality and privacy
Evaluation Components#
Synthetic Quality Score#
Column Correlation Stability: Analyze the correlation across every combination of two columns
Deep Structure Stability: Use Principal Component Analysis to reduce the dimensionality when comparing the original and synthetic data
Column Distribution Stability: Compare the distribution for each column in the original data to its match in the synthetic data
Text Structure Similarity: Calculate the sentence, word, and character counts across the two datasets
Text Semantic Similarity: Understand whether the semantic meaning of the text held after synthesizing
Data Privacy Score#
Membership Inference Protection: Test whether attackers can determine if specific records were in the training data:
Attribute Inference Protection: Assess whether sensitive attributes can be inferred from synthetic data
PII Replay: Evaluate the frequency with which sensitive values from the original data show up in the synthetic version
Integration with NeMo Platform#
Entity Management#
Safe Synthesizer integrates with core NeMo platform services:
Data Store: Input and output datasets managed through NeMo Data Store
Entity Store: Job metadata and configuration stored in Entity Store
Projects and Namespaces: Organize Safe Synthesizer jobs within existing project structures
Authentication: Use standard NeMo platform authentication and authorization
API Consistency#
Follow established NeMo microservices patterns:
Jobs API: Consistent
/v1beta1/safe-synthesizer/jobs/*
endpoint structureResult Management: Standard result listing, metadata, and download patterns
Status Monitoring: Common job status tracking and log streaming
Error Handling: Consistent error responses and validation patterns
Supported Data Types#
Safe Synthesizer is optimized for tabular data with the following column types:
Numeric Data#
Integers: Age, count, ID numbers
Floating Point: Measurements, scores, financial amounts
Constrained Ranges: Values with domain-specific bounds
Categorical Data#
Nominal: Categories without inherent order (departments, product types)
Ordinal: Categories with natural ordering (education levels, ratings)
High Cardinality: Large number of unique values (postal codes, product SKUs)
Text Data#
Free Text: Comments, descriptions, notes
Structured Text: Addresses, formatted names
Mixed Content: Columns containing both structured and free-form text
Temporal Data#
Dates: Event timestamps, birth dates, transaction dates
Time Series: Sequential data with temporal dependencies
Durations: Time intervals and periods
Privacy Considerations#
Regulatory Compliance#
Safe Synthesizer helps organizations meet various privacy regulations:
GDPR: Right to erasure, data minimization, privacy by design
HIPAA: Protected health information safeguards
CCPA: Consumer privacy rights and data protection
SOX: Financial data protection and audit requirements
Next Steps#
Learn about core Safe Synthesizer concepts and privacy protection methods
Explore the complete workflow guide for step-by-step instructions
Review API documentation for programmatic integration
Check out Python SDK examples for common usage patterns
See the Docker Compose setup guide for deployment instructions