Synthesize Data#
Learn how to generate synthetic tabular data that preserves statistical properties while protecting individual privacy using NeMo Safe Synthesizer.
Overview#
The synthesizer component is the main component of the NeMo Safe Synthesizer product. It uses LLM-based fine-tuning to generate realistic synthetic data that maintains the utility of your original dataset while providing privacy protection. Creating synthetic versions of private data allows you to unlock insights without compromising privacy, in order to enable downstream use cases like AI model training and analytics.
You can optionally enable Differential Privacy to achieve maximum privacy with mathematical guarantees.
Quick Start#
Note
Before you start, make sure that you have:
Stored CSVs locally
Uploaded them using the following steps:
export HF_ENDPOINT="http://localhost:3000/v1/hf"
huggingface-cli upload --repo-type dataset default/safe-synthesizer sensitive-data.csv
import pandas as pd
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(base_url="http://localhost:8080")
# Load your sensitive dataset
df = pd.read_csv("sensitive-data.csv")
# Full pipeline using REST API
job_request = {
"name": "synthesizer-pipeline",
"project": "default",
"spec": {
"data_source": "hf://datasets/default/safe-synthesizer/sensitive-data.csv",
"config": {
"enable_synthesis": True,
"enable_replace_pii": True,
"replace_pii": {
"globals": {"locales": ["en_US"]},
"steps": [{"rows": {"update": [{"entity": ["email", "phone_number"], "value": "column.entity | fake"}]}}]
},
"training": {
"pretrained_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"max_sequences_per_example": "auto",
"num_input_records_to_sample": "auto"
},
"generation": {"num_records": 1000, "temperature": 0.8},
"privacy": {"privacy_hyperparams": {"dp": True, "epsilon": 6.0}},
"evaluation": {"mia_enabled": True, "aia_enabled": True}
}
}
}
job = client.beta.safe_synthesizer.jobs.create(**job_request)
# Access results using the jobs API
results = client.beta.safe_synthesizer.jobs.results.list(job.id)
Supported Data Types#
NeMo Safe Synthesizer supports numeric, categorical, text, and event-driven fields in tabular data. The system uses configurable parameters to support customized synthetic data generation.