Quick Start: Your First Safe Synthesizer Job#
Learn the fundamentals of NeMo Safe Synthesizer by creating your first Safe Synthesizer job using provided defaults. In this tutorial, you’ll upload sample customer data, replace personally identifiable information, fine-tune a model, generate synthetic records, and review the evaluation report. The tutorial should take about 20 minutes to run.
Prefer a jupyter notebook with the same content? Check out Safe Synthesizer 101.
What You’ll Learn#
By the end of this tutorial, you’ll understand how to:
Upload datasets for processing
Run Safe Synthesizer jobs using the Python SDK
Track job progress and retrieve results
Prerequisites#
Before you begin, make sure that you have:
Access to a deployment of NeMo Safe Synthesizer, such as <../docker-compose.md>
Setup#
Step 1: Install nemo-microservices Python SDK#
pip install nemo-microservices[safe-synthesizer]
Step 2: Configure endpoints#
You will use the NeMoMicroservices
client to interact with NeMo Safe Synthesizer.
http://localhost:8080 is the default base_url
for the quickstart deployment.
If using a managed or remote deployment, ensure the correct base URL is used.
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(
base_url="http://localhost:8080",
)
NeMo DataStore is also launched as part of the deployment, and we’ll use it to manage our storage.
datastore_config = {
"endpoint": "http://localhost:3000/v1/hf",
"token": "",
}
Step 3: Verify Service is Working#
Verify that your NeMo Safe Synthesizer service is running and accessible.
# Test connection by checking service health
try:
# This will raise an exception if service is not accessible
jobs = client.beta.safe_synthesizer.jobs.list()
print("✅ Successfully connected to Safe Synthesizer service")
print(f"Found {len(jobs)} existing jobs")
except Exception as e:
print(f"❌ Cannot connect to service: {e}")
print("Please verify base_url and service status")
If the connection test fails, refer to the deployment guide to troubleshoot your setup.
Run a Job with SafeSynthesizerBuilder#
The SafeSynthesizerBuilder
wraps lower level API calls to make common use cases easier. This is the recommended way to use NeMo Safe Synthesizer unless you need access to more direct API calls.
Step 1: Load Input Dataset#
Safe Synthesizer learns the patterns and correlations in your input dataset to produce synthetic data with similar properties. For this tutorial, we will use a small public sample dataset. Replace it with your own data if desired.
The sample dataset used here is a set of women’s clothing reviews, including age, product category, rating, and review text. Some of the reviews contain Personally Identifiable Information (PII), such as height, weight, age, and location.
import pandas as pd
# Change to your own dataset
%pip install kagglehub || uv pip install kagglehub
import kagglehub
import pandas as pd
# Download latest version
path = kagglehub.dataset_download("nicapotato/womens-ecommerce-clothing-reviews")
df = pd.read_csv(f"{path}/Womens Clothing E-Commerce Reviews.csv", index_col=0)
df.head()
Expected Output:
✅ Loaded dataset with 9344 records
transaction_id date amount currency transaction_type description
0 86492337 2021-08-01 250.5 USD debit Paid monthly rent for apartment in New York City.
1 82782780 2022-02-14 50.0 EUR debit Bought a souvenir for a friend during a trip t...
Step 2: Configure and Start the NeMo Safe Synthesizer Job#
SafeSynthesizerBuilder
is a fluent interface to configure a job.
This tutorial uses the basic functionality, but any custom configuration parameters may also be provided.
The SafeSynthesizerBuilder
handles uploading df
to NeMo Data Store and making the API call to create the job.
The microservice will automatically start running the job as soon as a GPU is available.
from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder
builder = (
SafeSynthesizerBuilder(client)
.from_data_source(df)
.with_datastore(datastore_config)
.with_replace_pii()
.synthesize()
)
job = builder.create_job()
Expected Output:
...
INFO:httpx:HTTP Request: POST http://localhost:8080/v1beta1/safe-synthesizer/jobs "HTTP/1.1 200 OK"
Step 3: Check Status and Read Logs#
You can use the job
instance from the previous step interact with the running safe synthesizer job.
Check the status of the job:
job.fetch_status()
Expected Output:
INFO:httpx:HTTP Request: GET http://localhost:8080/v1beta1/safe-synthesizer/jobs/job-nqzna8jirzefcorogsmz19/status "HTTP/1.1 200 OK"
'active'
Job Status Definitions:
created
: Job has been created but not yet startedpending
: Job is queued and waiting for resourcesactive
: Job is processing your datacompleted
: Job finished successfully - results are readyerror
: Job encountered an error - check logs for detailscancelled
: Job was manually cancelledcancelling
: Job is in the process of being cancelledpaused
: Job execution has been pausedpausing
: Job is in the process of being pausedresuming
: Job is resuming from a paused state
Print the log messages produced by the job so far:
job.print_logs()
(Output will be empty if the job is still in ‘created’ or ‘pending’.)
Step 4: Retrieve Synthetic Data#
First, wait until the job is finished and check the status.
job.wait_for_completion()
job.fetch_status()
Expected Output:
'completed'
If the output is not ‘completed’, then something went wrong with the job. Inspect the logs using job.print_logs()
to look for any error messages.
With a completed job, you can return the generated synthetic data as a pandas DataFrame with
synthetic_df = job.fetch_data()
Step 5: View Evaluation Report and Summary#
A summary with information on how long various steps took, top level scores from the evaluation report (see full report below), and other statistics is available.
summary = job.fetch_summary()
print(summary)
The summary is useful for programmatic access to the synthetic data quality score or timing information, such as for hyperparam tuning.
By default, every NeMo Safe Synthesizer job performs a comparison of the input and output data and produces an evaluation report. Download the report and open the the file ./evaluation_report.html
in a browser to see the high level synthetic data quality score and data privacy score, plus more details and charts comparing the input and output data.
job.save_report("./evaluation_report.html")
If you are using a jupyter notebook, the report can be displayed inline with
job.display_report_in_notebook()
Next Steps#
Now that you understand the basics, explore more advanced NeMo Safe Synthesizer capabilities:
PII Replacement: Explore custom PII replacement strategies for your data
Synthesize Data: Learn about training and generating synthetic data
Quality and Privacy Evaluation: Understand privacy and quality evaluation metrics