Quick Start: Your First Safe Synthesizer Job#

Learn the fundamentals of NeMo Safe Synthesizer by creating your first Safe Synthesizer job using provided defaults. In this tutorial, you’ll upload sample customer data, replace personally identifiable information, fine-tune a model, generate synthetic records, and review the evaluation report. The tutorial should take about 20 minutes to run, but can take longer depending on your image pull time (slower internet speeds means more time spent pulling weights.)

Prefer a jupyter notebook with the same content? Check out Safe Synthesizer 101.

What You’ll Learn#

By the end of this tutorial, you’ll understand how to:

Upload datasets for processing
Run Safe Synthesizer jobs using the Python SDK
Track job progress and retrieve results

Prerequisites#

Before you begin, make sure that you have:

Access to a deployment of NeMo Safe Synthesizer, such as <../docker-compose.md>

Setup#

Step 1: Install nemo-microservices Python SDK#

pip install nemo-microservices[safe-synthesizer]

Step 2: Configure endpoints#

You will use the NeMoMicroservices client to interact with NeMo Safe Synthesizer. http://localhost:8080 is the default base_url for the quickstart deployment. If using a managed or remote deployment, ensure the correct base URL is used.

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http://localhost:8080",
)

NeMo DataStore is also launched as part of the deployment, and we’ll use it to manage our storage.

datastore_config = {
    "endpoint": "http://localhost:3000/v1/hf",
}

Step 3: Verify Service is Working#

Verify that your NeMo Safe Synthesizer service is running and accessible.

# Test connection by checking service health
try:
    # This will raise an exception if service is not accessible
    jobs = client.beta.safe_synthesizer.jobs.list()
    print("✅ Successfully connected to Safe Synthesizer service")
    print(f"Found {len(jobs.data)} existing jobs")
except Exception as e:
    print(f"❌ Cannot connect to service: {e}")
    print("Please verify base_url and service status")

If the connection test fails, refer to the deployment guide to troubleshoot your setup.

Run a Job with SafeSynthesizerBuilder#

The SafeSynthesizerBuilder wraps lower level API calls to make common use cases easier. This is the recommended way to use NeMo Safe Synthesizer unless you need access to more direct API calls.

Step 1: Load Input Dataset#

Safe Synthesizer learns the patterns and correlations in your input dataset to produce synthetic data with similar properties. For this tutorial, we will use a small public sample dataset. Replace it with your own data if desired.

The sample dataset used here is a set of women’s clothing reviews, including age, product category, rating, and review text. Some of the reviews contain Personally Identifiable Information (PII), such as height, weight, age, and location.

import pandas as pd

# Change to your own dataset
%pip install kagglehub || uv pip install kagglehub
import kagglehub
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("nicapotato/womens-ecommerce-clothing-reviews")
df = pd.read_csv(f"{path}/Womens Clothing E-Commerce Reviews.csv", index_col=0)
df.head()

Expected Output:

✅ Loaded dataset with 9344 records
   transaction_id        date  amount currency transaction_type                                        description
0        86492337  2021-08-01   250.5      USD            debit  Paid monthly rent for apartment in New York City.
1        82782780  2022-02-14    50.0      EUR            debit  Bought a souvenir for a friend during a trip t...

Step 2: Configure and Start the NeMo Safe Synthesizer Job#

SafeSynthesizerBuilder is a fluent interface to configure a job. This tutorial uses the basic functionality, but any custom configuration parameters may also be provided. The SafeSynthesizerBuilder handles uploading df to NeMo Data Store and making the API call to create the job. The microservice will automatically start running the job as soon as a GPU is available.

from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder
builder = (
    SafeSynthesizerBuilder(client)
    .from_data_source(df)
    .with_datastore(datastore_config)
    .with_replace_pii()
    .synthesize()
)
job = builder.create_job()

Expected Output:

...
INFO:httpx:HTTP Request: POST http://localhost:8080/v1beta1/safe-synthesizer/jobs "HTTP/1.1 200 OK"

Step 3: Check Status and Read Logs#

You can use the job instance from the previous step interact with the running safe synthesizer job.

Check the status of the job:

job.fetch_status()

Expected Output:

INFO:httpx:HTTP Request: GET http://localhost:8080/v1beta1/safe-synthesizer/jobs/job-nqzna8jirzefcorogsmz19/status "HTTP/1.1 200 OK"
'active'

Job Status Definitions:

created: Job has been created but not yet started
pending: Job is queued and waiting for resources
active: Job is processing your data
completed: Job finished successfully - results are ready
error: Job encountered an error - check logs for details
cancelled: Job was manually cancelled
cancelling: Job is in the process of being cancelled
paused: Job execution has been paused
pausing: Job is in the process of being paused
resuming: Job is resuming from a paused state

Print the log messages produced by the job so far:

job.print_logs()

(Output will be empty if the job is still in ‘created’ or ‘pending’.)

Step 4: Retrieve Synthetic Data#

First, wait until the job is finished and check the status.

job.wait_for_completion()
job.fetch_status()

Expected Output:

'completed'

If the output is not ‘completed’, then something went wrong with the job. Inspect the logs using job.print_logs() to look for any error messages.

With a completed job, you can return the generated synthetic data as a pandas DataFrame with

synthetic_df = job.fetch_data()

Step 5: View Evaluation Report and Summary#

A summary with information on how long various steps took, top level scores from the evaluation report (see full report below), and other statistics is available.

summary = job.fetch_summary()
print(summary)

The summary is useful for programmatic access to the synthetic data quality score or timing information, such as for hyperparam tuning.

By default, every NeMo Safe Synthesizer job performs a comparison of the input and output data and produces an evaluation report. Download the report and open the the file ./evaluation_report.html in a browser to see the high level synthetic data quality score and data privacy score, plus more details and charts comparing the input and output data.

job.save_report("./evaluation_report.html")

If you are using a jupyter notebook, the report can be displayed inline with

job.display_report_in_notebook()

Next Steps#

Now that you understand the basics, explore more advanced NeMo Safe Synthesizer capabilities:

PII Replacement: Explore custom PII replacement strategies for your data
Synthesize Data: Learn about training and generating synthetic data
Quality and Privacy Evaluation: Understand privacy and quality evaluation metrics