Is this page helpful?

Safe Synthesizer#

The Safe Synthesizer page in NeMo Studio provides a centralized interface for managing your synthetic data generation jobs. You can create, monitor, and analyze jobs that generate privacy-protected synthetic datasets from your sensitive data.

Backend Microservices#

In the backend, the UI communicates with the Safe Synthesizer microservice to orchestrate the workflow.

Safe Synthesizer Page UI Overview#

The following are the main components and features of the Safe Synthesizer page.

Safe Synthesizer Job Listing#

The Safe Synthesizer page displays your synthetic data jobs in a table format with the following columns:

Job Name: The name for your synthetic data job.
Dataset: The name of the original sensitive dataset being synthesized.
SQS: Synthetic Quality Score is a measure of how well the output data matches the reference data
DPS: Data Privacy Score analyzes the synthetic output to measure how well protected the original data is from adversarial attacks.
Created: Timestamp showing when the job was created.
Status: The current state of the job (e.g., pending, active, completed, error).
Actions: Shortcut menu to view summary, report, or delete the job.

You can select multiple jobs to delete them in bulk.

Safe Synthesizer Job Management#

You can manage synthetic data jobs on the Safe Synthesizer page.

Create New Job: Start a new synthetic data job by clicking the Create New Job button.
View Details: Access detailed information about each job, including configuration parameters, progress, logs, and reports.

Safe Synthesizer Workflow#

The following are the common workflows for generating private synthetic data.

Create a Safe Synthesizer Job#

To create a new synthetic data job:

Navigate to the Safe Synthesizer page from the left sidebar.
Click the Create New Job button.
Name Your Job: Provide a unique name for your synthetic data job.
Training Data:
- Select your input dataset from Datasets. You can upload a new Dataset from CSV, JSONL, or Parquet files if needed.
- Optionally select columns to sort or group data for training. Learn when to use grouping.
Generation:
- Set the number of synthetic records to generate.
- Select privacy level:
  - Standard: Uses the standard privacy inherent in synthetic data generation, balancing privacy and quality. This option is faster and generally more reliable.
  - Highest (advanced): Applies Differential Privacy, the gold standard of privacy, during training. This process adds noise to provide mathematical guarantees of privacy, but could result in lower quality results and/or longer training time. Learn more about differential privacy.
Adjust advanced parameters as needed, such as:
- temperature: Controls the randomness of responses. Lower values make output more focused and deterministic, while higher values increase creativity and variability. Learn more about temperature.
- top_p: Controls output diversity by only considering the most likely words until their combined probability reaches this percentage, filtering out improbable options. Learn more about top_p.
- num_input_records_to_sample: Total number of non-unique records seen by the model. It is effectively the product of training data size and the number of epochs. If ‘num_input_records_to_sample’ is greater than the sample size, the model is trained on each record multiple times; otherwise, the model is trained on a subset of the records. Recommended value 10,000 or more. Learn more about num_input_records_to_sample.
- rope_scaling_factor: Scaling factor for the model’s context window; an integer >=1. 1 means no additional scaling of the model’s context window. Lower is better for quality, but higher may be required if your records (or groups of records, in case of event-driven data) are too large to fit in the original context window. Higher values require more GPU RAM, so reduce if hitting OOM errors. Up to 6 typically works and higher values may be possible with large GPUs. Learn more about rope_scaling_factor.
- enable_replace_pii: Automatically redact or replace Personally Identifiable Information (PII) prior to training the model. Highly recommended to ensure the model has no chance to learn this sensitive information.
- The default values are automatic or known good values.
Review and Submit

Monitor Job Progress#

While a job is running, you can monitor its progress.

You can view real-time status updates and access detailed logs on the job details page.

Download and Analyze Results#

After a synthetic data job completes, you can:

View completion status
View and download the synthetic dataset
View job logs
View quality and privacy scores:
- Column Correlation Stability: Compares the correlation across every combination of two columns.
- Deep Structure Stability: Compares the original and synthetic data using Principal Component Analysis to reduce the dimensionality.
- Column Distribution Stability: Compares the distribution for each column in the original data to the matching column in the synthetic data.
- Text Semantic Similarity: Compares the semantic meaning of the text columns between the original and synthetic data.
- Text Structure Similarity: Compares the sentence, word, and character counts across text columns in the original and synthetic data.
- Membership Inference Protection: Tests whether attackers can determine if specific records were in the training data.
- Attribute Inference Protection: Tests whether sensitive attributes can be inferred by an attacker when other attributes are known.
View and download evaluation reports

Next Steps#

Learn more about core Safe Synthesizer concepts and privacy protection methods
Explore the complete workflow guide for detailed instructions
Review API documentation for programmatic integration
Check out Python SDK examples for common usage patterns