Safe Synthesizer#

The Safe Synthesizer page in NeMo Studio provides a centralized interface for managing your synthetic data generation jobs. You can create, monitor, and analyze jobs that generate privacy-protected synthetic datasets from your sensitive data.


Backend Microservices#

In the backend, the UI communicates with the Safe Synthesizer microservice to orchestrate the workflow.


Safe Synthesizer Page UI Overview#

The following are the main components and features of the Safe Synthesizer page.

Safe Synthesizer Job Listing#

The Safe Synthesizer page displays your synthetic data jobs in a table format with the following columns:

  • Job Name: The name for your synthetic data job.

  • Dataset: The name of the original sensitive dataset being synthesized.

  • SQS: Synthetic Quality Score is a measure of how well the output data matches the reference data

  • DPS: Data Privacy Score analyzes the synthetic output to measure how well protected the original data is from adversarial attacks.

  • Created: Timestamp showing when the job was created.

  • Status: The current state of the job (e.g., pending, active, completed, error).

  • Actions: Shortcut menu to view summary, report, or delete the job.

You can select multiple jobs to delete them in bulk.

Safe Synthesizer Job Management#

You can manage synthetic data jobs on the Safe Synthesizer page.

  • Create New Job: Start a new synthetic data job by clicking the Create New Job button.

  • View Details: Access detailed information about each job, including configuration parameters, progress, logs, and reports.


Safe Synthesizer Workflow#

The following are the common workflows for generating private synthetic data.

Create a Safe Synthesizer Job#

To create a new synthetic data job:

  1. Navigate to the Safe Synthesizer page from the left sidebar.

  2. Click the Create New Job button.

  3. Name Your Job: Provide a unique name for your synthetic data job.

  4. Training Data:

    • Select your input dataset from Datasets. You can upload a new Dataset from CSV, JSONL, or Parquet files if needed.

    • Optionally select columns to sort or group data for training. Learn when to use grouping.

  5. Generation:

    • Set the number of synthetic records to generate.

    • Select privacy level:

      • Standard: Uses the standard privacy inherent in synthetic data generation, balancing privacy and quality. This option is faster and generally more reliable.

      • Highest (advanced): Applies Differential Privacy, the gold standard of privacy, during training. This process adds noise to provide mathematical guarantees of privacy, but could result in lower quality results and/or longer training time. Learn more about differential privacy.

  6. Adjust advanced parameters as needed, such as:

    • temperature: Controls the randomness of responses. Lower values make output more focused and deterministic, while higher values increase creativity and variability. Learn more about temperature.

    • top_p: Controls output diversity by only considering the most likely words until their combined probability reaches this percentage, filtering out improbable options. Learn more about top_p.

    • num_input_records_to_sample: Total number of non-unique records seen by the model. It is effectively the product of training data size and the number of epochs. If ‘num_input_records_to_sample’ is greater than the sample size, the model is trained on each record multiple times; otherwise, the model is trained on a subset of the records. Recommended value 10,000 or more. Learn more about num_input_records_to_sample.

    • rope_scaling_factor: Scaling factor for the model’s context window; an integer >=1. 1 means no additional scaling of the model’s context window. Lower is better for quality, but higher may be required if your records (or groups of records, in case of event-driven data) are too large to fit in the original context window. Higher values require more GPU RAM, so reduce if hitting OOM errors. Up to 6 typically works and higher values may be possible with large GPUs. Learn more about rope_scaling_factor.

    • enable_replace_pii: Automatically redact or replace Personally Identifiable Information (PII) prior to training the model. Highly recommended to ensure the model has no chance to learn this sensitive information.

    • The default values are automatic or known good values.

  7. Review and Submit

Monitor Job Progress#

While a job is running, you can monitor its progress.

You can view real-time status updates and access detailed logs on the job details page.

Download and Analyze Results#

After a synthetic data job completes, you can:

  • View completion status

  • View and download the synthetic dataset

  • View job logs

  • View quality and privacy scores:

    • Column Correlation Stability: Compares the correlation across every combination of two columns.

    • Deep Structure Stability: Compares the original and synthetic data using Principal Component Analysis to reduce the dimensionality.

    • Column Distribution Stability: Compares the distribution for each column in the original data to the matching column in the synthetic data.

    • Text Semantic Similarity: Compares the semantic meaning of the text columns between the original and synthetic data.

    • Text Structure Similarity: Compares the sentence, word, and character counts across text columns in the original and synthetic data.

    • Membership Inference Protection: Tests whether attackers can determine if specific records were in the training data.

    • Attribute Inference Protection: Tests whether sensitive attributes can be inferred by an attacker when other attributes are known.

  • View and download evaluation reports


Next Steps#