Safe Synthesizer Jobs | NVIDIA NeMo Platform

NeMo Safe Synthesizer jobs orchestrate the complete pipeline from data preparation through synthesis to evaluation. Understanding the job lifecycle and configuration options is essential for effective use of the platform.

Job Lifecycle

A NeMo Safe Synthesizer job progresses through several states:

Job States

created: Job has been created but not yet started
pending: Job is queued and waiting for resources (GPU)
active: Job is processing your data
completed: Job finished successfully - results are ready
error: Job encountered an error - check logs for details
cancelled: Job was manually cancelled
cancelling: Job is in the process of being cancelled
paused: Job execution has been paused
pausing: Job is in the process of being paused
resuming: Job is resuming from a paused state

Job Phases

A complete job typically includes these phases:

Data Preparation

Data validation and preprocessing
Column type inference
Grouping and ordering (if configured)
Train/test split for holdout evaluation

PII Replacement (optional)

PII detection using configured methods
Entity classification
Value transformation

Synthesis

Training: Fine-tune LLM on prepared data
Apply differential privacy (if enabled)
Generation: Generate synthetic records

Evaluation

Calculate SQS metrics
Calculate DPS metrics
Generate evaluation report

Job Configuration

Jobs are configured through a hierarchical configuration structure:

Top-Level Configuration

1 {
2     "name": "my-job",
3     "project": "my-project",
4     "spec": {
5         "data_source": "fileset://default/safe-synthesizer-inputs/data.csv",
6         "config": {
7             # Configuration sections below
8         },
9     },
10 }

Configuration Sections

data_prep: Grouping, ordering, holdout configuration
replace_pii: PII detection and replacement rules
training: Model selection and training parameters
generation: Synthetic data generation settings
privacy: Differential privacy parameters
evaluation: Quality and privacy assessment options

Job Management

Creating Jobs

Create jobs using:

Python SDK: Recommended approach with SafeSynthesizerJobBuilder
REST API: Direct HTTP requests for integration
CLI: Command-line interface for scripting

Monitoring Jobs

Track job progress through:

Status checks: Poll job state
Logs: View real-time execution logs
Events: Subscribe to job state changes when supported by the service

Retrieving Results

When the job completes, access:

Synthetic data: Generated CSV files
Evaluation report: HTML report with scores and visualizations
Metadata: Job summary and configuration
Adapter: LoRA adapter from the training step (when synthesis ran)
Logs: Complete execution history

Reusing a Trained Adapter

For platform jobs, set pretrained_model_job in the job spec to a completed job that has an adapter result in Files. Reuse is generation-only (no retraining). Use either pretrained_model_job or config.training.pretrained_model, not both.

For host-local development (nemo safe-synthesizer run-local), set config.training.pretrained_model to a local adapter or work directory from an earlier run. See Local and Subprocess Execution.

Job Builder API

The SafeSynthesizerJobBuilder provides a high-level interface for common workflows:

1 import os
2 import pandas as pd
3 
4 from nemo_platform import NeMoPlatform
5 from nemo_safe_synthesizer_plugin.sdk.job_builder import SafeSynthesizerJobBuilder
6 
7 # Placeholders
8 df: pd.DataFrame = pd.DataFrame()
9 client = NeMoPlatform(
10     base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
11     workspace="default",
12 )
13 
14 builder = (
15     SafeSynthesizerJobBuilder(client)
16     .with_data_source(df)
17     .with_replace_pii()
18     .synthesize()
19 )
20 job = builder.create_job(name="my-job", project="my-project")

The builder does the following:

Uploads data to filesets automatically
Provides smart defaults
Validates configuration
Returns a SafeSynthesizerJob instance for job interaction

Best Practices

Resource Planning

Larger datasets and models require more GPU memory
Training time scales with data size and model complexity
Plan for 15-120 minutes for typical jobs

Configuration

Start with default settings
Enable PII replacement for sensitive data
Use differential privacy for maximum privacy guarantees
Adjust generation parameters based on evaluation results

Monitoring

Check status periodically during execution
Review logs if jobs fail or take longer than expected
Use evaluation reports to iterate on configuration

Error Handling

Common errors: insufficient GPU memory, invalid data format, configuration errors
Check logs for detailed error messages
Reduce model size or data size if resource errors occur

Troubleshooting

This section covers common issues and how to diagnose them.

Viewing Job Logs

Logs are essential for diagnosing job failures. Access them through:

Python SDK:

1 # Print logs to stdout
2 job.print_logs()
3 
4 # Iterate over log entries programmatically
5 for log in job.fetch_logs():
6     print(log.message.strip())

Common Issues and Solutions

Job Stuck in “Pending” State

Symptoms: Job remains in pending state for an extended period.

Possible Causes:

No GPU resources available
Resource quota exceeded
Scheduling constraints not met

Solutions:

Check available GPU resources for the job
Verify resource quotas and limits
Review job logs for scheduling failures

Out of Memory (OOM) Errors

Symptoms: Job fails with memory-related errors during training.

Possible Causes:

Dataset too large for available GPU memory
Batch size too high
Model too large for available resources

Solutions:

Reduce batch_size in training parameters
Use a smaller subset of data for initial testing
Increase gradient_accumulation_steps to maintain effective batch size with lower memory

Invalid Data Format Errors

Symptoms: Job fails during data preparation phase.

Possible Causes:

CSV file has encoding issues
Missing or malformed columns
Unsupported data types

Solutions:

Ensure CSV is UTF-8 encoded
Validate column names do not contain special characters
Check for null values or inconsistent data types

Generation Quality Issues

Symptoms: Generated synthetic data has poor quality or many invalid records.

Possible Causes:

Insufficient training (low num_input_records_to_sample)
Temperature too high or too low
Data has complex patterns that need more training

Solutions:

Increase num_input_records_to_sample for more training
Adjust temperature (try 0.7-1.0 range)
Enable use_structured_generation for better format adherence
Review evaluation report for specific quality issues

Local and Subprocess Execution: run-local, adapter reuse, and plugin tests
safe-synthesizer-101: Get started with NeMo Safe Synthesizer jobs
index: More hands-on tutorials
reference: Full parameter reference

Job Lifecycle

Job States

Job Phases

Job Configuration

Top-Level Configuration

Configuration Sections

Job Management

Creating Jobs

Monitoring Jobs

Retrieving Results

Reusing a Trained Adapter

Job Builder API

Best Practices

Resource Planning

Configuration

Monitoring

Error Handling

Troubleshooting

Viewing Job Logs

Common Issues and Solutions

Job Stuck in “Pending” State

Out of Memory (OOM) Errors

Invalid Data Format Errors

Generation Quality Issues

Related Topics