About Synthetic Data Generation#

Generate synthetic training data with NeMo Data Designer using a declarative YAML pipeline. Seed a generation run with your domain-specific topics, scenarios, or personas; define the column structure and prompts in YAML; and produce training-ready JSONL without writing Python.

Three output shapes ship out of the box: SFT chat data, tool-calling SFT data, and DPO preference pairs.

Tip

New to SDG or new to model training? Read Use the SDG Skill With Confidence for a short guide to productive agent sessions, then start the Generate Your First Synthetic Dataset tutorial to run the bundled pipeline and produce your first dataset in 5 to 10 minutes.

When to Use#

Use SDG when you need training data that does not already exist in sufficient quantity or quality for your target domain or task.

SFT chat data — Generate user/assistant conversation pairs grounded in domain-specific topics, scenarios, or personas. Use default.yaml as a starting point and adapt it to your domain.
Tool-calling SFT data — Generate multi-turn conversations that include assistant tool calls and tool responses in OpenAI format. Use customer_support_tools.yaml as a starting point.
DPO preference data — Generate prompt / chosen / rejected triples for preference learning. Use rl_pref.yaml.
Custom domains — Swap the seed file, category columns, and prompts to target any domain. The pipeline is fully declarative; customisation does not require editing Python.
Cluster-scale generation — Dispatch generation to Lepton or Slurm via env.toml profiles when local throughput is insufficient.

Pipeline at a Glance#

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333', 'clusterBkg': '#ffffff', 'clusterBorder': '#333333'}}}%%
flowchart TB
    seed["Seed file (optional)"] --> dsg
    cat["Category samplers"] --> dsg
    per["Person sampler (optional)"] --> dsg
    dsg["Data Designer column graph<br/>Jinja2 prompts · LLM calls"]
    dsg --> proj["output_projection"]
    proj --> om["openai_messages"]
    proj --> dpo["dpo_preference"]
    proj --> sm["structured_messages"]
    om --> jsonl["JSONL"]
    dpo --> jsonl
    sm --> jsonl
    jsonl --> train["data_prep/sft_packing or AutoModel SFT"]

Each run is reproducible: the seed file, column specs, model alias, inference parameters, and projection rules are all version-controlled in a single YAML file.

Documentation#

Getting Started

Run the bundled pipeline end-to-end: preview two records, generate five, inspect the output JSONL.

5–10 min tutorial

Generate Your First Synthetic Dataset

Use the SDG Skill With Confidence

Prepare for a focused chat with a coding agent: opening brief, seed ideas, and how SKILL.md supports the session without memorization.

10 min read newcomer

Use the SDG Skill With Confidence

How-To Guides

Task-focused guides: adapt the pipeline to a domain, generate preference pairs, dispatch to a cluster.

5 guides task-focused

Synthetic Data How-To Guides

Reference

YAML config schema, CLI flags, output projection shapes, and troubleshooting.

4 references lookup

SDG Reference

All Documentation#

Getting Started

Guide	What You’ll Do	Time
Generate Your First Synthetic Dataset	Preview and generate your first synthetic SFT dataset	5–10 min
Use the SDG Skill With Confidence	Run a productive agent session: brief, seeds, plain terms, and light use of `SKILL.md`	10 min read

How-To Guides

Guide	What You’ll Do
Tips for the Data Generation Pipeline	Preview, generate, and customize output path and projection
Create a Domain Dataset for Airlines Customer Service	Adapt the pipeline to a custom domain with a seed file and multiple category dimensions
Generate Tool-Calling Data for SFT	Generate multi-turn tool-calling SFT data
Generate Preference Data for DPO	Generate DPO preference pairs from `rl_pref.yaml`
Dispatch SDG to a Cluster	Dispatch generation to Lepton or Slurm via env.toml

Reference

Reference	What You’ll Find
Config Schema	Full YAML column types, sampler parameters, and projection fields
CLI Reference	`nemotron steps run sdg/data_designer` flags and hydra overrides
Output Projections	The three projection shapes with annotated JSONL examples
Troubleshooting	Dispatch failures, image pull errors, API key issues, schema drift

Before You Start#

The NVIDIA_API_KEY environment variable is required for the default model, nvidia/nemotron-3-nano-30b-a3b, hosted on integrate.nvidia.com.

Limitations and Considerations#

Cost: Generation calls a hosted LLM endpoint; each record incurs API cost.
Quality: After generating records, review them before training.
Scale: API rate limits apply. For large generation runs, dispatch to a cluster and consider batching across multiple nodes.
Reproducibility: Seed files, column specs, model aliases, and inference parameters should all be version-controlled together. Changing any one of them changes the output distribution.