About Synthetic Data Generation#
Generate synthetic training data with NeMo Data Designer using a declarative YAML pipeline. Seed a generation run with your domain-specific topics, scenarios, or personas; define the column structure and prompts in YAML; and produce training-ready JSONL without writing Python.
Three output shapes ship out of the box: SFT chat data, tool-calling SFT data, and DPO preference pairs.
Tip
New to SDG or new to model training? Read Use the SDG Skill With Confidence for a short guide to productive agent sessions, then start the Generate Your First Synthetic Dataset tutorial to run the bundled pipeline and produce your first dataset in 5 to 10 minutes.
When to Use#
Use SDG when you need training data that does not already exist in sufficient quantity or quality for your target domain or task.
SFT chat data — Generate user/assistant conversation pairs grounded in domain-specific topics, scenarios, or personas. Use
default.yamlas a starting point and adapt it to your domain.Tool-calling SFT data — Generate multi-turn conversations that include assistant tool calls and tool responses in OpenAI format. Use
customer_support_tools.yamlas a starting point.DPO preference data — Generate prompt / chosen / rejected triples for preference learning. Use
rl_pref.yaml.Custom domains — Swap the seed file, category columns, and prompts to target any domain. The pipeline is fully declarative; customisation does not require editing Python.
Cluster-scale generation — Dispatch generation to Lepton or Slurm via env.toml profiles when local throughput is insufficient.
Pipeline at a Glance#
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333', 'clusterBkg': '#ffffff', 'clusterBorder': '#333333'}}}%%
flowchart TB
seed["Seed file (optional)"] --> dsg
cat["Category samplers"] --> dsg
per["Person sampler (optional)"] --> dsg
dsg["Data Designer column graph<br/>Jinja2 prompts · LLM calls"]
dsg --> proj["output_projection"]
proj --> om["openai_messages"]
proj --> dpo["dpo_preference"]
proj --> sm["structured_messages"]
om --> jsonl["JSONL"]
dpo --> jsonl
sm --> jsonl
jsonl --> train["data_prep/sft_packing or AutoModel SFT"]
Each run is reproducible: the seed file, column specs, model alias, inference parameters, and projection rules are all version-controlled in a single YAML file.
Documentation#
Run the bundled pipeline end-to-end: preview two records, generate five, inspect the output JSONL.
Prepare for a focused chat with a coding agent: opening brief, seed ideas, and how SKILL.md supports the session without memorization.
Task-focused guides: adapt the pipeline to a domain, generate preference pairs, dispatch to a cluster.
YAML config schema, CLI flags, output projection shapes, and troubleshooting.
All Documentation#
Guide |
What You’ll Do |
Time |
|---|---|---|
Preview and generate your first synthetic SFT dataset |
5–10 min |
|
Run a productive agent session: brief, seeds, plain terms, and light use of |
10 min read |
Guide |
What You’ll Do |
|---|---|
Preview, generate, and customize output path and projection |
|
Adapt the pipeline to a custom domain with a seed file and multiple category dimensions |
|
Generate multi-turn tool-calling SFT data |
|
Generate DPO preference pairs from |
|
Dispatch generation to Lepton or Slurm via env.toml |
Reference |
What You’ll Find |
|---|---|
Full YAML column types, sampler parameters, and projection fields |
|
|
|
The three projection shapes with annotated JSONL examples |
|
Dispatch failures, image pull errors, API key issues, schema drift |
Before You Start#
The
NVIDIA_API_KEYenvironment variable is required for the default model,nvidia/nemotron-3-nano-30b-a3b, hosted on integrate.nvidia.com.
Limitations and Considerations#
Cost: Generation calls a hosted LLM endpoint; each record incurs API cost.
Quality: After generating records, review them before training.
Scale: API rate limits apply. For large generation runs, dispatch to a cluster and consider batching across multiple nodes.
Reproducibility: Seed files, column specs, model aliases, and inference parameters should all be version-controlled together. Changing any one of them changes the output distribution.