Generate Your First Synthetic Dataset#
What You’ll Build: a small supervised fine-tuning (SFT) chat dataset in OpenAI message format.
The dataset contains five records grounded in the sft_topic_seeds.jsonl seed file in the repository.
Data Designer generates the records against an NVIDIA-hosted large language model (LLM) endpoint.
In this tutorial, you will:
Set up prerequisites: the repository and an NVIDIA API key.
Read the default pipeline configuration.
Run a preview to verify the pipeline and model.
Generate a small dataset of five records.
Locate and inspect the output JSON Lines (JSONL) file.
This tutorial requires between 5 and 10 minutes to complete.
Run a 2-record preview of the default synthetic data generation (SDG) pipeline, then generate 5 records and show me the first output record.
Prerequisites#
Run all commands from the repository root.
An
NVIDIA_API_KEYfor the default model,nvidia/nemotron-3-nano-30b-a3b. Data generation runs against an NVIDIA-hosted endpoint, so you can complete this tutorial on any machine with network access.
How the Default Pipeline Works#
The default pipeline at src/nemotron/steps/sdg/data_designer/config/default.yaml combines two sources of variation for each record.
A seed topic is sampled from sft_topic_seeds.jsonl, for example a topic on safe deployment of AI assistants in enterprise support workflows.
A persona category, such as teacher or engineer, is sampled from a fixed category set.
Together they anchor a user prompt.
The pipeline generates a matching assistant response and projects the result into OpenAI chat-format messages.
# SFT synthetic chat — expand seed topics into user/assistant turns.
# Schema mirrors the NVIDIA-NeMo/DataDesigner Python SDK; column specs are
# translated by step.py into the corresponding typed config builder calls.
#
# Defaults are Lepton-friendly and local-friendly: output lands under
# $SDG_OUTPUT_DIR when set, otherwise $NEMO_RUN_DIR/sdg or ./output/sdg.
# The seed dataset is packaged with this step. Override either at the CLI:
# `output_path=... seed_dataset.path=...`.
output_dir: ${oc.env:SDG_OUTPUT_DIR,${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/sdg}
output_path: ${output_dir}/sft.jsonl
num_records: 1000
seed_dataset:
path: ${oc.env:PWD}/src/nemotron/steps/sdg/data_designer/data/sft_topic_seeds.jsonl
strategy: shuffle # shuffle | ordered
fields: [topic]
# Models map an `alias` (referenced by llm_text/llm_structured columns) to a
# concrete model + provider + inference parameters. Override per-environment:
# - Cloud: provider: nvidia, set NVIDIA_API_KEY in the env profile.
# - Local: provider: openai, point at a vLLM/OpenAI-compatible endpoint.
#
# Set skip_health_check: true to skip Designer's startup probe — useful when
# the model exists at runtime but isn't in the provider's catalog at config
# time, or for offline / vLLM endpoints.
models:
- alias: nvidia-text
model: nvidia/nemotron-3-nano-30b-a3b
provider: nvidia
skip_health_check: false
inference_parameters:
temperature: 0.8
top_p: 1.0
max_tokens: 1024
# Seed columns (e.g. `topic`) are added automatically when seed_dataset is set.
# Reference them in prompts as `{{ topic }}` without declaring them here.
columns:
- name: persona
type: category
values: [teacher, engineer, student, researcher, support_agent]
- name: user_query
type: llm_text
model_alias: nvidia-text
prompt: |
Write a single user message for a {{ persona }} asking about: {{ topic }}.
Keep it natural, 1-3 sentences.
- name: assistant_response
type: llm_text
model_alias: nvidia-text
prompt: |
Helpful assistant reply to: "{{ user_query }}".
Style: concise, factual, no markdown.
output_projection:
type: openai_messages
user_field: user_query
assistant_field: assistant_response
metadata_fields: [persona, topic]
Procedure#
Clone the repository, if you have not already done so:
$ git clone https://github.com/NVIDIA-NeMo/Nemotron && cd Nemotron
Install the dependencies for synthetic data generation:
$ uv sync --extra data-sdg
Set your NVIDIA API key:
$ export NVIDIA_API_KEY="<your-api-key>"
Run a 2-record preview to verify the model alias, prompts, and column mappings before generating at scale.
$ uv run nemotron steps run sdg/data_designer -c default preview=true num_records=2
The pipeline registers the model alias, generates two rows, and prints a summary:
Example Output
Compiled Configuration ╭──────────────────────────────────────── run ────────────────────────────────────────╮ │ mode: local │ │ profile: null │ │ env: {} │ │ cli: │ │ argv: │ │ - preview=true │ │ - num_records=2 │ │ dotlist: │ │ - preview=true │ │ - num_records=2 │ │ passthrough: [] │ │ config: default │ │ recipe: │ │ name: steps/sdg/data_designer │ │ script: /local/var/tmp/nvidia/nemotron/src/nemotron/steps/sdg/data_designer/step.py │ ╰─────────────────────────────────────────────────────────────────────────────────────╯ ╭─────────────────────────────────────── config ────────────────────────────────────────╮ │ output_dir: ${oc.env:SDG_OUTPUT_DIR,${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/sdg} │ │ output_path: ${output_dir}/sft.jsonl │ │ num_records: 2 │ │ seed_dataset: │ │ path: ${oc.env:PWD}/src/nemotron/steps/sdg/data_designer/data/sft_topic_seeds.jsonl │ │ strategy: shuffle │ │ fields: │ │ - topic │ │ models: │ │ - alias: nvidia-text │ │ model: nvidia/nemotron-3-nano-30b-a3b │ │ provider: nvidia │ │ skip_health_check: true │ │ inference_parameters: │ │ temperature: 0.8 │ │ top_p: 1.0 │ │ max_tokens: '******' │ │ columns: │ │ - name: persona │ │ type: category │ │ values: │ │ - teacher │ │ - engineer │ │ - student │ │ - researcher │ │ - support_agent │ │ - name: user_query │ │ type: llm_text │ │ model_alias: nvidia-text │ │ prompt: 'Write a single user message for a {{ persona }} asking about: {{ topic │ │ }}. │ │ │ │ Keep it natural, 1-3 sentences. │ │ │ │ ' │ │ - name: assistant_response │ │ type: llm_text │ │ model_alias: nvidia-text │ │ prompt: 'Helpful assistant reply to: "{{ user_query }}". │ │ │ │ Style: concise, factual, no markdown. │ │ │ │ ' │ │ output_projection: │ │ type: openai_messages │ │ user_field: user_query │ │ assistant_field: assistant_response │ │ metadata_fields: │ │ - persona │ │ - topic │ │ preview: true │ ╰───────────────────────────────────────────────────────────────────────────────────────╯ ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Job Submission │ │ ├── configs │ │ │ ├── job: /local/var/tmp/nvidia/nemotron/.nemotron/jobs/20260509-133530-steps-sdg-data_designer/job.yaml │ │ │ └── train: /local/var/tmp/nvidia/nemotron/.nemotron/jobs/20260509-133530-steps-sdg-data_designer/train.yaml │ │ └── mode: local │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ Executing: /local/var/tmp/nvidia/nemotron/.venv/bin/python3 /local/var/tmp/nvidia/nemotron/src/nemotron/steps/sdg/data_designer/step.py --config /local/var/tmp/nvidia/nemotron/.nemotron/jobs/20260509-133530-steps-sdg-data_designer/train.yaml Preview 2 records → /local/var/tmp/nvidia/nemotron/output/sdg/sft.jsonlGenerate the five-record dataset:
$ uv run nemotron steps run sdg/data_designer -c default num_records=5
The default output path is
./output/sdg/sft.jsonl. To change the path, set theSDG_OUTPUT_DIRenvironment variable or passoutput_path=...on the command line.Inspect the output. Each line of
sft.jsonlis one chat record. Theopenai_messagesprojection emits amessagesarray along with the seedtopicand sampledpersonaas metadata for traceability. The following sample shows one record from thesft.jsonlfile.{ "messages": [ { "role": "user", "content": "Could you explain how reducing latency affects the quality of the model’s responses? I’m trying to understand the trade‑offs involved." }, { "role": "assistant", "content": "Reducing latency shortens the time the model takes to generate a response, which often means it has less computational budget or fewer inference steps. With fewer steps or a tighter budget, the model may:\n\n- Produce shorter, less nuanced outputs \n- Miss subtle contextual cues or deeper reasoning \n- Rely more on surface‑level patterns rather than elaborate context \n- Be more prone to errors or hallucinations that would be filtered out with additional processing \n\nThe trade‑off is speed versus depth: faster responses can be less thorough, less coherent, or lower in quality, especially for tasks that benefit from extended reasoning or detailed elaboration. Balancing the two involves choosing an acceptable latency target while preserving enough inference capacity to maintain the desired response quality." } ], "persona": "teacher", "topic": "tradeoffs between model latency and response quality" }
Summary#
In this tutorial, you completed the following tasks:
Ran a 2-record preview to verify the pipeline and model.
Generated a 5-record SFT chat dataset with
default.yaml.Located the OpenAI-format JSONL output.
As you scale this workflow up, keep two principles in mind:
Run a preview first. The
preview=true num_records=Nform runs the same pipeline against a small record count, so you can iterate on column specifications and prompts before scalingnum_recordsup.The output format matches the trainer. The
openai_messagesprojection emits records ready fordata_prep/sft_packingor AutoModel SFT.
Next Steps#
Adapt the pipeline to a specific domain: Create a Domain Dataset for Airlines Customer Service.
Preview, generate, and customize output: Tips for the Data Generation Pipeline.
Generate preference pairs for direct preference optimization (DPO): Generate Preference Data for DPO.
Dispatch to a cluster: Dispatch SDG to a Cluster describes env.toml profiles and container images.
Look up flags and config fields: CLI Reference, Config Schema.