Generate Your First Synthetic Dataset#

What You’ll Build: a small supervised fine-tuning (SFT) chat dataset in OpenAI message format. The dataset contains five records grounded in the sft_topic_seeds.jsonl seed file in the repository. Data Designer generates the records against an NVIDIA-hosted large language model (LLM) endpoint.

In this tutorial, you will:

Set up prerequisites: the repository and an NVIDIA API key.
Read the default pipeline configuration.
Run a preview to verify the pipeline and model.
Generate a small dataset of five records.
Locate and inspect the output JSON Lines (JSONL) file.

This tutorial requires between 5 and 10 minutes to complete.

Sample Prompt

Run a 2-record preview of the default synthetic data generation (SDG) pipeline, then generate 5 records and show me the first output record.

Prerequisites#

Run all commands from the repository root.
An NVIDIA_API_KEY for the default model, nvidia/nemotron-3-nano-30b-a3b. Data generation runs against an NVIDIA-hosted endpoint, so you can complete this tutorial on any machine with network access.

How the Default Pipeline Works#

The default pipeline at src/nemotron/steps/sdg/data_designer/config/default.yaml combines two sources of variation for each record. A seed topic is sampled from sft_topic_seeds.jsonl, for example a topic on safe deployment of AI assistants in enterprise support workflows. A persona category, such as teacher or engineer, is sampled from a fixed category set. Together they anchor a user prompt. The pipeline generates a matching assistant response and projects the result into OpenAI chat-format messages.

# SFT synthetic chat — expand seed topics into user/assistant turns.
# Schema mirrors the NVIDIA-NeMo/DataDesigner Python SDK; column specs are
# translated by step.py into the corresponding typed config builder calls.
#
# Defaults are Lepton-friendly and local-friendly: output lands under
# $SDG_OUTPUT_DIR when set, otherwise $NEMO_RUN_DIR/sdg or ./output/sdg.
# The seed dataset is packaged with this step. Override either at the CLI:
# `output_path=... seed_dataset.path=...`.

output_dir: ${oc.env:SDG_OUTPUT_DIR,${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/sdg}
output_path: ${output_dir}/sft.jsonl
num_records: 1000

seed_dataset:
  path: ${oc.env:PWD}/src/nemotron/steps/sdg/data_designer/data/sft_topic_seeds.jsonl
  strategy: shuffle           # shuffle | ordered
  fields: [topic]

# Models map an `alias` (referenced by llm_text/llm_structured columns) to a
# concrete model + provider + inference parameters. Override per-environment:
#   - Cloud:  provider: nvidia,  set NVIDIA_API_KEY in the env profile.
#   - Local:  provider: openai,  point at a vLLM/OpenAI-compatible endpoint.
#
# Set skip_health_check: true to skip Designer's startup probe — useful when
# the model exists at runtime but isn't in the provider's catalog at config
# time, or for offline / vLLM endpoints.
models:
  - alias: nvidia-text
    model: nvidia/nemotron-3-nano-30b-a3b
    provider: nvidia
    skip_health_check: false
    inference_parameters:
      temperature: 0.8
      top_p: 1.0
      max_tokens: 1024

# Seed columns (e.g. `topic`) are added automatically when seed_dataset is set.
# Reference them in prompts as `{{ topic }}` without declaring them here.
columns:
  - name: persona
    type: category
    values: [teacher, engineer, student, researcher, support_agent]

  - name: user_query
    type: llm_text
    model_alias: nvidia-text
    prompt: |
      Write a single user message for a {{ persona }} asking about: {{ topic }}.
      Keep it natural, 1-3 sentences.

  - name: assistant_response
    type: llm_text
    model_alias: nvidia-text
    prompt: |
      Helpful assistant reply to: "{{ user_query }}".
      Style: concise, factual, no markdown.

output_projection:
  type: openai_messages
  user_field: user_query
  assistant_field: assistant_response
  metadata_fields: [persona, topic]

Procedure#

Clone the repository, if you have not already done so:

$ git clone https://github.com/NVIDIA-NeMo/Nemotron && cd Nemotron

Install the dependencies for synthetic data generation:
```
$ uv sync --extra data-sdg
```

Set your NVIDIA API key:

$ export NVIDIA_API_KEY="<your-api-key>"

Run a 2-record preview to verify the model alias, prompts, and column mappings before generating at scale.

$ uv run nemotron steps run sdg/data_designer -c default preview=true num_records=2

The pipeline registers the model alias, generates two rows, and prints a summary:

Example Output

Compiled Configuration

╭──────────────────────────────────────── run ────────────────────────────────────────╮
│ mode: local                                                                         │
│ profile: null                                                                       │
│ env: {}                                                                             │
│ cli:                                                                                │
│   argv:                                                                             │
│   - preview=true                                                                    │
│   - num_records=2                                                                   │
│   dotlist:                                                                          │
│   - preview=true                                                                    │
│   - num_records=2                                                                   │
│   passthrough: []                                                                   │
│   config: default                                                                   │
│ recipe:                                                                             │
│   name: steps/sdg/data_designer                                                     │
│   script: /local/var/tmp/nvidia/nemotron/src/nemotron/steps/sdg/data_designer/step.py │
╰─────────────────────────────────────────────────────────────────────────────────────╯

╭─────────────────────────────────────── config ────────────────────────────────────────╮
│ output_dir: ${oc.env:SDG_OUTPUT_DIR,${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/sdg}  │
│ output_path: ${output_dir}/sft.jsonl                                                  │
│ num_records: 2                                                                        │
│ seed_dataset:                                                                         │
│   path: ${oc.env:PWD}/src/nemotron/steps/sdg/data_designer/data/sft_topic_seeds.jsonl │
│   strategy: shuffle                                                                   │
│   fields:                                                                             │
│   - topic                                                                             │
│ models:                                                                               │
│ - alias: nvidia-text                                                                  │
│   model: nvidia/nemotron-3-nano-30b-a3b                                               │
│   provider: nvidia                                                                    │
│   skip_health_check: true                                                             │
│   inference_parameters:                                                               │
│     temperature: 0.8                                                                  │
│     top_p: 1.0                                                                        │
│     max_tokens: '******'                                                              │
│ columns:                                                                              │
│ - name: persona                                                                       │
│   type: category                                                                      │
│   values:                                                                             │
│   - teacher                                                                           │
│   - engineer                                                                          │
│   - student                                                                           │
│   - researcher                                                                        │
│   - support_agent                                                                     │
│ - name: user_query                                                                    │
│   type: llm_text                                                                      │
│   model_alias: nvidia-text                                                            │
│   prompt: 'Write a single user message for a {{ persona }} asking about: {{ topic     │
│     }}.                                                                               │
│                                                                                       │
│     Keep it natural, 1-3 sentences.                                                   │
│                                                                                       │
│     '                                                                                 │
│ - name: assistant_response                                                            │
│   type: llm_text                                                                      │
│   model_alias: nvidia-text                                                            │
│   prompt: 'Helpful assistant reply to: "{{ user_query }}".                            │
│                                                                                       │
│     Style: concise, factual, no markdown.                                             │
│                                                                                       │
│     '                                                                                 │
│ output_projection:                                                                    │
│   type: openai_messages                                                               │
│   user_field: user_query                                                              │
│   assistant_field: assistant_response                                                 │
│   metadata_fields:                                                                    │
│   - persona                                                                           │
│   - topic                                                                             │
│ preview: true                                                                         │
╰───────────────────────────────────────────────────────────────────────────────────────╯



╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Job Submission                                                                                                │
│ ├── configs                                                                                                   │
│ │   ├── job:   /local/var/tmp/nvidia/nemotron/.nemotron/jobs/20260509-133530-steps-sdg-data_designer/job.yaml   │
│ │   └── train: /local/var/tmp/nvidia/nemotron/.nemotron/jobs/20260509-133530-steps-sdg-data_designer/train.yaml │
│ └── mode: local                                                                                               │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Executing: /local/var/tmp/nvidia/nemotron/.venv/bin/python3 /local/var/tmp/nvidia/nemotron/src/nemotron/steps/sdg/data_designer/step.py --config /local/var/tmp/nvidia/nemotron/.nemotron/jobs/20260509-133530-steps-sdg-data_designer/train.yaml
Preview 2 records → /local/var/tmp/nvidia/nemotron/output/sdg/sft.jsonl

Generate the five-record dataset:
```
$ uv run nemotron steps run sdg/data_designer -c default num_records=5
```
The default output path is ./output/sdg/sft.jsonl. To change the path, set the SDG_OUTPUT_DIR environment variable or pass output_path=... on the command line.

Inspect the output. Each line of sft.jsonl is one chat record. The openai_messages projection emits a messages array along with the seed topic and sampled persona as metadata for traceability. The following sample shows one record from the sft.jsonl file.

{
  "messages": [
    {
      "role": "user",
      "content": "Could you explain how reducing latency affects the quality of the model’s responses? I’m trying to understand the trade‑offs involved."
    },
    {
      "role": "assistant",
      "content": "Reducing latency shortens the time the model takes to generate a response, which often means it has less computational budget or fewer inference steps. With fewer steps or a tighter budget, the model may:\n\n- Produce shorter, less nuanced outputs  \n- Miss subtle contextual cues or deeper reasoning  \n- Rely more on surface‑level patterns rather than elaborate context  \n- Be more prone to errors or hallucinations that would be filtered out with additional processing  \n\nThe trade‑off is speed versus depth: faster responses can be less thorough, less coherent, or lower in quality, especially for tasks that benefit from extended reasoning or detailed elaboration. Balancing the two involves choosing an acceptable latency target while preserving enough inference capacity to maintain the desired response quality."
    }
  ],
  "persona": "teacher",
  "topic": "tradeoffs between model latency and response quality"
}

Summary#

In this tutorial, you completed the following tasks:

Ran a 2-record preview to verify the pipeline and model.
Generated a 5-record SFT chat dataset with default.yaml.
Located the OpenAI-format JSONL output.

As you scale this workflow up, keep two principles in mind:

Run a preview first. The preview=true num_records=N form runs the same pipeline against a small record count, so you can iterate on column specifications and prompts before scaling num_records up.
The output format matches the trainer. The openai_messages projection emits records ready for data_prep/sft_packing or AutoModel SFT.

Next Steps#

Adapt the pipeline to a specific domain: Create a Domain Dataset for Airlines Customer Service.
Preview, generate, and customize output: Tips for the Data Generation Pipeline.
Generate preference pairs for direct preference optimization (DPO): Generate Preference Data for DPO.
Dispatch to a cluster: Dispatch SDG to a Cluster describes env.toml profiles and container images.
Look up flags and config fields: CLI Reference, Config Schema.