Tips for the Data Generation Pipeline#
Preview Before Generating#
Always preview before running a full generation job. Preview mode calls the same pipeline with a small record count, projects the records, and writes them to output_path:
$ nemotron steps run sdg/data_designer -c default preview=true num_records=2
Use preview to verify:
Column references in prompts (
{{ column_name }}) resolve to the expected values.Seed fields, such as
{{ scenario }},{{ prompt }}, and so on, are populated from the seed file.The model returns text that matches the prompt’s intent.
The
output_projectionproduces the schema downstream steps expect.
Specify a Configuration File#
The repository includes the following sample config files in the src/nemotron/steps/sdg/data_designer/config directory:
Config |
Output |
Use for |
|---|---|---|
|
SFT chat ( |
General chat SFT |
|
Tool-call SFT ( |
Tool-use SFT |
|
Preference pairs ( |
DPO / RLHF |
|
SFT chat, 10 records, short tokens |
Fast iteration |
Specify the file in the -c argument:
$ nemotron steps run sdg/data_designer -c customer_support_tools preview=true num_records=2
Run Attached on a Cluster Profile#
To dispatch to a Lepton or Slurm profile configured in env.toml, use --run (attached, streams logs) or --batch (detached):
$ nemotron steps run sdg/data_designer -c default --run my-lepton-profile num_records=1000
For cluster setup, see Dispatch SDG to a Cluster.