Tips for the Data Generation Pipeline#

Preview Before Generating#

Always preview before running a full generation job. Preview mode calls the same pipeline with a small record count, projects the records, and writes them to output_path:

$ nemotron steps run sdg/data_designer -c default preview=true num_records=2

Use preview to verify:

  • Column references in prompts ({{ column_name }}) resolve to the expected values.

  • Seed fields, such as {{ scenario }}, {{ prompt }}, and so on, are populated from the seed file.

  • The model returns text that matches the prompt’s intent.

  • The output_projection produces the schema downstream steps expect.

Specify a Configuration File#

The repository includes the following sample config files in the src/nemotron/steps/sdg/data_designer/config directory:

Config

Output

Use for

default.yaml

SFT chat (openai_messages)

General chat SFT

customer_support_tools.yaml

Tool-call SFT (structured_messages)

Tool-use SFT

rl_pref.yaml

Preference pairs (dpo_preference)

DPO / RLHF

tiny.yaml

SFT chat, 10 records, short tokens

Fast iteration

Specify the file in the -c argument:

$ nemotron steps run sdg/data_designer -c customer_support_tools preview=true num_records=2

Run Attached on a Cluster Profile#

To dispatch to a Lepton or Slurm profile configured in env.toml, use --run (attached, streams logs) or --batch (detached):

$ nemotron steps run sdg/data_designer -c default --run my-lepton-profile num_records=1000

For cluster setup, see Dispatch SDG to a Cluster.