Config Schema#

This page provides the reference information for the YAML config file consumed by sdg/data_designer.

Simple Fields#

Field

Type

Required

Description

output_dir

string

no

Base output directory. Supports OmegaConf env-var interpolation. Default resolves $SDG_OUTPUT_DIR, then $NEMO_RUN_DIR/sdg, then ./output/sdg.

output_path

string

yes

Full path for the output JSONL file. Typically ${output_dir}/my-dataset.jsonl.

num_records

int

yes

Number of records to generate (client.create) or preview (client.preview).

preview

bool

no

When true, calls client.preview() instead of client.create(). Default: false. Prefer setting this as a CLI override (preview=true) rather than in the YAML.

seed_dataset#

Optional top-level field. When present, Data Designer samples one row per generated record from the seed file and makes the fields available to column prompts by using Jinja2.

Field

Type

Required

Description

path

string

yes

Path to a JSONL file. Each line is a JSON object.

strategy

string

no

shuffle (default) or ordered.

fields

list[string]

yes

Column names to expose. Must match keys in the seed JSONL objects. These become available as {{ field_name }} in prompts without being declared in columns.

models#

A required top-level field. The field specifies a list of model configurations. Each entry defines one alias that column specs reference by name.

Field

Type

Required

Description

alias

string

yes

Short name referenced by model_alias in column specs.

model

string

yes

Model identifier such as nvidia/nemotron-3-nano-30b-a3b and openai/gpt-oss-20b.

provider

string

no

Provider name, such as nvidia or anthropic.

skip_health_check

bool

no

Skip the startup probe against the model provider. Useful for local or offline endpoints. Default: false.

inference_parameters.temperature

float

no

Sampling temperature.

inference_parameters.top_p

float

no

Top-p nucleus sampling.

inference_parameters.max_tokens

int

no

Maximum output tokens per call.

columns#

A required top-level field. This field is an ordered list of column specs. Each column has a name, a type, and type-specific fields. Columns can reference earlier columns and seed fields in prompts by using Jinja2 syntax like {{ column_name }}.

Categorical Columns#

Samples uniformly from a fixed list of string or numeric values like the following example.

- name: persona
  type: category
  values: [teacher, engineer, student, researcher]

Field

Required

Description

name

yes

Column name.

values

yes

List of values to sample from.

Seed Columns#

Provides a named field from the seed dataset as a column. Use this column type when a seed field needs to appear in metadata_fields or must be referenced in a way that requires it to be an explicit column.

- name: topic
  type: seed

Field

Required

Description

name

yes

Must match a field name in seed_dataset.fields.

Seed fields declared in seed_dataset.fields are available directly in prompts without this column type. Use seed only when you need the field as a named column in the output schema.

LLM Text Columns#

Generates free-form text using an LLM call. These columns can references earlier specified columns and seed fields in prompt by using Jinja2 syntax.

- name: user_query
  type: llm_text
  model_alias: nvidia-text
  prompt: |
    Write a message from a {{ persona }} asking about: {{ topic }}.

Field

Required

Description

name

yes

Column name.

model_alias

no

Alias from models. Default: nvidia-text.

prompt

yes

Jinja2 template. Reference any earlier column or seed field with {{ name }}.

LLM Structured Columns#

This column type generates structured JSON by making an LLM call. The column definition instructs the model to return JSON matching output_format. Use this column type for multi-turn conversations, preference judges, and any output that must conform to a schema.

- name: conversation
  type: llm_structured
  model_alias: nvidia-text
  prompt: |
    Generate a support conversation for customer {{ customer_name }}...
  output_format:
    type: object
    properties:
      messages:
        type: array
        ...
    required: [messages]

Field

Required

Description

name

yes

Column name.

model_alias

no

Alias from models. Default: nvidia-text.

prompt

yes

Jinja2 template.

output_format

yes

JSON Schema dict describing the expected output structure.

LLM Judge Columns#

This type is an alias for llm_structured. This type is typically used for columns that compare or evaluate other columns.

- name: judge
  type: llm_judge
  model_alias: nvidia-text
  prompt: |
    Compare response A and B for: {{ prompt }}
    A: {{ response_a }}
    B: {{ response_b }}
  output_format:
    type: object
    properties:
      winner:
        type: string
        enum: [A, B]
    required: [winner]

output_projection#

This top-level field maps raw Data Designer records into the schema expected by downstream steps. Refer to Output Projections for full field tables and annotated JSONL examples for each type.

type

Use for

Downstream

openai_messages

Single-turn SFT chat

data_prep/sft_packing, AutoModel SFT

dpo_preference

Preference pairs

data_prep/rl_prep, rl/nemo_rl/dpo

structured_messages

Multi-turn with tool calls

data_prep/sft_packing, AutoModel SFT

Extending the Schema: person and datetime Samplers#

The current step.py supports the column types above. To use Data Designer’s locale-aware person sampler or datetime sampler, step.py’s build_columns() function must be extended with person and datetime branches. A reference implementation showing both additions is in:

        elif kind == "person":
            builder.add_column(
                dd.SamplerColumnConfig(
                    name=name,
                    sampler_type=dd.SamplerType.PERSON,
                    params=dd.PersonSamplerParams(
                        locale=spec.get("locale", "en_US"),
                        age_range=spec.get("age_range"),
                        with_synthetic_personas=spec.get("with_synthetic_personas", True),
                    ),
                )
            )

        elif kind == "datetime":
            builder.add_column(
                dd.SamplerColumnConfig(
                    name=name,
                    sampler_type=dd.SamplerType.DATETIME,
                    params=dd.DatetimeSamplerParams(
                        start=spec["start"],
                        end=spec["end"],
                    ),
                )
            )

Once merged, configs can declare:

- name: traveler
  type: person
  locale: en_US
  age_range: [22, 75]
  with_synthetic_personas: true

- name: booking_date
  type: datetime
  start: "2024-01-01"
  end: "2025-12-31"

Download personas for the locale before running:

$ data-designer download personas --locale en_US