Config Schema#
This page provides the reference information for the YAML config file consumed by sdg/data_designer.
Simple Fields#
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
no |
Base output directory. Supports OmegaConf env-var interpolation. Default resolves |
|
string |
yes |
Full path for the output JSONL file. Typically |
|
int |
yes |
Number of records to generate ( |
|
bool |
no |
When |
seed_dataset#
Optional top-level field. When present, Data Designer samples one row per generated record from the seed file and makes the fields available to column prompts by using Jinja2.
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
yes |
Path to a JSONL file. Each line is a JSON object. |
|
string |
no |
|
|
list[string] |
yes |
Column names to expose. Must match keys in the seed JSONL objects. These become available as |
models#
A required top-level field. The field specifies a list of model configurations. Each entry defines one alias that column specs reference by name.
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
yes |
Short name referenced by |
|
string |
yes |
Model identifier such as |
|
string |
no |
Provider name, such as |
|
bool |
no |
Skip the startup probe against the model provider. Useful for local or offline endpoints. Default: |
|
float |
no |
Sampling temperature. |
|
float |
no |
Top-p nucleus sampling. |
|
int |
no |
Maximum output tokens per call. |
columns#
A required top-level field.
This field is an ordered list of column specs.
Each column has a name, a type, and type-specific fields.
Columns can reference earlier columns and seed fields in prompts by using Jinja2 syntax like {{ column_name }}.
Categorical Columns#
Samples uniformly from a fixed list of string or numeric values like the following example.
- name: persona
type: category
values: [teacher, engineer, student, researcher]
Field |
Required |
Description |
|---|---|---|
|
yes |
Column name. |
|
yes |
List of values to sample from. |
Seed Columns#
Provides a named field from the seed dataset as a column.
Use this column type when a seed field needs to appear in metadata_fields or must be referenced in a way that requires it to be an explicit column.
- name: topic
type: seed
Field |
Required |
Description |
|---|---|---|
|
yes |
Must match a field name in |
Seed fields declared in seed_dataset.fields are available directly in prompts without this column type.
Use seed only when you need the field as a named column in the output schema.
LLM Text Columns#
Generates free-form text using an LLM call.
These columns can references earlier specified columns and seed fields in prompt by using Jinja2 syntax.
- name: user_query
type: llm_text
model_alias: nvidia-text
prompt: |
Write a message from a {{ persona }} asking about: {{ topic }}.
Field |
Required |
Description |
|---|---|---|
|
yes |
Column name. |
|
no |
Alias from |
|
yes |
Jinja2 template. Reference any earlier column or seed field with |
LLM Structured Columns#
This column type generates structured JSON by making an LLM call.
The column definition instructs the model to return JSON matching output_format.
Use this column type for multi-turn conversations, preference judges, and any output that must conform to a schema.
- name: conversation
type: llm_structured
model_alias: nvidia-text
prompt: |
Generate a support conversation for customer {{ customer_name }}...
output_format:
type: object
properties:
messages:
type: array
...
required: [messages]
Field |
Required |
Description |
|---|---|---|
|
yes |
Column name. |
|
no |
Alias from |
|
yes |
Jinja2 template. |
|
yes |
JSON Schema dict describing the expected output structure. |
LLM Judge Columns#
This type is an alias for llm_structured.
This type is typically used for columns that compare or evaluate other columns.
- name: judge
type: llm_judge
model_alias: nvidia-text
prompt: |
Compare response A and B for: {{ prompt }}
A: {{ response_a }}
B: {{ response_b }}
output_format:
type: object
properties:
winner:
type: string
enum: [A, B]
required: [winner]
output_projection#
This top-level field maps raw Data Designer records into the schema expected by downstream steps. Refer to Output Projections for full field tables and annotated JSONL examples for each type.
|
Use for |
Downstream |
|---|---|---|
|
Single-turn SFT chat |
|
|
Preference pairs |
|
|
Multi-turn with tool calls |
|
Extending the Schema: person and datetime Samplers#
The current step.py supports the column types above. To use Data Designer’s locale-aware person sampler or datetime sampler, step.py’s build_columns() function must be extended with person and datetime branches. A reference implementation showing both additions is in:
elif kind == "person":
builder.add_column(
dd.SamplerColumnConfig(
name=name,
sampler_type=dd.SamplerType.PERSON,
params=dd.PersonSamplerParams(
locale=spec.get("locale", "en_US"),
age_range=spec.get("age_range"),
with_synthetic_personas=spec.get("with_synthetic_personas", True),
),
)
)
elif kind == "datetime":
builder.add_column(
dd.SamplerColumnConfig(
name=name,
sampler_type=dd.SamplerType.DATETIME,
params=dd.DatetimeSamplerParams(
start=spec["start"],
end=spec["end"],
),
)
)
Once merged, configs can declare:
- name: traveler
type: person
locale: en_US
age_range: [22, 75]
with_synthetic_personas: true
- name: booking_date
type: datetime
start: "2024-01-01"
end: "2025-12-31"
Download personas for the locale before running:
$ data-designer download personas --locale en_US