Generate Preference Data for DPO#
This example shows how to use the rl_pref.yaml configuration file.
The example generates prompt, chosen, and rejected triples for direct preference optimization (DPO) training.
Output flows directly into data_prep/rl_prep and then rl/nemo_rl/dpo.
How It Works#
The rl_pref.yaml file registers two model aliases at different temperatures:
a high-temperature creative model and a low-temperature precise model.
The goal is to produce two responses per prompt that are distinct:
# DPO preference data — two responses per prompt + LLM judge for chosen/rejected.
output_dir: ${oc.env:SDG_OUTPUT_DIR,${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/sdg}
output_path: ${output_dir}/rl_pref.jsonl
num_records: 100
seed_dataset:
path: ${oc.env:PWD}/src/nemotron/steps/sdg/data_designer/data/rl_pref_prompt_seeds.jsonl
strategy: shuffle
fields: [prompt]
# Two model aliases: a high-temperature 'creative' model and a low-temperature
# 'precise' model, so the resulting preference pairs are meaningfully distinct.
models:
- alias: nvidia-text
model: nvidia/nemotron-3-nano-30b-a3b
provider: nvidia
skip_health_check: false
inference_parameters:
temperature: 0.9
top_p: 1.0
max_tokens: 1024
- alias: nvidia-text-precise
model: nvidia/nemotron-3-nano-30b-a3b
provider: nvidia
skip_health_check: false
inference_parameters:
temperature: 0.3
top_p: 1.0
max_tokens: 1024
# `prompt` is supplied automatically by the seed dataset (must match the field
# name in the seed JSONL). No need to declare it here.
columns:
- name: response_a
type: llm_text
model_alias: nvidia-text
prompt: "Answer the user's question: {{ prompt }}"
- name: response_b
type: llm_text
model_alias: nvidia-text-precise
prompt: "Answer the user's question: {{ prompt }}"
- name: judge
type: llm_judge
model_alias: nvidia-text
prompt: |
Compare two responses for: {{ prompt }}
A: {{ response_a }}
B: {{ response_b }}
Which is more helpful and correct?
output_format:
type: object
properties:
winner:
type: string
enum: [A, B]
required: [winner]
output_projection:
type: dpo_preference
prompt_field: prompt
response_a_field: response_a
response_b_field: response_b
judge_field: judge
winner_field: winner
For each seed prompt the pipeline:
Generates
response_a(high temperature) andresponse_b(low temperature) independently.Asks a third LLM call (
judgecolumn,llm_judgetype) to compare them and return{"winner": "A"}or{"winner": "B"}.The
dpo_preferenceprojection maps winner → chosen / rejected and writes{"prompt": "...", "chosen": "...", "rejected": "..."}.
Prerequisites#
NVIDIA_API_KEYset in your environment.A seed file with one
promptfield per line. The bundledrl_pref_prompt_seeds.jsonlcontains general reasoning prompts. Replace it with domain-specific prompts for targeted preference data.
Procedure#
Preview two records to verify the judge returns valid
winnervalues:$ nemotron steps run sdg/data_designer -c rl_pref preview=true num_records=2
Generate the dataset. The checked-in
rl_pref.yamldefault is 100 records:$ nemotron steps run sdg/data_designer -c rl_pref num_records=500
Output is written to
./output/sdg/rl_pref.jsonl.Inspect the output. Each line is a preference triple:
{"prompt": "Explain why retrieval-augmented generation can reduce hallucinations.", "chosen": "RAG grounds the model in retrieved documents, so claims are tied to specific passages rather than purely to weights.", "rejected": "RAG is better because it uses more data and is generally smarter than standard models."}
Adapt the Seed File#
Swap seed_dataset.path to point at your own prompt seed file. Each line must be valid JSON with a prompt field:
{"prompt": "Describe the tradeoffs between batch and streaming inference for real-time applications."}
Keep seed prompts representative of the target capability and diverse across difficulty levels. The judge performs better when the two responses have a clear quality difference–consider widening the temperature gap between the two model aliases if the judge returns many ties or unexpected results.
Downstream Pipeline#
rl_pref.jsonl → data_prep/rl_prep → rl/nemo_rl/dpo
data_prep/rl_prep tokenizes and prepares preference pairs. rl/nemo_rl/dpo consumes the prepared dataset. Verify the prompt, chosen, and rejected fields are present in every record before handing off.
Next Steps#
Output projection reference: Output Projections —
dpo_preferenceschema.Config schema: Config Schema —
llm_judgecolumn type anddpo_preferenceprojection fields.Dispatch to a cluster: Dispatch SDG to a Cluster.