Generate Preference Data for DPO#

This example shows how to use the rl_pref.yaml configuration file. The example generates prompt, chosen, and rejected triples for direct preference optimization (DPO) training. Output flows directly into data_prep/rl_prep and then rl/nemo_rl/dpo.

How It Works#

The rl_pref.yaml file registers two model aliases at different temperatures: a high-temperature creative model and a low-temperature precise model. The goal is to produce two responses per prompt that are distinct:

# DPO preference data — two responses per prompt + LLM judge for chosen/rejected.

output_dir: ${oc.env:SDG_OUTPUT_DIR,${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/sdg}
output_path: ${output_dir}/rl_pref.jsonl
num_records: 100

seed_dataset:
  path: ${oc.env:PWD}/src/nemotron/steps/sdg/data_designer/data/rl_pref_prompt_seeds.jsonl
  strategy: shuffle
  fields: [prompt]

# Two model aliases: a high-temperature 'creative' model and a low-temperature
# 'precise' model, so the resulting preference pairs are meaningfully distinct.
models:
  - alias: nvidia-text
    model: nvidia/nemotron-3-nano-30b-a3b
    provider: nvidia
    skip_health_check: false
    inference_parameters:
      temperature: 0.9
      top_p: 1.0
      max_tokens: 1024

  - alias: nvidia-text-precise
    model: nvidia/nemotron-3-nano-30b-a3b
    provider: nvidia
    skip_health_check: false
    inference_parameters:
      temperature: 0.3
      top_p: 1.0
      max_tokens: 1024

# `prompt` is supplied automatically by the seed dataset (must match the field
# name in the seed JSONL). No need to declare it here.
columns:
  - name: response_a
    type: llm_text
    model_alias: nvidia-text
    prompt: "Answer the user's question: {{ prompt }}"

  - name: response_b
    type: llm_text
    model_alias: nvidia-text-precise
    prompt: "Answer the user's question: {{ prompt }}"

  - name: judge
    type: llm_judge
    model_alias: nvidia-text
    prompt: |
      Compare two responses for: {{ prompt }}
      A: {{ response_a }}
      B: {{ response_b }}
      Which is more helpful and correct?
    output_format:
      type: object
      properties:
        winner:
          type: string
          enum: [A, B]
      required: [winner]

output_projection:
  type: dpo_preference
  prompt_field: prompt
  response_a_field: response_a
  response_b_field: response_b
  judge_field: judge
  winner_field: winner

For each seed prompt the pipeline:

Generates response_a (high temperature) and response_b (low temperature) independently.
Asks a third LLM call (judge column, llm_judge type) to compare them and return {"winner": "A"} or {"winner": "B"}.
The dpo_preference projection maps winner → chosen / rejected and writes {"prompt": "...", "chosen": "...", "rejected": "..."}.

Prerequisites#

NVIDIA_API_KEY set in your environment.
A seed file with one prompt field per line. The bundled rl_pref_prompt_seeds.jsonl contains general reasoning prompts. Replace it with domain-specific prompts for targeted preference data.

Procedure#

Preview two records to verify the judge returns valid winner values:

$ nemotron steps run sdg/data_designer -c rl_pref preview=true num_records=2

Generate the dataset. The checked-in rl_pref.yaml default is 100 records:

$ nemotron steps run sdg/data_designer -c rl_pref num_records=500

Output is written to ./output/sdg/rl_pref.jsonl.

Inspect the output. Each line is a preference triple:

{"prompt": "Explain why retrieval-augmented generation can reduce hallucinations.", "chosen": "RAG grounds the model in retrieved documents, so claims are tied to specific passages rather than purely to weights.", "rejected": "RAG is better because it uses more data and is generally smarter than standard models."}

Adapt the Seed File#

Swap seed_dataset.path to point at your own prompt seed file. Each line must be valid JSON with a prompt field:

{"prompt": "Describe the tradeoffs between batch and streaming inference for real-time applications."}

Keep seed prompts representative of the target capability and diverse across difficulty levels. The judge performs better when the two responses have a clear quality difference–consider widening the temperature gap between the two model aliases if the judge returns many ties or unexpected results.

Downstream Pipeline#

rl_pref.jsonl  →  data_prep/rl_prep  →  rl/nemo_rl/dpo

data_prep/rl_prep tokenizes and prepares preference pairs. rl/nemo_rl/dpo consumes the prepared dataset. Verify the prompt, chosen, and rejected fields are present in every record before handing off.

Next Steps#

Output projection reference: Output Projections — dpo_preference schema.
Config schema: Config Schema — llm_judge column type and dpo_preference projection fields.
Dispatch to a cluster: Dispatch SDG to a Cluster.