Direct Preference Optimization in NeMo RL#
Direct Preference Optimization (DPO) is an RL-free alignment algorithm that operates on preference data. Given a prompt and a pair of chosen and rejected responses, DPO aims to increase the probability of the chosen response and decrease the probability of the rejected response relative to a frozen reference model. The actor is initialized using the reference model. For more details, refer to the DPO paper.
Launch a DPO Run#
The script examples/run_dpo.py can be used to launch a DPO experiment. This script can either be launched locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the cluster documentation.
Be sure to launch the job using uv. The command to launch a DPO job is as follows:
uv run examples/run_dpo.py --config <PATH TO YAML CONFIG> <OVERRIDES>
If not specified, config will default to examples/configs/dpo.yaml.
Configuration#
NeMo RL allows users to configure DPO experiments using yaml config files. An example DPO configuration file can be found here.
To override a value in the config, either update the value in the yaml file directly, or pass the override via the command line. For example:
uv run examples/run_dpo.py \
cluster.gpus_per_node=8 \
dpo.sft_loss_weight=0.1 \
dpo.preference_average_log_probs=True \
logger.wandb.name="dpo-dev-8-gpu"
Reminder: Don’t forget to set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). You’ll need to do a huggingface-cli login as well for Llama models.
Datasets#
DPO datasets in NeMo RL are encapsulated using classes. Each DPO data class is expected to have the following attributes:
dataset: A dictionary containing the formatted datasets. Each example in the dataset must conform to the format described below.task_name: A string identifier that uniquely identifies the dataset.
If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. An example implementation can be found in preference_datasets/tulu3.py.
Note: The task_name field is required in each formatted example.
{
"context": [], // list of dicts - The prompt message (including previous turns, if any)
"completions": [ // list of dicts — The list of completions
{
"rank": 0, // int — The rank of the completion (lower rank is preferred)
"completion": [] // list of dicts — The completion message(s)
},
{
"rank": 1, // int — The rank of the completion (lower rank is preferred)
"completion": [] // list of dicts — The completion message(s)
}
],
"task_name": "task_name" // identifier for the task
}
DPO training supports only two completions (where the lowest rank is preferred and the highest one is rejected), with each completion being a single response. For example:
{
"context": [
{
"role": "user",
"content": "What's the capital of France?"
},
{
"role": "assistant",
"content": "The capital of France is Paris."
},
{
"role": "user",
"content": "Thanks! And what's the capital of Germany?"
}
],
"completions": [
{
"rank": 0,
"completion": [
{
"role": "assistant",
"content": "The capital of Germany is Berlin."
}
]
},
{
"rank": 1,
"completion": [
{
"role": "assistant",
"content": "The capital of Germany is Munich."
}
]
}
],
"task_name": "task_name"
}
By default, NeMo RL has support for HelpSteer3 and Tulu3Preference datasets. Both of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there’s no need to provide a path to any datasets on disk.
We provide a PreferenceDataset class that is compatible with jsonl-formatted preference datasets for loading datasets from local path or HuggingFace. You can modify your config as follows to use such a custom preference dataset:
data:
# other data settings, see `examples/configs/dpo.yaml` for more details
...
# dataset settings
train:
# this dataset will override prompt_key and use the default values for other vars
data_path: /path/to/local/train_dataset.jsonl # local file or hf_org/hf_dataset_name (HuggingFace)
subset: null # used for HuggingFace datasets
split: train # used for HuggingFace datasets
validation:
# this dataset will use the default values for other vars except data_path
data_path: /path/to/local/val_dataset.jsonl
default:
# will use below vars as default values if dataset doesn't specify it
dataset_name: PreferenceDataset
prompt_file: null
system_prompt_file: null
# multiple validation sets is supported by using val_data_paths
# this will be removed after refactor
val_data_paths:
<NameOfValidationDataset1>: /path/to/local/val_dataset_1.jsonl
<NameOfValidationDataset2>: /path/to/local/val_dataset_2.jsonl
Your JSONL files should contain one JSON object per line with the following structure:
{
"context": [{"role": "user", "content": "What is 2+2?"}], // list of dicts - The prompt message (including previous turns, if any)
"completions": [ // list of dicts — The list of completions
{
"rank": 0, // int — The rank of the completion (lower rank is preferred)
"completion": [{"role": "assistant", "content": "The answer is 4."}] // list of dicts — The completion message(s)
},
{
"rank": 1, // int — The rank of the completion (lower rank is preferred)
"completion": [{"role": "assistant", "content": "I don't know."}] // list of dicts — The completion message(s)
}
]
}
We also provide a BinaryPreferenceDataset class, which is a simplified version of PreferenceDataset for pairwise ranked preference with single turn completions. You can use prompt_key, chosen_key and rejected_key to specify which fields in your data correspond to the question, chosen answer and rejected answer respectively. Here’s an example configuration:
data:
# other data settings, see `examples/configs/dpo.yaml` for more details
...
# dataset settings
train:
# this dataset will override prompt_key and use the default values for other vars
data_path: /path/to/local/train_dataset.jsonl # local file or hf_org/hf_dataset_name (HuggingFace)
prompt_key: context
subset: null # used for HuggingFace datasets
split: train # used for HuggingFace datasets
validation:
# this dataset will use the default values for other vars except data_path
data_path: /path/to/local/val_dataset.jsonl
default:
# will use below vars as default values if dataset doesn't specify it
dataset_name: BinaryPreferenceDataset
prompt_key: prompt
chosen_key: chosen
rejected_key: rejected
prompt_file: null
system_prompt_file: null
Your JSONL files should contain one JSON object per line with the following structure:
{
"prompt": "What is 2+2?", // <prompt_key>: <prompt_content>
"chosen": "The answer is 4.", // <chosen_key>: <chosen_content>
"rejected": "I don't know." // <rejected_key>: <rejected_content>
}
Please note:
If you are using a logger, the prefix used for each validation set will be
validation-<NameOfValidationDataset>. The total validation time, summed across all validation sets, is reported undertiming/validation/total_validation_time.If you are doing checkpointing, the
metric_namevalue in yourcheckpointingconfig should reflect the metric and validation set to be tracked. For example,validation-<NameOfValidationDataset1>_loss.
DPO-Specific Parameters#
The DPO implementation in NeMo RL supports several key parameters that can be adjusted:
dpo.reference_policy_kl_penalty: Controls the strength of the KL penalty termdpo.preference_loss_weight: Weight for the preference lossdpo.sft_loss_weight: Weight for the auxiliary SFT lossdpo.preference_average_log_probs: Whether to average log probabilities over tokens in the preference loss termdpo.sft_average_log_probs: Whether to average log probabilities over tokens in the SFT loss term
These parameters can be adjusted in the config file or via command-line overrides to optimize training for your specific use case.
Optimizations#
Chunked Linear Cross-Entropy Fusion Loss#
During standard DPO training the model materializes a full logit tensor of shape [batch_size, seq_length, vocab_size] for both the policy forward-backward pass and the reference model logprob computation. This can cause out-of-memory (OOM) errors for long sequences or large vocabularies. The chunked linear cross-entropy fusion loss avoids this by computing log probabilities directly from the hidden states: it chunks the sequence dimension, projects each chunk to logits on the fly, gathers per-token log probabilities, and discards the logits before moving to the next chunk.
Benefits:
Extends the maximum trainable sequence length significantly by eliminating the large logit tensor from GPU memory.
Applies to both the training forward-backward pass and the reference model logprob computation.
Produces numerically equivalent loss values to the standard path.
How to enable:
Add the following to your Megatron config in your YAML file:
policy:
megatron_cfg:
enabled: true
use_linear_ce_fusion_loss: true
linear_ce_fusion_chunk_size: 256 # tokens per chunk; smaller = less memory, larger = more throughput
Notes:
Context parallelism is not supported when linear CE fusion is enabled.
Sequence packing is not supported with DPO regardless of this setting (see #719).
The
linear_ce_fusion_chunk_sizeparameter controls the trade-off between memory savings and compute throughput. The default value of 256 is a good starting point.
Evaluate the Trained Model#
Upon completion of the training process, you can refer to our evaluation guide to assess model capabilities.