dpo#

This step runs direct preference optimization (DPO) alignment on NeMo-RL. The step consumes a preference dataset of chosen-and-rejected response pairs, together with a supervised fine-tuning (SFT) Megatron checkpoint that serves as the warm-start policy. The step produces an aligned checkpoint_megatron artifact.

Syntax#

nemotron steps run rl/nemo_rl/dpo \
    [-c <config-name-or-path>] \
    [-r <run-profile> | -b <batch-profile>] \
    [-d] \
    [--force-squash] \
    [<dotlist-overrides>...] \
    [<passthrough-args>...]

Refer to the Nemotron Steps CLI Reference for the shared flag set.

Configuration Files#

The step ships two configuration files under src/nemotron/steps/rl/nemo_rl/dpo/config/.

File	Purpose
`default.yaml`	Full-shape DPO with a five-hundred-step training schedule and `meta-llama/Llama-3.2-1B-Instruct` as the policy.
`tiny.yaml`	Short validation run with a small dataset slice.

Pass the configuration name with -c:

$ nemotron steps run rl/nemo_rl/dpo -c tiny
$ nemotron steps run rl/nemo_rl/dpo -c default

Inputs and Outputs#

Direction	Artifact Type	Required	Description
Consumes	`training_jsonl`	Yes	Preference JSON Lines with `prompt`, `chosen`, and `rejected` fields.
Consumes	`checkpoint_megatron`	Yes	The supervised fine-tuned policy to align.
Produces	`checkpoint_megatron`	—	The DPO-aligned policy checkpoint.

Step Parameters#

The manifest declares one DPO-specific parameter. Pass it as a dotlist override.

dpo.reference_policy_kl_penalty=<float>#

The Kullback-Leibler (KL) penalty between the trained policy and the reference policy. This value corresponds to β in the DPO objective. Lower values let the policy drift further from the reference; higher values keep the policy close to the reference.

Default: 0.05.

Example: dpo.reference_policy_kl_penalty=0.1

Frequently used dotlist overrides drawn from the NeMo-RL DPO recipe include the following.

policy.model_name=<id-or-path>#

The Hugging Face identifier or local path for the policy. The default configuration reads this value from the RL_POLICY_MODEL environment variable.

Example: policy.model_name=meta-llama/Llama-3.2-3B-Instruct

dpo.max_num_steps=<n>#

The maximum number of training steps.

Example: dpo.max_num_steps=1000

policy.train_global_batch_size=<n>#

The global batch size across all data-parallel workers.

Example: policy.train_global_batch_size=64

policy.optimizer.lr=<float>#

The optimizer learning rate.

Example: policy.optimizer.lr=1.0e-6

Strategies#

The manifest records one operator strategy for rl/nemo_rl/dpo.

When the loss diverges or the Kullback-Leibler divergence collapses, raise dpo.reference_policy_kl_penalty to a value in the 0.1-0.3 range, or lower the learning rate.

Command Examples#

Run the tiny validation configuration locally:

$ nemotron steps run rl/nemo_rl/dpo -c tiny

Compile the default configuration without submitting the job:

$ nemotron steps run rl/nemo_rl/dpo -c default --dry-run

Submit an attached run on a Lepton profile with a tighter Kullback-Leibler penalty:

$ nemotron steps run rl/nemo_rl/dpo -c default -r lepton_rl_nemo_rl_dpo \
    dpo.reference_policy_kl_penalty=0.2 \
    dpo.max_num_steps=1000

Submit a detached run on a Slurm profile with a larger policy:

$ nemotron steps run rl/nemo_rl/dpo -c default -b slurm_rl_nemo_rl_dpo \
    policy.model_name=meta-llama/Llama-3.2-3B-Instruct \
    policy.train_global_batch_size=64 \
    policy.optimizer.lr=1.0e-6