Reinforcement Learning Steps#

This section documents the reinforcement learning (RL) alignment steps registered under src/nemotron/steps/rl/nemo_rl/. All three steps run on NeMo-RL, consume a supervised fine-tuning (SFT) Megatron checkpoint as the warm-start policy, and produce an aligned checkpoint_megatron artifact.

Steps#

rl/nemo_rl/dpo

Direct preference optimization with preference pairs.

rl/nemo_rl/dpo
rl/nemo_rl/rlvr

Reinforcement learning with verifiable rewards via group relative policy optimization.

rl/nemo_rl/rlvr
rl/nemo_rl/rlhf

Reinforcement learning from human feedback with a generative reward model judge.

rl/nemo_rl/rlhf