Reinforcement Learning Steps#
This section documents the reinforcement learning (RL) alignment steps registered under src/nemotron/steps/rl/nemo_rl/.
All three steps run on NeMo-RL, consume a supervised fine-tuning (SFT) Megatron checkpoint as the warm-start policy, and produce an aligned checkpoint_megatron artifact.
Steps#
rl/nemo_rl/dpo
Direct preference optimization with preference pairs.
rl/nemo_rl/rlvr
Reinforcement learning with verifiable rewards via group relative policy optimization.
rl/nemo_rl/rlhf
Reinforcement learning from human feedback with a generative reward model judge.