Choose an RL Alignment Step#

Post-training alignment with NeMo RL is split into three steps under rl/nemo_rl/. The table uses short names: direct preference optimization (DPO), reinforcement learning with verifiable rewards (RLVR) paired with group relative policy optimization (GRPO), and reinforcement learning from human feedback (RLHF). Choose a step based on how the reward signal enters training, not based on model family alone.

Options#

Step id	Reward source	Typical data shape	Output
`rl/nemo_rl/dpo`	Static preference pairs	Prompt with chosen and rejected completions	`checkpoint_megatron`
`rl/nemo_rl/rlvr`	Verifiable or programmatic checks	Prompt with answers, tests, or environment metadata	`checkpoint_megatron`
`rl/nemo_rl/rlhf`	Learned reward or judge model	Prompts plus a reward model checkpoint	`checkpoint_megatron`

All three steps consume a warm-start policy in checkpoint_megatron format produced by Megatron-style supervised fine tuning (SFT). They do not train a policy from scratch.

Decision Flow#

If you only have pairwise preferences and no online reward, use rl/nemo_rl/dpo.
If reward is deterministic, for example unit tests, answer match, or tool success, use rl/nemo_rl/rlvr.
If a separate reward model or judge produces scores, use rl/nemo_rl/rlhf.
For resource-server rewards or NeMo Gym style rewards, use the RLVR or RLHF configuration paths documented in each step SKILL.md file and YAML file. Some flows use config/nemo_gym.yaml.

Data Preparation#

When preference JSONL still contains Hugging Face placeholders or needs sharding resolution, run the RL prep step upstream. Inspect data_prep/rl_prep in the step tree. Read the manifest for your chosen rl/nemo_rl/... step for required consumes types.

Sample Commands#

$ uv run nemotron steps run rl/nemo_rl/dpo -c tiny
$ uv run nemotron steps run rl/nemo_rl/rlvr -c tiny
$ uv run nemotron steps run rl/nemo_rl/rlhf -c tiny

Success Criteria#

You validate reward design on a small batch before you scale rollout count.
You track Kullback–Leibler (KL) drift, reward variance, response length, and held-out task metrics. Average reward alone is not sufficient.