Choose an RL Alignment Step#
Post-training alignment with NeMo RL is split into three steps under rl/nemo_rl/. The table uses short names: direct preference optimization (DPO), reinforcement learning with verifiable rewards (RLVR) paired with group relative policy optimization (GRPO), and reinforcement learning from human feedback (RLHF). Choose a step based on how the reward signal enters training, not based on model family alone.
Options#
Step id |
Reward source |
Typical data shape |
Output |
|---|---|---|---|
|
Static preference pairs |
Prompt with chosen and rejected completions |
|
|
Verifiable or programmatic checks |
Prompt with answers, tests, or environment metadata |
|
|
Learned reward or judge model |
Prompts plus a reward model checkpoint |
|
All three steps consume a warm-start policy in checkpoint_megatron format produced by Megatron-style supervised fine tuning (SFT). They do not train a policy from scratch.
Decision Flow#
If you only have pairwise preferences and no online reward, use
rl/nemo_rl/dpo.If reward is deterministic, for example unit tests, answer match, or tool success, use
rl/nemo_rl/rlvr.If a separate reward model or judge produces scores, use
rl/nemo_rl/rlhf.For resource-server rewards or NeMo Gym style rewards, use the RLVR or RLHF configuration paths documented in each step
SKILL.mdfile and YAML file. Some flows useconfig/nemo_gym.yaml.
Data Preparation#
When preference JSONL still contains Hugging Face placeholders or needs sharding resolution, run the RL prep step upstream. Inspect data_prep/rl_prep in the step tree. Read the manifest for your chosen rl/nemo_rl/... step for required consumes types.
Sample Commands#
$ uv run nemotron steps run rl/nemo_rl/dpo -c tiny
$ uv run nemotron steps run rl/nemo_rl/rlvr -c tiny
$ uv run nemotron steps run rl/nemo_rl/rlhf -c tiny
Success Criteria#
You validate reward design on a small batch before you scale rollout count.
You track Kullback–Leibler (KL) drift, reward variance, response length, and held-out task metrics. Average reward alone is not sufficient.