RLHF (Stage 3)#
Reinforcement Learning from Human Feedback is the final RL stage, run after RLVR and SWE-RL to improve model behavior and interaction quality.
Approach#
Uses a large Generative Reward Model (GenRM) to provide supervision during RL. The GenRM is a principle-following model trained as described in RL-BFF, which allows guiding Nemotron Super 3’s behavior on important domains like identity and safety-related topics.
GenRM Model#
Parameter |
Value |
|---|---|
Initialization |
Qwen3-235B-A22B-Thinking-2507 |
Training Data |
HelpSteer 3 + commercially friendly subsets of lmarena-140k + recently collected human preference data |
Approach |
Principle-following GenRM for guiding behavior on identity and safety domains |
The GenRM is used throughout both the multi-environment RL stage (RLVR) AND as the sole reward signal in this RLHF-only stage.
KL Penalty#
Unlike the RLVR and SWE stages (which use KL=0), RLHF applies a KL penalty of 1e-4 to prevent the model from drifting too far from the reference policy. This is critical for maintaining the capabilities learned in prior stages while improving interaction quality.
Configuration#
Parameter |
Value |
|---|---|
Nodes |
72 (576 GPUs) |
Generation nodes |
32 (colocated=false) |
NeMo Gym GPU nodes |
8 |
Prompts/step |
128 |
Generations/prompt |
16 |
Batch size |
2,048 |
Max sequence length |
49,152 |
TP / CP |
4 / 4 |
Learning rate |
1e-6 |
KL penalty |
1e-4 |
Overlong filtering |
false |
GenRM router DP size |
8 |
Environments#
The RLHF stage uses GenRM comparison as the primary reward signal, along with tool use evaluation:
genrm_compare— Pairwise comparison using the GenRM model with principle-following promptssingle_step_tool_use_with_argument_comparison— Tool use correctness
Config Files#
stage3_rlhf/config/default.yaml— Full-scale 72-node configstage3_rlhf/config/small.yaml— Reduced 24-node variant for testing
Usage#
Using nemotron CLI#
uv run nemotron super3 rl rlhf --run YOUR-CLUSTER
--run YOUR-CLUSTERrefers to a profile defined in yourenv.tomlfile. See the env.toml setup guide for details.
Using super_launch.sh#
EXP_NAME=stage3-rlhf \
CONFIG_PATH=examples/configs/super/stage3_rlhf.yaml \
MODEL_PATH=/path/to/swe2_checkpoint \
TRAIN_PATH=$DATA_DIR/rlhf/train-split.jsonl \
VAL_PATH=$DATA_DIR/rlhf/val-split.jsonl \
CONTAINER=nvcr.io/nvidia/nemo-rl:v0.5.0.nemotron_3_super \
SANDBOX_CONTAINER=$SANDBOX_CONTAINER \
PERSISTENT_CACHE=$PERSISTENT_CACHE \
EXTRA_MOUNTS=$EXTRA_MOUNTS \
SLURM_PARTITION=$SLURM_PARTITION \
SLURM_ACCOUNT=$SLURM_ACCOUNT \
bash super_launch.sh
See the upstream training guide for full details on environment variables.
References#
Recipe Source: src/nemotron/recipes/super3/stage2_rl/stage3_rlhf/