RLHF (Stage 3)#

Reinforcement Learning from Human Feedback is the final RL stage, run after RLVR and SWE-RL to improve model behavior and interaction quality.

Approach#

Uses a large Generative Reward Model (GenRM) to provide supervision during RL. The GenRM is a principle-following model trained as described in RL-BFF, which allows guiding Nemotron Super 3’s behavior on important domains like identity and safety-related topics.

GenRM Model#

Parameter	Value
Initialization	Qwen3-235B-A22B-Thinking-2507
Training Data	HelpSteer 3 + commercially friendly subsets of lmarena-140k + recently collected human preference data
Approach	Principle-following GenRM for guiding behavior on identity and safety domains

The GenRM is used throughout both the multi-environment RL stage (RLVR) AND as the sole reward signal in this RLHF-only stage.

KL Penalty#

Unlike the RLVR and SWE stages (which use KL=0), RLHF applies a KL penalty of 1e-4 to prevent the model from drifting too far from the reference policy. This is critical for maintaining the capabilities learned in prior stages while improving interaction quality.

Configuration#

Parameter	Value
Nodes	72 (576 GPUs)
Generation nodes	32 (colocated=false)
NeMo Gym GPU nodes	8
Prompts/step	128
Generations/prompt	16
Batch size	2,048
Max sequence length	49,152
TP / CP	4 / 4
Learning rate	1e-6
KL penalty	1e-4
Overlong filtering	false
GenRM router DP size	8

Environments#

The RLHF stage uses GenRM comparison as the primary reward signal, along with tool use evaluation:

genrm_compare — Pairwise comparison using the GenRM model with principle-following prompts
single_step_tool_use_with_argument_comparison — Tool use correctness

Config Files#

stage3_rlhf/config/default.yaml — Full-scale 72-node config
stage3_rlhf/config/small.yaml — Reduced 24-node variant for testing

Usage#

Using nemotron CLI#

uv run nemotron super3 rl rlhf --run YOUR-CLUSTER

--run YOUR-CLUSTER refers to a profile defined in your env.toml file. See the env.toml setup guide for details.

Using super_launch.sh#

EXP_NAME=stage3-rlhf \
CONFIG_PATH=examples/configs/super/stage3_rlhf.yaml \
MODEL_PATH=/path/to/swe2_checkpoint \
TRAIN_PATH=$DATA_DIR/rlhf/train-split.jsonl \
VAL_PATH=$DATA_DIR/rlhf/val-split.jsonl \
CONTAINER=nvcr.io/nvidia/nemo-rl:v0.5.0.nemotron_3_super \
SANDBOX_CONTAINER=$SANDBOX_CONTAINER \
PERSISTENT_CACHE=$PERSISTENT_CACHE \
EXTRA_MOUNTS=$EXTRA_MOUNTS \
SLURM_PARTITION=$SLURM_PARTITION \
SLURM_ACCOUNT=$SLURM_ACCOUNT \
bash super_launch.sh

See the upstream training guide for full details on environment variables.

References#

Recipe Source: src/nemotron/recipes/super3/stage2_rl/stage3_rlhf/