rlhf#

This step runs reinforcement learning from human feedback (RLHF) on top of the NeMo-RL group relative policy optimization (GRPO) training loop, with a generative reward model judge. The step consumes a prompt dataset for rollouts, a supervised fine-tuning (SFT) Megatron checkpoint as the warm-start policy, and a Hugging Face (HF) reward model checkpoint that NeMo-Gym serves as the judge. The step produces an aligned checkpoint_megatron artifact.

Syntax#

nemotron steps run rl/nemo_rl/rlhf \
    [-c <config-name-or-path>] \
    [-r <run-profile> | -b <batch-profile>] \
    [-d] \
    [--force-squash] \
    [<dotlist-overrides>...] \
    [<passthrough-args>...]

Refer to the Nemotron Steps CLI Reference for the shared flag set.

Configuration Files#

The step ships two configuration files under src/nemotron/steps/rl/nemo_rl/rlhf/config/.

File	Purpose
`default.yaml`	NeMo-Gym RLHF path with GRPO sampling and a generative reward model judge that NeMo-Gym hosts.
`tiny.yaml`	Short validation run with a small dataset slice.

Pass the configuration name with -c:

$ nemotron steps run rl/nemo_rl/rlhf -c tiny
$ nemotron steps run rl/nemo_rl/rlhf -c default

Inputs and Outputs#

Direction	Artifact Type	Required	Description
Consumes	`training_jsonl`	Yes	Prompt JSON Lines used for rollouts.
Consumes	`checkpoint_megatron`	Yes	The supervised fine-tuned policy to align.
Consumes	`checkpoint_hf`	Yes	The reward model checkpoint, in Hugging Face format, served as the generative reward model judge.
Produces	`checkpoint_megatron`	—	The RLHF-aligned policy checkpoint.

Step Parameters#

The manifest declares two NeMo-Gym RLHF parameters. Pass them as dotlist overrides.

grpo.num_generations_per_prompt=<n>#

The number of rollouts produced per prompt. This value sets the GRPO group size.

Default: 8.

Example: grpo.num_generations_per_prompt=16

env.nemo_gym.genrm_model.responses_api_models.vllm_model.model=<id-or-path>#

The Hugging Face identifier or local path for the generative reward model that NeMo-Gym serves through its responses API.

Default: meta-llama/Llama-3.2-1B-Instruct (or the value of RL_REWARD_MODEL when that variable is set).

Example: env.nemo_gym.genrm_model.responses_api_models.vllm_model.model=nvidia/Nemotron-Reward-Model

Frequently used dotlist overrides drawn from the NeMo-RL GRPO recipe include the following.

grpo.max_num_steps=<n>#

The maximum number of training steps.

Example: grpo.max_num_steps=1000

grpo.num_prompts_per_step=<n>#

The number of prompts sampled per training step.

Example: grpo.num_prompts_per_step=32

data.train.data_path=<path>#

The path to the training JSON Lines dataset.

Example: data.train.data_path=/lustre/rlhf/train.jsonl

data.validation.data_path=<path>#

The path to the validation JSON Lines dataset.

Example: data.validation.data_path=/lustre/rlhf/val.jsonl

Strategies#

The manifest records two operator strategies for rl/nemo_rl/rlhf.

When the reward model saturates or reward hacking appears in rollouts, raise the Kullback-Leibler penalty, lower the learning rate, or add reward clipping.
When the data follows the Super3 RLHF layout, keep env.should_use_nemo_gym=true and point data.train.data_path and data.validation.data_path at the prepared NeMo-Gym JSON Lines files.

Command Examples#

Run the tiny validation configuration locally:

$ nemotron steps run rl/nemo_rl/rlhf -c tiny

Compile the default configuration without submitting the job:

$ nemotron steps run rl/nemo_rl/rlhf -c default --dry-run

Submit an attached run on a Lepton profile with an explicit reward model identifier:

$ nemotron steps run rl/nemo_rl/rlhf -c default -r lepton_rl_nemo_rl_rlhf \
    env.nemo_gym.genrm_model.responses_api_models.vllm_model.model=nvidia/Nemotron-Reward-Model \
    grpo.num_generations_per_prompt=16

Submit a detached run on a Slurm profile with a longer training schedule:

$ nemotron steps run rl/nemo_rl/rlhf -c default -b slurm_rl_nemo_rl_rlhf \
    grpo.max_num_steps=1000 \
    data.train.data_path=/lustre/rlhf/train.jsonl \
    data.validation.data_path=/lustre/rlhf/val.jsonl