rl/nemo_rl/rlhf#
This step runs reinforcement learning from human feedback (RLHF) on top of the NeMo-RL group relative policy optimization (GRPO) training loop, with a generative reward model judge.
The step consumes a prompt dataset for rollouts, a supervised fine-tuning (SFT) Megatron checkpoint as the warm-start policy, and a Hugging Face (HF) reward model checkpoint that NeMo-Gym serves as the judge.
The step produces an aligned checkpoint_megatron artifact.
Syntax#
nemotron steps run rl/nemo_rl/rlhf \
[-c <config-name-or-path>] \
[-r <run-profile> | -b <batch-profile>] \
[-d] \
[--force-squash] \
[<dotlist-overrides>...] \
[<passthrough-args>...]
Refer to the Nemotron Steps CLI Reference for the shared flag set.
Configuration Files#
The step ships two configuration files under src/nemotron/steps/rl/nemo_rl/rlhf/config/.
File |
Purpose |
|---|---|
|
NeMo-Gym RLHF path with GRPO sampling and a generative reward model judge that NeMo-Gym hosts. |
|
Short validation run with a small dataset slice. |
Pass the configuration name with -c:
$ nemotron steps run rl/nemo_rl/rlhf -c tiny
$ nemotron steps run rl/nemo_rl/rlhf -c default
Inputs and Outputs#
Direction |
Artifact Type |
Required |
Description |
|---|---|---|---|
Consumes |
|
Yes |
Prompt JSON Lines used for rollouts. |
Consumes |
|
Yes |
The supervised fine-tuned policy to align. |
Consumes |
|
Yes |
The reward model checkpoint, in Hugging Face format, served as the generative reward model judge. |
Produces |
|
— |
The RLHF-aligned policy checkpoint. |
Step Parameters#
The manifest declares two NeMo-Gym RLHF parameters. Pass them as dotlist overrides.
- grpo.num_generations_per_prompt=<n>#
The number of rollouts produced per prompt. This value sets the GRPO group size.
Default:
8.Example:
grpo.num_generations_per_prompt=16
- env.nemo_gym.genrm_model.responses_api_models.vllm_model.model=<id-or-path>#
The Hugging Face identifier or local path for the generative reward model that NeMo-Gym serves through its responses API.
Default:
meta-llama/Llama-3.2-1B-Instruct(or the value ofRL_REWARD_MODELwhen that variable is set).Example:
env.nemo_gym.genrm_model.responses_api_models.vllm_model.model=nvidia/Nemotron-Reward-Model
Frequently used dotlist overrides drawn from the NeMo-RL GRPO recipe include the following.
- grpo.max_num_steps=<n>#
The maximum number of training steps.
Example:
grpo.max_num_steps=1000
- grpo.num_prompts_per_step=<n>#
The number of prompts sampled per training step.
Example:
grpo.num_prompts_per_step=32
- data.train.data_path=<path>#
The path to the training JSON Lines dataset.
Example:
data.train.data_path=/lustre/rlhf/train.jsonl
- data.validation.data_path=<path>#
The path to the validation JSON Lines dataset.
Example:
data.validation.data_path=/lustre/rlhf/val.jsonl
Strategies#
The manifest records two operator strategies for rl/nemo_rl/rlhf.
When the reward model saturates or reward hacking appears in rollouts, raise the Kullback-Leibler penalty, lower the learning rate, or add reward clipping.
When the data follows the Super3 RLHF layout, keep
env.should_use_nemo_gym=trueand pointdata.train.data_pathanddata.validation.data_pathat the prepared NeMo-Gym JSON Lines files.
Command Examples#
Run the tiny validation configuration locally:
$ nemotron steps run rl/nemo_rl/rlhf -c tiny
Compile the default configuration without submitting the job:
$ nemotron steps run rl/nemo_rl/rlhf -c default --dry-run
Submit an attached run on a Lepton profile with an explicit reward model identifier:
$ nemotron steps run rl/nemo_rl/rlhf -c default -r lepton_rl_nemo_rl_rlhf \
env.nemo_gym.genrm_model.responses_api_models.vllm_model.model=nvidia/Nemotron-Reward-Model \
grpo.num_generations_per_prompt=16
Submit a detached run on a Slurm profile with a longer training schedule:
$ nemotron steps run rl/nemo_rl/rlhf -c default -b slurm_rl_nemo_rl_rlhf \
grpo.max_num_steps=1000 \
data.train.data_path=/lustre/rlhf/train.jsonl \
data.validation.data_path=/lustre/rlhf/val.jsonl