Reward Model Training in NeMo RL#

This document explains how to train reward models (RM) within NeMo RL. Currently, only Bradley-Terry reward models are supported on the DTensor backend. Megatron backend support is tracked here.

Launch a Training Job#

The script, examples/run_rm.py, is used to train a Bradley-Terry reward model. This script can be launched either locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the cluster documentation.

Be sure to launch the job using uv. The command to launch a training job is as follows:

uv run examples/run_rm.py

# Can also add overrides on CLI, like changing the config or changing the model
uv run examples/run_rm.py --config examples/configs/rm.yaml policy.model_name=Qwen/Qwen2.5-1.5B

The default YAML config shares the same base template as the SFT config but includes a new reward_model_cfg section with enabled: true to load the model as a Reward Model. You can find an example RM config file at examples/configs/rm.yaml.

Reminder: Set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). Make sure to log in using huggingface-cli if you’re working with Llama models.

Datasets#

By default, NeMo RL supports the HelpSteer3 dataset. This dataset is downloaded from Hugging Face and preprocessed on-the-fly, so there’s no need to provide a path to any datasets on disk.