RM#

We provide a sample RM experiment that uses the HelpSteer3 dataset for preference-based training.

RM Single Node#

The default RM experiment is configured to run on a single GPU. To launch the experiment:

uv run python examples/run_rm.py

This trains a RM based on meta-llama/Llama-3.2-1B-Instruct on 1 GPU.

If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration:

uv run python examples/run_rm.py cluster.gpus_per_node=8

Refer to the RM documentation for more information.

RM Multi-node#

For distributed RM training across multiple nodes, modify the following script for your use case:

# Run from the root of NeMo RL repo
## number of nodes to use for your job
NUM_ACTOR_NODES=2

COMMAND="uv run ./examples/run_rm.py --config examples/configs/rm.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/rm_llama1b_2nodes' logger.wandb_enabled=True logger.wandb.name='rm-llama1b-2nodes'" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub