Nemo RL: A Scalable and Efficient Post-Training Library#
Nemo RL is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
What you can expect:
Seamless integration with Hugging Face for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
High-performance implementation with Megatron Core, supporting various parallelism techniques for large models (>100B) and large context lengths.
Efficient resource management using Ray, enabling scalable and flexible deployment across different hardware configurations.
Flexibility with a modular design that allows easy integration and customization.
Comprehensive documentation that is both detailed and user-friendly, with practical examples.
π£ News#
[5/14/2025] Reproduce DeepscaleR with NeMo RL!
Features#
β Available now | π Coming in v0.3
β Fast Generation - vLLM backend for optimized inference.
β HuggingFace Integration - Works with 1-32B models (Qwen2.5, Llama).
β Distributed Training - Fully Sharded Data Parallel (FSDP) support and Ray-based infrastructure.
β Environment Support - Support for multi-environment training.
β Learning Algorithms - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization).
β Multi-Turn RL - Multi-turn generation and training for RL with tool use, games, etc.
β Large Model Support - Native PyTorch support for models up to 32B parameters.
β Advanced Parallelism - PyTorch native FSDP2, TP, and SP for efficient training.
β Worker Isolation - Process isolation between RL Actors (no worries about global state).
β Environment Isolation - Dependency isolation between components.
π Improved Native Performance - Improve training time for Native Pytorch Models.
π (even) Larger Model Support with Long(er) Sequence - Support advanced parallelism in training with Megatron Core.
π MoE Models - Support DeepseekV3 and Llama4.
π Megatron Inference - Support Megatron Inference for day-0 support for new megatron models.
Prerequisites#
Clone NeMo RL.
git clone git@github.com:NVIDIA/NeMo-RL.git nemo-rl
cd nemo-rl
Install uv
.
# For faster setup and environment isolation, we use `uv`
pip install uv
# If you cannot install at the system level, you can install for your user with
# pip install --user uv
# Use `uv run` to launch all commands. It handles pip installing implicitly and
# ensures your environment is up to date with our lock file.
# Note that it is not recommended to activate the venv and instead use `uv run` since
# it ensures consistent environment usage across different shells and sessions.
# Example: uv run python examples/run_grpo_math.py
Important Notes:
Use the
uv run <command>
to execute scripts within the managed environment. This helps maintain consistency across different shells and sessions.Ensure you have the necessary CUDA drivers and PyTorch installed compatible with your hardware.
Reminder: Donβt forget to set your
HF_HOME
,WANDB_API_KEY
, andHF_DATASETS_CACHE
(if needed). Youβll need to do ahuggingface-cli login
as well for Llama models.
GRPO#
We have a reference GRPO experiment config set up trained for math benchmarks using the OpenInstructMath2 dataset.
GRPO Single Node#
To run GRPO on a single GPU for Qwen/Qwen2.5-1.5B
:
# Run the GRPO math example using a 1B parameter model
uv run python examples/run_grpo_math.py
By default, this uses the configuration in examples/configs/grpo_math_1B.yaml
. You can customize parameters with command-line overrides. For example, to run on 8 GPUs,
# Run the GRPO math example using a 1B parameter model using 8 GPUs
uv run python examples/run_grpo_math.py \
cluster.gpus_per_node=8
You can override any of the parameters listed in the yaml configuration file. For example,
uv run python examples/run_grpo_math.py \
policy.model_name="meta-llama/Llama-3.2-1B-Instruct" \
checkpointing.checkpoint_dir="results/llama1b_math" \
logger.wandb_enabled=True \
logger.wandb.name="grpo-llama1b_math" \
logger.num_val_samples_to_print=10
GRPO Multi-node#
# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=2
# grpo_math_8b uses Llama-3.1-8B-Instruct model
COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=4:0:0 \
--gres=gpu:8 \
ray.sub
The required CONTAINER
can be built by following the instructions in the Docker documentation.
GRPO Qwen2.5-32B#
This section outlines how to run GRPO for Qwen2.5-32B with a 16k sequence length.
# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=16
# Download Qwen before the job starts to avoid spending time downloading during the training loop
HF_HOME=/path/to/hf_home huggingface-cli download Qwen/Qwen2.5-32B
# Ensure HF_HOME is included in your MOUNTS
HF_HOME=/path/to/hf_home \
COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml policy.model_name='Qwen/Qwen2.5-32B' policy.generation.vllm_cfg.tensor_parallel_size=4 policy.max_total_sequence_length=16384 cluster.num_nodes=${NUM_ACTOR_NODES} policy.dtensor_cfg.enabled=True policy.dtensor_cfg.tensor_parallel_size=8 policy.dtensor_cfg.sequence_parallel=True policy.dtensor_cfg.activation_checkpointing=True checkpointing.checkpoint_dir='results/qwen2.5-32b' logger.wandb_enabled=True logger.wandb.name='qwen2.5-32b'" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=4:0:0 \
--gres=gpu:8 \
ray.sub
GRPO Multi-Turn#
We also support multi-turn generation and training (tool use, games, etc.). Reference example for training to play a Sliding Puzzle Game:
uv run python examples/run_grpo_sliding_puzzle.py
Supervised Fine-Tuning (SFT)#
We provide an example SFT experiment using the SQuAD dataset.
SFT Single Node#
The default SFT configuration is set to run on a single GPU. To start the experiment:
uv run python examples/run_sft.py
This fine-tunes the Llama3.2-1B
model on the SQuAD dataset using a 1 GPU.
To use multiple GPUs on a single node, you can modify the cluster configuration. This adjustment will also let you potentially increase the model and batch size:
uv run python examples/run_sft.py \
policy.model_name="meta-llama/Meta-Llama-3-8B" \
policy.train_global_batch_size=128 \
sft.val_global_batch_size=128 \
cluster.gpus_per_node=8
Refer to examples/configs/sft.yaml
for a full list of parameters that can be overridden.
SFT Multi-node#
# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=2
COMMAND="uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=4:0:0 \
--gres=gpu:8 \
ray.sub
DPO#
We provide a sample DPO experiment that uses the HelpSteer3 dataset for preference-based training.
DPO Single Node#
The default DPO experiment is configured to run on a single GPU. To launch the experiment:
uv run python examples/run_dpo.py
This trains Llama3.2-1B-Instruct
on one GPU.
If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration and switch to an 8B Llama3.1 Instruct model:
uv run python examples/run_dpo.py \
policy.model_name="meta-llama/Llama-3.1-8B-Instruct" \
policy.train_global_batch_size=256 \
cluster.gpus_per_node=8
Any of the DPO parameters can be customized from the command line. For example:
uv run python examples/run_dpo.py \
dpo.sft_loss_weight=0.1 \
dpo.preference_average_log_probs=True \
checkpointing.checkpoint_dir="results/llama_dpo_sft" \
logger.wandb_enabled=True \
logger.wandb.name="llama-dpo-sft"
Refer to examples/configs/dpo.yaml
for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the DPO documentation.
DPO Multi-node#
For distributed DPO training across multiple nodes, modify the following script for your use case:
# Run from the root of NeMo RL repo
## number of nodes to use for your job
NUM_ACTOR_NODES=2
COMMAND="uv run ./examples/run_dpo.py --config examples/configs/dpo.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 dpo.val_global_batch_size=32 checkpointing.checkpoint_dir='results/dpo_llama81_2nodes' logger.wandb_enabled=True logger.wandb.name='dpo-llama1b'" \
RAY_DEDUP_LOGS=0 \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=4:0:0 \
--gres=gpu:8 \
ray.sub
Evaluation#
We provide evaluation tools to assess model capabilities.
Convert Model Format (Optional)#
If you have trained a model and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the Hugging Face format before running evaluation:
# Example for a GRPO checkpoint at step 170
uv run python examples/convert_dcp_to_hf.py \
--config results/grpo/step_170/config.yaml \
--dcp-ckpt-path results/grpo/step_170/policy/weights/ \
--hf-ckpt-path results/grpo/hf
Note: Adjust the paths according to your training output directory structure.
For an in-depth explanation of checkpointing, refer to the Checkpointing documentation.
Run Evaluation#
Run evaluation script with converted model:
uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf
Run evaluation script with custom settings:
# Example: Evaluation of DeepScaleR-1.5B-Preview on MATH-500 using 8 GPUs
# Pass@1 accuracy averaged over 16 samples for each problem
uv run python examples/run_eval.py \
generation.model_name=agentica-org/DeepScaleR-1.5B-Preview \
generation.temperature=0.6 \
generation.top_p=0.95 \
generation.vllm_cfg.max_model_len=32768 \
data.dataset_name=HuggingFaceH4/MATH-500 \
data.dataset_key=test \
eval.num_tests_per_prompt=16 \
cluster.gpus_per_node=8
Note: Evaluation results may vary slightly due to various factors, such as sampling parameters, random seed, inference engine version, and inference engine settings.
Refer to examples/configs/eval.yaml
for a full list of parameters that can be overridden. For an in-depth explanation of evaluation, refer to the Evaluation documentation.
Set Up Clusters#
For detailed instructions on how to set up and launch NeMo RL on Slurm or Kubernetes clusters, please refer to the dedicated Cluster Start documentation.
Citation#
If you use NeMo RL in your research, please cite it using the following BibTeX entry:
@misc{nemo-rl,
title = {NeMo RL: A Scalable and Efficient Post-Training Library},
howpublished = {\url{https://github.com/NVIDIA/NeMo-RL}},
year = {2025},
note = {GitHub repository},
}
Contributing#
We welcome contributions to NeMo RL! Please see our Contributing Guidelines for more information on how to get involved.
Licenses#
NVIDIA NeMo RL is licensed under the Apache License 2.0.