Nemo RL: A Scalable and Efficient Post-Training Library#

Nemo RL is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.

What you can expect:

  • Seamless integration with Hugging Face for ease of use, allowing users to leverage a wide range of pre-trained models and tools.

  • High-performance implementation with Megatron Core, supporting various parallelism techniques for large models (>100B) and large context lengths.

  • Efficient resource management using Ray, enabling scalable and flexible deployment across different hardware configurations.

  • Flexibility with a modular design that allows easy integration and customization.

  • Comprehensive documentation that is both detailed and user-friendly, with practical examples.

πŸ“£ News#

Features#

βœ… Available now | πŸ”œ Coming in v0.3

  • βœ… Fast Generation - vLLM backend for optimized inference.

  • βœ… HuggingFace Integration - Works with 1-32B models (Qwen2.5, Llama).

  • βœ… Distributed Training - Fully Sharded Data Parallel (FSDP) support and Ray-based infrastructure.

  • βœ… Environment Support - Support for multi-environment training.

  • βœ… Learning Algorithms - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization).

  • βœ… Multi-Turn RL - Multi-turn generation and training for RL with tool use, games, etc.

  • βœ… Large Model Support - Native PyTorch support for models up to 32B parameters.

  • βœ… Advanced Parallelism - PyTorch native FSDP2, TP, and SP for efficient training.

  • βœ… Worker Isolation - Process isolation between RL Actors (no worries about global state).

  • βœ… Environment Isolation - Dependency isolation between components.

  • πŸ”œ Improved Native Performance - Improve training time for Native Pytorch Models.

  • πŸ”œ (even) Larger Model Support with Long(er) Sequence - Support advanced parallelism in training with Megatron Core.

  • πŸ”œ MoE Models - Support DeepseekV3 and Llama4.

  • πŸ”œ Megatron Inference - Support Megatron Inference for day-0 support for new megatron models.

Prerequisites#

Clone NeMo RL.

git clone git@github.com:NVIDIA/NeMo-RL.git nemo-rl
cd nemo-rl

Install uv.

# For faster setup and environment isolation, we use `uv`
pip install uv

# If you cannot install at the system level, you can install for your user with
# pip install --user uv

# Use `uv run` to launch all commands. It handles pip installing implicitly and
# ensures your environment is up to date with our lock file.

# Note that it is not recommended to activate the venv and instead use `uv run` since
# it ensures consistent environment usage across different shells and sessions.
# Example: uv run python examples/run_grpo_math.py

Important Notes:

  • Use the uv run <command> to execute scripts within the managed environment. This helps maintain consistency across different shells and sessions.

  • Ensure you have the necessary CUDA drivers and PyTorch installed compatible with your hardware.

  • Reminder: Don’t forget to set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). You’ll need to do a huggingface-cli login as well for Llama models.

GRPO#

We have a reference GRPO experiment config set up trained for math benchmarks using the OpenInstructMath2 dataset.

GRPO Single Node#

To run GRPO on a single GPU for Qwen/Qwen2.5-1.5B:

# Run the GRPO math example using a 1B parameter model
uv run python examples/run_grpo_math.py

By default, this uses the configuration in examples/configs/grpo_math_1B.yaml. You can customize parameters with command-line overrides. For example, to run on 8 GPUs,

# Run the GRPO math example using a 1B parameter model using 8 GPUs
uv run python examples/run_grpo_math.py \
  cluster.gpus_per_node=8

You can override any of the parameters listed in the yaml configuration file. For example,

uv run python examples/run_grpo_math.py \
  policy.model_name="meta-llama/Llama-3.2-1B-Instruct" \
  checkpointing.checkpoint_dir="results/llama1b_math" \
  logger.wandb_enabled=True \
  logger.wandb.name="grpo-llama1b_math" \
  logger.num_val_samples_to_print=10

GRPO Multi-node#

# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=2

# grpo_math_8b uses Llama-3.1-8B-Instruct model
COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub

The required CONTAINER can be built by following the instructions in the Docker documentation.

GRPO Qwen2.5-32B#

This section outlines how to run GRPO for Qwen2.5-32B with a 16k sequence length.

# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=16

# Download Qwen before the job starts to avoid spending time downloading during the training loop
HF_HOME=/path/to/hf_home huggingface-cli download Qwen/Qwen2.5-32B

# Ensure HF_HOME is included in your MOUNTS
HF_HOME=/path/to/hf_home \
COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml policy.model_name='Qwen/Qwen2.5-32B' policy.generation.vllm_cfg.tensor_parallel_size=4 policy.max_total_sequence_length=16384 cluster.num_nodes=${NUM_ACTOR_NODES} policy.dtensor_cfg.enabled=True policy.dtensor_cfg.tensor_parallel_size=8 policy.dtensor_cfg.sequence_parallel=True policy.dtensor_cfg.activation_checkpointing=True checkpointing.checkpoint_dir='results/qwen2.5-32b' logger.wandb_enabled=True logger.wandb.name='qwen2.5-32b'" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub

GRPO Multi-Turn#

We also support multi-turn generation and training (tool use, games, etc.). Reference example for training to play a Sliding Puzzle Game:

uv run python examples/run_grpo_sliding_puzzle.py 

Supervised Fine-Tuning (SFT)#

We provide an example SFT experiment using the SQuAD dataset.

SFT Single Node#

The default SFT configuration is set to run on a single GPU. To start the experiment:

uv run python examples/run_sft.py

This fine-tunes the Llama3.2-1B model on the SQuAD dataset using a 1 GPU.

To use multiple GPUs on a single node, you can modify the cluster configuration. This adjustment will also let you potentially increase the model and batch size:

uv run python examples/run_sft.py \
  policy.model_name="meta-llama/Meta-Llama-3-8B" \
  policy.train_global_batch_size=128 \
  sft.val_global_batch_size=128 \
  cluster.gpus_per_node=8

Refer to examples/configs/sft.yaml for a full list of parameters that can be overridden.

SFT Multi-node#

# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=2

COMMAND="uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub

DPO#

We provide a sample DPO experiment that uses the HelpSteer3 dataset for preference-based training.

DPO Single Node#

The default DPO experiment is configured to run on a single GPU. To launch the experiment:

uv run python examples/run_dpo.py

This trains Llama3.2-1B-Instruct on one GPU.

If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration and switch to an 8B Llama3.1 Instruct model:

uv run python examples/run_dpo.py \
  policy.model_name="meta-llama/Llama-3.1-8B-Instruct" \
  policy.train_global_batch_size=256 \
  cluster.gpus_per_node=8

Any of the DPO parameters can be customized from the command line. For example:

uv run python examples/run_dpo.py \
  dpo.sft_loss_weight=0.1 \
  dpo.preference_average_log_probs=True \
  checkpointing.checkpoint_dir="results/llama_dpo_sft" \
  logger.wandb_enabled=True \
  logger.wandb.name="llama-dpo-sft"

Refer to examples/configs/dpo.yaml for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the DPO documentation.

DPO Multi-node#

For distributed DPO training across multiple nodes, modify the following script for your use case:

# Run from the root of NeMo RL repo
## number of nodes to use for your job
NUM_ACTOR_NODES=2

COMMAND="uv run ./examples/run_dpo.py --config examples/configs/dpo.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 dpo.val_global_batch_size=32 checkpointing.checkpoint_dir='results/dpo_llama81_2nodes' logger.wandb_enabled=True logger.wandb.name='dpo-llama1b'" \
RAY_DEDUP_LOGS=0 \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub

Evaluation#

We provide evaluation tools to assess model capabilities.

Convert Model Format (Optional)#

If you have trained a model and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the Hugging Face format before running evaluation:

# Example for a GRPO checkpoint at step 170
uv run python examples/convert_dcp_to_hf.py \
    --config results/grpo/step_170/config.yaml \
    --dcp-ckpt-path results/grpo/step_170/policy/weights/ \
    --hf-ckpt-path results/grpo/hf

Note: Adjust the paths according to your training output directory structure.

For an in-depth explanation of checkpointing, refer to the Checkpointing documentation.

Run Evaluation#

Run evaluation script with converted model:

uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf

Run evaluation script with custom settings:

# Example: Evaluation of DeepScaleR-1.5B-Preview on MATH-500 using 8 GPUs
#          Pass@1 accuracy averaged over 16 samples for each problem
uv run python examples/run_eval.py \
    generation.model_name=agentica-org/DeepScaleR-1.5B-Preview \
    generation.temperature=0.6 \
    generation.top_p=0.95 \
    generation.vllm_cfg.max_model_len=32768 \
    data.dataset_name=HuggingFaceH4/MATH-500 \
    data.dataset_key=test \
    eval.num_tests_per_prompt=16 \
    cluster.gpus_per_node=8

Note: Evaluation results may vary slightly due to various factors, such as sampling parameters, random seed, inference engine version, and inference engine settings.

Refer to examples/configs/eval.yaml for a full list of parameters that can be overridden. For an in-depth explanation of evaluation, refer to the Evaluation documentation.

Set Up Clusters#

For detailed instructions on how to set up and launch NeMo RL on Slurm or Kubernetes clusters, please refer to the dedicated Cluster Start documentation.

Citation#

If you use NeMo RL in your research, please cite it using the following BibTeX entry:

@misc{nemo-rl,
title = {NeMo RL: A Scalable and Efficient Post-Training Library},
howpublished = {\url{https://github.com/NVIDIA/NeMo-RL}},
year = {2025},
note = {GitHub repository},
}

Contributing#

We welcome contributions to NeMo RL! Please see our Contributing Guidelines for more information on how to get involved.

Licenses#

NVIDIA NeMo RL is licensed under the Apache License 2.0.