NeMo-RL End-to-End Guide

A complete walkthrough of running a reinforcement learning job with NeMo-RL on DGX Cloud Lepton.

Reinforcement learning (RL) is a post-training concept which, at a high level, teaches models to think for themselves. RL techniques have been applied to many use cases—from excelling at video games and playing chess to creating financial estimates and unlocking new behaviors in large language models (LLMs). RL workflows typically have an environment in which the model or agent explores, and a reward is given for desirable actions to reinforce those behaviors. A common example is a video game in which the model decides which buttons to press (with the game being the "environment" in this case), where higher rewards are given for gaining points, whereas lower rewards are given when the player loses a life.

RL has gained traction in LLM development, as many foundation models now rely on it to learn new techniques that produce higher-quality output and more accurate responses. This was famously showcased by DeepSeek-R1, which used GRPO (an RL technique) during post-training. During the RL phase, the model learned to reflect on its responses and self-correct in a "thinking" phase before producing a final answer. This led to significantly higher performance on academic benchmarks, even though the model was never explicitly instructed to self-reflect.

NVIDIA NeMo-RL is a scalable library that supports various RL techniques for LLMs, such as GRPO and DPO, as well as fine-tuning approaches like SFT. NeMo-RL leverages Ray clusters to configure all models and communication during training. This guide walks through running GRPO with NeMo-RL for the Qwen/Qwen3-1.7B-Base model to improve mathematical reasoning accuracy, using the RayCluster feature on DGX Cloud Lepton to manage the job.

Requirements

The following is a list of requirements to follow this guide:

  • An NVIDIA DGX Cloud Lepton cluster with at least 2x A100 (or newer) GPU nodes.
  • A shared filesystem with read/write access that is mountable in jobs.
  • A local machine to launch jobs with Python installed and access to the internet.

RayCluster Setup

First, create a RayCluster. Follow the RayCluster user guide to create a RayCluster in your workspace, or go directly to this link. Use the following settings for this example:

  • Name: Specify a name for the RayCluster, such as nemo-rl
  • Container: Select a custom image and enter nvcr.io/nvidia/nemo-rl:v0.4.0 as the container image.
  • Ray Version: Select 2.49.0.
  • Head Node Resource Shape: Select the resource shape that reflects 8x GPUs in the container, such as gpu.8xh200.
  • Worker Resource Shape: Select the same resource shape as the head node, such as gpu.8xh200.
  • Min Replicas: Set this to 1 for a single worker. This, combined with the head node, will allocate 16 GPUs in your RayCluster.

Once configured, click the green Create button at the bottom of the page to deploy the RayCluster. This may take some time to pull the image from NGC and configure the cluster.

Local Setup

Next, configure your local machine to launch RL jobs on the RayCluster. This includes configuring the DGX Cloud Lepton SDK, installing NeMo-Run, and creating the launcher script.

Install Python SDK

Install the Python SDK and authenticate with your workspace. Install the SDK with:

Next, authenticate with your workspace:

This prompts you to authenticate with your DGX Cloud Lepton workspace. If you're in a GUI-supported environment such as a desktop, a browser will open to the credentials page in your workspace. Otherwise, a URL will be displayed. Open this URL in a browser.

On the credentials page, create an authentication token by following the prompts. The page will display a secret token which is used for authentication. Copy the workspace ID and token shown in the second field and paste it back into your terminal. The format should look like xxxxxx:**************************. You should now be authenticated with DGX Cloud Lepton. You only need to authenticate once locally as long as your credentials remain valid.

Validate Installation

After authentication, validate the installation by running:

This lists your available workspaces and should look similar to the following if authentication was successful:

Install NeMo-Run

Install NeMo-Run on your local machine to remotely launch jobs on the RayCluster using the Python SDK.

Run the following in your terminal:

Create Launcher Script

Create the RL job launcher script on your local machine as train.py with the following contents:

To launch the RL job on the RayCluster, run:

This connects to your DGX Cloud Lepton workspace and submits the RL job to the nemo-rl RayCluster created earlier. The job begins when resources are available in the RayCluster.

Monitoring the RL Job

After the RayJob is submitted, it allocates 16 GPUs across two nodes (8 GPUs each) to run GRPO on the Qwen/Qwen3-1.7B-Base model. NeMo-RL dynamically allocates resources within the RayJob for the policy, reward, and reference models during training.

The launcher script above will automatically tail the job's logs with the job.logs(follow=True) line. The logs can also be viewed by connecting to the RayCluster in the UI and clicking the Terminal button on the head node. This will open a bash instance in the RayCluster. To view the logs from within the RayCluster, run:

This shows the full logs of the job so far and continues to poll for new updates to the logs once available.

The configuration above contains helper functions to prepare the agentica-org/DeepScaleR-Preview-Dataset dataset for training and the HuggingFaceH4/aime_2024 dataset for validation. The DeepScaleR-Preview dataset contains thousands of math questions and their associated answers, while the aime_2024 dataset contains 30 challenging math problems taken from the 2024 edition of the American Invitational Mathematics Examination (AIME), which is taken by some of the most talented high-school-aged mathematicians each year.

These datasets provide a challenging set of math questions for the model to learn from. During training, the "policy" model (in our case, Qwen/Qwen3-1.7B-Base) will explore ways to improve its capabilities in math. This typically takes the form of adding reasoning capabilities in which the model reflects on its response and self-corrects before producing the final answer.

View Rewards

As mentioned at the top of this guide, one of the main goals of reinforcement learning is to optimize the reward. In this example, the reward is determined by the accuracy of the generated responses during each step in the training process. Specifically, this configuration will generate 1024 responses for each training step, and those responses will be graded for accuracy. If the response is correct, it receives a score of 1. Otherwise, it receives a score of 0. These scores are averaged across all 1024 generations to produce a simple percentage. For example, if 352 of the generated responses are correct in the current step, the reward will be 352/1024 = 0.34. Ideally, the reward will continue to grow over time as the model learns to generate more accurate responses.

The calculated reward is shown in the logs at the end of each step. Example output:

The end of the output shows that the reward for that step was 0.0771. This means around 8% of the responses generated by the policy were correct. Monitoring this value is a good way to view training performance over time.

Saving Checkpoints

A new checkpoint is saved to storage every 10 steps, for a maximum of 10 checkpoints. If there are already 10 checkpoints in the specified directory, only those with the highest accuracy are kept.

The script above saves the checkpoints to /nemo-workspace/checkpoints/grpo-math-deepscaler on the mounted storage. There will be a subdirectory for each checkpoint inside this parent directory, which can be used for inference deployment or resuming the training process.

The configuration saves consolidated safetensors checkpoints which are compatible with most inference engines, such as NVIDIA NIM, vLLM, and SGLang.

Cleaning Up

After the RayJob finishes, resources within the RayCluster are freed up, but the RayCluster will still have resources allocated that might prevent other jobs in your node group from starting. This is by design, as the RayCluster can be used for multiple jobs without tearing down and spinning up a cluster each time.

When the RayCluster is no longer needed, tear it down to free compute resources for other tasks in your workspace. Navigate to the RayCluster in the dashboard and click Delete. Data saved to shared storage, including checkpoints, persists after teardown.

Copyright @ 2025, NVIDIA Corporation.