NeMo-RL Quick Start
NVIDIA NeMo-RL is a scalable library for post-training models and supports various techniques such as GRPO, DPO, and SFT. NeMo-RL leverages Ray for scheduling and management of resources. This guide showcases how to create a lightweight RayCluster on DGX Cloud Lepton and launch jobs with NeMo-RL.
Requirements
The following is a list of requirements to follow this guide:
- An NVIDIA DGX Cloud Lepton cluster with at least 2x A100 or newer GPU nodes.
- Privileged mode enabled on the DGX Cloud Lepton workspace for container building.
- Access to a container registry to publish images.
- A shared filesystem with read/write access that is mountable in jobs.
Container Building
The NeMo-RL container must be built before launching jobs on DGX Cloud Lepton. Containers can be built directly on the platform using privileged dev pods. This process is documented here. Before building, ensure privileged mode is enabled for your node group as detailed in the linked guide.
To build the container, create a new dev pod with the following options:
- Enter a pod name, such as
nemo-rl-builder
. - Select a single GPU in your node group for the resource shape.
- Check the Enable privileged mode box in the Resource form.
- Select a Custom image and specify
nvcr.io/nvidia/nemo:25.07
for the image. - Select Custom entrypoint for the container Entrypoint.
- Enter the following as the entrypoint command. Replace
docker tag nemo_rl <registry tag>
with the location of your registry to push to. For example, if you will push the container to a private registry on nvcr.io namedmy-private-registry
, the command will bedocker tag nemo_rl nvcr.io/my-private-registry/nemo-rl
. Copy the same tag for thedocker push
command beneath it as well, such asdocker push nvcr.io/my-private-registry/nemo-rl
.
# Exit script on any error
set -e
################
# Install Docker
################
echo "Updating package lists..."
apt-get update
echo "Installing dependencies..."
apt-get install -y ca-certificates curl
echo "Adding Docker's official GPG key..."
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | tee /etc/apt/keyrings/docker.asc > /dev/null
chmod a+r /etc/apt/keyrings/docker.asc
echo "Adding Docker repository to Apt sources..."
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
tee /etc/apt/sources.list.d/docker.list > /dev/null
echo "Updating package lists after adding Docker repository..."
apt-get update
echo "Installing Docker packages..."
apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# Start the docker daemon process with the following command
service docker start
#########################
# Build NeMo-RL container
#########################
echo "Building NeMo-RL container..."
git clone https://github.com/nvidia-nemo/rl
cd rl/docker
docker buildx build --target release --network host -t nemo_rl -f Dockerfile ..
################################
# Tag and push image to registry
################################
echo "Tagging and pushing container to registry..."
# Change the commands here with your private registry ID
docker tag nemo_rl <registry tag>
docker push <registry tag>
Click the Create button at the bottom of the page to run the dev pod and build the image. This process can take 30 minutes or longer depending on connection speed and processing power. Progress can be monitored in the job logs in the UI.
Once the dev pod completes, the image should be available at the container registry you specified at the end of the entrypoint command.
Job Setup
There are two primary methods for launching jobs with Ray, which NeMo-RL relies on: via a RayCluster and directly as a RayJob. The two are compared here. As a summary, choose RayCluster when you want resources to persist and plan on running several jobs, or choose RayJob when you want to schedule a job on just the resources you need, run a single task to completion, and immediately free up the resources once complete. This guide demonstrates both methods—follow whichever path best suits your needs.
RayCluster
A RayCluster will spin up several workers and a head node, allowing jobs to be submitted to these resources. The resources will remain allocated until this job on DGX Cloud Lepton is explicitly stopped by a user, making the resources available again. This option is best for users who plan on running multiple jobs concurrently or sequentially and do not want resources to get torn down and re-allocated after each job.
To launch a simple RayCluster on DGX Cloud Lepton, create a new Batch Job in the UI and enter the following fields:
- Enter a job name like
nemo-rl-cluster
. - Select the node group for the job to run on.
- Enter the number of GPUs for each node in the Resource section. For example, x8 will use eight GPUs per worker.
- Enter the number of workers for the job. By default, this example will allow jobs to be scheduled on the head node in addition to the workers. If the head node should not run jobs, add an extra worker (that is, if 4 workers for running jobs and a standalone head node are desired, enter
5
for the worker count in the UI). - Enter the custom image that was pushed in the previous section, for example
nvcr.io/my-private-registry/nemo-rl
. - Add your authentication key for the container registry in the Private Image Registry Auth field.
- Copy the following into the Run Command field. Note, if a discrete head node is desired, change the
ray start --head --port=6379 --block
line to beray start --head --port=6379 --block --num-cpus 0 --num-gpus 0
. This will ensure jobs don't get scheduled on the head node.
# Setup environment variables for communication
SERVICE_PREFIX="${LEPTON_JOB_SERVICE_PREFIX:-$LEPTON_JOB_NAME}"
SUBDOMAIN="${LEPTON_SUBDOMAIN:-$LEPTON_JOB_NAME-job-svc}"
export HEAD_ADDR=${SERVICE_PREFIX}-0.${SUBDOMAIN}
export THIS_ADDR=${SERVICE_PREFIX}-${LEPTON_JOB_WORKER_INDEX}.${SUBDOMAIN}
export NODE_COUNT=${LEPTON_JOB_TOTAL_WORKERS}
export NODE_RANK=${LEPTON_JOB_WORKER_INDEX}
export NGPUS=${LEPTON_RESOURCE_ACCELERATOR_NUM}
if [ ${NODE_RANK} == 0 ]; then
# Initialize Ray cluster head
ray start --head --port=6379 --block
else
# All non-head nodes are assigned as workers and listen indefinitely for jobs
ray start --address="${HEAD_ADDR}:6379" --resources='{"nrl_tag_ALL": 1}' --block
fi
- In the Advanced Configuration section, add a Hugging Face token by clicking Add secret in the Environment Variables field. Either select your Hugging Face token if already added, or add one by clicking CREATE NEW SECRET and select the Hugging Face option and fill out the requested information.
- In the Advanced Configuration section, add storage with the + Mount Storage button and select the volume for your node group's storage and enter the storage path as well as the mount path inside containers.
Click the Create button at the bottom of the field to schedule the job. Once resources are available, DGX Cloud Lepton will pull the custom container and create the simple RayCluster.
Running jobs in the RayCluster
Once the RayCluster is running and all pods are ready, select the job in the Batch Jobs queue. In the Replicas tab at the bottom of the screen, select the Terminal button for the first replica in the list. This will open a terminal session directly in the container for the head node where you can launch commands.
To verify the RayCluster, run:
ray status
The status should match the number of nodes and GPUs specified when the job was created.
When the cluster is ready, you can launch a NeMo-RL job in the terminal session. For example, to run GRPO against a specific model on two workers with 8 GPUs each, run the following:
uv run python examples/run_grpo_math.py \
policy.model_name="meta-llama/Llama-3.2-1B-Instruct" \
cluster.gpus_per_node=8 \
cluster.num_nodes=2 \
checkpointing.checkpoint_dir=/my/mounted/storage/path
RayJob
A RayJob will allocate a specific number of resources and create a one-off RayCluster where a single job is run and resources are freed up at conclusion. The job runs in the background and is not interactive. This option is best for users who plan on running a single job and want resources to be available for any job types and not limited to just Ray.
To launch a simple RayJob on DGX Cloud Lepton, create a new Batch Job in the UI and enter the following fields:
- Enter a job name like
nemo-rl-job
. - Select the node group for the job to run on.
- Enter the number of GPUs for each node in the Resource section. For example, x8 will use eight GPUs per worker.
- Enter the number of workers for the job. By default, this example will allow jobs to be scheduled on the head node in addition to the workers. If the head node should not run jobs, add an extra worker (that is, if 4 workers for running jobs and a standalone head node are desired, enter
5
for the worker count in the UI). - Enter the custom image that was pushed in the previous section, for example
nvcr.io/my-private-registry/nemo-rl
. - Add your authentication key for the container registry in the Private Image Registry Auth field.
- Copy the following into the Run Command field. Note, if a discrete head node is desired, change the
ray start --head --port=6379
line to beray start --head --port=6379 --num-cpus 0 --num-gpus 0
. This will ensure the job doesn't run on the head node.
# Setup environment variables for communication
SERVICE_PREFIX="${LEPTON_JOB_SERVICE_PREFIX:-$LEPTON_JOB_NAME}"
SUBDOMAIN="${LEPTON_SUBDOMAIN:-$LEPTON_JOB_NAME-job-svc}"
export HEAD_ADDR=${SERVICE_PREFIX}-0.${SUBDOMAIN}
export THIS_ADDR=${SERVICE_PREFIX}-${LEPTON_JOB_WORKER_INDEX}.${SUBDOMAIN}
export NODE_COUNT=${LEPTON_JOB_TOTAL_WORKERS}
export NODE_RANK=${LEPTON_JOB_WORKER_INDEX}
export NGPUS=${LEPTON_RESOURCE_ACCELERATOR_NUM}
if [ ${NODE_RANK} == 0 ]; then
# Initialize Ray cluster head
ray start --head --port=6379
else
# All non-head nodes are assigned as workers and listen indefinitely for jobs
ray start --address="${HEAD_ADDR}:6379" --resources='{"nrl_tag_ALL": 1}' --block
fi
# Enter the NeMo-RL command here to run
uv run python examples/run_grpo_math.py \
policy.model_name="meta-llama/Llama-3.2-1B-Instruct" \
cluster.gpus_per_node=8 \
cluster.num_nodes=2 \
checkpointing.checkpoint_dir=/my/mounted/storage/path
- In the script, change the
uv run python...
command to the specific task you want to run with NeMo-RL including any modifications to the parameters. The task specified here will be run directly on the RayCluster that is created as part of this Batch Job. - In the Advanced Configuration section, add a Hugging Face token by clicking Add secret in the Environment Variables field. Either select your Hugging Face token if already added, or add one by clicking CREATE NEW SECRET and select the Hugging Face option and fill out the requested information.
- In the Advanced Configuration section, add storage with the + Mount Storage button and select the volume for your node group's storage and enter the storage path as well as the mount path inside containers.
Click the Create button at the bottom of the field to schedule the job. Once resources are available, DGX Cloud Lepton will pull the custom container and create the simple RayCluster and run the specified job directly.
Monitoring
Regardless of the method used for launching the job, you can view the logs from the training process in the Logs tab on the job page in the UI. Additionally, you can see resource utilization in the Metrics tab for the job.
Next Steps
For more information on using NeMo-RL, refer to their GitHub page and associated documentation.