Set Up Clusters#
This guide explains how to run NeMo RL with Ray on Slurm or Kubernetes.
Slurm (Batched and Interactive)#
The following code provides instructions on how to use Slurm to run batched job submissions and run jobs interactively.
Batched Job Submission#
# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=1 # Total nodes requested (head is colocated on ray-worker-0)
COMMAND="uv run ./examples/run_grpo_math.py" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=1:0:0 \
--gres=gpu:8 \
ray.sub
Notes:
Some clusters may or may not need
--gres=gpu:8
to be added to thesbatch
command.
Which will print the SLURM_JOB_ID
:
Submitted batch job 1980204
Make note of the the job submission number. Once the job begins, you can track its process in the driver logs which you can tail
:
tail -f 1980204-logs/ray-driver.log
Interactive Launching#
Tip
A key advantage of running interactively on the head node is the ability to execute multiple multi-node jobs without needing to requeue in the Slurm job queue. This means that during debugging sessions, you can avoid submitting a new sbatch
command each time. Instead, you can debug and re-submit your NeMo RL job directly from the interactive session.
To run interactively, launch the same command as Batched Job Submission, but omit the COMMAND
line:
# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=1 # Total nodes requested (head is colocated on ray-worker-0)
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=1:0:0 \
--gres=gpu:8 \
ray.sub
Which will print the SLURM_JOB_ID
:
Submitted batch job 1980204
Once the Ray cluster is up, a script should be created to attach to the Ray head node, which you can use to launch experiments.
bash 1980204-attach.sh
Now that you are on the head node, you can launch the command as follows:
uv run ./examples/run_grpo_math.py
Slurm UV_CACHE_DIR#
There several choices for UV_CACHE_DIR
when using ray.sub
:
(default)
UV_CACHE_DIR
defaults to$SLURM_SUBMIT_DIR/uv_cache
when not specified the shell environment, and is mounted to head and worker nodes to serve as a persistent cache between runs.Use the warm uv cache from our docker images:
... UV_CACHE_DIR=/home/ray/.cache/uv \ sbatch ... \ ray.sub
(1) is more efficient in general since the cache is not ephemeral and is persisted run to run; but for users that
donβt want to persist the cache, you can use (2), which is just as performant as (1) if the uv.lock
is
covered by warmed cache.
Kubernetes#
TBD