Multi-Node Ray on Slurm
Multi-Node Ray on Slurm
SlurmRayClient is a drop-in replacement for RayClient that bootstraps a multi-node Ray cluster under SLURM. The head node starts ray start --head, writes the assigned GCS port to a shared file, and waits for workers to join. Worker nodes block on the port file and call ray start --block. Only the head returns from start(), so your pipeline code runs on the head while workers stay attached to the cluster.
For details on container environments and SLURM-specific environment variables, refer to Container Environments. For the image-curation specific Slurm workflow, refer to Deploy Image Curation on Slurm.
When to Use It
Use SlurmRayClient when:
- You’re submitting Ray-backed pipelines through
sbatchand want to span more than one node. - You want a single-binary entry point: the same script works on a 1-node debug job and a 2-node production job without code changes.
For single-node SLURM jobs, the regular RayClient is sufficient.
Prerequisites
- SLURM cluster: a working SLURM cluster with at least one GPU node.
- Shared filesystem: a filesystem visible from every allocated node (NFS, Lustre, GPFS) — required for the Ray port broadcast file.
- Pyxis + enroot (recommended): for the container-based submit script. If unavailable, use the bare-metal
uvscript. - uv (alternative): for bare-metal execution. Install via astral.sh/uv.
- Source checkout: NeMo Curator source code on a shared filesystem path readable from every node.
Drop-in Replacement
Replace RayClient with SlurmRayClient. No other changes are required:
The head/worker split is determined automatically from SLURM_NODEID. Workers stay inside SlurmRayClient.start() for the duration of the job; the head runs your pipeline.
Required Environment
Submit Scripts
The repository ships two reference submit scripts at tutorials/slurm/:
1. NGC Container (Pyxis)
2. Bare-metal (uv)
submit_container.sh — launches the NGC NeMo Curator container via Pyxis on each allocated node. Recommended for production because the container ships a known-good environment.
Submit:
Both scripts default RAY_PORT_BROADCAST_DIR to ${CURATOR_DIR}/logs. Change this to a shared filesystem path (NFS, Lustre, GPFS) if your cluster’s /tmp is node-local — workers cannot read the head’s port file otherwise.
Verified Configurations
End-to-end tested on H100 nodes using the nvcr.io/nvidia/nemo-curator:26.02 container at the time SlurmRayClient was introduced. The reference workload is a word-count + GPU-tagging pipeline shipped in the tutorials/slurm/ directory; it works unchanged on subsequent container versions, including {{ container_version }}.
Sample output from the 2-node, 8-GPU run:
Monitoring and Logs
Check job status:
View logs (the bundled scripts write to logs/slurm_demo_<jobid>.log):
The first line of each node’s log identifies its role:
Troubleshooting
Performance Considerations
- GPU Memory: ensure all nodes have homogeneous GPU memory; mixed-memory clusters cause Ray to schedule conservatively against the smallest GPU.
- Network: Ray’s object store and shuffle traffic flow over the cluster’s primary network. Use InfiniBand or 100 GbE+ for large-scale jobs.
- Shared FS bandwidth: stage worker reads/writes through the shared filesystem. For S3-backed pipelines, prefer S3 over the shared FS to reduce contention.
- Time limits: SLURM kills jobs at the wall-clock limit. The bundled scripts use
--time=00:10:00for the demo; raise it to match your real workload.
Related Topics
- Deploy Image Curation on Slurm — full image curation pipeline using SLURM and Pyxis.
- Container Environments — image versions and SLURM-specific environment variables.
- Execution Backends — Ray Data, Xenna, and Ray actor pool executor selection.