Run on a Cluster
In this guide, you will learn how to submit distributed training jobs on Slurm clusters (single- or multi-node). For single-node workstation usage, see Run on Your Local Workstation. For setup details, refer to our Installation Guide.
NeMo AutoModel uses recipes to run end-to-end workflows. If you’re new to recipes, see the Repository Structure guide.
Quickstart
For interactive testing on a Slurm node:
- Single node, single GPU
- Single node, multiple GPUs
Submit a Batch Job with Slurm
SLURM clusters vary widely: some use Pyxis containers, others use Singularity/Apptainer, and many run bare-metal with environment modules. Instead of trying to cover all variations in code, AutoModel provides a reference sbatch script that you copy and adapt to your cluster.
Getting Started
- Copy the reference script:
-
Edit
my_cluster.sub— changeCONFIG,#SBATCHdirectives (account, partition, nodes, time), container runtime, mounts, and secrets for your cluster. -
Submit the job:
How It Works
The reference slurm.sub script:
- Sets
CONFIGto point at your YAML recipe config - Allocates nodes via SBATCH directives
- Sets up the multi-node environment (
MASTER_ADDR,MASTER_PORT) - Runs
torchrun -m nemo_automodel.cli.app $CONFIGon each node viasrun - Each torchrun worker detects the distributed environment and runs the recipe in-process
All cluster-specific configuration (SBATCH directives, container runtime, mounts, NCCL tuning, secrets) lives in your sbatch script where you can see and edit it directly.
Examples
Pyxis container (NVIDIA clusters):
Bare-metal (no container):
Apptainer / Singularity:
Launch with Modified Code
If the script is executed from within a Git repository accessible to Slurm
workers, automodel will use the repository source over the installation
inside the container image (it prepends $CWD to PYTHONPATH when it
detects an editable checkout).
Customize Configuration Settings
You can customize training by following the steps in this section.
-
Override config values: Edit the
CONFIGvariable and add CLI overrides in your torchrun command inside the sbatch script. For example, to change the model: -
Edit the config file: Modify the YAML directly for persistent changes.
-
Create custom configs: Copy and modify existing configurations from the
examples/directory.
For single-node workflows, see our Run on Your Local Workstation guide.