Run on a Cluster#
In this guide, you will learn how to submit distributed training jobs on Slurm clusters (single- or multi-node). For single-node workstation usage, see Run on Your Local Workstation. For setup details, refer to our Installation Guide.
NeMo AutoModel uses recipes to run end-to-end workflows. If you’re new to recipes, see the Repository Structure guide.
Quickstart#
# Edit the reference script for your cluster, then submit:
cp slurm.sub my_cluster.sub
vim my_cluster.sub
sbatch my_cluster.sub
For interactive testing on a Slurm node:
Single node, single GPU
automodel your_config.yamlSingle node, multiple GPUs
automodel --nproc-per-node 8 your_config.yaml
Submit a Batch Job with Slurm#
SLURM clusters vary widely: some use Pyxis containers, others use Singularity/Apptainer, and many run bare-metal with environment modules. Instead of trying to cover all variations in code, AutoModel provides a reference sbatch script that you copy and adapt to your cluster.
Getting Started#
Copy the reference script:
cp slurm.sub my_cluster.sub
Edit
my_cluster.sub— changeCONFIG,#SBATCHdirectives (account, partition, nodes, time), container runtime, mounts, and secrets for your cluster.Submit the job:
sbatch my_cluster.sub
How It Works#
The reference slurm.sub script:
Sets
CONFIGto point at your YAML recipe configAllocates nodes via SBATCH directives
Sets up the multi-node environment (
MASTER_ADDR,MASTER_PORT)Runs
torchrun -m nemo_automodel.cli.app $CONFIGon each node viasrunEach torchrun worker detects the distributed environment and runs the recipe in-process
All cluster-specific configuration (SBATCH directives, container runtime, mounts, NCCL tuning, secrets) lives in your sbatch script where you can see and edit it directly.
Examples#
Pyxis container (NVIDIA clusters):
#!/bin/bash
#SBATCH -A my_account
#SBATCH -p batch
#SBATCH -t 01:00:00
#SBATCH -N 8
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=1
#SBATCH -J automodel-finetune
#SBATCH --output=slurm_jobs/%x_%j.out
#SBATCH --error=slurm_jobs/%x_%j.err
CONFIG=examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml
CONT=/lustre/fsw/images/automodel.sqsh
CONT_NAME=automodel-training
CONT_MOUNT="\
/home/$USER/Automodel:/opt/Automodel,\
/home/$USER/.cache/huggingface:/root/.cache/huggingface"
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=13742
srun \
--container-name="${CONT_NAME}" \
--container-image="${CONT}" \
--container-mounts="${CONT_MOUNT}" \
--container-entrypoint \
--no-container-mount-home \
--export=ALL \
bash -c "\
cd /opt/Automodel && \
torchrun \
--nproc-per-node=\${SLURM_GPUS_PER_NODE:-8} \
--nnodes=\${SLURM_NNODES:-1} \
--rdzv_backend=c10d \
--rdzv_endpoint=\${MASTER_ADDR}:\${MASTER_PORT} \
-m nemo_automodel.cli.app ${CONFIG}"
Bare-metal (no container):
#!/bin/bash
#SBATCH -A my_account
#SBATCH -p gpu
#SBATCH -N 2
#SBATCH --gpus-per-node=8
#SBATCH --time=01:00:00
CONFIG=examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=13742
module load cuda/12.8
source /opt/venvs/automodel/bin/activate
srun bash -c "\
torchrun \
--nproc-per-node=\${SLURM_GPUS_PER_NODE:-8} \
--nnodes=\${SLURM_NNODES:-1} \
--rdzv_backend=c10d \
--rdzv_endpoint=\${MASTER_ADDR}:\${MASTER_PORT} \
-m nemo_automodel.cli.app ${CONFIG}"
Apptainer / Singularity:
#!/bin/bash
#SBATCH -A my_account
#SBATCH -p gpu
#SBATCH -N 2
#SBATCH --gpus-per-node=8
#SBATCH --time=01:00:00
CONFIG=examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=13742
srun apptainer exec --nv /shared/images/automodel.sif \
bash -c "\
torchrun \
--nproc-per-node=\${SLURM_GPUS_PER_NODE:-8} \
--nnodes=\${SLURM_NNODES:-1} \
--rdzv_backend=c10d \
--rdzv_endpoint=\${MASTER_ADDR}:\${MASTER_PORT} \
-m nemo_automodel.cli.app ${CONFIG}"
Launch with Modified Code#
If the script is executed from within a Git repository accessible to Slurm
workers, automodel will use the repository source over the installation
inside the container image (it prepends $CWD to PYTHONPATH when it
detects an editable checkout).
git clone git@github.com:NVIDIA-NeMo/Automodel.git automodel_test_repo
cd automodel_test_repo/
sbatch slurm.sub
Customize Configuration Settings#
You can customize training by following the steps in this section.
Override config values: Edit the
CONFIGvariable and add CLI overrides in your torchrun command inside the sbatch script. For example, to change the model:-m nemo_automodel.cli.app ${CONFIG} --model.pretrained_model_name_or_path Qwen/Qwen3-0.6B
Edit the config file: Modify the YAML directly for persistent changes.
Create custom configs: Copy and modify existing configurations from the
examples/directory.
For single-node workflows, see our Run on Your Local Workstation guide.