Run on a Cluster | NVIDIA NeMo AutoModel

In this guide, you will learn how to submit distributed training jobs on Slurm clusters (single- or multi-node). For single-node workstation usage, see Run on Your Local Workstation. For setup details, refer to our Installation Guide.

NeMo AutoModel uses recipes to run end-to-end workflows. If you’re new to recipes, see the Repository Structure guide.

Quickstart

$ # Edit the reference script for your cluster, then submit:
$ cp slurm.sub my_cluster.sub
$ vim my_cluster.sub
$ sbatch my_cluster.sub

For interactive testing on a Slurm node:

Single node, single GPU
```
$ automodel your_config.yaml
```

Single node, multiple GPUs

$ automodel --nproc-per-node 8 your_config.yaml

Submit a Batch Job with Slurm

SLURM clusters vary widely: some use Pyxis containers, others use Singularity/Apptainer, and many run bare-metal with environment modules. Instead of trying to cover all variations in code, AutoModel provides a reference sbatch script that you copy and adapt to your cluster.

Getting Started

Copy the reference script:

$ cp slurm.sub my_cluster.sub

Edit my_cluster.sub — change CONFIG, #SBATCH directives (account, partition, nodes, time), container runtime, mounts, and secrets for your cluster.
Submit the job:

$ sbatch my_cluster.sub

How It Works

The reference slurm.sub script:

Sets CONFIG to point at your YAML recipe config
Allocates nodes via SBATCH directives
Sets up the multi-node environment (MASTER_ADDR, MASTER_PORT)
Runs torchrun -m nemo_automodel.cli.app $CONFIG on each node via srun
Each torchrun worker detects the distributed environment and runs the recipe in-process

All cluster-specific configuration (SBATCH directives, container runtime, mounts, NCCL tuning, secrets) lives in your sbatch script where you can see and edit it directly.

Examples

Pyxis container (NVIDIA clusters):

$ #!/bin/bash
$ #SBATCH -A my_account
$ #SBATCH -p batch
$ #SBATCH -t 01:00:00
$ #SBATCH -N 8
$ #SBATCH --gpus-per-node=8
$ #SBATCH --ntasks-per-node=1
$ #SBATCH -J automodel-finetune
$ #SBATCH --output=slurm_jobs/%x_%j.out
$ #SBATCH --error=slurm_jobs/%x_%j.err
$ 
$ CONFIG=examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml
$ 
$ CONT=/lustre/fsw/images/automodel.sqsh
$ CONT_NAME=automodel-training
$ CONT_MOUNT="\
> /home/$USER/Automodel:/opt/Automodel,\
> /home/$USER/.cache/huggingface:/root/.cache/huggingface"
$ 
$ export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
$ export MASTER_PORT=13742
$ 
$ srun \
>     --container-name="${CONT_NAME}" \
>     --container-image="${CONT}" \
>     --container-mounts="${CONT_MOUNT}" \
>     --container-entrypoint \
>     --no-container-mount-home \
>     --export=ALL \
>     bash -c "\
>         cd /opt/Automodel && \
>         torchrun \
>             --nproc-per-node=\${SLURM_GPUS_PER_NODE:-8} \
>             --nnodes=\${SLURM_NNODES:-1} \
>             --rdzv_backend=c10d \
>             --rdzv_endpoint=\${MASTER_ADDR}:\${MASTER_PORT} \
>             -m nemo_automodel.cli.app ${CONFIG}"

Bare-metal (no container):

$ #!/bin/bash
$ #SBATCH -A my_account
$ #SBATCH -p gpu
$ #SBATCH -N 2
$ #SBATCH --gpus-per-node=8
$ #SBATCH --time=01:00:00
$ 
$ CONFIG=examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml
$ 
$ export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
$ export MASTER_PORT=13742
$ 
$ module load cuda/12.8
$ source /opt/venvs/automodel/bin/activate
$ 
$ srun bash -c "\
>     torchrun \
>         --nproc-per-node=\${SLURM_GPUS_PER_NODE:-8} \
>         --nnodes=\${SLURM_NNODES:-1} \
>         --rdzv_backend=c10d \
>         --rdzv_endpoint=\${MASTER_ADDR}:\${MASTER_PORT} \
>         -m nemo_automodel.cli.app ${CONFIG}"

Apptainer / Singularity:

$ #!/bin/bash
$ #SBATCH -A my_account
$ #SBATCH -p gpu
$ #SBATCH -N 2
$ #SBATCH --gpus-per-node=8
$ #SBATCH --time=01:00:00
$ 
$ CONFIG=examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml
$ 
$ export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
$ export MASTER_PORT=13742
$ 
$ srun apptainer exec --nv /shared/images/automodel.sif \
>     bash -c "\
>         torchrun \
>             --nproc-per-node=\${SLURM_GPUS_PER_NODE:-8} \
>             --nnodes=\${SLURM_NNODES:-1} \
>             --rdzv_backend=c10d \
>             --rdzv_endpoint=\${MASTER_ADDR}:\${MASTER_PORT} \
>             -m nemo_automodel.cli.app ${CONFIG}"

Launch with Modified Code

If the script is executed from within a Git repository accessible to Slurm workers, automodel will use the repository source over the installation inside the container image (it prepends $CWD to PYTHONPATH when it detects an editable checkout).

$ git clone git@github.com:NVIDIA-NeMo/Automodel.git automodel_test_repo
$ cd automodel_test_repo/
$ sbatch slurm.sub

Customize Configuration Settings

You can customize training by following the steps in this section.

Override config values: Edit the CONFIG variable and add CLI overrides in your torchrun command inside the sbatch script. For example, to change the model:
```
$ -m nemo_automodel.cli.app ${CONFIG} --model.pretrained_model_name_or_path Qwen/Qwen3-0.6B
```
Edit the config file: Modify the YAML directly for persistent changes.
Create custom configs: Copy and modify existing configurations from the examples/ directory.

For single-node workflows, see our Run on Your Local Workstation guide.