Run on a Cluster (Slurm / Multi-node)#
In this guide you will learn how to submit distributed training jobs on Slurm clusters (single- or multi-node). For single-node workstation usage, see Run on Your Local Workstation. For setup details, refer to our Installation Guide.
NeMo Automodel uses recipes to run end-to-end workflows. If you’re new to recipes, see the Repository Structure guide.
Note
Kubernetes support is coming soon.
Quick start: Choose your job launch option#
Slurm jobs support two modes of execution: batch and interactive. In batch mode, the job is submitted to the cluster queue and
is executed without any other input from the user (e.g., no keyboard input), while the interactive mode, as the name implies, enables keyboard input.
CLI (recommended for Slurm) This only requires to configure the
slurmsection in the YAML file, and the launcher will render the SBATCH script.automodel finetune llm -c your_config_with_slurm.yaml
Direct recipe script (typically for interactive testing) You can also launch an interactive job on a Slurm node, and on the node run:
Single node, single GPU
python3 examples/llm_finetune/finetune.py -c your_config.yaml
Single node, multiple GPUs
torchrun --nproc-per-node=8 examples/llm_finetune/finetune.py -c your_config.yaml
Note: For multi-node, prefer the CLI with
slurmconfiguration.
Submit a Batch Job with Slurm#
For distributed training on Slurm clusters, add a slurm section to your YAML configuration:
# Your existing model, dataset, training config...
step_scheduler:
grad_acc_steps: 4
num_epochs: 1
model:
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
dataset:
_target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
dataset_name: rajpurkar/squad
split: train
# Add Slurm configuration
slurm:
job_name: llm-finetune
nodes: 1
ntasks_per_node: 8
time: 00:30:00
account: your_account
partition: gpu
container_image: nvcr.io/nvidia/nemo:25.07
gpus_per_node: 8 # Adds an SBATCH line: "#SBATCH --gpus-per-node=8"
# Optional: Add extra mount points if needed
extra_mounts:
- /lustre:/lustre
# Optional: Specify custom HF_HOME location (will auto-create if not specified)
hf_home: /path/to/your/HF_HOME
# Optional : Specify custom env vars
# env_vars:
# ENV_VAR: value
# Optional: Specify custom job directory (defaults to cwd/slurm_jobs)
# job_dir: /path/to/slurm/jobs
Then submit the job:
automodel finetune llm -c your_config_with_slurm.yaml
The AutoModel CLI is the preferred method for most users. It provides a unified interface to submit Slurm batch jobs without deep knowledge of cluster specifics.
The CLI will automatically submit the job to Slurm and handle the distributed setup.
The above example launches one node with eight workers per node inside torchrun (--nproc_per_node=8).
The Slurm script itself uses #SBATCH --ntasks-per-node 1, and when gpus_per_node is set, it adds #SBATCH --gpus-per-node=8 as well.
The CLI follows this format:
automodel <command> <domain> -c <config_file> [options]
Where:
<command>: The operation to perform (finetune)<domain>: The model domain (llmorvlm)<config_file>: Path to your YAML configuration file
Launch a Batch Job on Slurm with Modified Code#
If the command is executed from within a Git repository accessible to Slurm workers, the generated SBATCH script will prioritize the repository source over the Automodel installation inside the container image.
For example:
git clone git@github.com:NVIDIA-NeMo/Automodel.git automodel_test_repo
cd automodel_test_repo/
automodel finetune llm -c examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml --nproc-per-node=2
This will launch the job using the source code in the automodel_test_repo directory instead of the version bundled in the Docker image.
Standalone Slurm Script (advanced)#
If you prefer to submit with your own Slurm script, here is a standalone bash script adapted from the Automodel launcher template. See the upstream template for the authoritative reference: Automodel Slurm template.
#!/bin/bash
#SBATCH -A <account>
#SBATCH -p <partition>
#SBATCH -N <nodes>
#SBATCH --ntasks-per-node 1<gpus_per_node_directive>
#SBATCH --time <HH:MM:SS>
#SBATCH --mail-type=FAIL
#SBATCH --exclusive
#SBATCH --output=<job_dir>/slurm_%x_%j.out
#SBATCH -J <job_name>
# Multi-node env
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=<master_port>
export NUM_GPUS=<num_gpus>
export WORLD_SIZE=$(($NUM_GPUS * $SLURM_NNODES))
export CUDA_DEVICE_MAX_CONNECTIONS=1
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
export NCCL_NVLS_ENABLE=0
# Experiment env
export WANDB_API_KEY=<wandb_key>
export HF_HOME=<hf_home>
export HF_TOKEN=<hf_token>
# Add any custom env vars below, e.g.:
# export MY_ENV_VAR=value
# User command
read -r -d '' CMD <<'EOF'
cd <chdir>; whoami; date; pwd;
<command>
EOF
echo "$CMD"
srun \
--mpi=pmix \
--container-entrypoint \
--no-container-mount-home \
--container-image=<container_image> \
--container-mounts=<container_mounts> \
--export=ALL \
bash -c "$CMD"
Replace bracketed placeholders (e.g., <account>, <container_image>, <command>) with your values. For multi-node training, ensure your <command> uses torchrun with --nnodes=$SLURM_NNODES --nproc-per-node=$NUM_GPUS --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT or rely on the Automodel CLI which configures this for you.
Run with uv (Development Mode)#
When developing on clusters, you can use uv to prepare and test scripts locally. For single-node torchrun examples, see Run on Your Local Workstation. Cluster execution should be done through the CLI with slurm configs above.
For Slurm-based execution, rely on the slurm section in your YAML and submit with the CLI.
Why use uv?#
uv provides several advantages for development and experimentation:
Automatic environment management: uv automatically creates and manages virtual environments, ensuring consistent dependencies without manual setup.
Lock file synchronization: Keeps your local environment perfectly synchronized with the project’s
uv.lockfile.No installation required: Run scripts directly from the repository without installing packages system-wide.
Development flexibility: Direct access to Python scripts for debugging, profiling, and customization.
Dependency isolation: Each project gets its own isolated environment, preventing conflicts.
Run with Torchrun#
For cluster usage, prefer submitting via the CLI with slurm configuration. Direct torchrun is recommended for single-node development; see Run on Your Local Workstation.
Standard PyTorch multi-node example#
If you need a straightforward reference for manual multi-node launching with PyTorch (outside of Slurm helpers), use the pattern below. Run this on each node, updating NODE_RANK per node and adjusting --nnodes/--nproc-per-node as needed for your setup.
export MASTER_ADDR=node0.hostname # master node's host/IP
export MASTER_PORT=29500
export NODE_RANK=0 # node0 -> 0, node1 -> 1, ...
torchrun \
--nnodes=2 \
--nproc_per_node=8 \
--node_rank=${NODE_RANK} \
--rdzv_backend=c10d \
--rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
train.py --batch_size 32
Notes:
Set
NODE_RANK=0on the master node (whereMASTER_ADDRresolves),NODE_RANK=1on the second node, and so on.Ensure
--nproc_per_nodematches the number of GPUs per node.When launching under Slurm, prefer the CLI
slurmconfiguration above or ensure equivalent rendezvous/env settings are provided via the scheduler.
Customize Configuration Settings#
All approaches use the same YAML configuration files. You can easily customize training by following the steps in this section.
Override config values: Use command-line arguments to directly replace default settings. For example, if you want to fine-tune
Qwen/Qwen3-0.6Binstead ofmeta-llama/Llama-3.2-1B, you can use:automodel finetune llm -c config.yaml --model.pretrained_model_name_or_path Qwen/Qwen3-0.6B
Edit the config file: Modify the YAML directly for persistent changes.
Create custom configs: Copy and modify existing configurations from the
examples/directory.
When to Use Which Approach#
Use the Automodel CLI when:
You want a simple, unified interface
You are submitting jobs to production clusters (Slurm)
Use uv when:
You’re developing or debugging the codebase
You want automatic dependency management
Use Torchrun when:
You have a stable, pre-configured environment
You prefer explicit control over Python execution
All approaches use the same configuration files. For single-node workflows, see our Run on Your Local Workstation guide.