Run with NeMo-Run#
In this guide, you will learn how to launch NeMo AutoModel training jobs using NeMo-Run. NeMo-Run supports multiple backends including Slurm, Kubernetes, Docker, and local execution. For cloud-based training, see Run on Any Cloud with SkyPilot. For direct sbatch usage, see Run on a Cluster (Slurm). For single-node workstation usage, see Run on Your Local Workstation.
NeMo-Run is an open-source tool from NVIDIA that manages job submission across different execution backends. You define your compute configuration once in a Python file and reuse it across all your training jobs.
Before You Begin#
Install NeMo-Run (it is not bundled with AutoModel):
pip install nemo-run
Create an executor definitions file at
$NEMORUN_HOME/executors.py.NEMORUN_HOMEdefaults to~/.nemo_run; set the environment variable to use a different location. This file tells NeMo-Run how to reach your compute target. Every executor you reference in a YAML config must be defined here. See Executor Setup for a complete example.Verify connectivity to the target in your executor (e.g. SSH for Slurm, kubeconfig for Kubernetes).
Set required environment variables (if needed by your training config):
export HF_TOKEN=hf_... # Required for gated models (e.g. Llama)
export WANDB_API_KEY=... # Optional: Weights & Biases logging
Executor Setup#
The executor: field in your YAML config is a name that maps to an entry in $NEMORUN_HOME/executors.py. This file must define a module-level EXECUTOR_MAP dictionary. NeMo-Run supports several executor types – here are examples of the most common ones:
Slurm Executor#
import nemo_run as run
def my_slurm_cluster():
executor = run.SlurmExecutor(
account="my_account",
partition="batch",
tunnel=run.SSHTunnel(
user="myuser",
host="login-node.example.com",
job_dir="/remote/path/nemo_run/jobs",
),
nodes=1,
ntasks_per_node=8,
gpus_per_node=8,
mem="0",
exclusive=True,
packager=run.Packager(),
)
executor.container_image = "nvcr.io/nvidia/nemo-automodel:26.02"
executor.container_mounts = ["/data:/data", "/checkpoints:/checkpoints"]
executor.env_vars = {"HF_HOME": "/data/hf_cache"}
executor.time = "04:00:00"
return executor
EXECUTOR_MAP = {
"my_slurm": my_slurm_cluster(),
}
Kubernetes Executor#
import nemo_run as run
def my_k8s_cluster():
return run.KubeflowExecutor(
namespace="training",
image="nvcr.io/nvidia/nemo-automodel:26.02",
num_nodes=1,
nprocs_per_node=8,
gpus_per_node=8,
)
EXECUTOR_MAP = {
"my_k8s": my_k8s_cluster(),
}
Multiple Executors#
You can define as many executors as you need for different backends, clusters, or resource configurations:
EXECUTOR_MAP = {
"slurm_dev": my_slurm_dev(),
"slurm_prod": my_slurm_prod(),
"k8s": my_k8s_cluster(),
}
Keys in
EXECUTOR_MAPare names you reference in YAML (executor: slurm_dev).Values can be executor instances or zero-argument callables that return one.
Override fields in the YAML (
nodes,devices,container_image, etc.) are applied on top of the executor defaults.
Quickstart#
Any existing AutoModel YAML config can be run via NeMo-Run by adding a nemo_run: section at the top. For example, given an existing config that you run locally:
automodel examples/llm_finetune/qwen/qwen3_moe_30b_te_packed_sequence.yaml
Add a nemo_run: block to submit it to a remote executor instead:
# -- Add this section to any existing config ----------------------------------
nemo_run:
executor: my_slurm # Name from EXECUTOR_MAP in $NEMORUN_HOME/executors.py
container_image: /images/custom.sqsh # Override executor's default image
nodes: 1 # Override number of nodes
ntasks_per_node: 8 # GPUs per node
time: "04:00:00" # Override time limit
job_name: qwen3_moe_finetune # Experiment and job name
# -- Everything below is your existing training config (unchanged) ------------
recipe: TrainFinetuneRecipeForNextTokenPrediction
step_scheduler:
global_batch_size: 32
# ... rest of your config ...
Then run the same command:
automodel your_config.yaml
The CLI detects the nemo_run: key, strips it from the training config, loads the named executor from $NEMORUN_HOME/executors.py, and submits the job – all in one command.
Configuration Reference#
All nemo_run: Fields#
Field |
Default |
Description |
|---|---|---|
|
|
Name from |
|
|
Experiment and job name |
|
|
Return immediately after submission |
|
|
Stream logs after submission |
|
|
Path to the executor definitions file |
|
|
Local directory for job artifacts (config snapshot) |
(any other key) |
(from executor) |
Applied directly to the executor via |
Examples#
Single-Node Fine-Tuning (1 x 8 GPUs)#
nemo_run:
executor: my_slurm
nodes: 1
ntasks_per_node: 8
job_name: single_node_finetune
Multi-Node Distributed Training (2 x 8 GPUs)#
nemo_run:
executor: my_slurm
nodes: 2
ntasks_per_node: 8
time: "08:00:00"
job_name: multinode_pretrain
For multi-node jobs the launcher automatically adds --nnodes, --node-rank, --rdzv-backend, --master-addr, and --master-port to the torchrun command.
Custom Container Image and Mounts#
nemo_run:
executor: my_slurm
container_image: /images/automodel_nightly.sqsh
container_mounts:
- /scratch/datasets:/datasets
- /scratch/checkpoints:/checkpoints
env_vars:
HF_HOME: /datasets/hf_cache
NCCL_DEBUG: INFO
Local Execution (No Cluster)#
Use executor: local to run on the current machine. No $NEMORUN_HOME/executors.py entry is needed:
nemo_run:
executor: local
ntasks_per_node: 2
job_name: local_test
Monitor and Manage Jobs#
NeMo-Run stores experiment metadata under $NEMORUN_HOME/experiments/. Set tail_logs: true in the YAML to stream job output after submission.
For Slurm-based executors, standard Slurm commands also work:
squeue -u $USER # List your queued and running jobs
scancel <job_id> # Cancel a running or pending job
sacct -j <job_id> # View job accounting information
For Kubernetes-based executors, use kubectl to monitor pods and jobs.
How It Works#
The
automodelCLI detects thenemo_run:key and importsNemoRunLauncher.The
nemo_run:section is popped from the config. The remaining training config is written tonemo_run_jobs/<timestamp>/job_config.yamlfor record-keeping.The launcher loads a pre-configured executor from
$NEMORUN_HOME/executors.pyby name (or creates aLocalExecutorforexecutor: local). Override fields are applied on top of the executor defaults.The training config YAML is embedded in a self-contained inline bash script via a heredoc, so no separate file transfer is needed.
A
torchruncommand is built with--nproc-per-nodeand (for multi-node) distributed rendezvous arguments.The script is submitted via
nemo_run.Experiment. By default the call returns immediately (detach=True).
Customize Configuration#
Override any training parameter from the command line, same as with local runs:
automodel config_with_nemo_run.yaml \
--model.pretrained_model_name_or_path meta-llama/Llama-3.2-3B
When to Use NeMo-Run vs. SkyPilot vs. Slurm#
NeMo-Run |
SkyPilot |
Slurm (sbatch) |
|
|---|---|---|---|
Infrastructure |
Slurm, Kubernetes, Docker, local |
Public cloud (AWS, GCP, Azure) |
On-prem HPC |
Container support |
Yes (Pyxis/Enroot, Docker, K8s pods) |
N/A (cloud VMs) |
Manual (in sbatch script) |
Setup required |
|
Cloud credentials + |
Cluster access + sbatch script |
Job submission |
|
|
|
Good for |
Managed multi-backend execution, reusable executor configs |
Cloud burst, cost optimization, spot instances |
Direct Slurm scripts, full control over sbatch |