Execution through NeMo-Run#
Nemotron recipes use NeMo-Run for job orchestration. NeMo-Run is an NVIDIA tool for configuring, executing, and managing ML experiments across computing environments.
Note: This release supports Slurm, DGX Cloud (run:ai), and Lepton executors, including Ray-based runs on all three backends.
What is NeMo-Run?#
NeMo-Run decouples what you want to run from where you run it:
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333', 'clusterBkg': '#ffffff', 'clusterBorder': '#333333'}}}%%
flowchart LR
subgraph config["Configuration"]
Task["Training Task"]
end
subgraph nemorun["NeMo-Run"]
Packager["Packager"]
Executor["Executor"]
end
subgraph targets["Execution Targets"]
Local["Local"]
Docker["Docker"]
Slurm["Slurm"]
Cloud["Cloud"]
end
Task --> Packager
Packager --> Executor
Executor --> Local
Executor --> Docker
Executor --> Slurm
Executor --> Cloud
Core concepts:
Executor – where to run (local, Docker, Slurm, cloud)
Packager – how to package code for the executor
Launcher – how to launch the process (torchrun, direct)
For full documentation, see the NeMo-Run GitHub repository.
Quick Start#
Add --run <profile> to any recipe command:
# Execute on a Slurm cluster
uv run nemotron nano3 pretrain -c tiny --run YOUR-CLUSTER
# Detached execution (submit and exit)
uv run nemotron nano3 pretrain -c tiny --batch YOUR-CLUSTER
# Preview without executing
uv run nemotron nano3 pretrain -c tiny --run YOUR-CLUSTER --dry-run
Execution Profiles#
The nemo_runspec package adds an env.toml configuration layer on top of NeMo-Run. This gives you declarative execution profiles that integrate with the CLI. This is a Nemotron-specific feature – standard NeMo-Run requires programmatic configuration.
Create an env.toml in your project root. Each section defines a named execution profile that can be referenced via --run <profile> or --batch <profile>:
# env.toml
[wandb]
project = "nemotron"
entity = "YOUR-TEAM"
[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 2
ntasks_per_node = 8
gpus_per_node = 8
mounts = ["/lustre:/lustre"]
When you run uv run nemotron nano3 pretrain --run YOUR-CLUSTER, the kit reads this profile, builds the appropriate NeMo-Run executor, and submits your job.
Profile Inheritance#
Profiles can extend other profiles to reduce duplication:
[base-slurm]
executor = "slurm"
account = "my-account"
partition = "gpu"
time = "04:00:00"
[YOUR-CLUSTER]
extends = "base-slurm"
nodes = 4
ntasks_per_node = 8
gpus_per_node = 8
[YOUR-CLUSTER-large]
extends = "YOUR-CLUSTER"
nodes = 16
time = "08:00:00"
Executors#
Slurm#
Submit jobs to a Slurm cluster. Container execution requires Pyxis, NVIDIA’s container plugin for Slurm.
[YOUR-CLUSTER]
executor = "slurm"
account = "my-account"
partition = "gpu"
nodes = 4
ntasks_per_node = 8
gpus_per_node = 8
time = "04:00:00"
mounts = ["/data:/data"]
SSH Tunnel Submission#
Submit from a remote machine via SSH:
[YOUR-CLUSTER]
executor = "slurm"
account = "my-account"
partition = "gpu"
nodes = 4
tunnel = "ssh"
host = "cluster.example.com"
user = "username"
identity = "~/.ssh/id_rsa"
Partition Overrides#
Specify different partitions for --run vs --batch:
[YOUR-CLUSTER]
executor = "slurm"
partition = "batch" # Default
run_partition = "interactive" # For --run (attached)
batch_partition = "backfill" # For --batch (detached)
DGX Cloud (run:ai)#
Submit jobs to NVIDIA DGX Cloud via the run:ai API. This executor handles authentication, data movement to PVCs, and multi-node distributed training through DGX Cloud workloads.
[dgxcloud]
executor = "dgxcloud"
base_url = "https://your-dgxcloud-instance.example.com/api/v1"
kube_apiserver_url = "https://your-dgxcloud-k8s.example.com"
client_id = "YOUR-CLIENT-ID" # `app_id` accepted as legacy alias
client_secret = "YOUR-CLIENT-SECRET" # `app_secret` accepted as legacy alias
project_name = "YOUR-PROJECT"
container_image = "nvcr.io/nvidia/nemo:latest"
pvc_nemo_run_dir = "/pvc/nemo_run"
nodes = 2
gpus_per_node = 8
nprocs_per_node = 8
pvcs = [
{ claimName = "my-pvc", path = "/pvc", readOnly = false }
]
For run:ai scheduler integration, use custom_spec:
[dgxcloud-runai]
extends = "dgxcloud"
custom_spec = { schedulerName = "runai-scheduler" }
Lepton#
Submit jobs to NVIDIA DGX Cloud Lepton clusters. Lepton provides SDK-based job submission with node group scheduling and shared memory configuration.
[lepton]
executor = "lepton"
container_image = "nvcr.io/nvidia/nemo:latest"
nemo_run_dir = "/nemo_run"
nodes = 2
gpus_per_node = 8
resource_shape = "gpu.h100.8"
node_group = "my-node-group"
shared_memory_size = 65536
mounts = [
{ path = "/data", mount_path = "/data" }
]
pre_launch_commands = ["pip install my-package"]
Docker#
Run in a local Docker container with GPU support:
[docker]
executor = "docker"
container_image = "nvcr.io/nvidia/nemo:latest"
gpus_per_node = 8
runtime = "nvidia"
mounts = ["/data:/data"]
Building Containers#
Some recipes, including Omni3, build stage-local containers through the same profile system instead of pulling a pre-baked image.
Cluster-side build via Nemotron CLI
uv run nemotron omni3 build sft --run YOUR-CLUSTER
uv run nemotron omni3 build rl --run YOUR-CLUSTER
These commands submit short CPU-oriented build jobs through NeMo-Run. If your site uses a different partition or walltime for build jobs, set build_partition and build_time in env.toml. The dispatcher only runs on slurm executors today; non-slurm executors (e.g. Lepton, k8s) error out cleanly with a hook for future PRs.
[YOUR-CLUSTER]
executor = "slurm"
partition = "batch"
build_partition = "cpu"
build_time = "02:00:00"
# Lustre-visible host path for the build cache; mounted into the build
# container at /nemotron-cache and used to land the OCI archive output.
build_cache_dir = "/lustre/.../users/<user>/.cache/nemotron"
Why
build_cache_dir? The dispatcher mounts this path into the build container so podman can save the resulting OCI archive somewhere a future training job can read. Pre-build_cache_dir, the dispatcher hard-coded~/.cache/nemotronevaluated at submission time, which resolved to the laptop’s home directory and didn’t exist on cluster nodes. Setbuild_cache_dirto a Lustre-visible path; the dispatcher willmkdir -pit on the remote before submission so first-run is seamless.
How the build container authenticates with private registries#
The SFT Dockerfile’s FROM references nvcr.io/nvidian/nemo:<tag>,
which is gated behind NGC credentials. Rather than ask operators to
provision a separate ~/.docker/config.json, the dispatcher reuses
enroot’s existing credentials at job-submission time:
The dispatcher reads
~/.config/enroot/.credentials(netrc format) over the SSH tunnel.It filters to the
nvcr.ioentry (other registries in the file — gitlab tokens, etc. — are not exposed to the build container).It transcodes the entry into a docker-format
auth.jsonand writes it to<build_cache_dir>/.auth/auth.jsonwith mode0600.It mounts that file into the build container at
/root/.config/containers/auth.json:ro.
Podman inside the build container then authenticates with nvcr.io
automatically. No env vars, no per-job secrets in slurm scripts. If you
add a new private registry the build needs, add an entry for it in
~/.config/enroot/.credentials and extend the dispatcher’s allowlist
(see materialize_podman_auth_from_enroot in
src/nemo_runspec/execution.py).
Local build from the stage Dockerfile
cd src/nemotron/recipes/omni3/stage0_sft
docker build -t nemotron/omni3-sft:latest -f Dockerfile .
# or
podman build -t nemotron/omni3-sft:latest -f Dockerfile .
Use the local path when iterating on the Dockerfile itself; use nemotron omni3 build <stage> --run <profile> when you want the same build recipe executed on the cluster.
Other Executors#
NeMo-Run also supports these executors (not yet integrated into env.toml profiles):
Executor |
Description |
|---|---|
|
Cloud instances (AWS, GCP, Azure) via SkyPilot |
|
Kubernetes Training Operator (PyTorchJob) |
Packagers#
Packagers determine how your code is bundled and transferred to the execution target. NeMo-Run provides several packagers for different workflows.
GitArchivePackager (Default)#
Packages code from a git repository using git archive. Only includes committed files.
[YOUR-CLUSTER]
executor = "slurm"
packager = "git"
# ... other settings
How it works:
Runs
git archive --format=tar.gzfrom repository rootIncludes only committed files (uncommitted changes are excluded)
Optionally includes git submodules
Transfers archive to execution target
Key options:
Option |
Default |
Description |
|---|---|---|
|
- |
Package only a subdirectory of the repo |
|
|
Include git submodules in archive |
|
- |
Glob pattern for additional uncommitted files |
Best for: Production runs with version-controlled code.
PatternPackager#
Packages files matching glob patterns. Useful for code not under version control.
[YOUR-CLUSTER]
executor = "slurm"
packager = "pattern"
packager_include_pattern = "src/**/*.py"
How it works:
Finds files matching the glob pattern
Creates tar archive of matched files
Preserves directory structure relative to pattern base
Key options:
Option |
Description |
|---|---|
|
Glob pattern(s) for files to include |
|
Base path for pattern matching |
Best for: Quick iterations, non-git projects, or including generated files.
HybridPackager#
Combines multiple packagers into a single archive. Useful for complex projects.
[YOUR-CLUSTER]
executor = "slurm"
packager = "hybrid"
# Configuration via code (see below)
How it works:
Runs each sub-packager independently
Extracts outputs to temporary directories
Merges all into final archive with folder organization
Example in code:
import nemo_run as run
hybrid = run.HybridPackager(
sub_packagers={
"code": run.GitArchivePackager(subpath="src"),
"configs": run.PatternPackager(include_pattern="configs/*.yaml"),
"data": run.PatternPackager(include_pattern="data/*.json"),
}
)
Best for: Projects with mixed version-controlled and generated content.
Passthrough Packager#
No packaging—assumes code is already available on the target (e.g., in a container image or shared filesystem).
[YOUR-CLUSTER]
executor = "slurm"
packager = "none"
Best for: Pre-built containers, shared filesystems like /lustre.
CLI Options#
--run vs --batch#
Option |
Behavior |
Use Case |
|---|---|---|
|
Attached—waits for completion, streams logs |
Interactive development |
|
Detached—submits and exits immediately |
Long-running jobs |
Config Overrides#
Override config values using Hydra-style syntax:
# Override training iterations
uv run nemotron nano3 pretrain --run YOUR-CLUSTER train.train_iters=5000
# Override nodes
uv run nemotron nano3 pretrain --run YOUR-CLUSTER run.env.nodes=8
env.toml Reference#
An env.toml file contains two kinds of sections: system-level sections (reserved names) and execution profiles (any other name).
System-Level Sections#
These sections have reserved names and are not execution profiles:
[wandb]
project = "nemotron"
entity = "YOUR-TEAM"
[artifacts]
backend = "file" # "file" or "wandb" (default: "wandb" if [wandb] exists)
root = "/path/to/artifacts" # Root directory for file-based artifacts
[cache]
git_dir = "/lustre/.../git-cache" # Directory for caching git repos (for auto_mount)
[cli]
theme = "github-light" # Rich console theme
Section |
Field |
Type |
Description |
|---|---|---|---|
|
|
str |
W&B project name (injected into |
|
|
str |
W&B entity/team name (injected into |
|
|
str |
Artifact storage backend: |
|
|
str |
Root directory for file-based artifact storage |
|
|
str |
Directory for caching git repos cloned by |
|
|
str |
Rich console theme name |
Execution Profile Fields#
Every other section is an execution profile. Use --run <name> or --batch <name> to select one.
[my-cluster]
executor = "slurm"
account = "my-account"
partition = "batch"
run_partition = "interactive"
batch_partition = "batch"
tunnel = "ssh"
host = "cluster.example.com"
user = "username"
remote_job_dir = "/lustre/users/me/run"
time = "04:00:00"
nodes = 2
ntasks_per_node = 8
gpus_per_node = 8
mem = "0"
exclusive = true
container_image = "/path/to/image.sqsh"
mounts = ["/lustre:/lustre"]
startup_commands = ["source /opt/env.sh"]
Core:
Field |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Execution backend: |
|
str |
- |
Parent profile to inherit from |
Local executor:
Field |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Number of GPU processes (for torchrun) |
Slurm executor:
Field |
Type |
Default |
Description |
|---|---|---|---|
|
str |
- |
Slurm account for job billing |
|
str |
- |
Default Slurm partition |
|
str |
- |
Partition for |
|
str |
- |
Partition for |
|
int |
|
Number of compute nodes |
|
int |
|
Tasks per node |
|
int |
- |
GPUs per node |
|
int |
- |
CPU cores per task |
|
str |
|
Job time limit (HH:MM:SS) |
|
str |
- |
Memory per node ( |
|
bool |
- |
Request exclusive node access |
SSH tunnel (for remote submission):
Field |
Type |
Default |
Description |
|---|---|---|---|
|
str |
- |
Set to |
|
str |
- |
SSH hostname |
|
str |
- |
SSH username |
|
str |
- |
Working directory on the remote cluster |
Container:
Field |
Type |
Default |
Description |
|---|---|---|---|
|
str |
- |
Container image path ( |
|
str |
- |
Alias for |
|
list |
|
Container mount points ( |
Startup:
Field |
Type |
Default |
Description |
|---|---|---|---|
|
list |
|
Shell commands to run before the training script |
DGX Cloud executor:
Field |
Type |
Default |
Description |
|---|---|---|---|
|
str |
required |
DGX Cloud API base URL |
|
str |
required |
Kubernetes API server URL for the DGX Cloud cluster |
|
str |
required |
Client ID for authentication |
|
str |
required |
Client secret for authentication |
|
str |
- |
Legacy alias for |
|
str |
- |
Legacy alias for |
|
str |
required |
DGX Cloud project name |
|
str |
required |
PVC path for nemo-run job directory |
|
int |
|
Number of compute nodes |
|
int |
|
GPUs per node |
|
int |
|
Processes per node |
|
list |
|
PVC mount specifications (list of dicts with |
|
str |
|
Distributed framework ( |
|
dict |
|
Additional workload spec overrides (e.g., |
Lepton executor:
Field |
Type |
Default |
Description |
|---|---|---|---|
|
str |
required |
Remote directory for nemo-run job files |
|
int |
|
Number of compute nodes |
|
int |
|
GPUs per node |
|
int |
|
Processes per node |
|
str |
|
Resource shape (e.g., |
|
str |
|
Target node group |
|
str |
|
Node reservation identifier |
|
int |
|
Shared memory size in MB |
|
list |
|
Mount specifications (list of dicts with |
|
list |
|
Container registry authentication secrets |
|
list |
|
Commands to run before launching the job |
|
dict |
|
Additional job spec overrides |
How Profiles Are Resolved#
When you run --run my-cluster:
Load:
env.tomlis found by walking up from the current directoryInherit: If
extends = "parent"is set, fields are inherited recursivelyMerge: The profile overlays the config YAML’s
run.envdefaults ({**yaml_defaults, **profile})Inject:
[wandb]section is injected intorun.wandb
This means config YAML provides defaults for any field env.toml doesn’t set. For example, the evaluator config sets container: nvcr.io/nvidia/nemo-evaluator:latest in its YAML, which is used unless your env.toml profile overrides it.
Full Example#
# env.toml
# --- System sections (not profiles) ---
[wandb]
project = "nemotron"
entity = "my-team"
[artifacts]
backend = "file"
root = "/lustre/users/me/artifacts"
[cache]
git_dir = "/lustre/users/me/git-cache"
# --- Execution profiles ---
[local]
executor = "local"
nproc_per_node = 8
[my-cluster]
executor = "slurm"
account = "my-account"
partition = "batch"
run_partition = "interactive"
batch_partition = "batch"
time = "04:00:00"
tunnel = "ssh"
host = "login.cluster.com"
user = "me"
remote_job_dir = "/lustre/users/me/run"
[pretrain]
extends = "my-cluster"
nodes = 2
ntasks_per_node = 8
gpus_per_node = 8
mem = "0"
exclusive = true
mounts = ["/lustre:/lustre"]
[pretrain-large]
extends = "pretrain"
nodes = 8
time = "08:00:00"
[data-prep]
extends = "my-cluster"
run_partition = "cpu"
batch_partition = "cpu"
nodes = 1
ntasks_per_node = 1
cpus_per_task = 96
[rl]
extends = "pretrain"
container_image = "/lustre/users/me/nano3-rl.sqsh"
# --- DGX Cloud (run:ai) ---
[dgxcloud]
executor = "dgxcloud"
base_url = "https://dgx.example.com/api/v1"
app_id = "MY-APP-ID"
app_secret = "MY-APP-SECRET"
project_name = "my-project"
container_image = "nvcr.io/nvidia/nemo:latest"
pvc_nemo_run_dir = "/pvc/nemo_run"
nodes = 2
gpus_per_node = 8
nprocs_per_node = 8
pvcs = [
{ claimName = "my-pvc", path = "/pvc", readOnly = false }
]
# --- Lepton ---
[lepton]
executor = "lepton"
container_image = "nvcr.io/nvidia/nemo:latest"
nemo_run_dir = "/nemo_run"
nodes = 2
gpus_per_node = 8
resource_shape = "gpu.h100.8"
node_group = "my-node-group"
Ray-Enabled Recipes#
Some recipes use Ray for distributed execution. When you run a Ray-enabled recipe with --run, the Ray cluster is set up automatically:
# Data prep uses Ray
uv run nemotron nano3 data prep pretrain --run YOUR-CLUSTER
# RL training uses Ray
uv run nemotron nano3 rl -c tiny --run YOUR-CLUSTER
Ray on DGX Cloud and Lepton#
Ray-based recipes work with DGX Cloud and Lepton executors. The framework automatically selects the appropriate Ray backend based on the executor type:
Slurm: Uses
SlurmRayJobwith SSH tunnel (default)DGX Cloud: Uses
DGXCloudRayJobfor self-organizing Ray topology within distributed workloadsLepton: Uses
LeptonRayJobwith the Lepton SDK
# Ray-enabled RL on DGX Cloud
[dgxcloud-rl]
executor = "dgxcloud"
base_url = "https://dgx.example.com/api/v1"
app_id = "YOUR-ID"
app_secret = "YOUR-SECRET"
project_name = "your-project"
container_image = "nvcr.io/nvidia/nemo-rl:latest"
pvc_nemo_run_dir = "/pvc/nemo_run"
nodes = 4
gpus_per_node = 8
# Ray-enabled RL on Lepton
[lepton-rl]
executor = "lepton"
container_image = "nvcr.io/nvidia/nemo-rl:latest"
nemo_run_dir = "/nemo_run"
nodes = 4
gpus_per_node = 8
resource_shape = "gpu.h100.8"
node_group = "my-node-group"
Use with any Ray-enabled recipe:
uv run nemotron nano3 rl -c tiny --run dgxcloud-rl
uv run nemotron nano3 rl -c tiny --run lepton-rl
Further Reading#
NeMo-Run GitHub – full documentation
W&B Integration – credential handling
Nemotron Kit – artifact system and lineage tracking
CLI Framework – building recipe CLIs