Executors#

An execution unit is a (task, executor) pair. The task defines what to run; the executor defines where and how. NeMo-Run keeps these two concerns separate so you can swap executors without changing your task configuration.

Choose an executor#

Pick the executor that matches your environment:

Executor	When to use	Setup cost
LocalExecutor	Prototyping, debugging, CI	None — works out of the box
DockerExecutor	Reproducible local runs, container-based workflows	Docker installed & running
SlurmExecutor	HPC clusters with Slurm and Pyxis	SSH access to a Slurm cluster
SkypilotExecutor	Multi-cloud: AWS, GCP, Azure, Kubernetes	`pip install nemo_run[skypilot]` + cloud credentials
DGXCloudExecutor	NVIDIA DGX Cloud via Run:ai	Pod access + PVC on DGX Cloud
LeptonExecutor	NVIDIA DGX Cloud Lepton (standard execution)	Lepton CLI installed & authenticated
KubeflowExecutor	Distributed training via Kubeflow Training Operator v2	kubectl + Kubeflow Training Operator v2
KubeRayExecutor	Ray workloads on Kubernetes	kubectl + KubeRay operator

Packager support matrix#

The packager controls how your code is bundled and sent to the execution environment.

Executor	Packagers
LocalExecutor	`run.Packager` (passthrough)
DockerExecutor	`run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager`
SlurmExecutor	`run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager`
SkypilotExecutor	`run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager`
DGXCloudExecutor	`run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager`
LeptonExecutor	`run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager`
KubeflowExecutor	`run.Packager`

See Execution — Packagers for a description of each packager.

Launcher support#

The launcher controls how the process is started inside the executor.

Launcher	Flag	Description
Default	`None`	Direct subprocess — no special launcher
Torchrun	`"torchrun"` / `run.Torchrun(...)`	Distributed training via `torchrun`
Fault Tolerance	`"ft"` / `run.core.execution.FaultTolerance(...)`	NVIDIA fault-tolerant launcher
SlurmRay	`"slurm_ray"`	Ray cluster on Slurm (see ray.md)

See Execution — Launchers for details.