LeptonExecutor#
Launch distributed batch jobs on NVIDIA DGX Cloud Lepton.
Prerequisites#
DGX Cloud Lepton CLI installed and authenticated (
lep workspace infoshould return your workspace)A node group with sufficient GPU capacity
A remote storage mount accessible from the job pods
Note
For Ray workloads on Lepton (e.g. RayCluster / RayJob), see Ray Clusters & Jobs instead.
Executor configuration#
import nemo_run as run
executor = run.LeptonExecutor(
resource_shape="gpu.8xh100-80gb", # resource shape = GPUs per pod
node_group="my-node-group",
container_image="nvcr.io/nvidia/pytorch:24.05-py3",
nodes=1,
gpus_per_node=8,
nemo_run_dir="/nemo-workspace/nemo-run", # path on remote storage for NeMo-Run metadata
mounts=[{
"path": "/nemo-workspace", # remote storage path
"mount_path": "/nemo-workspace", # container mount point
}],
env_vars={"PYTHONUNBUFFERED": "1"},
)
Key parameters:
Parameter |
Description |
|---|---|
|
Resource shape string (encodes GPU count per pod) |
|
Lepton node group to schedule on |
|
Container image URI |
|
Number of pods |
|
GPUs per pod |
|
Directory on remote storage where NeMo-Run saves experiment metadata |
|
Remote storage mounts ( |
E2E workflow#
import nemo_run as run
task = run.Script("python train.py --lr=3e-4 --max-steps=500")
executor = run.LeptonExecutor(
resource_shape="gpu.8xh100-80gb",
node_group="my-node-group",
container_image="nvcr.io/nvidia/pytorch:24.05-py3",
nodes=1,
gpus_per_node=8,
nemo_run_dir="/nemo-workspace/nemo-run",
mounts=[{"path": "/nemo-workspace", "mount_path": "/nemo-workspace"}],
)
with run.Experiment("my-experiment") as exp:
exp.add(task, executor=executor, name="training")
exp.run(detach=True)
# Later — reconnect and check status
experiment = run.Experiment.from_id("my-experiment_<id>")
experiment.status()
experiment.logs("training")
Advanced options#
Node reservation#
Pin the job to a specific reserved node group:
executor = run.LeptonExecutor(
...,
node_reservation="my-node-reservation",
)
Pre-launch commands#
Run shell commands inside the container before the job starts:
executor = run.LeptonExecutor(
...,
pre_launch_commands=["nvidia-smi", "pip install --upgrade my-package"],
)
Private registry images#
executor = run.LeptonExecutor(
...,
image_pull_secrets=["my-registry-secret"],
)