DGXCloudExecutor#
Launch distributed jobs on NVIDIA DGX Cloud via the Run:ai API.
Prerequisites#
Warning
DGXCloudExecutor is currently only supported when launching experiments from a pod running on the DGX Cloud cluster itself. The launching pod must have access to a Persistent Volume Claim (PVC), and the same PVC must be mounted by the launched job.
You need:
Access to a DGX Cloud cluster and a Run:ai project
A Run:ai application ID and secret (create one in the Run:ai console under Application credentials)
A PVC accessible from both the launching pod and the job pods
NEMORUN_HOMEpointing to a path on the PVC
Executor configuration#
import nemo_run as run
executor = run.DGXCloudExecutor(
base_url="https://<cluster-name>.<domain>/api/v1", # Run:ai API endpoint
app_id="YOUR_RUNAI_APP_ID",
app_secret="YOUR_RUNAI_APP_SECRET",
project_name="YOUR_RUNAI_PROJECT_NAME",
container_image="nvcr.io/nvidia/pytorch:24.05-py3",
nodes=1,
gpus_per_node=8,
pvcs=[{
"name": "your-pvc-k8s-name", # Kubernetes PVC name
"path": "/your_custom_path", # mount path inside the container
}],
env_vars={"PYTHONUNBUFFERED": "1"},
)
Key parameters:
Parameter |
Description |
|---|---|
|
Run:ai REST API base URL |
|
Run:ai application credentials |
|
Run:ai project to submit jobs to |
|
Container image URI |
|
Number of nodes |
|
GPUs per node |
|
List of PVC mounts ( |
E2E workflow#
import nemo_run as run
task = run.Script("python train.py --lr=3e-4 --max-steps=500")
executor = run.DGXCloudExecutor(
base_url="https://my-cluster.example.com/api/v1",
app_id="my-app-id",
app_secret="my-app-secret",
project_name="my-project",
container_image="nvcr.io/nvidia/pytorch:24.05-py3",
nodes=1,
gpus_per_node=8,
pvcs=[{"name": "my-pvc", "path": "/workspace"}],
)
with run.Experiment("my-experiment") as exp:
exp.add(task, executor=executor, name="training")
exp.run(detach=True)
# Later — reconnect and check status
experiment = run.Experiment.from_id("my-experiment_<id>")
experiment.status()
experiment.logs("training")
Advanced options#
Package code from git#
executor = run.DGXCloudExecutor(
...,
packager=run.GitArchivePackager(subpath="src"),
)
For a complete end-to-end NeMo example see the NVIDIA DGX Cloud NeMo E2E Workflow.