SkypilotExecutor#
Launch tasks across clouds (AWS, GCP, Azure, Kubernetes, and more) via SkyPilot.
Prerequisites#
Install the SkyPilot extras:
pip install "nemo_run[skypilot]"
Configure at least one cloud with
sky check. Follow the SkyPilot cloud setup guide for your provider.
Executor configuration#
from nemo_run.core.execution.skypilot import SkypilotExecutor
executor = SkypilotExecutor(
cloud="kubernetes", # or "aws", "gcp", "azure", …
gpus="A100", # GPU type string recognised by SkyPilot
gpus_per_node=8,
num_nodes=1,
container_image="nvcr.io/nvidia/pytorch:24.05-py3",
env_vars={"PYTHONUNBUFFERED": "1"},
# Optional: reuse an existing cluster instead of provisioning a new one
cluster_name="my-sky-cluster",
setup="""
conda deactivate
nvidia-smi
""",
)
Key parameters:
Parameter |
Description |
|---|---|
|
Cloud provider or |
|
GPU type string (e.g. |
|
GPUs per node |
|
Number of nodes |
|
Docker image for the job |
|
Optional: name of an existing cluster to reuse |
|
Shell commands to run once on the cluster before the job |
E2E workflow#
import nemo_run as run
from nemo_run.core.execution.skypilot import SkypilotExecutor
task = run.Script("python train.py --lr=3e-4 --max-steps=500")
executor = SkypilotExecutor(
cloud="kubernetes",
gpus="RTX5880-ADA-GENERATION",
gpus_per_node=8,
num_nodes=1,
container_image="nvcr.io/nvidia/pytorch:24.05-py3",
)
with run.Experiment("my-experiment") as exp:
exp.add(task, executor=executor, name="training")
exp.run(detach=True)
# Later — reconnect and check status
experiment = run.Experiment.from_id("my-experiment_<id>")
experiment.status()
experiment.logs("training")
Advanced options#
SkypilotJobsExecutor (managed jobs)#
SkypilotJobsExecutor submits SkyPilot Managed Jobs, which survive controller failures and support spot instances with auto-recovery:
from nemo_run.core.execution.skypilot import SkypilotJobsExecutor
executor = SkypilotJobsExecutor(
cloud="aws",
gpus="A100",
gpus_per_node=8,
num_nodes=4,
container_image="nvcr.io/nvidia/pytorch:24.05-py3",
use_spot=True,
)
Package code from git#
executor = SkypilotExecutor(
...,
packager=run.GitArchivePackager(subpath="src"),
)