Run on Any Cloud with SkyPilot
In this guide, you will learn how to launch NeMo AutoModel training jobs with SkyPilot. SkyPilot can target public clouds such as AWS, GCP, Azure, and Lambda, and it can also submit jobs to Kubernetes clusters. For a beginner-friendly Kubernetes walkthrough, see SkyPilot + Kubernetes tutorial. For on-premises cluster usage without SkyPilot, see Run on a Cluster (Slurm). For single-node workstation usage, see Run on Your Local Workstation.
SkyPilot is an open-source framework that abstracts cloud infrastructure so you can train on whichever cloud is cheapest or most available at launch time — including automatic spot-instance handling for significant cost savings.
Before You Begin
Complete the following setup steps before launching your first AutoModel job on a cloud provider.
- Install SkyPilot with the connector for your target infrastructure:
- Configure access for your target infrastructure, then verify:
You should see at least one cloud listed as OK.
- Set required environment variables:
Quickstart
Add a skypilot: section to any existing config YAML, then run the same automodel command you already know:
The CLI detects the skypilot: key, strips it from the training config, uploads the code and config to a cloud VM, and launches training — all in one command.
Configuration Reference
Below is an annotated example for fine-tuning Llama-3.2-1B on SQuAD on a GCP spot T4. A ready-to-run copy lives at examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot.yaml.
All skypilot: Fields
Cloud Examples
AWS — On-Demand A10G
GCP — Spot V100, 8 GPUs (Single Node)
Multi-Node Distributed Training (2 x 8 x A100)
For multi-node jobs, the launcher automatically adds the SkyPilot rendezvous environment variables ($SKYPILOT_NODE_RANK, $SKYPILOT_NUM_NODES, $SKYPILOT_NODE_IPS) to the torchrun command.
Monitor and Manage Jobs
After submitting, use standard SkyPilot commands:
How It Works
- The
automodelCLI detects theskypilot:key in the YAML and callslaunch_with_skypilot(). - The training config (with
skypilot:removed) is written to a localskypilot_jobs/<timestamp>/job_config.yaml. - A
sky.Taskis created with:- workdir — the current directory synced to
~/sky_workdiron the remote VM. - file_mounts — the job config uploaded to
/tmp/automodel_job_config.yaml. - setup —
pip install -e .(plus any customsetup:commands). - run — a
torchruncommand pointing at the recipe script and config.
- workdir — the current directory synced to
sky.launch()provisions the VM, runs setup, then executes training. The call returns immediately (detach_run=True); usesky logsto follow progress.
Customize Configuration
Override any training parameter from the command line, same as with local runs:
Kubernetes Users
If you want to run on a Kubernetes cluster, use cloud: kubernetes and follow the dedicated SkyPilot + Kubernetes tutorial. That guide includes:
- a copy-paste single-node config
- a two-node example
- sample
skyandkubectloutput to help you sanity-check your setup - a short troubleshooting section for common first-run issues