Run on Any Cloud with SkyPilot#
In this guide, you will learn how to launch NeMo AutoModel training jobs on any major cloud provider (AWS, GCP, Azure, Lambda, Kubernetes) using SkyPilot. For on-premises cluster usage, see Run on a Cluster (Slurm). For single-node workstation usage, see Run on Your Local Workstation.
SkyPilot is an open-source framework that abstracts cloud infrastructure so you can train on whichever cloud is cheapest or most available at launch time — including automatic spot-instance handling for significant cost savings.
Before You Begin#
Complete the following setup steps before launching your first AutoModel job on a cloud provider.
Install SkyPilot with the connector for your target cloud:
pip install "skypilot[gcp]" # Google Cloud
pip install "skypilot[aws]" # Amazon Web Services
pip install "skypilot[azure]" # Microsoft Azure
pip install "skypilot[lambda]" # Lambda Cloud
pip install "skypilot[kubernetes]" # Any Kubernetes cluster
Configure your cloud credentials by following the SkyPilot credential setup guide for your cloud, then verify:
sky check
You should see at least one cloud listed as OK.
Set required environment variables:
export HF_TOKEN=hf_... # Required for gated models (e.g. Llama)
export WANDB_API_KEY=... # Optional: Weights & Biases logging
Quickstart#
Add a skypilot: section to any existing config YAML, then run the same automodel command you already know:
automodel finetune llm -c your_config_with_skypilot.yaml
The CLI detects the skypilot: key, strips it from the training config, uploads the code and config to a cloud VM, and launches training — all in one command.
Configuration Reference#
Below is an annotated example for fine-tuning Llama-3.2-1B on SQuAD on a GCP spot T4. A ready-to-run copy lives at examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot.yaml.
# ── SkyPilot launcher section ─────────────────────────────────────────────
# Removed before the training config reaches the remote VM.
skypilot:
cloud: gcp # aws | gcp | azure | lambda | kubernetes
accelerators: T4:1 # GPU type:count per node, e.g. A100:8
use_spot: true # ~80 % cost reduction vs on-demand
disk_size: 100 # Remote VM disk size in GB
num_nodes: 1 # Increase for multi-node distributed training
region: us-central1 # Optional — SkyPilot picks cheapest if omitted
job_name: llama3_2_finetune # Also used as the SkyPilot cluster name
# Use env-var placeholders so secrets are never stored in YAML
hf_token: ${HF_TOKEN}
# wandb_key: ${WANDB_API_KEY}
# Optional: extra shell commands run on the VM after `pip install -e .`
# setup: |
# pip install some-extra-dependency
# Optional: override the default output directory (default: ./skypilot_jobs)
# job_dir: /path/to/skypilot/jobs
# ── Training config (forwarded to the VM unchanged) ───────────────────────
step_scheduler:
global_batch_size: 64
local_batch_size: 8
num_epochs: 1
model:
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
# ... rest of your training config ...
All skypilot: Fields#
Field |
Default |
Description |
|---|---|---|
|
(required) |
Cloud provider: |
|
|
GPU type and count per node, e.g. |
|
|
Number of VMs for distributed training |
|
|
Use spot/preemptible instances |
|
|
Remote VM disk size in GB |
|
(auto) |
Cloud region; SkyPilot selects cheapest if omitted |
|
(auto) |
Availability zone within the region |
|
(auto) |
Specific instance type; auto-selected if omitted |
|
|
Job and SkyPilot cluster name |
|
(auto) |
Extra setup commands run after |
|
|
Hugging Face cache directory on the remote VM |
|
|
Hugging Face token for gated model access |
|
|
Weights & Biases API key |
|
|
Additional environment variables for the remote VM |
|
|
Local directory for job artifacts (config snapshot, logs) |
|
(parsed from |
Override GPU count per node passed to |
Cloud Examples#
AWS — On-Demand A10G#
skypilot:
cloud: aws
accelerators: A10G:1
use_spot: false
region: us-east-1
job_name: llm_aws_finetune
hf_token: ${HF_TOKEN}
GCP — spot V100, 8 GPUs (single node)#
skypilot:
cloud: gcp
accelerators: V100:8
use_spot: true
region: us-west1
job_name: llm_gcp_v100_8gpu
hf_token: ${HF_TOKEN}
Multi-node distributed training (2 × 8 × A100)#
skypilot:
cloud: gcp
accelerators: A100:8
num_nodes: 2
use_spot: false
job_name: llm_multinode_a100
hf_token: ${HF_TOKEN}
For multi-node jobs the launcher automatically adds the SkyPilot rendezvous environment variables ($SKYPILOT_NODE_RANK, $SKYPILOT_NUM_NODES, $SKYPILOT_NODE_IPS) to the torchrun command.
Monitor and Manage Jobs#
After submitting, use standard SkyPilot commands:
sky status # List running clusters and their status
sky logs <cluster_name> # Stream training logs
sky ssh <cluster_name> # SSH into the VM for debugging
sky cancel <cluster_name> <job_id> # Cancel a running job
sky down <cluster_name> # Terminate the cluster and stop billing
How It Works#
The
automodelCLI detects theskypilot:key in the YAML and callslaunch_with_skypilot().The training config (with
skypilot:removed) is written to a localskypilot_jobs/<timestamp>/job_config.yaml.A
sky.Taskis created with:workdir — the current directory synced to
~/sky_workdiron the remote VM.file_mounts — the job config uploaded to
/tmp/automodel_job_config.yaml.setup —
pip install -e .(plus any customsetup:commands).run — a
torchruncommand pointing at the recipe script and config.
sky.launch()provisions the VM, runs setup, then executes training. The call returns immediately (detach_run=True); usesky logsto follow progress.
Customize Configuration#
Override any training parameter from the command line, same as with local runs:
automodel finetune llm -c config_with_skypilot.yaml \
--model.pretrained_model_name_or_path meta-llama/Llama-3.2-3B
When to Use SkyPilot vs. Slurm#
SkyPilot |
Slurm |
|
|---|---|---|
Infrastructure |
Any public cloud |
On-premises HPC cluster |
Spot instances |
Yes (automatic) |
Depends on cluster config |
Setup required |
Cloud credentials + |
Cluster access |
Good for |
Flexible cloud burst, cost optimization |
Fixed on-prem GPU clusters |