Run with SkyPilot#
In this guide, you will learn how to launch NeMo AutoModel training jobs with SkyPilot. SkyPilot can target public clouds such as AWS, GCP, Azure, and Lambda, and it can also submit jobs to Kubernetes clusters. For a beginner-friendly Kubernetes walkthrough, see SkyPilot + Kubernetes tutorial. For on-premises cluster usage without SkyPilot, see Run on a Cluster (Slurm). For single-node workstation usage, see Run on Your Local Workstation.
SkyPilot is an open-source framework that abstracts cloud infrastructure so you can train on whichever cloud is cheapest or most available at launch time — including automatic spot-instance handling for significant cost savings.
Before You Begin#
Complete the following setup steps before launching your first AutoModel job on a cloud provider.
Install SkyPilot with the connector for your target infrastructure:
uv pip install "skypilot[gcp]" # Google Cloud
uv pip install "skypilot[aws]" # Amazon Web Services
uv pip install "skypilot[azure]" # Microsoft Azure
uv pip install "skypilot[lambda]" # Lambda Cloud
uv pip install "skypilot[kubernetes]" # Any Kubernetes cluster
Configure access for your target infrastructure, then verify:
sky check
You should see at least one cloud listed as OK.
Set required environment variables:
export HF_TOKEN=hf_... # Required for gated models (e.g. Llama)
export WANDB_API_KEY=... # Optional: Weights & Biases logging
Quickstart#
Add a skypilot: section to any existing config YAML, then run the same automodel command you already know:
automodel your_config_with_skypilot.yaml
The CLI detects the skypilot: key, strips it from the training config, uploads the code and config to a cloud VM, and launches training — all in one command.
Configuration Reference#
Below is an annotated example for fine-tuning Llama-3.2-1B on SQuAD on a GCP spot T4. A ready-to-run copy lives at examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot.yaml.
# ── SkyPilot launcher section ─────────────────────────────────────────────
# Removed before the training config reaches the remote VM.
skypilot:
cloud: gcp # aws | gcp | azure | lambda | kubernetes
accelerators: T4:1 # GPU type:count per node, e.g. A100:8
use_spot: true # ~80 % cost reduction vs on-demand
disk_size: 100 # Remote VM disk size in GB
num_nodes: 1 # Increase for multi-node distributed training
region: us-central1 # Optional — SkyPilot picks cheapest if omitted
job_name: llama3_2_finetune # Also used as the SkyPilot cluster name
# Use env-var placeholders so secrets are never stored in YAML
hf_token: ${HF_TOKEN}
# wandb_key: ${WANDB_API_KEY}
# Optional: extra shell commands run on the VM after `pip install -e .`
# setup: |
# pip install some-extra-dependency
# Optional: override the default output directory (default: ./skypilot_jobs)
# job_dir: /path/to/skypilot/jobs
# ── Training config (forwarded to the VM unchanged) ───────────────────────
step_scheduler:
global_batch_size: 64
local_batch_size: 8
num_epochs: 1
model:
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
# ... rest of your training config ...
All skypilot: Fields#
Field |
Default |
Description |
|---|---|---|
|
(required) |
Cloud provider: |
|
|
GPU type and count per node, e.g. |
|
|
Number of VMs for distributed training |
|
|
Use spot/preemptible instances |
|
|
Remote VM disk size in GB |
|
(auto) |
Cloud region; SkyPilot selects cheapest if omitted |
|
(auto) |
Availability zone within the region |
|
(auto) |
Specific instance type; auto-selected if omitted |
|
|
Job and SkyPilot cluster name |
|
(auto) |
Extra setup commands run after |
|
|
Hugging Face cache directory on the remote VM |
|
|
Hugging Face token for gated model access |
|
|
Weights & Biases API key |
|
|
Additional environment variables for the remote VM |
|
|
Local directory for job artifacts (config snapshot, logs) |
|
(parsed from |
Override GPU count per node passed to |
Cloud Examples#
AWS — On-Demand A10G#
skypilot:
cloud: aws
accelerators: A10G:1
use_spot: false
region: us-east-1
job_name: llm_aws_finetune
hf_token: ${HF_TOKEN}
GCP — Spot V100, 8 GPUs (Single Node)#
skypilot:
cloud: gcp
accelerators: V100:8
use_spot: true
region: us-west1
job_name: llm_gcp_v100_8gpu
hf_token: ${HF_TOKEN}
Multi-Node Distributed Training (2 x 8 x A100)#
skypilot:
cloud: gcp
accelerators: A100:8
num_nodes: 2
use_spot: false
job_name: llm_multinode_a100
hf_token: ${HF_TOKEN}
For multi-node jobs, the launcher automatically adds the SkyPilot rendezvous environment variables ($SKYPILOT_NODE_RANK, $SKYPILOT_NUM_NODES, $SKYPILOT_NODE_IPS) to the torchrun command.
Monitor and Manage Jobs#
After submitting, use standard SkyPilot commands:
sky status # List running clusters and their status
sky logs <cluster_name> # Stream training logs
sky ssh <cluster_name> # SSH into the VM for debugging
sky cancel <cluster_name> <job_id> # Cancel a running job
sky down <cluster_name> # Terminate the cluster and stop billing
How It Works#
The
automodelCLI detects theskypilot:key in the YAML and callslaunch_with_skypilot().The training config (with
skypilot:removed) is written to a localskypilot_jobs/<timestamp>/job_config.yaml.A
sky.Taskis created with:workdir — the current directory synced to
~/sky_workdiron the remote VM.file_mounts — the job config uploaded to
/tmp/automodel_job_config.yaml.setup —
pip install -e .(plus any customsetup:commands).run — a
torchruncommand pointing at the recipe script and config.
sky.launch()provisions the VM, runs setup, then executes training. The call returns immediately (detach_run=True); usesky logsto follow progress.
Customize Configuration#
Override any training parameter from the command line, same as with local runs:
automodel config_with_skypilot.yaml \
--model.pretrained_model_name_or_path meta-llama/Llama-3.2-3B
Kubernetes Users#
If you want to run on a Kubernetes cluster, use cloud: kubernetes and follow the dedicated SkyPilot + Kubernetes tutorial. That guide includes:
a copy-paste single-node config
a two-node example
sample
skyandkubectloutput to help you sanity-check your setupa short troubleshooting section for common first-run issues
When to Use SkyPilot vs. Slurm#
SkyPilot |
Slurm |
|
|---|---|---|
Infrastructure |
Any public cloud |
On-premises HPC cluster |
Spot instances |
Yes (automatic) |
Depends on cluster config |
Setup required |
Cloud credentials + |
Cluster access |
Good for |
Flexible cloud burst, cost optimization |
Fixed on-prem GPU clusters |