Run on Any Cloud with SkyPilot

View as Markdown

In this guide, you will learn how to launch NeMo AutoModel training jobs with SkyPilot. SkyPilot can target public clouds such as AWS, GCP, Azure, and Lambda, and it can also submit jobs to Kubernetes clusters. For a beginner-friendly Kubernetes walkthrough, see SkyPilot + Kubernetes tutorial. For on-premises cluster usage without SkyPilot, see Run on a Cluster (Slurm). For single-node workstation usage, see Run on Your Local Workstation.

SkyPilot is an open-source framework that abstracts cloud infrastructure so you can train on whichever cloud is cheapest or most available at launch time — including automatic spot-instance handling for significant cost savings.

Before You Begin

Complete the following setup steps before launching your first AutoModel job on a cloud provider.

  1. Install SkyPilot with the connector for your target infrastructure:
$uv pip install "skypilot[gcp]" # Google Cloud
$uv pip install "skypilot[aws]" # Amazon Web Services
$uv pip install "skypilot[azure]" # Microsoft Azure
$uv pip install "skypilot[lambda]" # Lambda Cloud
$uv pip install "skypilot[kubernetes]" # Any Kubernetes cluster
  1. Configure access for your target infrastructure, then verify:
$sky check

You should see at least one cloud listed as OK.

  1. Set required environment variables:
$export HF_TOKEN=hf_... # Required for gated models (e.g. Llama)
$export WANDB_API_KEY=... # Optional: Weights & Biases logging

Quickstart

Add a skypilot: section to any existing config YAML, then run the same automodel command you already know:

$automodel your_config_with_skypilot.yaml

The CLI detects the skypilot: key, strips it from the training config, uploads the code and config to a cloud VM, and launches training — all in one command.

Configuration Reference

Below is an annotated example for fine-tuning Llama-3.2-1B on SQuAD on a GCP spot T4. A ready-to-run copy lives at examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot.yaml.

1# ── SkyPilot launcher section ─────────────────────────────────────────────
2# Removed before the training config reaches the remote VM.
3skypilot:
4 cloud: gcp # aws | gcp | azure | lambda | kubernetes
5 accelerators: T4:1 # GPU type:count per node, e.g. A100:8
6 use_spot: true # ~80 % cost reduction vs on-demand
7 disk_size: 100 # Remote VM disk size in GB
8 num_nodes: 1 # Increase for multi-node distributed training
9 region: us-central1 # Optional — SkyPilot picks cheapest if omitted
10 job_name: llama3_2_finetune # Also used as the SkyPilot cluster name
11
12 # Use env-var placeholders so secrets are never stored in YAML
13 hf_token: ${HF_TOKEN}
14 # wandb_key: ${WANDB_API_KEY}
15
16 # Optional: extra shell commands run on the VM after `pip install -e .`
17 # setup: |
18 # pip install some-extra-dependency
19
20 # Optional: override the default output directory (default: ./skypilot_jobs)
21 # job_dir: /path/to/skypilot/jobs
22
23# ── Training config (forwarded to the VM unchanged) ───────────────────────
24step_scheduler:
25 global_batch_size: 64
26 local_batch_size: 8
27 num_epochs: 1
28
29model:
30 _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
31 pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
32
33# ... rest of your training config ...

All skypilot: Fields

FieldDefaultDescription
cloud(required)Cloud provider: aws, gcp, azure, lambda, kubernetes
acceleratorsT4:1GPU type and count per node, e.g. A100:8, V100:4
num_nodes1Number of VMs for distributed training
use_spottrueUse spot/preemptible instances
disk_size100Remote VM disk size in GB
region(auto)Cloud region; SkyPilot selects cheapest if omitted
zone(auto)Availability zone within the region
instance_type(auto)Specific instance type; auto-selected if omitted
job_name<domain>_<command>Job and SkyPilot cluster name
setup(auto)Extra setup commands run after pip install -e .
hf_home~/.cache/huggingfaceHugging Face cache directory on the remote VM
hf_token$HF_TOKEN envHugging Face token for gated model access
wandb_key$WANDB_API_KEY envWeights & Biases API key
env_vars{}Additional environment variables for the remote VM
job_dir./skypilot_jobsLocal directory for job artifacts (config snapshot, logs)
gpus_per_node(parsed from accelerators)Override GPU count per node passed to torchrun

Cloud Examples

AWS — On-Demand A10G

1skypilot:
2 cloud: aws
3 accelerators: A10G:1
4 use_spot: false
5 region: us-east-1
6 job_name: llm_aws_finetune
7 hf_token: ${HF_TOKEN}

GCP — Spot V100, 8 GPUs (Single Node)

1skypilot:
2 cloud: gcp
3 accelerators: V100:8
4 use_spot: true
5 region: us-west1
6 job_name: llm_gcp_v100_8gpu
7 hf_token: ${HF_TOKEN}

Multi-Node Distributed Training (2 x 8 x A100)

1skypilot:
2 cloud: gcp
3 accelerators: A100:8
4 num_nodes: 2
5 use_spot: false
6 job_name: llm_multinode_a100
7 hf_token: ${HF_TOKEN}

For multi-node jobs, the launcher automatically adds the SkyPilot rendezvous environment variables ($SKYPILOT_NODE_RANK, $SKYPILOT_NUM_NODES, $SKYPILOT_NODE_IPS) to the torchrun command.

Monitor and Manage Jobs

After submitting, use standard SkyPilot commands:

$sky status # List running clusters and their status
$sky logs <cluster_name> # Stream training logs
$sky ssh <cluster_name> # SSH into the VM for debugging
$sky cancel <cluster_name> <job_id> # Cancel a running job
$sky down <cluster_name> # Terminate the cluster and stop billing

How It Works

  1. The automodel CLI detects the skypilot: key in the YAML and calls launch_with_skypilot().
  2. The training config (with skypilot: removed) is written to a local skypilot_jobs/<timestamp>/job_config.yaml.
  3. A sky.Task is created with:
    • workdir — the current directory synced to ~/sky_workdir on the remote VM.
    • file_mounts — the job config uploaded to /tmp/automodel_job_config.yaml.
    • setuppip install -e . (plus any custom setup: commands).
    • run — a torchrun command pointing at the recipe script and config.
  4. sky.launch() provisions the VM, runs setup, then executes training. The call returns immediately (detach_run=True); use sky logs to follow progress.

Customize Configuration

Override any training parameter from the command line, same as with local runs:

$automodel config_with_skypilot.yaml \
> --model.pretrained_model_name_or_path meta-llama/Llama-3.2-3B

Kubernetes Users

If you want to run on a Kubernetes cluster, use cloud: kubernetes and follow the dedicated SkyPilot + Kubernetes tutorial. That guide includes:

  • a copy-paste single-node config
  • a two-node example
  • sample sky and kubectl output to help you sanity-check your setup
  • a short troubleshooting section for common first-run issues

When to Use SkyPilot vs. Slurm

SkyPilotSlurm
InfrastructureAny public cloudOn-premises HPC cluster
Spot instancesYes (automatic)Depends on cluster config
Setup requiredCloud credentials + sky checkCluster access
Good forFlexible cloud burst, cost optimizationFixed on-prem GPU clusters