SkyPilot k8s
This tutorial shows how to run NeMo AutoModel on a Kubernetes cluster through SkyPilot.
You will:
- Check that SkyPilot can see your Kubernetes cluster and GPUs.
- Launch a small NeMo AutoModel fine-tuning job on one GPU.
- Scale the same job to two nodes.
- Follow logs and clean everything up when you are done.
This guide is written for new AutoModel users, so it keeps the moving pieces as small as possible.
Before you begin
You need:
- a working Kubernetes context in
kubectl - at least one GPU-backed node in the cluster
- SkyPilot installed with Kubernetes support
- a local NeMo AutoModel checkout
- a Hugging Face token in
HF_TOKENif you plan to use a gated model such as Llama
If you are setting up SkyPilot on Kubernetes for the first time, the official SkyPilot Kubernetes setup guide is here:
Install the SkyPilot Kubernetes client in your AutoModel environment:
Set the token once in your shell:
Step 1: Verify the cluster
Start with three quick checks:
You want sky check kubernetes to report that Kubernetes is enabled.
Next, ask SkyPilot which GPUs it can request from the cluster:
Example output:
If you do not see any GPUs here, stop and fix the Kubernetes or SkyPilot setup first. AutoModel is ready, but SkyPilot still cannot place GPU jobs.
Step 2: Run a single-node job
The easiest starting point is a one-GPU fine-tune using the existing Llama 3.2 1B SQuAD example.
This repository now includes a Kubernetes-flavored SkyPilot config at examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml.
Launch it from the repo root:
The important part of that YAML is the skypilot: block:
What AutoModel does for you:
- writes a launcher-free copy of the training config to
skypilot_jobs/<timestamp>/job_config.yaml - syncs the repo to the SkyPilot workdir
- runs
torchrunon the Kubernetes worker pod - forwards your training config unchanged after removing the
skypilot:section
Example submission output:
Then watch the cluster come up:
Example log snippet:
Step 3: Scale to two nodes
Once the single-node job works, scaling out is just a small YAML change.
Use the two-node example at examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml:
The launcher block looks like this:
For multi-node jobs, AutoModel switches the generated command to a distributed torchrun launch that uses SkyPilot’s node metadata:
That means you do not need to hand-build rendezvous arguments yourself.
Use these commands while the job is starting:
What you want to see:
- two SkyPilot-managed worker pods
- both pods scheduled onto GPU nodes
- logs that include
--nnodes=$SKYPILOT_NUM_NODES
Step 4: Clean up
When the run is finished, tear the cluster down so it stops consuming resources:
You can remove old local launcher artifacts too:
Common first-run issues
sky check kubernetes fails
Usually this means SkyPilot cannot use your current kubeconfig context yet. Re-check the context with kubectl config current-context, then compare it with SkyPilot’s Kubernetes setup guide.
sky show-gpus --infra k8s shows no GPUs
SkyPilot can only schedule GPUs that Kubernetes exposes. Make sure the GPU device plugin or operator is installed and the GPU nodes are healthy.
The job starts, but model download fails
For gated models, make sure HF_TOKEN is exported in the shell that runs automodel. The SkyPilot launcher forwards it to the remote job.
Multi-node launch stalls during rendezvous
Start with the single-node example first. If that works, check that:
- your cluster has enough free GPU nodes for
num_nodes - worker pods can talk to each other over the cluster network
- the logs include the generated
torchrunmulti-node arguments shown above
Which file should I edit?
If you want to adapt this tutorial for your own model, the quickest path is:
- Copy
examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml. - Change the
modeland dataset sections. - Keep the
skypilot:block small until the first run succeeds.
That way, when something goes wrong, you only have a few knobs to inspect.