Kubernetes Quickstart
Get a model running on Kubernetes in minutes.
Dynamo’s production path is Kubernetes-native: you install the platform with Helm, submit Dynamo CRDs, and let the operator reconcile inference graphs into pods, services, routing, model-loading, and scaling resources. The local and container guides remain useful for development, but Kubernetes is the canonical path for shared GPU clusters and multi-node serving.
Deployment modes. Dynamo supports two deployment modes on Kubernetes. This quickstart uses standalone mode, where the Dynamo Frontend serves requests and the integrated Dynamo Router does KV-aware routing. Dynamo can also run in gateway mode behind a Gateway API Inference Extension gateway, where KV-aware routing happens in the Dynamo Endpoint Picker Plugin (EPP) at the gateway layer and the Frontend runs as a sidecar in --router-mode direct. See the Inference Gateway (GAIE) guide to set up gateway mode.
Prerequisites
- Kubernetes cluster (v1.24+) with GPU nodes
- kubectl (v1.24+)
- Helm (v3.0+) installed
- NVIDIA GPU Operator installed on the cluster
- HuggingFace token secret on cluster
HuggingFace token secret
Create a HuggingFace token secret for model downloads. If you don’t have a token, see the HuggingFace token guide.
GPU Operator quick install
If you don’t have the GPU Operator yet:
If your cluster already provides GPU drivers (e.g., GKE with gpu-driver-version=latest, or AKS), add:
Detailed installation
The GPU Operator is the only prerequisite for a basic deployment. For additional features like RDMA, Prometheus, or multinode scheduling with Grove/KAI Scheduler, see the Installation Guide.
If your GPU SKU and cloud provider are supported, you can use AICR for rapid installation of prerequisites and the Dynamo Helm chart.
Verify cluster is ready
Optionally, verify your cluster is ready:
Install Dynamo
Wait for the platform pods:
Understand Dynamo Deployment Resources
Before applying the first YAML, it helps to know the Kubernetes resources Dynamo uses. These are Dynamo’s native control-plane objects; you describe the inference graph, and the operator owns the Kubernetes deployments, services, and component rollout around it:
This quickstart uses DGDR because it avoids hand-writing the first DGD. After DGDR generates and applies the DGD, the DGDR reaches a terminal state, similar to a Kubernetes Job. The DGD persists and serves your model.
DGDR can also carry supported generated-deployment features such as
features.planner for Planner configuration and features.mocker for mocker
mode. KV-aware routing is not currently exposed as a DGDR feature field; use a
direct DGD, a tuned recipe, or overrides.dgd when you need to set router mode
or other graph-level details explicitly.
For tuned production-style manifests, start from Dynamo recipes. For the full deployment model, see the Deployment Overview.
Deploy Your First Model
Save this DGDR to generate and deploy a DGD for Qwen/Qwen3-0.6B:
The DGDR generates a DGD similar in shape to the following. If you already know the backend and runtime image you want, you can apply this canonical DGD object directly instead of using DGDR:
Apply exactly one of the manifests.
Option A: generate and apply a DGD with DGDR.
Option B: apply the DGD directly.
If you use DGDR, watch it progress from Pending to Profiling to Deploying
to Deployed:
In both paths, the DGD is the live serving resource:
Dynamo supports vLLM, TensorRT-LLM, and SGLang backends. Setting backend: auto lets the profiler choose the best one for your model and hardware. See the vLLM backend guide for a backend guide example.
Send a Request
Once the DGD is ready, it is serving the model:
Cleanup
Next Steps
- Installation Guide — Cloud provider setup, GPU Operator details, optional components (Grove, RDMA, model caching, Prometheus)
- Deployment Overview — DGD, DCD, DGDR, recipes, strategy selection, and common pitfalls
- DGDR Reference — Spec reference, lifecycle phases, monitoring commands, and generated DGD behavior
- Creating Deployments — Hand-craft a DGD spec for full control