Get a model running on Kubernetes in minutes.
Dynamo’s production path is Kubernetes-native: you install the platform with Helm, submit Dynamo CRDs, and let the operator reconcile inference graphs into pods, services, routing, model-loading, and scaling resources. The local and container guides remain useful for development, but Kubernetes is the canonical path for shared GPU clusters and multi-node serving.
Deployment modes. Dynamo supports two deployment modes on Kubernetes. This quickstart uses standalone mode, where the Dynamo Frontend serves requests and the integrated Dynamo Router does KV-aware routing. Dynamo can also run in gateway mode behind a Gateway API Inference Extension gateway, where KV-aware routing happens in the Dynamo Endpoint Picker Plugin (EPP) at the gateway layer and the Frontend runs as a sidecar in --router-mode direct. See the Inference Gateway (GAIE) guide to set up gateway mode.
Create a HuggingFace token secret for model downloads. If you don’t have a token, see the HuggingFace token guide.
If you don’t have the GPU Operator yet:
If your cluster already provides GPU drivers (e.g., GKE with gpu-driver-version=latest, or AKS), add:
The GPU Operator is the only prerequisite for a basic deployment. For additional features like RDMA, Prometheus, or multinode scheduling with Grove/KAI Scheduler, see the Installation Guide.
If your GPU SKU and cloud provider are supported, you can use AICR for rapid installation of prerequisites and the Dynamo Helm chart.
Optionally, verify your cluster is ready:
Wait for the platform pods:
Before applying the first YAML, it helps to know the Kubernetes resources Dynamo uses. These are Dynamo’s native control-plane objects; you describe the inference graph, and the operator owns the Kubernetes deployments, services, and component rollout around it:
This quickstart uses DGDR because it avoids hand-writing the first DGD. After DGDR generates and applies the DGD, the DGDR reaches a terminal state, similar to a Kubernetes Job. The DGD persists and serves your model.
DGDR can also carry supported generated-deployment features such as
features.planner for Planner configuration and features.mocker for mocker
mode. KV-aware routing is not currently exposed as a DGDR feature field; use a
direct DGD, a tuned recipe, or overrides.dgd when you need to set router mode
or other graph-level details explicitly.
For tuned production-style manifests, start from Dynamo recipes. For the full deployment model, see the Deployment Overview.
Save this DGDR to generate and deploy a DGD for Qwen/Qwen3-0.6B:
The DGDR generates a DGD similar in shape to the following. If you already know the backend and runtime image you want, you can apply this canonical DGD object directly instead of using DGDR:
Apply exactly one of the manifests.
Option A: generate and apply a DGD with DGDR.
Option B: apply the DGD directly.
If you use DGDR, watch it progress from Pending to Profiling to Deploying
to Deployed:
In both paths, the DGD is the live serving resource:
Dynamo supports vLLM, TensorRT-LLM, and SGLang backends. Setting backend: auto lets the profiler choose the best one for your model and hardware. See the vLLM backend guide for a backend guide example.
Once the DGD is ready, it is serving the model: