Kubernetes Quickstart | NVIDIA Dynamo Documentation

Get a model running on Kubernetes in minutes.

Dynamo’s production path is Kubernetes-native: you install the platform with Helm, submit Dynamo custom resources, and let the operator reconcile inference graphs into pods, services, routing, model-loading, and scaling resources. The local and container guides remain useful for development, but Kubernetes is the canonical path for shared GPU clusters and multi-node serving.

Request entry. This quickstart uses Dynamo-native Frontend routing: the Dynamo Frontend receives requests and the integrated Dynamo Router selects workers. Dynamo can also integrate Kubernetes-natively with Gateway API Inference Extension, where Gateway API receives requests and calls the Dynamo EPP for endpoint selection. See the GAIE guide for the Gateway API path.

Prerequisites

Helm (v3.0+) installed

CUDA

XPU

Kubernetes cluster (v1.30+) with GPU nodes
kubectl (v1.30+)
NVIDIA GPU Operator installed on the cluster

HuggingFace token secret

Create a HuggingFace token secret for model downloads. If you don’t have a token, see the HuggingFace token guide.

$ export HF_TOKEN=<your-hf-token>
$ 
$ kubectl create secret generic hf-token-secret \
>   --from-literal=HF_TOKEN="$HF_TOKEN"

Accelerator resource setup

CUDA

XPU

If you don’t have the GPU Operator yet:

$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --force-update
$ helm repo update nvidia
$ helm install gpu-operator nvidia/gpu-operator \
>   --namespace gpu-operator --create-namespace \
>   --wait --timeout=600s

If your cluster already provides GPU drivers (e.g., GKE with gpu-driver-version=latest, or AKS), add:

$ --set driver.enabled=false --set toolkit.enabled=false

Detailed installation

CUDA deployments require the GPU Operator. XPU deployments require the Intel resource drivers for Kubernetes and DRA. For additional features like RDMA, Prometheus, or multinode scheduling with Grove/KAI Scheduler, see the Installation Guide.

If your GPU SKU and cloud provider are supported, you can use AICR for rapid installation of prerequisites and the Dynamo Helm chart.

Verify cluster is ready

Optionally, verify your cluster is ready:

$ ./deploy/pre-deployment/pre-deployment-check.sh

Install Dynamo

$ export NAMESPACE=dynamo-system
$ helm install dynamo-platform \
>   oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform \
>   --version "1.2.1" \
>   --namespace "$NAMESPACE" \
>   --create-namespace

Wait for the platform pods:

$ kubectl get pods -n $NAMESPACE
$ # Expected: dynamo-operator-*, etcd-*, nats-* pods all Running

Understand Dynamo Deployment Resources

Before applying the first YAML, it helps to know the Kubernetes resources Dynamo uses. These are Dynamo’s native control-plane objects; you describe the inference graph, and the operator owns the Kubernetes deployments, services, and component rollout around it:

Resource or path	What it does	In this quickstart
`DynamoGraphDeployment` (DGD)	The canonical live deployment. It describes the Dynamo inference graph that serves traffic.	Generated by DGDR in Option A, or applied directly in Option B.
`DynamoComponentDeployment` (DCD)	Per-component deployments created by the operator from the DGD, such as frontend and worker components.	Created for you by the operator.
`DynamoGraphDeploymentRequest` (DGDR)	A generator/profiler that can produce a DGD from a model, backend, workload, hardware, and optional SLA targets.	Option A uses DGDR so Dynamo can generate the first DGD.
Recipes	Tuned `deploy.yaml` manifests that are already DGD specs.	Use these later when a recipe matches your model, backend, and hardware.

This quickstart uses DGDR because it avoids hand-writing the first DGD. After DGDR generates and applies the DGD, the DGDR reaches a terminal state, similar to a Kubernetes Job. The DGD persists and serves your model.

DGDR can also carry supported generated-deployment features such as features.planner for Planner configuration and features.mocker for mocker mode. KV-aware routing is not currently exposed as a DGDR feature field; use a direct DGD, a tuned recipe, or overrides.dgd when you need to set router mode or other graph-level details explicitly.

For tuned production-style manifests, start from Dynamo recipes. For the full deployment model, see the Deployment Overview.

Deploy Your First Model

CUDA

XPU

Save this DGDR to generate and deploy a DGD for Qwen/Qwen3-0.6B:

1 # qwen3-quickstart.yaml
2 apiVersion: nvidia.com/v1beta1
3 kind: DynamoGraphDeploymentRequest
4 metadata:
5   name: qwen3-quickstart
6 spec:
7   model: Qwen/Qwen3-0.6B
8   backend: auto
9   image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.1"  # dynamo-frontend for Dynamo < 1.1.0

The DGDR generates a DGD similar in shape to the following. If you already know the backend and runtime image you want, you can apply this canonical DGD object directly instead of using DGDR:

1 # qwen3-dgd.yaml
2 apiVersion: nvidia.com/v1beta1
3 kind: DynamoGraphDeployment
4 metadata:
5   name: qwen3-direct
6 spec:
7   components:
8     - name: Frontend
9       type: frontend
10       replicas: 1
11       podTemplate:
12         spec:
13           containers:
14             - name: main
15               image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.1
16               envFrom:
17                 - secretRef:
18                     name: hf-token-secret
19     - name: VllmDecodeWorker
20       type: worker
21       replicas: 1
22       podTemplate:
23         spec:
24           containers:
25             - name: main
26               image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.1
27               command:
28                 - python3
29                 - -m
30                 - dynamo.vllm
31               args:
32                 - --model
33                 - Qwen/Qwen3-0.6B
34               envFrom:
35                 - secretRef:
36                     name: hf-token-secret
37               resources:
38                 limits:
39                   nvidia.com/gpu: "1"
40                 requests:
41                   ephemeral-storage: 2Gi
42               workingDir: /workspace/examples/backends/vllm

Apply exactly one of the manifests.

Option A: generate and apply a DGD with DGDR.

$ kubectl apply -f qwen3-quickstart.yaml -n $NAMESPACE

Option B: apply the DGD directly.

$ kubectl apply -f qwen3-dgd.yaml -n $NAMESPACE

If you use DGDR, watch it progress from Pending to Profiling to Deploying to Deployed:

$ kubectl get dgdr qwen3-quickstart -n $NAMESPACE -w

In both paths, the DGD is the live serving resource:

$ kubectl get dynamographdeployment -n $NAMESPACE
$ kubectl get dynamocomponentdeployment -n $NAMESPACE

Dynamo supports vLLM, TensorRT-LLM, and SGLang backends. Setting backend: auto lets the profiler choose the best one for your model and hardware. See the vLLM backend guide for a backend guide example.

Send a Request

Once the DGD is ready, it is serving the model:

$ # Find and port-forward the frontend
$ FRONTEND_SVC=$(kubectl get svc -n $NAMESPACE -o name | grep frontend | head -1)
$ kubectl port-forward "$FRONTEND_SVC" 8000:8000 -n $NAMESPACE &
$ 
$ # Send a request
$ curl -s http://localhost:8000/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "Qwen/Qwen3-0.6B",
>     "messages": [{"role": "user", "content": "What is NVIDIA Dynamo?"}],
>     "max_tokens": 200
>   }' | python3 -m json.tool

Cleanup

$ kubectl delete dgdr qwen3-quickstart -n $NAMESPACE --ignore-not-found
$ kubectl delete dynamographdeployment qwen3-quickstart qwen3-direct \
>   -n $NAMESPACE --ignore-not-found

Next Steps

Installation Guide — Cloud provider setup, accelerator resource components, and optional components (Grove, RDMA, model caching, Prometheus)
Deployment Overview — DGD, DCD, DGDR, recipes, strategy selection, and common pitfalls
DGDR Reference — Spec reference, lifecycle phases, monitoring commands, and generated DGD behavior
Creating Deployments — Hand-craft a DGD spec for full control