For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes
    • DynoSim: Simulating the Pareto Frontier
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
      • Kubernetes Quickstart
      • Installation Guide
      • Dynamo Operator
      • Minikube Setup
  • Feature Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Benchmarking
    • Tool Calling & Reasoning Parsing
    • Fault Tolerance
    • Observability (Local)
    • Inference Simulation
    • Agents
    • LoRA Adapters
    • Multimodal
    • Diffusion
    • Fastokens Tokenizer
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Prerequisites
  • HuggingFace token secret
  • GPU Operator quick install
  • Detailed installation
  • Verify cluster is ready
  • Install Dynamo
  • Understand Dynamo Deployment Resources
  • Deploy Your First Model
  • Send a Request
  • Cleanup
  • Next Steps
Kubernetes DeploymentStart Here

Kubernetes Quickstart

||View as Markdown|
Previous

Flash Indexer: A Story of Inter-Galactic KV Routing

Next

Installation Guide

Get a model running on Kubernetes in minutes.

Dynamo’s production path is Kubernetes-native: you install the platform with Helm, submit Dynamo CRDs, and let the operator reconcile inference graphs into pods, services, routing, model-loading, and scaling resources. The local and container guides remain useful for development, but Kubernetes is the canonical path for shared GPU clusters and multi-node serving.

Deployment modes. Dynamo supports two deployment modes on Kubernetes. This quickstart uses standalone mode, where the Dynamo Frontend serves requests and the integrated Dynamo Router does KV-aware routing. Dynamo can also run in gateway mode behind a Gateway API Inference Extension gateway, where KV-aware routing happens in the Dynamo Endpoint Picker Plugin (EPP) at the gateway layer and the Frontend runs as a sidecar in --router-mode direct. See the Inference Gateway (GAIE) guide to set up gateway mode.

Prerequisites

  • Kubernetes cluster (v1.24+) with GPU nodes
  • kubectl (v1.24+)
  • Helm (v3.0+) installed
  • NVIDIA GPU Operator installed on the cluster
  • HuggingFace token secret on cluster

HuggingFace token secret

Create a HuggingFace token secret for model downloads. If you don’t have a token, see the HuggingFace token guide.

$export HF_TOKEN=<your-hf-token>
$
$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN="$HF_TOKEN"

GPU Operator quick install

If you don’t have the GPU Operator yet:

$helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --force-update
$helm repo update nvidia
$helm install gpu-operator nvidia/gpu-operator \
> --namespace gpu-operator --create-namespace \
> --wait --timeout=600s

If your cluster already provides GPU drivers (e.g., GKE with gpu-driver-version=latest, or AKS), add:

$--set driver.enabled=false --set toolkit.enabled=false

Detailed installation

The GPU Operator is the only prerequisite for a basic deployment. For additional features like RDMA, Prometheus, or multinode scheduling with Grove/KAI Scheduler, see the Installation Guide.

If your GPU SKU and cloud provider are supported, you can use AICR for rapid installation of prerequisites and the Dynamo Helm chart.

Verify cluster is ready

Optionally, verify your cluster is ready:

$./deploy/pre-deployment/pre-deployment-check.sh

Install Dynamo

$export NAMESPACE=dynamo-system
$helm install dynamo-platform \
> oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform \
> --version "1.0.2" \
> --namespace "$NAMESPACE" \
> --create-namespace

Wait for the platform pods:

$kubectl get pods -n $NAMESPACE
$# Expected: dynamo-operator-*, etcd-*, nats-* pods all Running

Understand Dynamo Deployment Resources

Before applying the first YAML, it helps to know the Kubernetes resources Dynamo uses. These are Dynamo’s native control-plane objects; you describe the inference graph, and the operator owns the Kubernetes deployments, services, and component rollout around it:

Resource or pathWhat it doesIn this quickstart
DynamoGraphDeployment (DGD)The canonical live deployment. It describes the Dynamo inference graph that serves traffic.Generated by DGDR in Option A, or applied directly in Option B.
DynamoComponentDeployment (DCD)Per-component deployments created by the operator from the DGD, such as frontend and worker components.Created for you by the operator.
DynamoGraphDeploymentRequest (DGDR)A generator/profiler that can produce a DGD from a model, backend, workload, hardware, and optional SLA targets.Option A uses DGDR so Dynamo can generate the first DGD.
RecipesTuned deploy.yaml manifests that are already DGD specs.Use these later when a recipe matches your model, backend, and hardware.

This quickstart uses DGDR because it avoids hand-writing the first DGD. After DGDR generates and applies the DGD, the DGDR reaches a terminal state, similar to a Kubernetes Job. The DGD persists and serves your model.

DGDR can also carry supported generated-deployment features such as features.planner for Planner configuration and features.mocker for mocker mode. KV-aware routing is not currently exposed as a DGDR feature field; use a direct DGD, a tuned recipe, or overrides.dgd when you need to set router mode or other graph-level details explicitly.

For tuned production-style manifests, start from Dynamo recipes. For the full deployment model, see the Deployment Overview.

Deploy Your First Model

Save this DGDR to generate and deploy a DGD for Qwen/Qwen3-0.6B:

1# qwen3-quickstart.yaml
2apiVersion: nvidia.com/v1beta1
3kind: DynamoGraphDeploymentRequest
4metadata:
5 name: qwen3-quickstart
6spec:
7 model: Qwen/Qwen3-0.6B
8 backend: auto
9 image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.1.1" # dynamo-frontend for Dynamo < 1.1.0

The DGDR generates a DGD similar in shape to the following. If you already know the backend and runtime image you want, you can apply this canonical DGD object directly instead of using DGDR:

1# qwen3-dgd.yaml
2apiVersion: nvidia.com/v1beta1
3kind: DynamoGraphDeployment
4metadata:
5 name: qwen3-direct
6spec:
7 components:
8 - name: Frontend
9 type: frontend
10 replicas: 1
11 podTemplate:
12 spec:
13 containers:
14 - name: main
15 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
16 envFrom:
17 - secretRef:
18 name: hf-token-secret
19 - name: VllmDecodeWorker
20 type: worker
21 replicas: 1
22 podTemplate:
23 spec:
24 containers:
25 - name: main
26 image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
27 command:
28 - python3
29 - -m
30 - dynamo.vllm
31 args:
32 - --model
33 - Qwen/Qwen3-0.6B
34 envFrom:
35 - secretRef:
36 name: hf-token-secret
37 resources:
38 limits:
39 nvidia.com/gpu: "1"
40 requests:
41 ephemeral-storage: 2Gi
42 workingDir: /workspace/examples/backends/vllm

Apply exactly one of the manifests.

Option A: generate and apply a DGD with DGDR.

$kubectl apply -f qwen3-quickstart.yaml -n $NAMESPACE

Option B: apply the DGD directly.

$kubectl apply -f qwen3-dgd.yaml -n $NAMESPACE

If you use DGDR, watch it progress from Pending to Profiling to Deploying to Deployed:

$kubectl get dgdr qwen3-quickstart -n $NAMESPACE -w

In both paths, the DGD is the live serving resource:

$kubectl get dynamographdeployment -n $NAMESPACE
$kubectl get dynamocomponentdeployment -n $NAMESPACE

Dynamo supports vLLM, TensorRT-LLM, and SGLang backends. Setting backend: auto lets the profiler choose the best one for your model and hardware. See the vLLM backend guide for a backend guide example.

Send a Request

Once the DGD is ready, it is serving the model:

$# Find and port-forward the frontend
$FRONTEND_SVC=$(kubectl get svc -n $NAMESPACE -o name | grep frontend | head -1)
$kubectl port-forward "$FRONTEND_SVC" 8000:8000 -n $NAMESPACE &
$
$# Send a request
$curl -s http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-0.6B",
> "messages": [{"role": "user", "content": "What is NVIDIA Dynamo?"}],
> "max_tokens": 200
> }' | python3 -m json.tool

Cleanup

$kubectl delete dgdr qwen3-quickstart -n $NAMESPACE --ignore-not-found
$kubectl delete dynamographdeployment qwen3-quickstart qwen3-direct \
> -n $NAMESPACE --ignore-not-found

Next Steps

  • Installation Guide — Cloud provider setup, GPU Operator details, optional components (Grove, RDMA, model caching, Prometheus)
  • Deployment Overview — DGD, DCD, DGDR, recipes, strategy selection, and common pitfalls
  • DGDR Reference — Spec reference, lifecycle phases, monitoring commands, and generated DGD behavior
  • Creating Deployments — Hand-craft a DGD spec for full control