For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
      • Detailed Installation Guide
      • Deploying Your First Model
      • Dynamo Operator
      • Service Discovery
      • Webhooks
      • Minikube Setup
      • Managing Models with DynamoModel
      • Autoscaling
      • Rolling Update
      • Inference Gateway (GAIE)
      • Snapshot
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Important Terminology
  • Prerequisites
  • Pre-deployment Checks
  • 1. Install Platform First
  • 2. Choose Your Backend
  • 3. Deploy Your First Model
  • Understanding Dynamo’s Custom Resources
  • DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration
  • DynamoGraphDeployment (DGD) - Direct Configuration
  • 📖 API Reference & Documentation
  • Choosing Your Architecture Pattern
  • Frontend and Worker Components
  • Customizing Your Deployment
  • Additional Resources
Kubernetes Deployment

Deployment Guide

||View as Markdown|
Edit this page
Previous

Examples

Next

Detailed Installation Guide

High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.

Important Terminology

Kubernetes Namespace: The K8s namespace where your DynamoGraphDeployment resource is created.

  • Used for: Resource isolation, RBAC, organizing deployments
  • Example: dynamo-system, team-a-namespace

Dynamo Namespace: The logical namespace used by Dynamo components for service discovery.

  • Used for: Runtime component communication, service discovery
  • Specified in: .spec.services.<ServiceName>.dynamoNamespace field
  • Example: my-llm, production-model, dynamo-dev

These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.

Prerequisites

Before you begin, ensure you have the following tools installed:

ToolMinimum VersionInstallation Guide
kubectlv1.24+Install kubectl
Helmv3.0+Install Helm

Verify your installation:

$kubectl version --client # Should show v1.24+
$helm version # Should show v3.0+

For detailed installation instructions, see the Prerequisites section in the Installation Guide.

Pre-deployment Checks

Before deploying the platform, run the pre-deployment checks to ensure the cluster is ready:

$./deploy/pre-deployment/pre-deployment-check.sh

This validates kubectl connectivity, StorageClass configuration, and GPU availability. See pre-deployment checks for more details.

1. Install Platform First

$# 1. Set environment
$export NAMESPACE=dynamo-system
$export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
$
$# 2. Install Platform (CRDs are automatically installed by the chart)
$helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
$helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace

v0.9.0 Helm Chart Issue: The initial v0.9.0 dynamo-platform Helm chart sets the operator image to v0.7.1 instead of v0.9.0. Use RELEASE_VERSION=0.9.0-post1 or add --set dynamo-operator.controllerManager.manager.image.tag=0.9.0 to your helm install command.

For Shared/Multi-Tenant Clusters:

If your cluster has namespace-restricted Dynamo operators, add this flag to step 2:

$--set dynamo-operator.namespaceRestriction.enabled=true

For more details or customization options (including multinode deployments), see Installation Guide for Dynamo Kubernetes Platform.

2. Choose Your Backend

Each backend has deployment examples and configuration options:

BackendAggregatedAggregated + RouterDisaggregatedDisaggregated + RouterDisaggregated + PlannerDisaggregated Multi-node
SGLang✅✅✅✅✅✅
TensorRT-LLM✅✅✅✅🚧✅
vLLM✅✅✅✅✅✅

3. Deploy Your First Model

Follow the Deploying Your First Model guide for a complete end-to-end walkthrough using DynamoGraphDeploymentRequest (DGDR) — Dynamo’s recommended path that handles profiling and configuration automatically.

The tutorial deploys Qwen/Qwen3-0.6B with vLLM and walks you through every step: creating the DGDR, watching the profiling lifecycle, and sending your first inference request.

For SLA-based autoscaling, see SLA Planner Guide.

Understanding Dynamo’s Custom Resources

Dynamo provides two main Kubernetes Custom Resources for deploying models:

DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration

The recommended approach for generating optimal configurations. DGDR provides a high-level interface where you specify:

  • Model name and backend framework
  • SLA targets (latency requirements)
  • GPU type (optional)

Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for:

  • SLA-driven configuration generation
  • Automated resource optimization
  • Users who want simplicity over control

Note: DGDR generates a DGD spec which you can then use to deploy.

DynamoGraphDeployment (DGD) - Direct Configuration

A lower-level interface that defines your complete inference pipeline:

  • Model configuration
  • Resource allocation (GPUs, memory)
  • Scaling policies
  • Frontend/backend connections

Use this when you need fine-grained control or have already completed profiling.

Refer to the API Reference and Documentation for more details.

📖 API Reference & Documentation

For detailed technical specifications of Dynamo’s Kubernetes resources:

  • API Reference - Complete CRD field specifications for all Dynamo resources
  • Create Deployment - Step-by-step deployment creation with DynamoGraphDeployment
  • Operator Guide - Dynamo operator configuration and management

Choosing Your Architecture Pattern

When creating a deployment, select the architecture pattern that best fits your use case:

  • Development / Testing - Use agg.yaml as the base configuration
  • Production with Load Balancing - Use agg_router.yaml to enable scalable, load-balanced inference
  • High Performance / Disaggregated - Use disagg_router.yaml for maximum throughput and modular scalability

Frontend and Worker Components

You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:

  • Provides OpenAI-compatible /v1/chat/completions endpoint
  • Auto-discovers backend workers via service discovery (Kubernetes-native by default)
  • Routes requests and handles load balancing
  • Validates and preprocesses requests

Customizing Your Deployment

Example structure:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: my-llm
5spec:
6 services:
7 Frontend:
8 dynamoNamespace: my-llm
9 componentType: frontend
10 replicas: 1
11 extraPodSpec:
12 mainContainer:
13 image: your-image
14 VllmDecodeWorker: # or SGLangDecodeWorker, TrtllmDecodeWorker
15 dynamoNamespace: dynamo-dev
16 componentType: worker
17 replicas: 1
18 envFromSecret: hf-token-secret # for HuggingFace models
19 resources:
20 limits:
21 gpu: "1"
22 extraPodSpec:
23 mainContainer:
24 image: your-image
25 command: ["/bin/sh", "-c"]
26 args:
27 - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]

Worker command examples per backend:

1# vLLM worker
2args:
3 - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
4
5# SGLang worker
6args:
7 - >-
8 python3 -m dynamo.sglang
9 --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
10 --tp 1
11 --trust-remote-code
12
13# TensorRT-LLM worker
14args:
15 - python3 -m dynamo.trtllm
16 --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
17 --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
18 --extra-engine-args /workspace/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml

Key customization points include:

  • Model Configuration: Specify model in the args command
  • Resource Allocation: Configure GPU requirements under resources.limits
  • Scaling: Set replicas for number of worker instances
  • Routing Mode: Enable KV-cache routing by setting DYN_ROUTER_MODE=kv in Frontend envs
  • Worker Specialization: Add --disaggregation-mode prefill flag for disaggregated prefill workers

Additional Resources

  • Examples - Complete working examples
  • Create Custom Deployments - Build your own CRDs
  • Managing Models with DynamoModel - Deploy LoRA adapters and manage models
  • Operator Documentation - How the platform works
  • Service Discovery - Discovery backends and configuration
  • Helm Charts - For advanced users
  • Snapshot - Fast pod startup with checkpoint/restore
  • GitOps Deployment with FluxCD - For advanced users
  • Logging - For logging setup
  • Multinode Deployment - For multinode deployment
  • Grove - For grove details and custom installation
  • Monitoring - For monitoring setup
  • Model Caching with Fluid - For model caching with Fluid