Deployment Guide | NVIDIA Dynamo Documentation

High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.

Important Terminology

Kubernetes Namespace: The K8s namespace where your DynamoGraphDeployment resource is created.

Used for: Resource isolation, RBAC, organizing deployments
Example: dynamo-system, team-a-namespace

Dynamo Namespace: The logical namespace used by Dynamo components for service discovery.

Used for: Runtime component communication, service discovery
Specified in: .spec.services.<ServiceName>.dynamoNamespace field
Example: my-llm, production-model, dynamo-dev

These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.

Prerequisites

Before you begin, ensure you have the following tools installed:

Tool	Minimum Version	Installation Guide
kubectl	v1.24+	Install kubectl
Helm	v3.0+	Install Helm

Verify your installation:

$ kubectl version --client  # Should show v1.24+
$ helm version              # Should show v3.0+

For detailed installation instructions, see the Prerequisites section in the Installation Guide.

Pre-deployment Checks

Before deploying the platform, run the pre-deployment checks to ensure the cluster is ready:

$ ./deploy/pre-deployment/pre-deployment-check.sh

This validates kubectl connectivity, StorageClass configuration, and GPU availability. See pre-deployment checks for more details.

1. Install Platform First

$ # 1. Set environment
$ export NAMESPACE=dynamo-system
$ export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
$ 
$ # 2. Install Platform (CRDs are automatically installed by the chart)
$ helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
$ helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace

v0.9.0 Helm Chart Issue: The initial v0.9.0 dynamo-platform Helm chart sets the operator image to v0.7.1 instead of v0.9.0. Use RELEASE_VERSION=0.9.0-post1 or add --set dynamo-operator.controllerManager.manager.image.tag=0.9.0 to your helm install command.

For Shared/Multi-Tenant Clusters:

DEPRECATED: Namespace-restricted mode (namespaceRestriction.enabled=true) is deprecated and will be removed in a future release. Use cluster-wide mode (the default) instead.

For more details or customization options (including multinode deployments), see Installation Guide for Dynamo Kubernetes Platform.

2. Choose Your Backend

Each backend has deployment examples and configuration options:

Backend	Aggregated	Aggregated + Router	Disaggregated	Disaggregated + Router	Disaggregated + Planner	Disaggregated Multi-node
SGLang	✅	✅	✅	✅	✅	✅
TensorRT-LLM	✅	✅	✅	✅	🚧	✅
vLLM	✅	✅	✅	✅	✅	✅

3. Deploy Your First Model

Follow the Deploying Your First Model guide for a complete end-to-end walkthrough using DynamoGraphDeploymentRequest (DGDR) — Dynamo’s recommended path that handles profiling and configuration automatically.

The tutorial deploys Qwen/Qwen3-0.6B with vLLM and walks you through every step: creating the DGDR, watching the profiling lifecycle, and sending your first inference request.

For SLA-based autoscaling, see SLA Planner Guide.

Understanding Dynamo’s Custom Resources

Dynamo provides two main Kubernetes Custom Resources for deploying models:

DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration

The recommended approach for generating optimal configurations. DGDR provides a high-level interface where you specify:

Model name and backend framework
SLA targets (latency requirements)
GPU type (optional)

Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for:

SLA-driven configuration generation
Automated resource optimization
Users who want simplicity over control

Note: DGDR generates a DGD spec which you can then use to deploy.

DynamoGraphDeployment (DGD) - Direct Configuration

A lower-level interface that defines your complete inference pipeline:

Model configuration
Resource allocation (GPUs, memory)
Scaling policies
Frontend/backend connections

Use this when you need fine-grained control or have already completed profiling.

Refer to the API Reference and Documentation for more details.

📖 API Reference & Documentation

For detailed technical specifications of Dynamo’s Kubernetes resources:

API Reference - Complete CRD field specifications for all Dynamo resources
Create Deployment - Step-by-step deployment creation with DynamoGraphDeployment
Operator Guide - Dynamo operator configuration and management

Choosing Your Architecture Pattern

When creating a deployment, select the architecture pattern that best fits your use case:

Development / Testing - Use agg.yaml as the base configuration
Production with Load Balancing - Use agg_router.yaml to enable scalable, load-balanced inference
High Performance / Disaggregated - Use disagg_router.yaml for maximum throughput and modular scalability

Frontend and Worker Components

You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:

Provides OpenAI-compatible /v1/chat/completions endpoint
Auto-discovers backend workers via service discovery (Kubernetes-native by default)
Routes requests and handles load balancing
Validates and preprocesses requests

Customizing Your Deployment

Example structure:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: my-llm
5 spec:
6   services:
7     Frontend:
8       dynamoNamespace: my-llm
9       componentType: frontend
10       replicas: 1
11       extraPodSpec:
12         mainContainer:
13           image: your-image
14     VllmDecodeWorker:  # or SGLangDecodeWorker, TrtllmDecodeWorker
15       dynamoNamespace: dynamo-dev
16       componentType: worker
17       replicas: 1
18       envFromSecret: hf-token-secret  # for HuggingFace models
19       resources:
20         limits:
21           gpu: "1"
22       extraPodSpec:
23         mainContainer:
24           image: your-image
25           command: ["/bin/sh", "-c"]
26           args:
27             - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]

Worker command examples per backend:

1 # vLLM worker
2 args:
3   - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
4 
5 # SGLang worker
6 args:
7   - >-
8     python3 -m dynamo.sglang
9     --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
10     --tp 1
11     --trust-remote-code
12 
13 # TensorRT-LLM worker
14 args:
15   - python3 -m dynamo.trtllm
16     --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
17     --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
18     --extra-engine-args /workspace/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml

Key customization points include:

Model Configuration: Specify model in the args command
Resource Allocation: Configure GPU requirements under resources.limits
Scaling: Set replicas for number of worker instances
Routing Mode: Enable KV-cache routing by setting DYN_ROUTER_MODE=kv in Frontend envs
Worker Specialization: Add --disaggregation-mode prefill flag for disaggregated prefill workers

Additional Resources

Examples - Complete working examples
Create Custom Deployments - Build your own CRDs
Managing Models with DynamoModel - Deploy LoRA adapters and manage models
Operator Documentation - How the platform works
Service Discovery - Discovery backends and configuration
Helm Charts - For advanced users
Snapshot - Fast pod startup with checkpoint/restore
GitOps Deployment with FluxCD - For advanced users
Logging - For logging setup
Multinode Deployment - For multinode deployment
Topology Aware Scheduling - Configure topology-aware workload placement
Grove - For grove details and custom installation
Monitoring - For monitoring setup
Model Caching with Fluid - For model caching with Fluid