Deployment Guide

High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.

Important Terminology

Kubernetes Namespace: The K8s namespace where your DynamoGraphDeployment resource is created.

Used for: Resource isolation, RBAC, organizing deployments
Example: dynamo-system, team-a-namespace

Dynamo Namespace: The logical namespace used by Dynamo components for service discovery.

Used for: Runtime component communication, service discovery
Specified in: .spec.services.<ServiceName>.dynamoNamespace field
Example: my-llm, production-model, dynamo-dev

These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.

Prerequisites

Before you begin, ensure you have the following tools installed:

Tool	Minimum Version	Installation Guide
kubectl	v1.24+	Install kubectl
Helm	v3.0+	Install Helm

Verify your installation:

$ kubectl version --client  # Should show v1.24+
$ helm version              # Should show v3.0+

For detailed installation instructions, see the Prerequisites section in the Installation Guide.

Pre-deployment Checks

Before deploying the platform, run the pre-deployment checks to ensure the cluster is ready:

$ ./deploy/pre-deployment/pre-deployment-check.sh

This validates kubectl connectivity, StorageClass configuration, and GPU availability. See pre-deployment checks for more details.

1. Install Platform First

$ # 1. Set environment
$ export NAMESPACE=dynamo-system
$ export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
$ 
$ # 2. Install Platform (CRDs are automatically installed by the chart)
$ helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
$ helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace

v0.9.0 Helm Chart Issue: The initial v0.9.0 dynamo-platform Helm chart sets the operator image to v0.7.1 instead of v0.9.0. Use RELEASE_VERSION=0.9.0-post1 or add --set dynamo-operator.controllerManager.manager.image.tag=0.9.0 to your helm install command.

For Shared/Multi-Tenant Clusters:

If your cluster has namespace-restricted Dynamo operators, add this flag to step 2:

$ --set dynamo-operator.namespaceRestriction.enabled=true

For more details or customization options (including multinode deployments), see Installation Guide for Dynamo Kubernetes Platform.

2. Choose Your Backend

Each backend has deployment examples and configuration options:

Backend	Aggregated	Aggregated + Router	Disaggregated	Disaggregated + Router	Disaggregated + Planner	Disaggregated Multi-node
SGLang	✅	✅	✅	✅	✅	✅
TensorRT-LLM	✅	✅	✅	✅	🚧	✅
vLLM	✅	✅	✅	✅	✅	✅

3. Deploy Your First Model

Follow the Deploying Your First Model guide for a complete end-to-end walkthrough using DynamoGraphDeploymentRequest (DGDR) — Dynamo’s recommended path that handles profiling and configuration automatically.

The tutorial deploys Qwen/Qwen3-0.6B with vLLM and walks you through every step: creating the DGDR, watching the profiling lifecycle, and sending your first inference request.

For SLA-based autoscaling, see SLA Planner Guide.

Understanding Dynamo’s Custom Resources

Dynamo provides two main Kubernetes Custom Resources for deploying models:

DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration

The recommended approach for generating optimal configurations. DGDR provides a high-level interface where you specify:

Model name and backend framework
SLA targets (latency requirements)
GPU type (optional)

Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for:

SLA-driven configuration generation
Automated resource optimization
Users who want simplicity over control

Note: DGDR generates a DGD spec which you can then use to deploy.

DynamoGraphDeployment (DGD) - Direct Configuration

A lower-level interface that defines your complete inference pipeline:

Model configuration
Resource allocation (GPUs, memory)
Scaling policies
Frontend/backend connections

Use this when you need fine-grained control or have already completed profiling.

Refer to the API Reference and Documentation for more details.

📖 API Reference & Documentation

For detailed technical specifications of Dynamo’s Kubernetes resources:

API Reference - Complete CRD field specifications for all Dynamo resources
Create Deployment - Step-by-step deployment creation with DynamoGraphDeployment
Operator Guide - Dynamo operator configuration and management

Choosing Your Architecture Pattern

When creating a deployment, select the architecture pattern that best fits your use case:

Development / Testing - Use agg.yaml as the base configuration
Production with Load Balancing - Use agg_router.yaml to enable scalable, load-balanced inference
High Performance / Disaggregated - Use disagg_router.yaml for maximum throughput and modular scalability

Frontend and Worker Components

You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:

Provides OpenAI-compatible /v1/chat/completions endpoint
Auto-discovers backend workers via service discovery (Kubernetes-native by default)
Routes requests and handles load balancing
Validates and preprocesses requests

Customizing Your Deployment

Example structure:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: my-llm
5 spec:
6   services:
7     Frontend:
8       dynamoNamespace: my-llm
9       componentType: frontend
10       replicas: 1
11       extraPodSpec:
12         mainContainer:
13           image: your-image
14     VllmDecodeWorker:  # or SGLangDecodeWorker, TrtllmDecodeWorker
15       dynamoNamespace: dynamo-dev
16       componentType: worker
17       replicas: 1
18       envFromSecret: hf-token-secret  # for HuggingFace models
19       resources:
20         limits:
21           gpu: "1"
22       extraPodSpec:
23         mainContainer:
24           image: your-image
25           command: ["/bin/sh", "-c"]
26           args:
27             - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]

Worker command examples per backend:

1 # vLLM worker
2 args:
3   - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
4 
5 # SGLang worker
6 args:
7   - >-
8     python3 -m dynamo.sglang
9     --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
10     --tp 1
11     --trust-remote-code
12 
13 # TensorRT-LLM worker
14 args:
15   - python3 -m dynamo.trtllm
16     --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
17     --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
18     --extra-engine-args /workspace/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml

Key customization points include:

Model Configuration: Specify model in the args command
Resource Allocation: Configure GPU requirements under resources.limits
Scaling: Set replicas for number of worker instances
Routing Mode: Enable KV-cache routing by setting DYN_ROUTER_MODE=kv in Frontend envs
Worker Specialization: Add --disaggregation-mode prefill flag for disaggregated prefill workers

Additional Resources

Examples - Complete working examples
Create Custom Deployments - Build your own CRDs
Managing Models with DynamoModel - Deploy LoRA adapters and manage models
Operator Documentation - How the platform works
Service Discovery - Discovery backends and configuration
Helm Charts - For advanced users
Snapshot - Fast pod startup with checkpoint/restore
GitOps Deployment with FluxCD - For advanced users
Logging - For logging setup
Multinode Deployment - For multinode deployment
Grove - For grove details and custom installation
Monitoring - For monitoring setup
Model Caching with Fluid - For model caching with Fluid

$	kubectl version --client # Should show v1.24+
$	helm version # Should show v3.0+

$	# 1. Set environment
$	export NAMESPACE=dynamo-system
$	export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
$
$	# 2. Install Platform (CRDs are automatically installed by the chart)
$	helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
$	helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace

1	apiVersion: nvidia.com/v1alpha1
2	kind: DynamoGraphDeployment
3	metadata:
4	name: my-llm
5	spec:
6	services:
7	Frontend:
8	dynamoNamespace: my-llm
9	componentType: frontend
10	replicas: 1
11	extraPodSpec:
12	mainContainer:
13	image: your-image
14	VllmDecodeWorker: # or SGLangDecodeWorker, TrtllmDecodeWorker
15	dynamoNamespace: dynamo-dev
16	componentType: worker
17	replicas: 1
18	envFromSecret: hf-token-secret # for HuggingFace models
19	resources:
20	limits:
21	gpu: "1"
22	extraPodSpec:
23	mainContainer:
24	image: your-image
25	command: ["/bin/sh", "-c"]
26	args:
27	- python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]

1	# vLLM worker
2	args:
3	- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
4
5	# SGLang worker
6	args:
7	- >-
8	python3 -m dynamo.sglang
9	--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
10	--tp 1
11	--trust-remote-code
12
13	# TensorRT-LLM worker
14	args:
15	- python3 -m dynamo.trtllm
16	--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
17	--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
18	--extra-engine-args /workspace/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml