Deploying Inference Graphs to Kubernetes#
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
1. Install Platform First#
# 1. Set environment
export NAMESPACE=dynamo-kubernetes
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# 2. Install CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
# 3. Install Platform
kubectl create namespace ${NAMESPACE}
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE}
For more details or customization options, see Installation Guide for Dynamo Kubernetes Platform.
2. Choose Your Backend#
Each backend has deployment examples and configuration options:
Backend |
Available Configurations |
---|---|
Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner |
|
Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node |
|
Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router |
3. Deploy Your First Model#
# Set same namespace from platform install
export NAMESPACE=dynamo-cloud
# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
# Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
# Test it
kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models
What’s a DynamoGraphDeployment?#
It’s a Kubernetes Custom Resource that defines your inference pipeline:
Model configuration
Resource allocation (GPUs, memory)
Scaling policies
Frontend/backend connections
The scripts in the components/<backend>/launch
folder like agg.sh
demonstrate how you can serve your models locally. The corresponding YAML files like agg.yaml
show you how you could create a kubernetes deployment for your inference graph.
📖 API Reference & Documentation#
For detailed technical specifications of Dynamo’s Kubernetes resources:
API Reference - Complete CRD field specifications for
DynamoGraphDeployment
andDynamoComponentDeployment
Operator Guide - Dynamo operator configuration and management
Create Deployment - Step-by-step deployment creation examples
Choosing Your Architecture Pattern#
When creating a deployment, select the architecture pattern that best fits your use case:
Development / Testing - Use
agg.yaml
as the base configurationProduction with Load Balancing - Use
agg_router.yaml
to enable scalable, load-balanced inferenceHigh Performance / Disaggregated - Use
disagg_router.yaml
for maximum throughput and modular scalability
Frontend and Worker Components#
You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
Provides OpenAI-compatible
/v1/chat/completions
endpointAuto-discovers backend workers via etcd
Routes requests and handles load balancing
Validates and preprocesses requests
Customizing Your Deployment#
Example structure:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
spec:
services:
Frontend:
dynamoNamespace: my-llm
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: your-image
VllmDecodeWorker: # or SGLangDecodeWorker, TrtllmDecodeWorker
dynamoNamespace: dynamo-dev
componentType: worker
replicas: 1
envFromSecret: hf-token-secret # for HuggingFace models
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: your-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
Worker command examples per backend:
# vLLM worker
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
# SGLang worker
args:
- >-
python3 -m dynamo.sglang
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--tp 1
--trust-remote-code
# TensorRT-LLM worker
args:
- python3 -m dynamo.trtllm
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--extra-engine-args engine_configs/agg.yaml
Key customization points include:
Model Configuration: Specify model in the args command
Resource Allocation: Configure GPU requirements under
resources.limits
Scaling: Set
replicas
for number of worker instancesRouting Mode: Enable KV-cache routing by setting
DYN_ROUTER_MODE=kv
in Frontend envsWorker Specialization: Add
--is-prefill-worker
flag for disaggregated prefill workers
Additional Resources#
Examples - Complete working examples
Create Custom Deployments - Build your own CRDs
Operator Documentation - How the platform works
Helm Charts - For advanced users