Dynamo’s canonical Kubernetes deployment is a
DynamoGraphDeployment (DGD). A DGD
describes the inference graph you want to run. The Dynamo operator reconciles
that graph into one or more
DynamoComponentDeployment (DCD)
resources, which run the frontend, router, prefill workers, decode workers, and
other graph components.
This is the Kubernetes-native control path for Dynamo: you author or generate Dynamo resources, and the operator translates them into Kubernetes workloads, services, routing metadata, model-loading resources, and status conditions. For local development or incremental adoption, you can still run the same frontend, router, and worker components outside Kubernetes.
You can create a DGD directly from a known-good manifest, or you can use a
DynamoGraphDeploymentRequest (DGDR) to profile your model and
generate a DGD for you.
Most users only need three ideas before they deploy:
You do not need to author DCDs directly for normal deployments.
Start with the row that matches your situation. The sections later in this page are reference material; you can read them as needed instead of going linearly.
If a recipe matches
your target model, backend, GPU type, and serving mode, start there. Recipes are
curated DynamoGraphDeployment manifests with model-cache setup and, for many
recipes, benchmark jobs.
The common recipe flow is:
Follow the README in the specific recipe directory for model-specific images, GPU requirements, cache setup, and request examples.
A DGDR is Dynamo’s deploy-by-intent path. Instead of hand-crafting a deployment spec with parallelism settings, replica counts, and resource limits, you describe what you want to run (model, backend, workload, SLA targets) and DGDR generates a DGD:
DynamoGraphDeployment (DGD) spec with the best parallelization strategy,
replica counts, and resource configuration.autoApply: false) — The generated DGD is stored in
.status.profilingResults.selectedConfig for you to inspect and optionally
modify before deploying.autoApply: true, the operator creates the DGD. With
autoApply: false, you apply the generated DGD yourself.DGDR currently supports generated-deployment feature configuration for Planner
(features.planner) and mocker mode (features.mocker). The DGDR API does not
currently expose features.kvRouter; configure explicit router mode in a DGD,
a tuned recipe, or a generated DGD override when you need KV-aware routing
details.
For the DGDR spec reference, field descriptions, and lifecycle phases, see the DGDR Reference.
The searchStrategy field controls how the profiler explores configurations.
Your choice depends on how much time you can invest and how close to optimal
you need.
Uses AIC-backed DynoSim-style performance modeling to search deployment configurations without running real inference. Completes in ~30 seconds with no GPU resources consumed during profiling.
Use rapid when:
Limitations:
Enumerates candidate parallelization configs, deploys each on real GPUs, and benchmarks with AIPerf. Takes 2–4 hours.
Use thorough when:
Constraints:
backend: auto is not supported — you must specify vllm, sglang, or
trtllm. The DGDR will be rejected if you use auto with thorough.The rapid strategy relies on AIC performance models. AIC currently supports:
Some rapid-mode SKUs use AIC estimate-only data until measured profiles are
available. Use searchStrategy: thorough when you need hardware-measured
profiling for an estimate-only or unsupported SKU.
When specifying GPU SKUs manually, use lowercase underscore format (e.g.,
h100_sxm, not H100-SXM5-80GB). See the
DGDR Reference — SKU Format for the full list.
All three backends are supported for both rapid and thorough:
If you are deploying a Mixture-of-Experts (MoE) model (e.g., DeepSeek-R1, Qwen3-MoE), use SGLang as the backend for full support. vLLM and TRT-LLM have partial MoE support that is still under development.
The profiler selects different parallelization strategies depending on the model architecture:
After the basic deployment path is clear, use this checklist to decide which production topics apply:
Set up model caching before deploying if any of these apply:
Add a modelCache section to your DGDR spec that points to a pre-populated PVC:
The operator mounts this PVC at pvcMountPath read-only into the profiling job
and passes it through to the generated DGD, so both profiling and serving use
the cached weights.
pvcModelPath must be the HuggingFace snapshot path inside the PVC —
hub/models--<org>--<model>/snapshots/<commit-hash>. This follows the layout
that huggingface-cli download creates when HF_HOME is set to the mount
point. Replace <org>--<model> by substituting / with -- in the model ID,
and replace <commit-hash> with the actual snapshot revision. See
Model Caching for how to look up the
hash after downloading.
ReadWriteMany PVC — see the
Installation Guide — Shared Storage
for provider-specific options (EFS, Azure Lustre, GKE Filestore).modelCache field.See Model Caching for the full walkthrough with YAML examples.
For models that require authentication (e.g., gated HuggingFace models), create
a Kubernetes Secret named hf-token-secret with a HF_TOKEN key:
The profiler and deployed pods will automatically use this token.
The Planner provides runtime autoscaling for disaggregated deployments. It adjusts prefill and decode replica counts to meet your SLA targets as traffic fluctuates.
The sla optimization target reads live TTFT/ITL metrics from Prometheus. If
you want SLA-driven autoscaling, install Prometheus before creating the DGDR.
See the Installation Guide — Prometheus
for setup instructions.
The throughput and latency modes use internal queue-depth signals and work
without Prometheus.
See the Planner Guide for advanced configuration and scaling behavior details.
Models that require more GPUs than a single node provides (e.g., DeepSeek-R1 on 8-GPU nodes) need multinode orchestration.
Grove is required for multinode DGDR deployments. It provides gang scheduling (all pods in a group start together or not at all), coordinated scaling, and network topology-aware placement. The operator will return an error if you attempt a multinode deployment without Grove or LeaderWorkerSet (LWS) installed.
KAI Scheduler is optional but recommended alongside Grove for GPU-aware scheduling and topology optimization.
See the Installation Guide — Grove + KAI Scheduler for setup instructions and the compatibility matrix.
Disaggregated serving transfers KV cache data between prefill and decode workers. Understanding the networking stack helps you diagnose performance issues:
When RDMA is missing or not active, NIXL can fall back to TCP. That makes KV cache movement the likely bottleneck and can produce very high TTFT or low throughput even when the model workers appear healthy.
Enable RDMA if:
See the Installation Guide — Network Operator / RDMA for provider-specific setup instructions, and the Disaggregated Communication Guide for transport details and performance expectations.
The profiler sweeps MoE models across up to 4 nodes (dense models: 1 node max per engine during sweep). If your MoE model requires more than 4 nodes of GPUs, the profiler will select the best config within that range and you may need to adjust replica counts manually.
The backend field controls which inference engine is used. The default
(auto) lets the profiler pick the best backend, but you should specify a
backend explicitly in these cases:
TensorRT-LLM does not support Python 3.11. If your environment uses
Python 3.11, use vllm or sglang instead.
Each backend handles multinode inference differently:
--dist-init-addr, --nnodes, --node-rank flags for distributed setup.mpirun.hardware.totalGpus is large enough for your model. The
profiler calculates minimum TP from model size and VRAM, but edge cases
(large context lengths, KV cache overhead) may require more GPUs than the
minimum.The operator caps auto-detected GPU count at 32. If your cluster has more
GPUs and you want the profiler to use them, set hardware.totalGpus explicitly:
GPU nodes often have taints. Add tolerations via the overrides field:
Once the DGDR enters the Profiling phase, the spec cannot be changed. If you
need to adjust settings, delete the DGDR and recreate it:
Deleting a DGDR does not delete the DGD it created. This is intentional — the DGD continues serving traffic independently. To clean up fully:
A small model on a single node with rapid profiling — the simplest case:
A 70B model with model caching, SLA targets, and the planner enabled:
A large MoE model requiring multinode, SGLang backend, and thorough profiling:
Prerequisites for this deployment: