For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes
    • DynoSim: Simulating the Pareto Frontier
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
      • Deployment Overview
      • Managing Models with DynamoModel
      • DGDR Reference
  • Feature Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Benchmarking
    • Tool Calling & Reasoning Parsing
    • Fault Tolerance
    • Observability (Local)
    • Inference Simulation
    • Agents
    • LoRA Adapters
    • Multimodal
    • Diffusion
    • Fastokens Tokenizer
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Start Here: Resource Model
  • Choose Your Path
  • Deploy a Tuned DGD from Recipes
  • Use DGDR to Generate a DGD
  • DGDR Detail: Choose a Search Strategy
  • Rapid (Default)
  • Thorough
  • DGDR Detail: AIC Support Matrix
  • GPU SKUs
  • Backends
  • Parallelization Strategies
  • Production Details
  • Production Detail: Model Caching
  • How It Works with DGDR
  • Setup
  • Private and Gated Models
  • Production Detail: Planner
  • Planner Scaling Modes
  • Prometheus Requirement
  • Production Detail: Multinode and RDMA
  • Grove and KAI Scheduler
  • High-Speed Networking (RDMA)
  • MoE Models and Multinode Sweep Limits
  • Production Detail: Backend Selection
  • Multinode Backend Behavior
  • Troubleshooting
  • OOM During Profiling or Serving
  • GPU Auto-Detection Cap
  • Profiling Job Fails to Schedule
  • DGDR Spec Is Immutable
  • DGD Persists After DGDR Deletion
  • Example Workflows
  • Small Dense Model (Quick Start)
  • Large Dense Model with SLA Targets
  • MoE Model (DeepSeek-R1)
  • Further Reading
Kubernetes DeploymentDeploy Models

Deployment Overview

Understand DynamoGraphDeployments, DynamoComponentDeployments, DGDR, and recipes
||View as Markdown|
Previous

Minikube Setup

Next

Managing Models with DynamoModel

Dynamo’s canonical Kubernetes deployment is a DynamoGraphDeployment (DGD). A DGD describes the inference graph you want to run. The Dynamo operator reconciles that graph into one or more DynamoComponentDeployment (DCD) resources, which run the frontend, router, prefill workers, decode workers, and other graph components.

This is the Kubernetes-native control path for Dynamo: you author or generate Dynamo resources, and the operator translates them into Kubernetes workloads, services, routing metadata, model-loading resources, and status conditions. For local development or incremental adoption, you can still run the same frontend, router, and worker components outside Kubernetes.

You can create a DGD directly from a known-good manifest, or you can use a DynamoGraphDeploymentRequest (DGDR) to profile your model and generate a DGD for you.

Most users only need three ideas before they deploy:

  • Recipes are the fastest path when one matches your model, backend, hardware, and serving pattern. They are already DGD manifests.
  • DGDR is the guided path when you want Dynamo to profile and generate a DGD from model/SLA intent.
  • DGD is the object that serves traffic. DGDR can create it, but the DGD is what persists after profiling completes.

You do not need to author DCDs directly for normal deployments.

Start Here: Resource Model

Resource or pathWhat it isUse it whenLearn more
DynamoGraphDeployment (DGD)The canonical live deployment for a Dynamo inference graph.You have a known-good configuration or tuned YAML.Creating Deployments, DGD API
DynamoComponentDeployment (DCD)The per-component deployment objects created from a DGD.Usually not authored directly; inspect them to debug frontend/router/worker rollout.DCD API
DynamoGraphDeploymentRequest (DGDR)A deploy-by-intent request that profiles your model/hardware and generates a DGD.You want Dynamo to size the deployment, choose parallelism, configure supported generated-deployment features such as Planner, or produce DGD YAML.DGDR Reference
RecipesCurated deploy.yaml manifests that are already DGD specs.A recipe matches your model, backend, hardware, and serving mode.Dynamo recipes
DynamoModelModel and adapter lifecycle management layered onto an existing DGD or DCD.You need declarative model operations such as LoRA adapter loading.Managing Models with DynamoModel

Choose Your Path

Start with the row that matches your situation. The sections later in this page are reference material; you can read them as needed instead of going linearly.

SituationDo this firstThen read
A recipe matches your model/backend/hardwareApply the recipe’s model cache resources, then apply its deploy.yaml.Deploy a Tuned DGD from Recipes
You want Dynamo to generate the deploymentCreate a DGDR. Use autoApply: true to let the operator create the DGD, or autoApply: false to inspect the generated DGD YAML first.Use DGDR to Generate a DGD
You already know the exact topologyAuthor or edit a DGD directly, then apply it with kubectl.Creating Deployments
You are preparing for productionAdd model caching, choose backend/search strategy, and validate networking/planner needs.Production Details

Deploy a Tuned DGD from Recipes

If a recipe matches your target model, backend, GPU type, and serving mode, start there. Recipes are curated DynamoGraphDeployment manifests with model-cache setup and, for many recipes, benchmark jobs.

The common recipe flow is:

$cd recipes
$
$# Update the recipe storageClassName first, then create model cache resources.
$kubectl apply -f <model>/model-cache/ -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download \
> -n ${NAMESPACE} --timeout=6000s
$
$# Deploy a tuned DGD.
$kubectl apply -f <model>/<backend>/<mode>/deploy.yaml -n ${NAMESPACE}

Follow the README in the specific recipe directory for model-specific images, GPU requirements, cache setup, and request examples.

Use DGDR to Generate a DGD

A DGDR is Dynamo’s deploy-by-intent path. Instead of hand-crafting a deployment spec with parallelism settings, replica counts, and resource limits, you describe what you want to run (model, backend, workload, SLA targets) and DGDR generates a DGD:

  1. Spec — You submit a DGDR with your model, workload expectations, and optional SLA targets.
  2. Hardware Discovery — The operator discovers your cluster’s GPU hardware (SKU, VRAM, count per node) via DCGM or node labels.
  3. Profiling — The profiler analyzes your model against the discovered hardware, using either rapid simulation or thorough real-GPU benchmarking.
  4. DGD Generation — The profiler produces an optimized DynamoGraphDeployment (DGD) spec with the best parallelization strategy, replica counts, and resource configuration.
  5. Review (when autoApply: false) — The generated DGD is stored in .status.profilingResults.selectedConfig for you to inspect and optionally modify before deploying.
  6. Deploy — With autoApply: true, the operator creates the DGD. With autoApply: false, you apply the generated DGD yourself.
  7. Planner (optional) — If enabled, the Planner monitors live traffic and adjusts replica counts at runtime to meet your SLA targets.

DGDR currently supports generated-deployment feature configuration for Planner (features.planner) and mocker mode (features.mocker). The DGDR API does not currently expose features.kvRouter; configure explicit router mode in a DGD, a tuned recipe, or a generated DGD override when you need KV-aware routing details.

┌──────┐ ┌───────────┐ ┌──────────┐ ┌─────────────┐ ┌────────┐ ┌─────────┐
│ Spec │───▶│ Hardware │───▶│ Profiler │───▶│ Generated │───▶│ Deploy │───▶│ Planner │
│ │ │ Discovery │ │ │ │ DGD │ │ │ │ (opt.) │
└──────┘ └───────────┘ └──────────┘ └─────────────┘ └────────┘ └─────────┘
│
autoApply: false?
▼ Review

For the DGDR spec reference, field descriptions, and lifecycle phases, see the DGDR Reference.

DGDR Detail: Choose a Search Strategy

The searchStrategy field controls how the profiler explores configurations. Your choice depends on how much time you can invest and how close to optimal you need.

Rapid (Default)

1searchStrategy: rapid

Uses AIC-backed DynoSim-style performance modeling to search deployment configurations without running real inference. Completes in ~30 seconds with no GPU resources consumed during profiling.

Use rapid when:

  • Getting started or iterating quickly
  • Running in CI/CD pipelines
  • Your GPU SKU is in the AIC support matrix

Limitations:

  • If AIC does not support your model/hardware/backend combination, the profiler falls back to a naive memory-fit config (basic TP calculation) which may not be optimal.
  • Simulated results may differ from real-hardware performance for unusual configurations.

Thorough

1searchStrategy: thorough
2backend: vllm # must specify a concrete backend

Enumerates candidate parallelization configs, deploys each on real GPUs, and benchmarks with AIPerf. Takes 2–4 hours.

Use thorough when:

  • Tuning for production and you need the most optimal configuration
  • Your hardware is not supported by AIC (e.g., PCIe GPUs)
  • You want measured rather than simulated performance data

Constraints:

  • Disaggregated mode only — thorough does not run aggregated configurations.
  • backend: auto is not supported — you must specify vllm, sglang, or trtllm. The DGDR will be rejected if you use auto with thorough.
  • Requires GPU resources — the profiler deploys real inference engines on your cluster during profiling.

DGDR Detail: AIC Support Matrix

The rapid strategy relies on AIC performance models. AIC currently supports:

GPU SKUs

Supported (rapid)Not Yet Supported (use thorough)
H100 SXMV100 (SXM/PCIe)
H100 PCIeT4
H200 SXMMI200, MI300
A100 SXM
A100 PCIe
A30
B200 SXM
GB200 SXM
L40S
L4

Some rapid-mode SKUs use AIC estimate-only data until measured profiles are available. Use searchStrategy: thorough when you need hardware-measured profiling for an estimate-only or unsupported SKU.

When specifying GPU SKUs manually, use lowercase underscore format (e.g., h100_sxm, not H100-SXM5-80GB). See the DGDR Reference — SKU Format for the full list.

Backends

All three backends are supported for both rapid and thorough:

BackendDense ModelsMoE Models
vLLM✅🚧 Work in progress
SGLang✅✅
TensorRT-LLM✅🚧 Work in progress

If you are deploying a Mixture-of-Experts (MoE) model (e.g., DeepSeek-R1, Qwen3-MoE), use SGLang as the backend for full support. vLLM and TRT-LLM have partial MoE support that is still under development.

Parallelization Strategies

The profiler selects different parallelization strategies depending on the model architecture:

Model ArchitecturePrefillDecode
MLA+MoE (DeepSeek-V3, DeepSeek-R1)TEP, DEPTEP, DEP
GQA+MoE (Qwen3-MoE)TP, TEP, DEPTP, TEP, DEP
Dense models (Llama, Qwen, etc.)TPTP

Production Details

After the basic deployment path is clear, use this checklist to decide which production topics apply:

ConcernWhy it mattersSection
Model startup is slow or the model is gatedAvoid repeated downloads and pass HF_TOKEN cleanly.Model Caching
Traffic changes over timePlanner can scale prefill/decode replicas at runtime.Planner
The model spans nodes or uses disaggregated servingGrove/LWS and RDMA affect scheduling and KV transfer.Multinode and RDMA
You need a specific inference engineBackend choice affects MoE support, thorough profiling, and distributed behavior.Backend Selection

Production Detail: Model Caching

Set up model caching before deploying if any of these apply:

  • Your model is large (>70B parameters) — downloading hundreds of GB per pod takes hours
  • You are scaling to many replicas — each pod downloads the full model independently, and HuggingFace will rate-limit concurrent downloads
  • You want fast pod startup on scaling events

How It Works with DGDR

Add a modelCache section to your DGDR spec that points to a pre-populated PVC:

1spec:
2 model: meta-llama/Llama-3.1-70B-Instruct
3 modelCache:
4 pvcName: model-cache
5 pvcMountPath: /home/dynamo/.cache/huggingface
6 pvcModelPath: hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/<commit-hash>

The operator mounts this PVC at pvcMountPath read-only into the profiling job and passes it through to the generated DGD, so both profiling and serving use the cached weights.

pvcModelPath must be the HuggingFace snapshot path inside the PVC — hub/models--<org>--<model>/snapshots/<commit-hash>. This follows the layout that huggingface-cli download creates when HF_HOME is set to the mount point. Replace <org>--<model> by substituting / with -- in the model ID, and replace <commit-hash> with the actual snapshot revision. See Model Caching for how to look up the hash after downloading.

Setup

  1. Create a ReadWriteMany PVC — see the Installation Guide — Shared Storage for provider-specific options (EFS, Azure Lustre, GKE Filestore).
  2. Run a one-time download Job to populate the PVC.
  3. Reference the PVC in your DGDR’s modelCache field.

See Model Caching for the full walkthrough with YAML examples.

Private and Gated Models

For models that require authentication (e.g., gated HuggingFace models), create a Kubernetes Secret named hf-token-secret with a HF_TOKEN key:

$kubectl create secret generic hf-token-secret \
> --from-literal=HF_TOKEN=<your-token> \
> -n $NAMESPACE

The profiler and deployed pods will automatically use this token.

Production Detail: Planner

The Planner provides runtime autoscaling for disaggregated deployments. It adjusts prefill and decode replica counts to meet your SLA targets as traffic fluctuates.

1spec:
2 features:
3 planner:
4 enabled: true
5 sla:
6 ttft: 500 # Target time to first token (ms)
7 itl: 50 # Target inter-token latency (ms)

Planner Scaling Modes

ModeDescriptionPrometheus Required?
throughput (default)Static queue-depth and KV-cache thresholds; scales based on saturationNo
latencySame as throughput with more aggressive thresholdsNo
slaRust engine perf shim targeting specific TTFT/ITL values; uses native AIC when available, optional bootstrap data, and live FPM tuningYes

Prometheus Requirement

The sla optimization target reads live TTFT/ITL metrics from Prometheus. If you want SLA-driven autoscaling, install Prometheus before creating the DGDR. See the Installation Guide — Prometheus for setup instructions.

The throughput and latency modes use internal queue-depth signals and work without Prometheus.

See the Planner Guide for advanced configuration and scaling behavior details.

Production Detail: Multinode and RDMA

Models that require more GPUs than a single node provides (e.g., DeepSeek-R1 on 8-GPU nodes) need multinode orchestration.

Grove and KAI Scheduler

Grove is required for multinode DGDR deployments. It provides gang scheduling (all pods in a group start together or not at all), coordinated scaling, and network topology-aware placement. The operator will return an error if you attempt a multinode deployment without Grove or LeaderWorkerSet (LWS) installed.

KAI Scheduler is optional but recommended alongside Grove for GPU-aware scheduling and topology optimization.

See the Installation Guide — Grove + KAI Scheduler for setup instructions and the compatibility matrix.

High-Speed Networking (RDMA)

Disaggregated serving transfers KV cache data between prefill and decode workers. Understanding the networking stack helps you diagnose performance issues:

LayerWhat it is
NIXLDynamo’s KV cache transfer library. Moves data between prefill and decode pods.
UCX / libfabricLow-level communication frameworks that NIXL uses underneath.
RDMARemote Direct Memory Access — the general technique for moving data between machines without involving the CPU.
InfiniBandHigh-speed RDMA networking standard. Common on-prem and on Azure (AKS).
RoCERDMA over Converged Ethernet — RDMA on standard Ethernet hardware.
EFAAWS Elastic Fabric Adapter — AWS’s RDMA-capable networking for EKS.
GPUDirect RDMAAllows data to go directly between a GPU and a network adapter, bypassing CPU memory entirely.
NCCLNVIDIA Collective Communications Library — handles intra-model parallelism (TP/PP) communication within a pod. Separate from NIXL.

When RDMA is missing or not active, NIXL can fall back to TCP. That makes KV cache movement the likely bottleneck and can produce very high TTFT or low throughput even when the model workers appear healthy.

Enable RDMA if:

  • You are running multinode disaggregated deployments
  • You need low-latency KV cache transfer between workers

See the Installation Guide — Network Operator / RDMA for provider-specific setup instructions, and the Disaggregated Communication Guide for transport details and performance expectations.

MoE Models and Multinode Sweep Limits

The profiler sweeps MoE models across up to 4 nodes (dense models: 1 node max per engine during sweep). If your MoE model requires more than 4 nodes of GPUs, the profiler will select the best config within that range and you may need to adjust replica counts manually.

Production Detail: Backend Selection

The backend field controls which inference engine is used. The default (auto) lets the profiler pick the best backend, but you should specify a backend explicitly in these cases:

ScenarioRecommended Backend
MoE models (DeepSeek-R1, Qwen3-MoE)sglang (full MoE support)
Using searchStrategy: thoroughAny except auto (required)
TensorRT-LLM compilation cachingtrtllm (add a compilation cache PVC)
Need load-based planner scaling (FPM)vllm (any config) or trtllm (non-attention-DP only). SGLang FPM is wired in Dynamo but the upstream module is not in the 1.2.0 runtime image.

TensorRT-LLM does not support Python 3.11. If your environment uses Python 3.11, use vllm or sglang instead.

Multinode Backend Behavior

Each backend handles multinode inference differently:

  • vLLM: Uses Ray for multi-node TP/PP. Ray head runs on the leader, agents on workers.
  • SGLang: Uses --dist-init-addr, --nnodes, --node-rank flags for distributed setup.
  • TRT-LLM: MPI-based. The operator auto-generates SSH keypairs; the leader runs mpirun.

Troubleshooting

OOM During Profiling or Serving

  • Cause: The model doesn’t fit in GPU memory with the selected TP size.
  • Fix: Ensure hardware.totalGpus is large enough for your model. The profiler calculates minimum TP from model size and VRAM, but edge cases (large context lengths, KV cache overhead) may require more GPUs than the minimum.

GPU Auto-Detection Cap

The operator caps auto-detected GPU count at 32. If your cluster has more GPUs and you want the profiler to use them, set hardware.totalGpus explicitly:

1spec:
2 hardware:
3 totalGpus: 64

Profiling Job Fails to Schedule

GPU nodes often have taints. Add tolerations via the overrides field:

1spec:
2 overrides:
3 profilingJob:
4 template:
5 spec:
6 containers: [] # required placeholder
7 tolerations:
8 - key: nvidia.com/gpu
9 operator: Exists
10 effect: NoSchedule

DGDR Spec Is Immutable

Once the DGDR enters the Profiling phase, the spec cannot be changed. If you need to adjust settings, delete the DGDR and recreate it:

$kubectl delete dgdr my-model -n $NAMESPACE
$kubectl apply -f updated-dgdr.yaml -n $NAMESPACE

DGD Persists After DGDR Deletion

Deleting a DGDR does not delete the DGD it created. This is intentional — the DGD continues serving traffic independently. To clean up fully:

$kubectl delete dgdr my-model -n $NAMESPACE
$kubectl delete dgd my-model-dgd -n $NAMESPACE

Example Workflows

Small Dense Model (Quick Start)

A small model on a single node with rapid profiling — the simplest case:

1apiVersion: nvidia.com/v1beta1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: qwen-small
5spec:
6 model: Qwen/Qwen3-0.6B

Large Dense Model with SLA Targets

A 70B model with model caching, SLA targets, and the planner enabled:

1apiVersion: nvidia.com/v1beta1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: llama-70b
5spec:
6 model: meta-llama/Llama-3.1-70B-Instruct
7 backend: vllm
8 searchStrategy: rapid
9 autoApply: false
10 modelCache:
11 pvcName: model-cache
12 pvcMountPath: /home/dynamo/.cache/huggingface
13 pvcModelPath: hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/<commit-hash>
14 sla:
15 ttft: 500
16 itl: 50
17 workload:
18 isl: 4000
19 osl: 1000
20 requestRate: 10
21 features:
22 planner:
23 enabled: true

MoE Model (DeepSeek-R1)

A large MoE model requiring multinode, SGLang backend, and thorough profiling:

1apiVersion: nvidia.com/v1beta1
2kind: DynamoGraphDeploymentRequest
3metadata:
4 name: deepseek-r1
5spec:
6 model: deepseek-ai/DeepSeek-R1
7 backend: sglang
8 searchStrategy: thorough
9 autoApply: false
10 modelCache:
11 pvcName: model-cache
12 pvcMountPath: /home/dynamo/.cache/huggingface
13 pvcModelPath: hub/models--deepseek-ai--DeepSeek-R1/snapshots/<commit-hash>
14 sla:
15 ttft: 2000
16 itl: 100
17 hardware:
18 totalGpus: 32
19 features:
20 planner:
21 enabled: true
22 overrides:
23 profilingJob:
24 template:
25 spec:
26 containers: []
27 tolerations:
28 - key: nvidia.com/gpu
29 operator: Exists
30 effect: NoSchedule

Prerequisites for this deployment:

  • Grove and KAI Scheduler installed
  • RDMA configured for efficient KV cache transfer
  • Model cached on a shared PVC
  • Prometheus installed (for SLA-driven planner scaling)

Further Reading

  • DGDR Reference — Spec reference, lifecycle phases, monitoring commands
  • DGDR Examples — Ready-to-use YAML for various scenarios
  • Profiler Guide — Profiling algorithms, picking modes, gate checks
  • Planner Guide — Scaling modes, PlannerConfig reference
  • Model Caching — PVC setup, ModelExpress, and ModelStreamer
  • Creating Deployments — Manual DGD spec for hand-crafted configs
  • Multinode Deployments — Grove, LWS, and multinode details
  • Disaggregated Communication — NIXL, RDMA, and networking