## Choose Your Path
You're here. Container fast path.
Full walkthrough — PyPI, configuration.
Kubernetes-native production path.
For contributors against `main`.
Dynamo is backend-agnostic and Kubernetes-native without being Kubernetes-only. Use this container path to try the same frontend/router/worker stack locally; use the Kubernetes path when you want the operator, CRDs, Gateway API integration, autoscaling, scheduling, and cluster lifecycle management.
## Pull a Container
Containers have all dependencies pre-installed. Pick your backend:
```bash
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.2
```
```bash
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.2
```
```bash
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2
```
**Hugging Face token required for gated models.** Llama, Kimi, Qwen-VL, and other gated models require `HF_TOKEN` in your environment and accepting the model card's license on huggingface.co. Set `export HF_TOKEN=hf_…` before launching.
For container versions and tags, see [Release Artifacts](/dynamo/resources/release-artifacts#container-images).
## Start the Frontend
In your container, start the OpenAI-compatible frontend on port 8000:
```bash
python3 -m dynamo.frontend --discovery-backend file
```
`--discovery-backend file` avoids needing etcd. To run frontend and worker in the same terminal, background each command with `> logfile.log 2>&1 &`.
## Start a Worker
In another terminal, launch a worker for your backend:
```bash
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend file
```
```bash
python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --discovery-backend file
```
```bash
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file \
--kv-events-config '{"enable_kv_cache_events": false}'
```
## Verify and Test
Check the endpoint is up:
```bash
curl -sf http://localhost:8000/health && echo OK
```
If you see `OK`, send a chat completion:
```bash title="Request"
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50}'
```
```json title="Response"
{
"id": "chatcmpl-...",
"model": "Qwen/Qwen3-0.6B",
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "Hello! How can I help you today?"},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 9, "completion_tokens": 10, "total_tokens": 19}
}
```
Connection refused? The frontend takes a few seconds to start — retry. For production liveness and readiness probes, see [Health Checks](/dynamo/user-guides/observability-local/health-checks).
## From the Digest
How Dynamo optimizes for agentic workloads at three layers: the frontend API, the router, and KV cache management.
How Dynamo's concurrent global index evolved through six iterations to sustain over 100M ops/sec.
## Dive Deeper
Pick a full install path from the [four options above](#choose-your-path), or explore how Dynamo works under the hood:
How the frontend, router, and workers fit together.
Worker discovery, multi-model routing, OpenAI compat.
How the router places requests for prefix reuse.
Liveness and readiness probes for production deployments.
# Introduction to Dynamo
Dynamo is an open-source, high-throughput, low-latency inference framework,
designed to serve generative AI workloads in distributed environments. It is
Kubernetes-native for production deployments, with an operator, CRDs, Helm
charts, service discovery, Gateway API integration, and topology-aware
scheduling, while still supporting local containers, Python workers, and
standalone components for development or incremental adoption.
This page gives an overview of Dynamo's design principles, performance benefits, and production-grade features.
Looking to get started right away? See the [Quickstart](/dynamo/getting-started/quickstart) to install and run Dynamo in minutes.
## Why Dynamo?
Inference engines optimize the GPU; Dynamo optimizes the system around them.
- **System-level optimization on top of any engine** -- Inference engines optimize the single-GPU forward pass. Dynamo adds the distributed layer: disaggregated serving, smart routing, KV cache management across memory tiers, and auto-scaling.
- **Composable performance improvement techniques** -- The techniques, disaggregated serving, KV cache-aware routing, and KV cache offloading, each improve performance on their own; using them together yields compounding gains.
- **Engine-agnostic** -- Works with vLLM, SGLang, and TensorRT-LLM. Swap engines without changing your serving infrastructure. Extending support for Intel XPU and AMD hardware.
- **Kubernetes-native production path** -- Dynamo exposes inference graphs as Kubernetes resources (`DynamoGraphDeployment`, `DynamoComponentDeployment`, `DynamoGraphDeploymentRequest`) and reconciles them with an operator, while integrating with Kubernetes service discovery, Gateway API Inference Extension, scheduling, observability, and model loading workflows.
- **Production-ready at scale** -- Dynamo covers the full deployment lifecycle: automatic configuration (AIConfigurator), runtime auto-scaling (Planner), topology-aware gang scheduling (Grove), fault tolerance, and observability.
- **Modular adoption** -- Start with one component (e.g., just the Router for KV-aware routing on top of your existing engine). Adopt more as needed. Each component is independently installable via pip.
## Design Principles
### Strong Foundations for AI Inference
Dynamo adds system-level optimizations on top of inference engines. To provide such optimizations, Dynamo takes an operating systems approach by laying down the foundations for scheduling, memory management, and data transfer. These foundations allow Dynamo to evolve as new system-level performance techniques emerge.
One of the motivations for Dynamo's system-level design was to support disaggregated serving: running prefill and decode on different devices so each can be scaled and parallelized independently. Disaggregated serving required three capabilities: (1) scheduling to assign prefill and decode phases without interference, (2) memory management for KV cache offloading and onboarding, and (3) low-latency data transfer to move KV cache between nodes and across the memory hierarchy.

Dynamo's foundations first addressed disaggregated serving, then extended to EPD disaggregation for multimodal, and now support workloads such as diffusion, RL, and agents.
### Modular but Well-Integrated Ecosystem
Dynamo is designed to reduce the burden of replacing an existing stack in production. It offers modular, standalone components as Rust crates and pip wheels. For example, the three foundations of Dynamo for scheduling (Dynamo), memory management (KV Block Manager), and data transfer (NIXL) are each independently installable:
```bash
pip install ai-dynamo
pip install kvbm
pip install nixl
```
Pre-built containers with all dependencies are also available. See [Release Artifacts](/dynamo/resources/release-artifacts) for container images.
The Dynamo ecosystem includes these additional modular components, and will continue to grow over time:
| Category | Products | Description |
| :--- | :--- | :--- |
| **Scheduling** | Dynamo | Inference serving for GenAI workloads |
| **Routing** | Router | Smart routing leveraging KV cache hit rate and KV cache load. More algorithms will be added (e.g., agentic routing) |
| **Data Transfer** | [NIXL](https://github.com/ai-dynamo/nixl) | Point-to-point data transfer between GPUs and tiered storage (G1: GPU, G2: CPU, G3: SSD, G4: remote) |
| **Memory** | KVBM (KV Block Manager) | Manage KV cache across memory tiers (G1-G4) with customizable eviction policy |
| **Scaling / Cloud** | Planner | Automatically tune performance in real time for prefill and decode given SLA constraints (TTFT and TPOT) |
| | [Grove](https://github.com/ai-dynamo/grove) | Enables gang scheduling and topology awareness required for Kubernetes multi-node disaggregated serving |
| | [Model Express](https://github.com/ai-dynamo/model-express) | Load model weights fast by caching and transferring them via NIXL to other GPUs. Will also be leveraged for fault tolerance |
| **Perf** | [DynoSim](/dynamo/user-guides/dynosim) | Simulate Dynamo deployment choices with Mocker, workload-driven runs, sweeps, and AIC-backed timing models before validating on GPUs |
| | [AIConfigurator](https://github.com/ai-dynamo/aiconfigurator) | Provides calibrated performance models and configuration search inputs for rapid DGDR profiling. Formerly known as LLMPet |
| | [AIPerf](https://github.com/ai-dynamo/aiperf) | Re-architected GenAI-Perf written in Python for maximum extensibility; supports distributed benchmarking |
| | AITune | Given a model or pipeline, searches for best backend to deploy with (e.g., TensorRT, Torch.compile, etc.) (coming soon) |
| | Flex Tensor | Stream weights to GPUs from host memory to run very large language models in GPUs with limited memory capacity (coming soon) |
These components are modular but are designed to work together as a unified family. New components will follow the same design principle.
### Vendor-Agnostic Ecosystem Enablement
Dynamo is ***not designed for vendor lock-in***. Dynamo aims to enable the broader AI ecosystem and to provide the functionality developers need, such as integrations with third-party components.
From the beginning, Dynamo is designed to support all LLM inference engines (vLLM, SGLang, and TensorRT-LLM). Support for additional engines is planned to enable more developer use cases.
**Support for non-NVIDIA hardware** is also available: Dynamo is working with HW vendors such as Intel and AMD to extend hardware support.
The full list of supported ecosystem components:
| **Product Areas** | **Supported Ecosystem Components** |
| :--- | :--- |
| Inference engines | SGLang, TensorRT-LLM, vLLM |
| Kubernetes | Inference gateway |
| Memory management | Dynamo KV Block Manager, [LMCache](/dynamo/integrations/kv-cache-integrations/lm-cache), [SGLang HiCache](/dynamo/integrations/kv-cache-integrations/hi-cache), [FlexKV](/dynamo/integrations/kv-cache-integrations/flex-kv) |
| Networking and storage | Mooncake, DOCA NetIO, GDS, POSIX, S3, 3FS ([supported via NIXL](/dynamo/design-docs/component-design/kvbm-design)) |
| Multi-HW | Intel XPU, AMD |
## Deployment Posture
Dynamo's production path is Kubernetes-native, not Kubernetes-only. The same
core runtime concepts can be used from a local process, a container, or a
Kubernetes cluster:
| Path | Use when | What Dynamo provides |
|---|---|---|
| Local or container | You are evaluating, developing, or adopting one component at a time. | OpenAI-compatible frontend, router, workers, file or etcd discovery, Python/Rust APIs, and installable packages. |
| Kubernetes | You are deploying shared GPU capacity, multi-node serving, autoscaling, or platform-integrated inference. | Helm install, Dynamo operator, DGD/DCD/DGDR CRDs, Kubernetes-native discovery, Gateway API Inference Extension, Grove/LWS scheduling, ModelExpress, observability, and lifecycle management. |
## Request Routing Modes
Dynamo supports two request-routing modes. Both expose the same
OpenAI-compatible API and the same backends; they differ in *where* request
routing happens.
- **Standalone mode** (default) -- The Dynamo Frontend serves HTTP requests directly, and the integrated Dynamo Router makes KV-aware routing decisions before dispatching to workers. No external gateway is required. This is the mode used by all local installs and the default Kubernetes deployment. Request flow: `client -> Frontend -> Router -> workers`.
- **Gateway mode (GAIE)** -- Dynamo runs behind a Kubernetes [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/) gateway. KV-aware routing is performed at the gateway layer by the Dynamo Endpoint Picker Plugin (EPP); the Frontend runs as a sidecar in `--router-mode direct` and forwards requests to the worker the EPP selected. Use this mode when your platform standardizes on the Inference Gateway, or when you want gateway-level policy (auth, rate limiting, observability) co-located with KV-aware routing. Request flow: `client -> Inference Gateway -> EPP (KV-aware) -> Frontend sidecar (direct) -> workers`.
Both modes support disaggregated serving, multimodal, and the same set of backends (vLLM, SGLang, TensorRT-LLM). For full setup, supported features, and configuration of gateway mode, see the [Inference Gateway (GAIE) guide](/dynamo/integrations/kubernetes-integrations/gateway-api-inference-extension-gaie).
## Performance
Dynamo achieves state-of-the-art LLM performance by composing three core techniques: Disaggregated Serving, KV Cache-Aware Routing, and KV Cache Offloading. These techniques are underpinned by NIXL, a low-latency data transfer layer that enables seamless KV cache movement between nodes.
- [KV cache-aware routing](/dynamo/design-docs/component-design/router-design) Smartly routes requests based on worker load and existing cache hits. By reusing precomputed KV pairs, it bypasses the prefill compute, starting the decode phase immediately. [Baseten](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/#how-baseten-uses-nvidia-dynamo) applied Dynamo KV cache-aware routing and saw 2x faster TTFT and 1.6x throughput on Qwen3 Coder 480B A35B.
- [KV cache offloading](/dynamo/design-docs/component-design/kvbm-design) Expands the available context window by moving KV cache from HBM to cheaper storage tiers such as host memory, local disk, or remote storage. Reusing precomputed state improves TTFT, reduces Total Cost of Ownership (TCO), and allows for longer context processing.
- [Disaggregated serving](/dynamo/design-docs/disaggregated-serving) In the Design Principles section, we introduced the concept of disaggregated serving. Its performance has been showcased by [InferenceX](https://newsletter.semianalysis.com/p/inferencex-v2-nvidia-blackwell-vs). DeepSeek V3 can be served with ~7x throughput/GPU, with disaggregated serving and large-scale expert parallelism.
Furthermore, when these three techniques are composed together, they yield compounding benefits as shown in the following diagram.

- **Disaggregated Serving + KV Cache-Aware Routing** -- KV cache-aware routing load balances for both compute (on prefill) and memory (on decode), optimizing latency and throughput simultaneously.
- **Disaggregated Serving + KV Cache Offloading** -- KV cache offloading results in faster TTFT, and the number of prefill workers can be reduced to reduce TCO.
- **KV Cache-Aware Routing + KV Cache Offloading** -- Offloading increases the total addressable cache size, increasing the KV cache hit rate, which in turn accelerates the TTFT.
Ready to try these techniques? See [Dynamo recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) for step-by-step deployment examples that compose disaggregated serving, routing, and offloading.
## From Configuration to Production-Grade Deployment
### Finding Best Configurations Under 30 Seconds with AIConfigurator
Manually finding the optimal parallelism for disaggregated serving can take days of exhaustive configuration sweeps—a challenge that only intensifies at scale.
Dynamo uses AIC-backed DynoSim-style modeling to identify strong configurations in under 30 seconds, providing clear projections of the performance gains over standard aggregated serving. This logic is natively integrated into Kubernetes Custom Resource Definition (CRD), Dynamo Graph Deployment Request (DGDR), allowing users to deploy using automatically generated optimized configs.
### Auto-Adjusting Deployment Based on SLA with Planner
Once the offline configuration is found with AIConfigurator or DGDR, developers can deploy their desired model into production. However, the production traffic can vary greatly online, and static configuration determined offline will not be able to adequately handle spikes in traffic.
Dynamo offers [Planner](/dynamo/design-docs/component-design/planner-design) to circumvent this problem. Developers can simply set their SLA in terms of TTFT and Time Per Output Token (TPOT). Planner examines online traffic and automatically makes decisions to scale prefill and decode workers to effectively deal with traffic spikes while maintaining the specified SLA.
Recently, Planner was expanded to deal with even more sophisticated scenarios such as drastically varying Input Sequence Length (ISL) given the same SLA. See the [Planner documentation](/dynamo/components/planner/planner-guide) for more details.
### Applying Topology-Aware Hierarchical Gang Scheduling with Grove
When Planner decides to autoscale, developers need a way to effectively scale workers independently and hierarchically. Especially for prefill/decode disaggregation, prefill and decode workers need to be scaled independently to meet the specified SLA, and they need to be scheduled in physical proximity to each other for best performance.
Dynamo offers [Grove](https://github.com/ai-dynamo/grove) which is a Kubernetes operator that provides a single declarative API for orchestrating any AI inference workload from simple single-pod deployments to complex multi-node, disaggregated systems.
Grove enables:
- Hierarchical gang scheduling
- Topology-aware placement
- Multi-level horizontal autoscaling
- Explicit startup ordering
- Rolling updates with configurable replacement strategies
These features are crucial for deploying and scaling inference at data center scale for optimal performance.
### Ensuring Fault Tolerance for LLMs
Kubernetes comes with some fault tolerance functionalities, but LLM deployment requires specialized fault tolerance and resiliency. Dynamo provides comprehensive fault tolerance mechanisms across multiple layers to ensure reliable LLM inference in production deployments:
- **Router and Frontend** -- Dynamo supports launching multiple frontend + router replicas for improved fault tolerance by sharing router states.
- **Request Migration** -- When a worker fails during request processing, Dynamo can migrate in-progress requests to healthy workers while preserving partial generation state and maintaining seamless token flow to clients.
- **Request Cancellation** -- Dynamo supports canceling in-flight requests through the AsyncEngineContext trait, which provides graceful stop signals and hierarchical cancellation propagation through request chains.
- **Request Rejection (Load Shedding)** -- When workers are overloaded, Dynamo rejects new requests with HTTP 503 responses based on configurable thresholds for KV cache utilization and prefill tokens.
### Observability
Dynamo provides built-in metrics, distributed tracing, and logging for monitoring inference deployments. See the [Observability Guide](/dynamo/user-guides/observability-local) for setup details.
## What's Next?
Explore the following resources to go deeper:
- [Recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) -- Compose disaggregated serving, routing, and offloading
- [KV Cache-Aware Routing](/dynamo/user-guides/kv-cache-aware-routing) -- Configure smart request routing
- [KV Cache Offloading](/dynamo/user-guides/kv-cache-offloading) -- Set up multi-tier memory management
- [Planner](/dynamo/components/planner/planner-guide) -- Configure SLA-based autoscaling
- [Kubernetes Deployment](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart) -- Deploy at scale with Grove
- [Inference Gateway (GAIE)](/dynamo/integrations/kubernetes-integrations/gateway-api-inference-extension-gaie) -- Run Dynamo in gateway mode behind the K8s Inference Gateway
- [Overall Architecture](/dynamo/design-docs/overall-architecture) -- Full technical design
- [Support Matrix](/dynamo/resources/support-matrix) -- Check hardware and engine compatibility
**Further reading:** [Dynamo Digest](../digest/index.mdx).
> Install and run Dynamo on a local machine or VM with containers or PyPI
# Local Installation
This guide walks through installing and running Dynamo on a local machine or VM with one or more GPUs. By the end, you'll have a working OpenAI-compatible endpoint serving a model.
For production multi-node clusters, see the [Kubernetes Deployment Guide](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart). To build from source for development, see [Building from Source](/dynamo/getting-started/building-from-source).
## System Requirements
| Requirement | Supported |
|---|---|
| **GPU** | NVIDIA Ampere, Ada Lovelace, Hopper, Blackwell |
| **OS** | Ubuntu 22.04, Ubuntu 24.04 |
| **Architecture** | x86_64, ARM64 (ARM64 requires Ubuntu 24.04) |
| **CUDA** | 12.9+ or 13.0+ (B300/GB300 require CUDA 13) |
| **Python** | 3.10, 3.12 |
| **Driver** | 575.51.03+ (CUDA 12) or 580.00.03+ (CUDA 13) |
TensorRT-LLM does not support Python 3.11.
For the full compatibility matrix including backend framework versions, see the [Support Matrix](/dynamo/resources/support-matrix).
## Install Dynamo
### Option A: Containers (Recommended)
Containers have all dependencies pre-installed. No setup required.
```bash
# SGLang
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0
# TensorRT-LLM
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.2.0
# vLLM
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
```
To run frontend and worker in the same container, either:
- Run processes in background with `&` (see Run Dynamo section below), or
- Open a second terminal and use `docker exec -it bash`
See [Release Artifacts](/dynamo/resources/release-artifacts#container-images) for available
versions and backend guides for run instructions: [SGLang](/dynamo/backends/sg-lang) |
[TensorRT-LLM](/dynamo/backends/tensor-rt-llm) | [vLLM](/dynamo/backends/v-llm)
### Option B: Install from PyPI
```bash
# Install uv (recommended Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment
uv venv venv
source venv/bin/activate
uv pip install pip
```
Install system dependencies and the Dynamo wheel for your chosen backend:
**SGLang**
```bash
sudo apt install python3-dev
uv pip install --prerelease=allow "ai-dynamo[sglang]"
```
For CUDA 13 (B300/GB300), the container is recommended. See
[SGLang install docs](https://docs.sglang.io/get_started/install.html) for details.
**TensorRT-LLM**
```bash
sudo apt install python3-dev
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"
```
TensorRT-LLM requires `pip` due to a transitive Git URL dependency that
`uv` doesn't resolve. We recommend using the TensorRT-LLM container for
broader compatibility. See the [TRT-LLM backend guide](/dynamo/backends/tensor-rt-llm)
for details.
**vLLM**
```bash
sudo apt install python3-dev libxcb1
uv pip install --prerelease=allow "ai-dynamo[vllm]"
```
## Run Dynamo
### Discovery Backend
Dynamo components discover each other through a shared backend. Two options are available:
| Backend | When to Use | Setup |
|---|---|---|
| **File** | Single machine, local development | No setup -- pass `--discovery-backend file` to all components. The event plane automatically defaults to ZMQ (no NATS required). |
| **etcd** | Multi-node, production | Requires a running etcd instance (default if no flag is specified). The event plane defaults to NATS. |
This guide uses `--discovery-backend file`. For etcd setup, see [Service Discovery](/dynamo/kubernetes-deployment/advanced-platform/service-discovery).
### Verify Installation (Optional)
Verify the CLI is installed and callable:
```bash
python3 -m dynamo.frontend --help
```
If you cloned the repository, you can run additional system checks:
```bash
python3 dev/sanity_check.py
```
### Start the Frontend
```bash
# Start the OpenAI compatible frontend (default port is 8000)
python3 -m dynamo.frontend --discovery-backend file
```
To run in a single terminal (useful in containers), append `> logfile.log 2>&1 &`
to run processes in background:
```bash
python3 -m dynamo.frontend --discovery-backend file > dynamo.frontend.log 2>&1 &
```
### Start a Worker
In another terminal (or same terminal if using background mode), start a worker for your chosen backend:
**SGLang**
```bash
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend file
```
**TensorRT-LLM**
```bash
python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --discovery-backend file
```
The warning `Cannot connect to ModelExpress server/transport error. Using direct download.`
is expected in this local single-machine setup (no ModelExpress server running) and can
be safely ignored. In a Kubernetes deployment where `MODEL_EXPRESS_URL` is configured,
this warning -- or the related `Failed to resolve local model path after server download`
-- indicates that ModelExpress is configured but is not actually serving cached models;
see [Model Caching in Kubernetes](/dynamo/kubernetes-deployment/model-loading/model-caching#option-2-modelexpress-p2p-distribution)
for the correct configuration.
**vLLM**
```bash
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file \
--kv-events-config '{"enable_kv_cache_events": false}'
```
### KV Events Configuration
For dependency-free local development, disable KV event publishing (avoids NATS):
- **vLLM:** Add `--kv-events-config '{"enable_kv_cache_events": false}'`
- **SGLang:** No flag needed (KV events disabled by default)
- **TensorRT-LLM:** No flag needed (KV events disabled by default)
KV events are disabled by default for all backends. For vLLM and SGLang, add backend-specific `--kv-events-config` only when you want KV event publishing enabled. For TensorRT-LLM, enable event publishing with `--publish-events-and-metrics`.
## Test Your Deployment
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50}'
```
## Troubleshooting
**CUDA/driver version mismatch**
Run `nvidia-smi` to check your driver version. Dynamo requires driver 575.51.03+ for CUDA 12 or 580.00.03+ for CUDA 13. B300/GB300 GPUs require CUDA 13. See the [Support Matrix](/dynamo/resources/support-matrix) for full requirements.
**Model doesn't fit on GPU (OOM)**
The default model `Qwen/Qwen3-0.6B` requires ~2GB of GPU memory. Larger models need more VRAM:
| Model Size | Approximate VRAM |
|---|---|
| 7B | 14-16 GB |
| 13B | 26-28 GB |
| 70B | 140+ GB (multi-GPU) |
Start with a small model and scale up based on your hardware.
**Python 3.11 with TensorRT-LLM**
TensorRT-LLM does not support Python 3.11. If you see installation failures with TensorRT-LLM, check your Python version with `python3 --version`. Use Python 3.10 or 3.12 instead.
**Container runs but GPU not detected**
Ensure you passed `--gpus all` to `docker run`. Without this flag, the container won't have access to GPUs:
```bash
# Correct
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0
# Wrong -- no GPU access
docker run --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0
```
## Next Steps
- [Backend Guides](/dynamo/backends/sg-lang) -- Backend-specific configuration and features
- [Disaggregated Serving](/dynamo/user-guides/disaggregated-serving) -- Scale prefill and decode independently
- [KV Cache Aware Routing](/dynamo/user-guides/kv-cache-aware-routing) -- Smart request routing
- [Kubernetes Deployment](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart) -- Production multi-node deployments
> Build Dynamo from source for development and contributions
# Building from Source
Build Dynamo from source when you want to contribute code, test features on the development branch, or customize the build. If you just want to run Dynamo, the [Local Installation](/dynamo/getting-started/local-installation) guide is faster.
This guide covers Ubuntu and macOS. For a containerized dev environment that handles all of this automatically, see [DevContainer](#devcontainer).
## 1. Install System Libraries
**Ubuntu:**
```bash
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake
```
**macOS:**
```bash
# Install Homebrew if needed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install cmake protobuf
# Verify Metal is accessible
xcrun -sdk macosx metal
```
## 2. Install Rust
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
```
## 3. Create a Python Virtual Environment
Install [uv](https://docs.astral.sh/uv/#installation) if you don't have it:
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Create and activate a virtual environment:
```bash
uv venv .venv
source .venv/bin/activate
```
## 4. Install Build Tools
```bash
uv pip install pip maturin
```
[Maturin](https://github.com/PyO3/maturin) is the Rust-Python bindings build tool.
## 5. Build the Rust Bindings
```bash
cd lib/bindings/python
maturin develop --uv
```
## 6. Install GPU Memory Service
```bash
# Return to project root
cd "$(git rev-parse --show-toplevel)"
uv pip install -e lib/gpu_memory_service
```
## 7. Install the Wheel
```bash
uv pip install -e .
```
## 8. Verify the Build
```bash
python3 -m dynamo.frontend --help
```
You should see the frontend command help output.
## DevContainer
VSCode and Cursor users can skip manual setup using pre-configured development containers. The DevContainer installs all toolchains, builds the project, and sets up the Python environment automatically.
Framework-specific containers are available for vLLM, SGLang, and TensorRT-LLM. See the [DevContainer README](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/.devcontainer) for setup instructions.
## Set Up Pre-commit Hooks
Before submitting PRs, install the pre-commit hooks to ensure your code passes CI checks:
```bash
uv pip install pre-commit
pre-commit install
```
Run checks manually on all files:
```bash
pre-commit run --all-files
```
## Troubleshooting
**Missing system packages**
If `maturin develop` fails with linker errors, verify all system dependencies are installed. On Ubuntu:
```bash
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake
```
**Virtual environment not activated**
Maturin builds against the active Python interpreter. If you see errors about Python or site-packages, ensure your virtual environment is activated:
```bash
source .venv/bin/activate
```
**Disk space**
The Rust `target/` directory can grow to 10+ GB during development. If builds fail with disk space errors, clean the build cache:
```bash
cargo clean
```
## Next Steps
- [Contribution Guide](/dynamo/getting-started/contribution-guide) -- Workflow for contributing code
- [Examples](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples) -- Explore the codebase
- [Good First Issues](https://github.com/ai-dynamo/dynamo/labels/good-first-issue) -- Find a task to work on
# Kubernetes Deployment
Use the Kubernetes guides when you are ready to move beyond a local Dynamo
process and deploy on a GPU cluster. Dynamo's Kubernetes path is native to the
platform: inference graphs are expressed as Dynamo CRDs, reconciled by the
Dynamo operator, installed with Helm, and integrated with Kubernetes service
discovery, Gateway API Inference Extension, scheduling, observability, and
model-loading workflows.
This does not make Kubernetes the only way to use Dynamo. Local containers,
PyPI installs, and standalone components remain the right path for evaluation,
development, and incremental adoption.
Start with the [Kubernetes Quickstart](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart) to run one model end to end. Then use the rest of the Kubernetes Deployment section based on what you need next:
| Goal | Guide |
|---|---|
| Install the operator and prerequisites | [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) |
| Deploy and manage models | [Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide) |
| Load models faster across pods | [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) and [ModelExpress](/dynamo/kubernetes-deployment/model-loading/model-express) |
| Operate a cluster deployment | [Autoscaling](/dynamo/kubernetes-deployment/operate/autoscaling), [Rolling Update](/dynamo/kubernetes-deployment/operate/rolling-update), [Disagg Communication](/dynamo/kubernetes-deployment/operate/disagg-communication), and [Observability Metrics](/dynamo/kubernetes-deployment/operate/observability/metrics) |
| Scale disaggregated serving | [Multinode Deployments](/dynamo/kubernetes-deployment/scale/multinode-deployments), [Grove](/dynamo/kubernetes-deployment/scale/grove), and [Topology Aware Scheduling](/dynamo/kubernetes-deployment/scale/topology-aware-scheduling) |
| Integrate with Kubernetes serving APIs | [Gateway API Inference Extension (GAIE)](/dynamo/integrations/kubernetes-integrations/gateway-api-inference-extension-gaie) and [LWS](/dynamo/integrations/kubernetes-integrations/lws) |
If you are still evaluating Dynamo locally, start with the [Quickstart](/dynamo/getting-started/quickstart) and [Local Installation](/dynamo/getting-started/local-installation) first.
# Contribution Guide
Dynamo is an open-source distributed inference platform, built by a growing community of contributors. The project is licensed under [Apache 2.0](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/LICENSE) and welcomes contributions of all sizes -- from typo fixes to major features. Community contributions have shaped core areas of Dynamo including backend integrations, documentation, deployment tooling, and performance improvements.
With 200+ external contributors, 220+ merged community PRs, and new contributors joining every month, Dynamo is one of the fastest-growing open-source inference projects. Check out our [commit activity](https://github.com/ai-dynamo/dynamo/graphs/commit-activity) and [GitHub stars](https://github.com/ai-dynamo/dynamo/stargazers). This guide will help you get started.
Join the community:
- [CNCF Slack (`#ai-dynamo`)](https://communityinviter.com/apps/cloud-native/cncf) -- join CNCF Slack and find us in `#ai-dynamo`
- [Discord](https://discord.gg/nvidia-dynamo)
- [GitHub Discussions](https://github.com/ai-dynamo/dynamo/discussions)
## TL;DR
For experienced contributors:
1. Fork and clone the repo
2. For changes ≥100 lines or new features, [open an issue](https://github.com/ai-dynamo/dynamo/issues/new?template=contribution_request.yml) first
3. Create a branch: `git checkout -b yourname/fix-router-timeout`
4. Make changes, run `pre-commit run`
5. Commit with DCO sign-off: `git commit -s -m "fix: description"`
6. Open a PR targeting `main`
---
## Ways to Contribute
### Report a Bug
Found something broken? [Open a bug report](https://github.com/ai-dynamo/dynamo/issues/new?template=bug_report.yml) with:
- Steps to reproduce
- Expected vs. actual behavior
- Environment details (OS, GPU, Python version, Dynamo version)
### Improve Documentation
Documentation improvements are always welcome:
- Fixing typos or unclear explanations
- Adding examples or tutorials
- Improving API documentation
Small doc fixes can be submitted directly as PRs without an issue.
### Propose a Feature
Have an idea? [Open a feature request](https://github.com/ai-dynamo/dynamo/issues/new?template=feature_request.yml) to discuss it with maintainers before implementation.
### Contribute Code
Ready to write code? See the [Contribution Workflow](#contribution-workflow) section below.
### Help the Community
Not all contributions are code. You can also:
- Answer questions on [Discord](https://discord.gg/nvidia-dynamo) or in the `#ai-dynamo` channel on [CNCF Slack](https://communityinviter.com/apps/cloud-native/cncf)
- Review pull requests
- Share how you're using Dynamo -- blog posts, talks, or social media
- Star the [repository](https://github.com/ai-dynamo/dynamo)
---
## Getting Started
### Find an Issue
Browse [open issues](https://github.com/ai-dynamo/dynamo/issues) or look for:
| Issue Type | Description |
|------------|-------------|
| [Good First Issues](https://github.com/ai-dynamo/dynamo/labels/good-first-issue) | Beginner-friendly, with guidance |
| [Help Wanted](https://github.com/ai-dynamo/dynamo/labels/help-wanted) | Community contributions welcome |
### Fork and Clone
1. [Fork the repository](https://github.com/ai-dynamo/dynamo/fork) on GitHub
2. Clone your fork:
```bash
git clone https://github.com/YOUR-USERNAME/dynamo.git
cd dynamo
git remote add upstream https://github.com/ai-dynamo/dynamo.git
```
### Building from Source
Full build instructions are included below. Expand the accordion to set up your local development environment.
Expand build instructions
#### 1. Install System Libraries
**Ubuntu:**
```bash
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake
```
**macOS:**
```bash
# Install Homebrew if needed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install cmake protobuf
# Verify Metal is accessible
xcrun -sdk macosx metal
```
#### 2. Install Rust
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
```
#### 3. Create a Python Virtual Environment
Install [uv](https://docs.astral.sh/uv/#installation) if you don't have it:
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Create and activate a virtual environment:
```bash
uv venv .venv
source .venv/bin/activate
```
#### 4. Install Build Tools
```bash
uv pip install pip maturin
```
[Maturin](https://github.com/PyO3/maturin) is the Rust-Python bindings build tool.
#### 5. Build the Rust Bindings
```bash
cd lib/bindings/python
maturin develop --uv
```
#### 6. Install GPU Memory Service
```bash
# Return to project root
cd "$(git rev-parse --show-toplevel)"
uv pip install -e lib/gpu_memory_service
```
#### 7. Install the Wheel
```bash
uv pip install -e .
```
#### 8. Verify the Build
```bash
python3 -m dynamo.frontend --help
```
VSCode and Cursor users can use the [`.devcontainer`](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/.devcontainer) folder for a pre-configured development environment. See the [devcontainer README](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/.devcontainer/README.md) for details.
### Set Up Pre-commit Hooks
```bash
uv pip install pre-commit
pre-commit install
```
You're all set up! Get curious -- explore the codebase, experiment with the [examples](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples), and see how the pieces fit together. When you're ready, pick an issue from the [Good First Issues](https://github.com/ai-dynamo/dynamo/labels/good-first-issue) board or read on for the full contribution workflow.
---
## Contribution Workflow
The contribution process depends on the size and scope of your change. Even when not required, opening an issue is a great way to start a conversation with Dynamo maintainers before investing time in a PR.
| Size | Lines Changed | Example | What You Need |
|------|---------------|---------|---------------|
| **XS** | 1–10 | Typo fix, config tweak | Submit a PR directly |
| **S** | 10–100 | Small bug fix, doc improvement, focused refactor | Submit a PR directly |
| **M** | 100–200 | Feature addition, moderate refactor | [Open an issue](https://github.com/ai-dynamo/dynamo/issues/new?template=contribution_request.yml) first |
| **L** | 200–500 | Multi-file feature, new component | [Open an issue](https://github.com/ai-dynamo/dynamo/issues/new?template=contribution_request.yml) first |
| **XL** | 500–1000 | Major feature, cross-component change | [Open an issue](https://github.com/ai-dynamo/dynamo/issues/new?template=contribution_request.yml) first |
| **XXL** | 1000+ | Architecture change | Requires a [DEP](https://github.com/ai-dynamo/enhancements) |
**Small changes (under 100 lines):** Submit a PR directly -- no issue needed. This includes typos, simple bug fixes, and formatting. If your PR addresses an existing approved issue, link it with "Fixes #123".
**Larger changes (≥100 lines):** [Open a Contribution Request](https://github.com/ai-dynamo/dynamo/issues/new?template=contribution_request.yml) issue first and wait for the `approved-for-pr` label before submitting a PR.
**Architecture changes:** Changes that affect multiple components, introduce or modify public APIs, alter communication plane architecture, or affect backend integration contracts require a [Dynamo Enhancement Proposal (DEP)](https://github.com/ai-dynamo/enhancements). Open a DEP in the [`ai-dynamo/enhancements`](https://github.com/ai-dynamo/enhancements) repo before starting implementation.
### Submitting a Pull Request
1. **Create a GitHub Issue** (if required) — [Open a Contribution Request](https://github.com/ai-dynamo/dynamo/issues/new?template=contribution_request.yml) and describe what you're solving, your proposed approach, estimated PR size, and files affected.
2. **Get Approval** — Wait for maintainers to review and apply the `approved-for-pr` label.
3. **Submit a Pull Request** — [Open a PR](https://github.com/ai-dynamo/dynamo/compare) that references the issue using GitHub keywords (e.g., "Fixes #123").
4. **Address Code Rabbit Review** — Respond to automated Code Rabbit suggestions, including nitpicks.
5. **Trigger CI Tests** — For external contributors, a maintainer must comment `/ok to test COMMIT-ID` to run the full CI suite, where `COMMIT-ID` is the short SHA of your latest commit. Fix any failing tests before requesting human review.
6. **Request Review** — Add the person who approved your issue as a reviewer. Check [CODEOWNERS](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/CODEOWNERS) for required approvers based on files modified.
**AI-Generated Code:** While we encourage using AI tools, you must fully understand every change in your PR. Inability to explain submitted code will result in rejection.
### Branch Naming
Use a descriptive branch name that identifies you and the change:
```text
yourname/fix-description
```
Examples:
```text
jsmith/fix-router-timeout
jsmith/add-lora-support
```
---
## Code Style & Quality
Maintainers assess contribution quality based on code style, test coverage, architecture alignment, and review responsiveness. Consistent, high-quality contributions are the foundation for building trust in the project.
### Pre-commit Hooks
All PRs are checked against [pre-commit hooks](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/.pre-commit-config.yaml). After [installing pre-commit](#set-up-pre-commit-hooks), run checks locally:
```bash
pre-commit run --all-files
```
### Commit Message Conventions
Use [conventional commit](https://www.conventionalcommits.org/) prefixes:
| Prefix | Use For |
|--------|---------|
| `feat:` | New features |
| `fix:` | Bug fixes |
| `docs:` | Documentation changes |
| `refactor:` | Code refactoring (no behavior change) |
| `test:` | Adding or updating tests |
| `chore:` | Maintenance, dependency updates |
| `ci:` | CI/CD changes |
| `perf:` | Performance improvements |
Examples:
```text
feat(router): add weighted load balancing
fix(frontend): resolve streaming timeout on large responses
docs: update quickstart for macOS users
test(planner): add unit tests for scaling policy
```
### Language Conventions
| Language | Style Guide | Formatter |
|----------|-------------|-----------|
| **Python** | [PEP 8](https://peps.python.org/pep-0008/) | `black`, `ruff` |
| **Rust** | [Rust API Guidelines](https://rust-lang.github.io/api-guidelines/) | `cargo fmt`, `cargo clippy` |
| **Go** | [Effective Go](https://go.dev/doc/effective_go) | `gofmt` |
### Testing
Run the test suite before submitting a PR:
```bash
# Run all tests
pytest tests/
# Run unit tests only
pytest -m unit tests/
# Run a specific test file
pytest -s -v tests/test_example.py
```
For Rust components:
```bash
cargo test
```
For the Kubernetes operator (Go):
```bash
cd deploy/operator
go test ./... -v
```
### General Guidelines
- Keep PRs focused -- one concern per PR
- Write clean, well-documented code that future contributors can understand
- Include tests for new functionality and bug fixes
- Ensure clean builds (no warnings or errors)
- All tests must pass
- No commented-out code
- Respond to review feedback promptly and constructively
### Running GitHub Actions Locally
Use [act](https://nektosact.com/) to run workflows locally:
```bash
act -j pre-merge-rust
```
Or use the [GitHub Local Actions](https://marketplace.visualstudio.com/items?itemName=SanjulaGanepola.github-local-actions) VS Code extension.
---
## What to Expect
### Status Labels
| Status | What It Means |
|--------|---------------|
| `needs-triage` | We're reviewing your issue |
| `needs-info` | We need more details from you |
| `approved-for-pr` | Ready for implementation — submit a PR |
| `in-progress` | Someone is working on this |
| `blocked` | Waiting on external dependency |
### Response Times
We aim to:
- **Respond** to new issues within a few business days
- **Triage** high-priority issues within a week
Issues with no activity for 30 days may be auto-closed (can be reopened).
### Review Process
After you submit a PR and complete the steps in [Submitting a Pull Request](#submitting-a-pull-request):
1. The reviewer will provide feedback -- please respond to all comments within a reasonable timeframe
2. If changes are requested, address them and ping the reviewer for re-review
3. If your PR hasn't been reviewed within 7 days, feel free to ping the reviewer or leave a comment
### Good First Issues
Issues labeled `good-first-issue` are sized for new contributors. We provide extra guidance on these -- look for clear acceptance criteria and a suggested approach in the issue description.
---
## DCO & Licensing
### Developer Certificate of Origin
Dynamo requires all contributions to be signed off with the [Developer Certificate of Origin (DCO)](https://developercertificate.org/). This certifies that you have the right to submit your contribution under the project's [Apache 2.0 license](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/LICENSE).
Each commit must include a sign-off line:
```text
Signed-off-by: Jane Smith
```
Add this automatically with the `-s` flag:
```bash
git commit -s -m "fix: your descriptive message"
```
**Requirements:**
- Use your real name (no pseudonyms or anonymous contributions)
- Your `user.name` and `user.email` must be configured in git
**DCO Check Failed?** See our [DCO Troubleshooting Guide](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/DCO.md) for step-by-step instructions to fix it.
### License
By contributing, you agree that your contributions will be licensed under the [Apache 2.0 License](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/LICENSE).
---
## Code of Conduct
We are committed to providing a welcoming and inclusive environment. All participants are expected to abide by our [Code of Conduct](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/CODE_OF_CONDUCT.md).
---
## Security
If you discover a security vulnerability, please follow the instructions in our [Security Policy](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/SECURITY.md). Do not open a public issue for security vulnerabilities.
---
## Getting Help
- **CNCF Slack**: [Join CNCF Slack](https://communityinviter.com/apps/cloud-native/cncf) and find us in `#ai-dynamo`
- **Discord**: [Join our community](https://discord.gg/nvidia-dynamo)
- **Discussions**: [GitHub Discussions](https://github.com/ai-dynamo/dynamo/discussions)
- **Documentation**: [docs.nvidia.com/dynamo](https://docs.nvidia.com/dynamo/)
Thank you for contributing to Dynamo!
# Support Matrix
**See also:** [Release Artifacts](/dynamo/resources/release-artifacts) for container images, wheels, Helm charts, and crates | [Feature Matrix](/dynamo/resources/feature-matrix) for backend feature support
## At a Glance
**Latest stable release:** [v1.2.0](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0) -- SGLang `0.5.11` (NIXL `1.0.1`) | TensorRT-LLM `1.3.0rc14` (NIXL `0.10.1`) | vLLM `0.20.1` (NIXL `0.10.1`)
**Experimental release:** [v1.2.0-deepseek-v4-dev.3](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0-deepseek-v4-dev.3) *(DeepSeek-V4-Flash / V4-Pro on Blackwell, vLLM + SGLang containers only)* -- vLLM `0.20.1` | SGLang upstream `deepseek-v4-blackwell` preview | NIXL `0.10.1`
| Requirement | Supported |
| :--- | :--- |
| **GPU** | NVIDIA Ampere, Ada Lovelace, Hopper, Blackwell |
| **OS** | Ubuntu 22.04, Ubuntu 24.04, CentOS Stream 9 (experimental) |
| **Arch** | x86_64, ARM64 (ARM64 requires Ubuntu 24.04) |
| **CUDA 12** | Container images for SGLang and vLLM (CUDA 12.9) |
| **CUDA 13** | Container images for TensorRT-LLM (CUDA 13.1), SGLang and vLLM (CUDA 13.0) |
**On this page:** [Backend Dependencies](#backend-dependencies) | [CUDA and Drivers](#cuda-and-driver-requirements) | [Hardware](#hardware-compatibility) | [Platform](#platform-architecture-compatibility) | [Cloud](#cloud-service-provider-compatibility) | [Build Support](#build-support)
## Backend Dependencies
> Driver requirements differ by backend — see [CUDA and Driver Requirements](#cuda-and-driver-requirements) below.
The following table shows the backend framework versions included with each Dynamo release:
| **Dynamo** | **SGLang** | **TensorRT-LLM** | **vLLM** | **NIXL** |
| :--- | :--- | :--- | :--- | :--- |
| **main (ToT)** | `0.5.11` | `1.3.0rc17` | `0.21.0` | `0.10.1` (TRT-LLM); `1.1.0` (vLLM); `1.0.1` (SGLang) |
| **v1.2.0** | `0.5.11` | `1.3.0rc14` | `0.20.1` | `0.10.1` (TRT-LLM, vLLM); `1.0.1` (SGLang) |
| **v1.2.0-deepseek-v4-dev.3** *(experimental, partial)* | upstream DSv4 preview | — | `0.20.1` | `0.10.1` |
| **v1.2.0-deepseek-v4-dev.2** *(experimental, partial)* | upstream DSv4 preview | — | `0.20.0` | `0.10.1` |
| **v1.1.1** | `0.5.10.post1` | `1.3.0rc11` | `0.19.0` | `0.10.1` (TRT-LLM, vLLM); `1.0.1` (SGLang) |
| **v1.1.0** | `0.5.10.post1` | `1.3.0rc11` | `0.19.0` | `0.10.1` (TRT-LLM, vLLM); `1.0.1` (SGLang) |
| **v1.1.0-dev.3** *(experimental, partial)* | `0.5.10.post1` | `1.3.0rc11` | `0.19.0` | `0.10.1` |
| **v1.1.0-dev.2** *(experimental, partial)* | `0.5.9` | `1.3.0rc9` | `0.19.0` | `0.10.1` |
| **v1.1.0-dev.1** *(experimental)* | `0.5.9` | `1.3.0rc5.post1` | `0.17.1` | `0.10.1` |
| **v1.0.2** | `0.5.9` | `1.3.0rc5.post1` | `0.16.0` | `0.10.1` |
| **v1.0.1** | `0.5.9` | `1.3.0rc5.post1` | `0.16.0` | `0.10.1` |
| **v1.0.0** | `0.5.9` | `1.3.0rc5.post1` | `0.16.0` | `0.10.1` |
| **v0.9.1** | `0.5.8` | `1.3.0rc3` | `0.14.1` | `0.9.0` |
| **v0.9.0** | `0.5.8` | `1.3.0rc1` | `0.14.1` | `0.9.0` |
| **v0.8.1.post3** | `0.5.6.post2` | `1.2.0rc6.post3` | `0.12.0` | `0.8.0` |
| **v0.8.1.post2** | `0.5.6.post2` | `1.2.0rc6.post2` | `0.12.0` | `0.8.0` |
| **v0.8.1.post1** | `0.5.6.post2` | `1.2.0rc6.post1` | `0.12.0` | `0.8.0` |
| **v0.8.1** | `0.5.6.post2` | `1.2.0rc6.post1` | `0.12.0` | `0.8.0` |
| **v0.8.0** | `0.5.6.post2` | `1.2.0rc6.post1` | `0.12.0` | `0.8.0` |
| **v0.7.1** | `0.5.4.post3` | `1.2.0rc3` | `0.11.0` | `0.8.0` |
| **v0.7.0.post1** | `0.5.4.post3` | `1.2.0rc3` | `0.11.0` | `0.8.0` |
| **v0.7.0** | `0.5.4.post3` | `1.2.0rc2` | `0.11.0` | `0.8.0` |
| **v0.6.1.post1** | `0.5.3.post2` | `1.1.0rc5` | `0.11.0` | `0.6.0` |
| **v0.6.1** | `0.5.3.post2` | `1.1.0rc5` | `0.11.0` | `0.6.0` |
| **v0.6.0** | `0.5.3.post2` | `1.1.0rc5` | `0.11.0` | `0.6.0` |
For **v1.1.0-dev.2**, **v1.1.0-dev.3**, **v1.2.0-deepseek-v4-dev.2**, and **v1.2.0-deepseek-v4-dev.3**, the cells above match `container/context.yaml` on the corresponding release branch (pins used to build images). Those lines are **partial releases**: not every backend has a published Dynamo runtime container for that tag. See [Pre-Release Artifacts](/dynamo/resources/release-artifacts#pre-release-artifacts) for what actually shipped. The `v1.2.0-deepseek-v4-dev.2` and `v1.2.0-deepseek-v4-dev.3` SGLang containers are built on the upstream `lmsysorg/sglang:deepseek-v4-blackwell` preview image rather than a tagged SGLang release; TensorRT-LLM is not part of those dev releases.
### Version Labels
- **1.3.0 (main / ToT)** reflects the current development branch.
- Releases marked *(experimental, partial)* are pre-releases: the table shows branch build pins, which may include backends with no NGC image for that dev tag yet.
### Version Compatibility
- Backend versions listed are the only versions tested and supported for each release.
- TensorRT-LLM does not support Python 3.11; installation of the `ai-dynamo[trtllm]` wheel will fail on Python 3.11.
### CUDA and Driver Requirements
Dynamo container images include CUDA toolkit libraries. The host machine must have a compatible NVIDIA GPU driver installed.
| Dynamo Version | Backend | CUDA Toolkit | Min Driver | Notes |
| :--- | :--- | :--- | :--- | :--- |
| **1.2.0** | **SGLang** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | |
| | **TensorRT-LLM** | 13.1 | 580.xx+ | CUDA 13 only |
| | **vLLM** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | |
| **1.1.1** | **SGLang** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | |
| | **TensorRT-LLM** | 13.1 | 580.xx+ | |
| | **vLLM** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | |
| **1.1.0** | **SGLang** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | |
| | **TensorRT-LLM** | 13.1 | 580.xx+ | |
| | **vLLM** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | |
| **1.0.2** | **SGLang** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | |
| | **TensorRT-LLM** | 13.1 | 580.xx+ | |
| | **vLLM** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | |
| **1.0.1** | **SGLang** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | |
| | **TensorRT-LLM** | 13.1 | 580.xx+ | |
| | **vLLM** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | |
| **1.0.0** | **SGLang** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | |
| | **TensorRT-LLM** | 13.1 | 580.xx+ | |
| | **vLLM** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | |
| **0.9.1** | **SGLang** | 12.9 | 575.xx+ | |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | |
| | **vLLM** | 12.9 | 575.xx+ | |
| **0.9.0** | **SGLang** | 12.9 | 575.xx+ | |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | |
| | **vLLM** | 12.9 | 575.xx+ | |
| **0.8.1** | **SGLang** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | Experimental |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | |
| | **vLLM** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | Experimental |
| **0.8.0** | **SGLang** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | Experimental |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | |
| | **vLLM** | 12.9 | 575.xx+ | |
| | | 13.0 | 580.xx+ | Experimental |
| **0.7.1** | **SGLang** | 12.8 | 570.xx+ | |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | |
| | **vLLM** | 12.9 | 575.xx+ | |
| **0.7.0** | **SGLang** | 12.9 | 575.xx+ | |
| | **TensorRT-LLM** | 13.0 | 580.xx+ | |
| | **vLLM** | 12.8 | 570.xx+ | |
Patch versions (e.g., v0.8.1.post1, v0.7.0.post1) have the same CUDA support as their base version.
Experimental `v1.1.0-dev.*` images follow the same CUDA matrix as `v1.0.2`. The `v1.2.0-deepseek-v4-dev.3` vLLM container is CUDA 13.0 multi-arch; the SGLang containers split by arch (CUDA 12.9 on `amd64`, CUDA 13.0 on `arm64`).
Experimental CUDA 13 images are not published for all versions. Check [Release Artifacts](/dynamo/resources/release-artifacts) for availability.
For detailed artifact versions and NGC links (including container images, Python wheels, Helm charts, and Rust crates), see the [Release Artifacts](/dynamo/resources/release-artifacts) page.
#### CUDA Compatibility Resources
For detailed information on CUDA driver compatibility, forward compatibility, and troubleshooting:
- [CUDA Compatibility Overview](https://docs.nvidia.com/deploy/cuda-compatibility/)
- [Why CUDA Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/why-cuda-compatibility.html)
- [Minor Version Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html)
- [Forward Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html)
- [FAQ](https://docs.nvidia.com/deploy/cuda-compatibility/frequently-asked-questions.html)
For extended driver compatibility beyond the minimum versions listed above, consider using `cuda-compat` packages on the host. See [Forward Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html) for details.
## Hardware Compatibility
| **CPU Architecture** | **Status** |
| :------------------- | :----------- |
| **x86_64** | Supported |
| **ARM64** | Supported |
Dynamo provides multi-arch container images supporting both AMD64 (x86_64) and ARM64 architectures. See [Release Artifacts](/dynamo/resources/release-artifacts) for available images.
### GPU Compatibility
If you are using a **GPU**, the following GPU models and architectures are supported:
| **GPU Architecture** | **Status** |
| :----------------------------------- | :--------- |
| **NVIDIA Blackwell Architecture** | Supported |
| **NVIDIA Hopper Architecture** | Supported |
| **NVIDIA Ada Lovelace Architecture** | Supported |
| **NVIDIA Ampere Architecture** | Supported |
## Platform Architecture Compatibility
**Dynamo** is compatible with the following platforms:
| **Operating System** | **Version** | **Architecture** | **Status** |
| :------------------- | :---------- | :--------------- | :----------- |
| **Ubuntu** | 22.04 | x86_64 | Supported |
| **Ubuntu** | 24.04 | x86_64 | Supported |
| **Ubuntu** | 24.04 | ARM64 | Supported |
| **CentOS Stream** | 9 | x86_64 | Experimental |
Wheels are built using a manylinux_2_28-compatible environment and validated on CentOS Stream 9 and Ubuntu (22.04, 24.04). Compatibility with other Linux distributions is expected but not officially verified.
## Cloud Service Provider Compatibility
### AWS
| **Host Operating System** | **Version** | **Architecture** | **Status** |
| :------------------------ | :---------- | :--------------- | :--------- |
| **Amazon Linux** | 2023 | x86_64 | Supported |
**AL2023 TensorRT-LLM Limitation:** There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
## Build Support
For version-specific artifact details, installation commands, and release history, see [Release Artifacts](/dynamo/resources/release-artifacts).
**Dynamo** currently provides build support in the following ways:
- **Wheels**: We distribute Python wheels of Dynamo and KV Block Manager:
- [ai-dynamo](https://pypi.org/project/ai-dynamo/)
- [ai-dynamo-runtime](https://pypi.org/project/ai-dynamo-runtime/)
- [kvbm](https://pypi.org/project/kvbm/) as a standalone implementation.
- **Dynamo Container Images**: We distribute multi-arch images (x86 & ARM64 compatible) on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo):
- [Dynamo Frontend](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/dynamo-frontend) *(New in v0.8.0)*
- [SGLang Runtime](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime)
- [SGLang Runtime (CUDA 13)](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime-cu13)
- [TensorRT-LLM Runtime](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime)
- [TensorRT-LLM Runtime (EFA)](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime) *(New in v1.0.0, Experimental, AMD64 only)*
- [vLLM Runtime](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime)
- [vLLM Runtime (CUDA 13)](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime-cu13)
- [vLLM Runtime (EFA)](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime) *(New in v1.0.0, Experimental, AMD64 only)*
- [Kubernetes Operator](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator)
- [Snapshot Agent](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/snapshot-agent) *(New in v1.0.0, Preview)*
- **Helm Charts**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the helm charts supporting Kubernetes deployments of Dynamo:
- [Dynamo Platform](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-platform) (now includes CRDs)
- [Snapshot](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/snapshot) *(New in v1.0.0, Preview)*
- [Dynamo CRDs](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-crds) *(Deprecated in v1.0.0, CRDs managed by Operator)*
- [Dynamo Graph](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-graph) *(Deprecated in v0.9.0)*
- **Rust Crates**:
- [dynamo-runtime](https://crates.io/crates/dynamo-runtime/)
- [dynamo-llm](https://crates.io/crates/dynamo-llm/)
- [dynamo-protocols](https://crates.io/crates/dynamo-protocols/)
- [dynamo-parsers](https://crates.io/crates/dynamo-parsers/)
- [dynamo-config](https://crates.io/crates/dynamo-config/) *(New in v0.8.0)*
- [dynamo-memory](https://crates.io/crates/dynamo-memory/) *(New in v0.8.0)*
- [dynamo-tokens](https://crates.io/crates/dynamo-tokens/) *(New in v0.9.0)*
- [dynamo-mocker](https://crates.io/crates/dynamo-mocker/) *(New in v1.0.0)*
- [dynamo-kv-router](https://crates.io/crates/dynamo-kv-router/) *(New in v1.0.0)*
Once you've confirmed that your platform and architecture are compatible, you can install **Dynamo** by following the [Local Quick Start](https://github.com/ai-dynamo/dynamo/blob/main/README.md#local-quick-start) in the README.
# Feature Matrix
This document provides a comprehensive compatibility matrix for key Dynamo features across the supported backends.
*Updated for Dynamo v1.2.0*
**Legend:**
* ✅ : Supported
* 🚧 : Work in Progress / Experimental / Limited
## Quick Comparison
| Feature | SGLang | TensorRT-LLM | vLLM | Source |
| :--- | :---: | :---: | :---: | :--- |
| **Disaggregated Serving** | ✅ | ✅ | ✅ | [Design Doc][disagg] |
| **KV-Aware Routing** | ✅ | ✅ | ✅ | [Router Doc][kv-routing] |
| **SLA-Based Planner** | ✅ | ✅ | ✅ | [Planner Doc][planner] |
| **KV Block Manager** | 🚧 | ✅ | ✅ | [KVBM Doc][kvbm] |
| **Multimodal (Image)** | ✅ | ✅ | ✅ | [Multimodal Doc][mm] |
| **Multimodal (Video)** | ✅ | | ✅ | [Multimodal Doc][mm] |
| **Multimodal (Audio)** | | | 🚧 | [Multimodal Doc][mm] |
| **Request Migration** | ✅ | 🚧 | ✅ | [Migration Doc][migration] |
| **Request Cancellation** | 🚧 | ✅ | ✅ | Backend READMEs |
| **LoRA** | | | ✅ | [K8s Guide][lora] |
| **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc][tools] |
| **Speculative Decoding** | 🚧 | ✅ | ✅ | Backend READMEs |
| **Dynamo Snapshot** | ✅ | | ✅ | [Snapshot Docs][snapshot] |
## 1. vLLM Backend
vLLM offers the broadest feature coverage in Dynamo, with full support for disaggregated serving, KV-aware routing, KV block management, LoRA adapters, and multimodal inference including video and audio.
*Source: [docs/backends/vllm/README.md][vllm-readme]*
| Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Disaggregated Serving** | — | | | | | | | | | |
| **KV-Aware Routing** | ✅ | — | | | | | | | | |
| **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | |
| **KV Block Manager** | ✅ | ✅ | ✅ | — | | | | | | |
| **Multimodal** | ✅ | ✅1 | — | ✅ | — | | | | | |
| **Request Migration** | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | | |
| **Request Cancellation** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | |
| **LoRA** | ✅ | ✅2 | — | ✅ | — | ✅ | ✅ | — | | |
| **Tool Calling** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | |
| **Speculative Decoding** | ✅ | ✅ | — | ✅ | — | ✅ | ✅ | — | ✅ | — |
> **Notes:**
> 1. **Multimodal + KV-Aware Routing**: Image-aware KV routing is supported in the documented vLLM paths. The default Rust frontend path supports model families handled by `llm-multimodal`; the Python chat-processor path delegates to vLLM's multimodal processor. ([Source][mm-kv-routing])
> 2. **KV-Aware LoRA Routing**: vLLM supports routing requests based on LoRA adapter affinity.
> 3. **Audio Support**: vLLM supports audio models like Qwen2-Audio (experimental). ([Source][mm-vllm])
> 4. **Video Support**: vLLM supports video input with frame sampling. ([Source][mm-vllm])
> 5. **Speculative Decoding**: Eagle3 support documented. ([Source][vllm-spec])
## 2. SGLang Backend
SGLang is optimized for high-throughput serving with fast primitives, providing robust support for disaggregated serving, KV-aware routing, and request migration.
*Source: [docs/backends/sglang/README.md][sglang-readme]*
| Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Disaggregated Serving** | — | | | | | | | | | |
| **KV-Aware Routing** | ✅ | — | | | | | | | | |
| **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | |
| **KV Block Manager** | 🚧 | 🚧 | 🚧 | — | | | | | | |
| **Multimodal** | ✅2 | 1 | — | 🚧 | — | | | | | |
| **Request Migration** | ✅ | ✅ | ✅ | 🚧 | ✅ | — | | | | |
| **Request Cancellation** | 🚧3 | ✅ | ✅ | 🚧 | 🚧 | ✅ | — | | | |
| **LoRA** | | | | 🚧 | | | | — | | |
| **Tool Calling** | ✅ | ✅ | ✅ | 🚧 | ✅ | ✅ | ✅ | | — | |
| **Speculative Decoding** | 🚧 | 🚧 | — | 🚧 | — | 🚧 | — | | 🚧 | — |
> **Notes:**
> 1. **Multimodal + KV-Aware Routing**: Not supported. ([Source][kv-routing])
> 2. **Multimodal Patterns**: Supports simple Aggregated **EPD**, **E/PD**, and **E/P/D** patterns. Traditional Disagg **EP/D** is not supported. ([Source][mm-sglang])
> 3. **Request Cancellation**: Cancellation during the remote prefill phase is not supported in disaggregated mode. ([Source][sglang-readme])
> 4. **Speculative Decoding**: Code hooks exist (`spec_decode_stats` in publisher), but no examples or documentation yet.
## 3. TensorRT-LLM Backend
TensorRT-LLM delivers maximum inference performance and optimization, with full KVBM integration and robust disaggregated serving support.
*Source: [docs/backends/trtllm/README.md][trtllm-readme]*
| Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Disaggregated Serving** | — | | | | | | | | | |
| **KV-Aware Routing** | ✅ | — | | | | | | | | |
| **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | |
| **KV Block Manager** | ✅ | ✅ | ✅ | — | | | | | | |
| **Multimodal** | ✅1 | ✅2 | — | ✅ | — | | | | | |
| **Request Migration** | ✅ | ✅ | ✅ | ✅ | 🚧 | — | | | | |
| **Request Cancellation** | ✅3 | ✅3 | ✅3 | ✅3 | ✅3 | ✅3 | — | | | |
| **LoRA** | | | | | | | | — | | |
| **Tool Calling** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | — | |
| **Speculative Decoding** | ✅ | ✅ | — | ✅ | — | ✅ | ✅ | | ✅ | — |
> **Notes:**
> 1. **Multimodal Disaggregation**: Supports **EP/D** (Traditional) and **E/P/D** (Full Disaggregation) image flows, including image URLs and pre-computed embeddings. ([Source][mm-trtllm])
> 2. **Multimodal + KV-Aware Routing**: Image-aware KV routing is supported through the dedicated TRT-LLM MM Router Worker. It requires KV event publishing on the TRT-LLM workers. ([Source][mm-kv-routing])
> 3. **Request Cancellation**: Due to known issues, the TensorRT-LLM engine is temporarily not notified of request cancellations, meaning allocated resources for cancelled requests are not freed.
---
[vllm-readme]: ../backends/v-llm
[sglang-readme]: ../backends/sg-lang
[trtllm-readme]: ../backends/tensor-rt-llm
[disagg]: ../design-docs/disaggregated-serving
[kv-routing]: ../user-guides/kv-cache-aware-routing
[planner]: ../components/planner
[kvbm]: ../components/kvbm
[migration]: ../user-guides/fault-tolerance/request-migration
[tools]: ../user-guides/tool-calling
[mm]: ../user-guides/multimodal
[mm-vllm]: ../features/multimodal/multimodal-vllm.md
[mm-trtllm]: ../features/multimodal/multimodal-trtllm.md
[mm-sglang]: ../features/multimodal/multimodal-sglang.md
[mm-kv-routing]: ../features/multimodal/multimodal-kv-routing.md
[lora]: ../kubernetes-deployment/deploy-models/managing-models-with-dynamo-model
[vllm-spec]: ../additional-resources/speculative-decoding/speculative-decoding-with-v-llm
[trtllm-eagle]: ../additional-resources/tensor-rt-llm-details/llama-4-eagle
[snapshot]: ../kubernetes-deployment/advanced-platform/snapshot
# Release Artifacts
This document provides a comprehensive inventory of all Dynamo release artifacts including container images, Python wheels, Helm charts, and Rust crates.
> **See also:** [Support Matrix](/dynamo/resources/support-matrix) for hardware and platform compatibility | [Feature Matrix](/dynamo/resources/feature-matrix) for backend feature support
Release history in this document begins at v0.6.0.
## Current Release: Dynamo v1.2.0
- **GitHub Release:** [v1.2.0](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0)
- **Docs:** [v1.2.0](https://docs.dynamo.nvidia.com/dynamo)
- **NGC Collection:** [ai-dynamo](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo)
> **Experimental:** The DeepSeek-V4 preview tags remain available under [Pre-Release Artifacts](#pre-release-artifacts). Use the stable v1.2.0 artifacts below unless you specifically need the preview SGLang DeepSeek-V4 images.
### Container Images
| Image:Tag | Description | Backend | CUDA | Arch | NGC | Notes |
|-----------|-------------|---------|------|------|-----|-------|
| `vllm-runtime:1.2.0` | Runtime container for vLLM backend | vLLM `v0.20.1` | `v12.9` | AMD64/ARM64 | [NGC: vLLM runtime 1.2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime?version=1.2.0) | |
| `vllm-runtime:1.2.0-cuda13` | Runtime container for vLLM backend (CUDA 13) | vLLM `v0.20.1` | `v13.0` | AMD64/ARM64 | [NGC: vLLM runtime 1.2.0-cuda13](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime?version=1.2.0-cuda13) | |
| `vllm-runtime:1.2.0-efa` | Runtime container for vLLM with AWS EFA | vLLM `v0.20.1` | `v12.9` | AMD64/ARM64 | [NGC: vLLM runtime 1.2.0-efa](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime?version=1.2.0-efa) | Experimental |
| `sglang-runtime:1.2.0` | Runtime container for SGLang backend | SGLang `v0.5.11` | `v12.9` | AMD64/ARM64 | [NGC: SGLang runtime 1.2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime?version=1.2.0) | |
| `sglang-runtime:1.2.0-cuda13` | Runtime container for SGLang backend (CUDA 13) | SGLang `v0.5.11` | `v13.0` | AMD64/ARM64 | [NGC: SGLang runtime 1.2.0-cuda13](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime?version=1.2.0-cuda13) | |
| `sglang-runtime:1.2.0-efa` | Runtime container for SGLang with AWS EFA | SGLang `v0.5.11` | `v12.9` | AMD64/ARM64 | [NGC: SGLang runtime 1.2.0-efa](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime?version=1.2.0-efa) | Experimental |
| `tensorrtllm-runtime:1.2.0-cuda13` | Runtime container for TensorRT-LLM backend | TRT-LLM `v1.3.0rc14` | `v13.1` | AMD64/ARM64 | [NGC: TensorRT-LLM runtime 1.2.0-cuda13](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime?version=1.2.0-cuda13) | CUDA 13 only |
| `tensorrtllm-runtime:1.2.0-efa` | Runtime container for TensorRT-LLM with AWS EFA | TRT-LLM `v1.3.0rc14` | `v13.1` | AMD64/ARM64 | [NGC: TensorRT-LLM runtime 1.2.0-efa](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime?version=1.2.0-efa) | Experimental |
| `dynamo-frontend:1.2.0` | API gateway with Endpoint Prediction Protocol (EPP) | — | — | AMD64/ARM64 | [NGC: Dynamo frontend 1.2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/dynamo-frontend?version=1.2.0) | |
| `dynamo-planner:1.2.0` | Standalone Planner image used by Profiler jobs and Planner pods | — | — | AMD64/ARM64 | [NGC: Dynamo planner 1.2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/dynamo-planner?version=1.2.0) | |
| `kubernetes-operator:1.2.0` | Kubernetes operator for Dynamo deployments | — | — | AMD64/ARM64 | [NGC: Kubernetes operator 1.2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator?version=1.2.0) | |
| `snapshot-agent:1.2.0` | Snapshot agent for fast GPU worker recovery via CRIU | — | — | AMD64/ARM64 | [NGC: Snapshot agent 1.2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/snapshot-agent?version=1.2.0) | Preview |
### Python Wheels
We recommend using the TensorRT-LLM NGC container instead of the `ai-dynamo[trtllm]` wheel. See the [NGC container collection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) for supported images.
| Package | Description | Python | Platform | PyPI |
|---------|-------------|--------|----------|------|
| `ai-dynamo==1.2.0.post1` | Main package with backend integrations (vLLM, SGLang, TRT-LLM) | `3.10`–`3.12` | Linux (glibc `v2.28+`) | [PyPI: ai-dynamo 1.2.0.post1](https://pypi.org/project/ai-dynamo/1.2.0.post1/) |
| `ai-dynamo-runtime==1.2.0.post1` | Core Python bindings for Dynamo runtime | `3.10`–`3.12` | Linux (glibc `v2.28+`) | [PyPI: ai-dynamo-runtime 1.2.0.post1](https://pypi.org/project/ai-dynamo-runtime/1.2.0.post1/) |
| `kvbm==1.2.0.post1` | KV Block Manager for disaggregated KV cache | `3.10`–`3.12` | Linux (glibc `v2.28+`) | [PyPI: kvbm 1.2.0.post1](https://pypi.org/project/kvbm/1.2.0.post1/) |
### Helm Charts
| Chart | Description | NGC |
|-------|-------------|-----|
| `dynamo-platform-1.2.0` | Platform services (etcd, NATS) and Dynamo Operator for Dynamo cluster | [NGC Helm: dynamo-platform-1.2.0](https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-1.2.0.tgz) |
| `snapshot-1.2.0` | Snapshot DaemonSet for fast GPU worker recovery | [NGC Helm: snapshot-1.2.0](https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/snapshot-1.2.0.tgz) |
The `dynamo-crds` Helm chart is deprecated as of v1.0.0; CRDs are now managed by the Dynamo Operator. The `dynamo-graph` Helm chart is deprecated as of v0.9.0.
### Rust Crates
| Crate | Description | MSRV (Rust) | crates.io |
|-------|-------------|-------------|-----------|
| `dynamo-runtime@1.2.0` | Core distributed runtime library | `v1.82` | [crates.io: dynamo-runtime 1.2.0](https://crates.io/crates/dynamo-runtime/1.2.0) |
| `dynamo-llm@1.2.0` | LLM inference engine | `v1.82` | [crates.io: dynamo-llm 1.2.0](https://crates.io/crates/dynamo-llm/1.2.0) |
| `dynamo-protocols@1.2.0` | Async OpenAI-compatible API client | `v1.82` | [crates.io: dynamo-protocols 1.2.0](https://crates.io/crates/dynamo-protocols/1.2.0) |
| `dynamo-async-openai@1.0.2` | Deprecated legacy OpenAI client; use **`dynamo-protocols`** | `v1.82` | [crates.io: dynamo-async-openai 1.0.2](https://crates.io/crates/dynamo-async-openai/1.0.2) |
| `dynamo-parsers@1.2.0` | Protocol parsers (SSE, JSON streaming) | `v1.82` | [crates.io: dynamo-parsers 1.2.0](https://crates.io/crates/dynamo-parsers/1.2.0) |
| `dynamo-memory@1.2.0` | Memory management utilities | `v1.82` | [crates.io: dynamo-memory 1.2.0](https://crates.io/crates/dynamo-memory/1.2.0) |
| `dynamo-config@1.2.0` | Configuration management | `v1.82` | [crates.io: dynamo-config 1.2.0](https://crates.io/crates/dynamo-config/1.2.0) |
| `dynamo-tokenizers@1.2.0` | Standalone tokenizer library | `v1.82` | [crates.io: dynamo-tokenizers 1.2.0](https://crates.io/crates/dynamo-tokenizers/1.2.0) |
| `dynamo-tokens@1.2.0` | Tokenizer bindings for LLM inference | `v1.82` | [crates.io: dynamo-tokens 1.2.0](https://crates.io/crates/dynamo-tokens/1.2.0) |
| `dynamo-mocker@1.2.0` | Inference engine simulator for benchmarking | `v1.82` | [crates.io: dynamo-mocker 1.2.0](https://crates.io/crates/dynamo-mocker/1.2.0) |
| `dynamo-kv-router@1.2.0` | KV-aware request routing library | `v1.82` | [crates.io: dynamo-kv-router 1.2.0](https://crates.io/crates/dynamo-kv-router/1.2.0) |
## Quick Install Commands
### Container Images (NGC)
For detailed run instructions, see the backend-specific guides: [vLLM](/dynamo/backends/v-llm) | [SGLang](/dynamo/backends/sg-lang) | [TensorRT-LLM](/dynamo/backends/tensor-rt-llm)
```bash
# Runtime containers
docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
docker pull nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0
docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.2.0-cuda13
# CUDA 13 variants
docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0-cuda13
docker pull nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0-cuda13
# EFA variants (AWS, experimental)
docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0-efa
docker pull nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0-efa
docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.2.0-efa
# Infrastructure containers
docker pull nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.2.0
docker pull nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0
docker pull nvcr.io/nvidia/ai-dynamo/kubernetes-operator:1.2.0
docker pull nvcr.io/nvidia/ai-dynamo/snapshot-agent:1.2.0
```
### Python Wheels (PyPI)
For detailed installation instructions, see the [Local Quick Start](https://github.com/ai-dynamo/dynamo#local-quick-start) in the README.
```bash
# Install Dynamo with a specific backend (Recommended)
uv pip install "ai-dynamo[vllm]==1.2.0.post1"
uv pip install --prerelease=allow "ai-dynamo[sglang]==1.2.0.post1"
# TensorRT-LLM requires the NVIDIA PyPI index and pip
pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]==1.2.0.post1"
# Install Dynamo core only
uv pip install ai-dynamo==1.2.0.post1
# Install standalone KVBM
uv pip install kvbm==1.2.0.post1
```
### Helm Charts (NGC)
For Kubernetes deployment instructions, see the [Kubernetes Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide).
```bash
helm install dynamo-platform oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform --version 1.2.0
helm install snapshot oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/snapshot --version 1.2.0
```
### Rust Crates (crates.io)
For API documentation, see each crate on [docs.rs](https://docs.rs/). To build Dynamo from source, see [Building from Source](https://github.com/ai-dynamo/dynamo#building-from-source).
```bash
cargo add dynamo-runtime@1.2.0
cargo add dynamo-llm@1.2.0
cargo add dynamo-protocols@1.2.0
# Deprecated legacy crate name — pin only if a dependency requires it; new code should use dynamo-protocols:
# cargo add dynamo-async-openai@1.0.2
cargo add dynamo-parsers@1.2.0
cargo add dynamo-memory@1.2.0
cargo add dynamo-config@1.2.0
cargo add dynamo-tokenizers@1.2.0
cargo add dynamo-tokens@1.2.0
cargo add dynamo-mocker@1.2.0
cargo add dynamo-kv-router@1.2.0
```
**CUDA and Driver Requirements:** For detailed CUDA toolkit versions and minimum driver requirements for each container image, see the [Support Matrix](/dynamo/resources/support-matrix#cuda-and-driver-requirements).
## Known Issues
For a complete list of known issues, refer to the release notes for each version:
- [v1.2.0 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0)
- [v1.1.1 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.1)
- [v1.1.0 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0)
- [v1.0.2 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v1.0.2)
- [v1.0.1 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v1.0.1)
- [v1.0.0 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v1.0.0)
- [v0.9.0 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v0.9.0)
- [v0.8.1 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v0.8.1)
### Known Artifact Issues
| Version | Artifact | Issue | Status |
|---------|----------|-------|--------|
| v0.9.0 | `dynamo-platform-0.9.0` | Helm chart sets operator image to `0.7.1` instead of `0.9.0`. | Fixed in v0.9.0.post1 |
| v0.8.1 | `vllm-runtime:0.8.1-cuda13` | Container fails to launch. | Known issue |
| v0.8.1 | `sglang-runtime:0.8.1-cuda13`, `vllm-runtime:0.8.1-cuda13` | Multimodality not expected to work on ARM64. Works on AMD64. | Known limitation |
| v0.8.0 | `sglang-runtime:0.8.0-cuda13` | CuDNN installation issue caused PyTorch `v2.9.1` compatibility problems with `nn.Conv3d`, resulting in performance degradation and excessive memory usage in multimodal workloads. | Fixed in v0.8.1 ([#5461](https://github.com/ai-dynamo/dynamo/pull/5461)) |
---
## Release Artifact History
Each bullet is a **delta** to what ships on NGC / Helm / PyPI / crates.io: net-new crates, removed Helm charts, or image lines that **split** or **appear** on the registry. See the inventory tables above for full matrices.
Stable releases first (newest first). **Pre-Release Git Tags** (`v*-dev.*`, experimental tracks) are summarized below; per-tag images and wheels are spelled out in [Pre-Release Artifacts](#pre-release-artifacts).
For backend version pins, see the version-pins table above and the [GitHub Releases](#github-releases) table below.
**Stable Releases**
- **v1.2.0**: **Images:** vLLM and SGLang runtime images for CUDA 12.9 and CUDA 13.0, TensorRT-LLM runtime image for CUDA 13.1, multi-arch EFA runtime images, and refreshed `dynamo-frontend`, `kubernetes-operator`, `dynamo-planner`, and `snapshot-agent` images. **Wheels:** `ai-dynamo`, `ai-dynamo-runtime`, and `kvbm` published as `1.2.0.post1`. **Crates:** `1.2.0` crates published, including the new `dynamo-tokenizers` crate. **Helm:** `dynamo-platform` and `snapshot` charts published at `1.2.0`.
- **v1.1.1**: Patch release. Same backend versions as v1.1.0: SGLang `v0.5.10.post1` (NIXL `v1.0.1`), TRT-LLM `v1.3.0rc11` (NIXL `v0.10.1`), vLLM `v0.19.0` (NIXL `v0.10.1`).
- **v1.1.0**: **Images:** Split Planner into its own `dynamo-planner` image on NGC for Profiler jobs and Planner pods; worker and runtime images no longer bundle Planner (**artifact boundary change**, not a new engine capability). **Crates:** First **`1.y.z`** publication on crates.io for **`dynamo-protocols`** (multi-protocol types; **`dynamo-async-openai`** remains deprecated with final release **`1.0.2`**).
- **v1.0.2 / v1.0.1**: No artifact additions or removals versus v1.0.0.
- **v1.0.0**: **Images:** `snapshot-agent`, EFA variants for vLLM and TRT-LLM (AMD64 only). **Crates:** First publish of `dynamo-mocker`, `dynamo-kv-router`. **Helm:** Added `snapshot` (preview); dropped deprecated `dynamo-crds` from the publish stream (CRDs owned by the Operator).
- **v0.9.1**: No artifact additions or removals versus v0.9.0.
- **v0.9.0**: **Crates:** First publish of `dynamo-tokens`. **Helm:** Dropped deprecated `dynamo-graph` from the publish stream.
- **v0.8.0**: **Images:** `dynamo-frontend`, CUDA 13 variants for vLLM and SGLang. **Crates:** First publish of `dynamo-memory`, `dynamo-config`.
**Dynamo Nightlies**
- **New as of v1.1.0\*:** **`ai-dynamo`** and **`ai-dynamo-runtime`** — nightly builds from **`main`** publish wheels tagged **`*.devYYYYMMDD`**. Install with **`pip`** or **`uv`** using **`--pre`** and the same NVIDIA extra-index pattern as [Pre-Release Artifacts](#pre-release-artifacts).
\* **`*.devYYYYMMDD`** versioning for nightly **`main`** wheels began **Apr 24, 2026**.
**Pre-Release and Experimental Git Tags**
- **v1.2.0-deepseek-v4-dev.3**: **Images:** `vllm-runtime:*-deepseek-v4-cuda13-dev.3`, `sglang-runtime:*-deepseek-v4-cuda12-dev.3`, `sglang-runtime:*-deepseek-v4-cuda13-dev.3`. **Helm / PyPI:** Not published for this tag (see [Pre-Release Artifacts](#v120-deepseek-v4-dev3)).
- **v1.1.0-dev.3**: **Images:** `tensorrtllm-runtime:1.1.0-dev.3`. **Wheels:** `ai-dynamo`, `ai-dynamo-runtime` on [pypi.nvidia.com](https://pypi.nvidia.com/) (see [below](#v110-dev3)).
- **v1.1.0-dev.2**: **Images:** `sglang-runtime:1.1.0-dev.2`, `tensorrtllm-runtime:1.1.0-dev.2`. **Wheels:** `ai-dynamo`, `ai-dynamo-runtime` on [pypi.nvidia.com](https://pypi.nvidia.com/) (see [below](#v110-dev2)).
- **v1.1.0-dev.1**: **Images:** vLLM, SGLang, TRT-LLM runtime matrix (CUDA 12 / 13 and EFA variants as listed), `dynamo-frontend`, `kubernetes-operator`, `snapshot-agent`. **Wheels:** `ai-dynamo`, `ai-dynamo-runtime` on [pypi.nvidia.com](https://pypi.nvidia.com/). **Helm:** `dynamo-platform`, `snapshot` at `1.1.0-dev.1` (see [below](#v110-dev1)).
**Helm-Only Patches**
- **v0.9.0.post1**: Republished `dynamo-platform` Helm chart only (operator image tag correction).
**Backend-Only Patch Trains**
- **v0.8.1.post1 / .post2 / .post3**: Republished TRT-LLM runtime image and PyPI wheels only.
### crates.io Rust Packages
These crates use repository `https://github.com/ai-dynamo/dynamo.git`. The table lists each crate’s **first non-placeholder** publication on crates.io (excluding reservation uploads named `0.0.0-prerelease.0`). Dates are from the crates.io registry index.
| Crate | First Published Version | Date (crates.io) |
|-------|-------------------------|------------------|
| `dynamo-runtime` | `0.1.0` | 2025-03-18 |
| `dynamo-llm` | `0.2.0` | 2025-05-01 |
| `dynamo-async-openai` | `0.4.1` | 2025-08-27 |
| `dynamo-parsers` | `0.5.0` | 2025-09-18 |
| `dynamo-memory` | `0.8.0` | 2026-01-15 |
| `dynamo-config` | `0.8.0` | 2026-01-15 |
| `dynamo-tokens` | `0.9.0` | 2026-02-12 |
| `dynamo-mocker` | `1.0.0` | 2026-03-13 |
| `dynamo-kv-router` | `1.0.0` | 2026-03-13 |
| `dynamo-protocols` | `1.1.0` | 2026-05-04 |
| `dynamo-tokenizers` | `1.2.0` | 2026-06-02 |
**`dynamo-async-openai`** is **deprecated**; **`1.0.2`** is its final crates.io release. Use **`dynamo-protocols`** for new dependencies ([crate](https://crates.io/crates/dynamo-protocols)).
### GitHub Releases
| Version | Release Date | GitHub | Docs | Notes |
|---------|--------------|--------|------|-------|
| `v1.2.0` | Jun 2, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0) | [Docs](https://docs.dynamo.nvidia.com/dynamo) | |
| `v1.2.0-deepseek-v4-dev.3` | May 9, 2026 | [Tag](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0-deepseek-v4-dev.3) | — | Experimental (DeepSeek-V4-Flash / V4-Pro Blackwell preview; vLLM + SGLang containers only) |
| `v1.2.0-deepseek-v4-dev.2` | May 1, 2026 | [Tag](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0-deepseek-v4-dev.2) | — | Experimental (DeepSeek-V4-Flash / V4-Pro Blackwell preview; vLLM + SGLang containers only) |
| `v1.1.1` | May 5, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.1) | [Docs](https://docs.dynamo.nvidia.com/dynamo) | |
| `v1.1.0` | May 1, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0) | [Docs](https://docs.dynamo.nvidia.com/dynamo) | |
| `v1.1.0-dev.3` | Apr 18, 2026 | [Tag](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0-dev.3) | — | Pre-Release (TRT-LLM Runtime Image + Wheels; see Pre-Release Artifacts) |
| `v1.1.0-dev.2` | Apr 9, 2026 | [Tag](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0-dev.2) | — | Pre-Release (SGLang + TRT-LLM Runtime Images + Wheels; see Pre-Release Artifacts) |
| `v1.1.0-dev.1` | Mar 17, 2026 | [Tag](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0-dev.1) | — | Experimental |
| `v1.0.2` | Apr 22, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v1.0.2) | [Docs](https://docs.dynamo.nvidia.com/dynamo) | |
| `v1.0.1` | Mar 16, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v1.0.1) | [Docs](https://docs.dynamo.nvidia.com/dynamo) | |
| `v1.0.0` | Mar 12, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v1.0.0) | [Docs](https://docs.dynamo.nvidia.com/dynamo) | |
| `v0.9.1` | Mar 4, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.9.1) | [Docs](https://docs.dynamo.nvidia.com/dynamo) |
| `v0.9.0` | Feb 11, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.9.0) | Archived docs unavailable |
| `v0.8.1` | Jan 23, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.8.1) | Archived docs unavailable |
| `v0.8.0` | Jan 15, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.8.0) | Archived docs unavailable |
| `v0.7.1` | Dec 15, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.7.1) | Archived docs unavailable |
| `v0.7.0` | Nov 26, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.7.0) | Archived docs unavailable |
| `v0.6.1` | Nov 6, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.6.1) | — |
| `v0.6.0` | Oct 28, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.6.0) | — |
### Container Images
> **NGC Collection:** [ai-dynamo](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo)
>
> To access a specific version, append `?version=TAG` to the container URL:
> `https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/{container}?version={tag}`
#### vllm-runtime
| Image:Tag | vLLM | Arch | CUDA | Notes |
|-----------|------|------|------|-------|
| `vllm-runtime:1.2.0` | `v0.20.1` | AMD64/ARM64 | `v12.9` | |
| `vllm-runtime:1.2.0-cuda13` | `v0.20.1` | AMD64/ARM64 | `v13.0` | |
| `vllm-runtime:1.2.0-efa` | `v0.20.1` | AMD64/ARM64 | `v12.9` | Experimental |
| `vllm-runtime:1.1.1` | `v0.19.0` | AMD64/ARM64 | `v12.9` | |
| `vllm-runtime:1.1.1-cuda13` | `v0.19.0` | AMD64/ARM64 | `v13.0` | |
| `vllm-runtime:1.1.1-efa-amd64` | `v0.19.0` | AMD64 | `v12.9` | Experimental |
| `vllm-runtime:1.1.0` | `v0.19.0` | AMD64/ARM64 | `v12.9` | |
| `vllm-runtime:1.1.0-cuda13` | `v0.19.0` | AMD64/ARM64 | `v13.0` | |
| `vllm-runtime:1.1.0-efa-amd64` | `v0.19.0` | AMD64 | `v12.9` | Experimental |
| `vllm-runtime:1.0.2` | `v0.16.0` | AMD64/ARM64 | `v12.9` | |
| `vllm-runtime:1.0.2-cuda13` | `v0.16.0` | AMD64/ARM64 | `v13.0` | |
| `vllm-runtime:1.0.2-efa-amd64` | `v0.16.0` | AMD64 | `v12.9` | Experimental |
| `vllm-runtime:1.0.1` | `v0.16.0` | AMD64/ARM64 | `v12.9` | |
| `vllm-runtime:1.0.1-cuda13` | `v0.16.0` | AMD64/ARM64 | `v13.0` | |
| `vllm-runtime:1.0.1-efa-amd64` | `v0.16.0` | AMD64 | `v12.9` | Experimental |
| `vllm-runtime:1.0.0` | `v0.16.0` | AMD64/ARM64 | `v12.9` | |
| `vllm-runtime:1.0.0-cuda13` | `v0.16.0` | AMD64/ARM64 | `v13.0` | |
| `vllm-runtime:1.0.0-efa-amd64` | `v0.16.0` | AMD64 | `v12.9` | Experimental |
| `vllm-runtime:0.9.1` | `v0.14.1` | AMD64/ARM64 | `v12.9` | |
| `vllm-runtime:0.9.1-cuda13` | `v0.14.1` | AMD64/ARM64 | `v13.0` | Experimental |
| `vllm-runtime:0.9.0` | `v0.14.1` | AMD64/ARM64 | `v12.9` | |
| `vllm-runtime:0.9.0-cuda13` | `v0.14.1` | AMD64/ARM64 | `v13.0` | Experimental |
| `vllm-runtime:0.8.1` | `v0.12.0` | AMD64/ARM64 | `v12.9` | |
| `vllm-runtime:0.8.0` | `v0.12.0` | AMD64/ARM64 | `v12.9` | |
| `vllm-runtime:0.8.0-cuda13` | `v0.12.0` | AMD64/ARM64 | `v13.0` | Experimental |
| `vllm-runtime:0.7.0.post2` | `v0.11.2` | AMD64/ARM64 | `v12.8` | Patch |
| `vllm-runtime:0.7.1` | `v0.11.0` | AMD64/ARM64 | `v12.8` | |
| `vllm-runtime:0.7.0.post1` | `v0.11.0` | AMD64/ARM64 | `v12.8` | Patch |
| `vllm-runtime:0.7.0` | `v0.11.0` | AMD64/ARM64 | `v12.8` | |
| `vllm-runtime:0.6.1.post1` | `v0.11.0` | AMD64/ARM64 | `v12.8` | Patch |
| `vllm-runtime:0.6.1` | `v0.11.0` | AMD64/ARM64 | `v12.8` | |
| `vllm-runtime:0.6.0` | `v0.11.0` | AMD64 | `v12.8` | |
#### sglang-runtime
| Image:Tag | SGLang | Arch | CUDA | Notes |
|-----------|--------|------|------|-------|
| `sglang-runtime:1.2.0` | `v0.5.11` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:1.2.0-cuda13` | `v0.5.11` | AMD64/ARM64 | `v13.0` | |
| `sglang-runtime:1.2.0-efa` | `v0.5.11` | AMD64/ARM64 | `v12.9` | Experimental |
| `sglang-runtime:1.1.1` | `v0.5.10.post1` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:1.1.1-cuda13` | `v0.5.10.post1` | AMD64/ARM64 | `v13.0` | |
| `sglang-runtime:1.1.0` | `v0.5.10.post1` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:1.1.0-cuda13` | `v0.5.10.post1` | AMD64/ARM64 | `v13.0` | |
| `sglang-runtime:1.0.2` | `v0.5.9` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:1.0.2-cuda13` | `v0.5.9` | AMD64/ARM64 | `v13.0` | |
| `sglang-runtime:1.0.1` | `v0.5.9` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:1.0.1-cuda13` | `v0.5.9` | AMD64/ARM64 | `v13.0` | |
| `sglang-runtime:1.0.0` | `v0.5.9` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:1.0.0-cuda13` | `v0.5.9` | AMD64/ARM64 | `v13.0` | |
| `sglang-runtime:0.9.1` | `v0.5.8` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:0.9.1-cuda13` | `v0.5.8` | AMD64/ARM64 | `v13.0` | Experimental |
| `sglang-runtime:0.9.0` | `v0.5.8` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:0.9.0-cuda13` | `v0.5.8` | AMD64/ARM64 | `v13.0` | Experimental |
| `sglang-runtime:0.8.1` | `v0.5.6.post2` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:0.8.1-cuda13` | `v0.5.6.post2` | AMD64/ARM64 | `v13.0` | Experimental |
| `sglang-runtime:0.8.0` | `v0.5.6.post2` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:0.8.0-cuda13` | `v0.5.6.post2` | AMD64/ARM64 | `v13.0` | Experimental |
| `sglang-runtime:0.7.1` | `v0.5.4.post3` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:0.7.0.post1` | `v0.5.4.post3` | AMD64/ARM64 | `v12.9` | Patch |
| `sglang-runtime:0.7.0` | `v0.5.4.post3` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:0.6.1.post1` | `v0.5.3.post2` | AMD64/ARM64 | `v12.9` | Patch |
| `sglang-runtime:0.6.1` | `v0.5.3.post2` | AMD64/ARM64 | `v12.9` | |
| `sglang-runtime:0.6.0` | `v0.5.3.post2` | AMD64 | `v12.8` | |
#### tensorrtllm-runtime
| Image:Tag | TRT-LLM | Arch | CUDA | Notes |
|-----------|---------|------|------|-------|
| `tensorrtllm-runtime:1.2.0-cuda13` | `v1.3.0rc14` | AMD64/ARM64 | `v13.1` | |
| `tensorrtllm-runtime:1.2.0-efa` | `v1.3.0rc14` | AMD64/ARM64 | `v13.1` | Experimental |
| `tensorrtllm-runtime:1.1.1` | `v1.3.0rc11` | AMD64/ARM64 | `v13.1` | |
| `tensorrtllm-runtime:1.1.1-efa-amd64` | `v1.3.0rc11` | AMD64 | `v13.1` | Experimental |
| `tensorrtllm-runtime:1.1.0` | `v1.3.0rc11` | AMD64/ARM64 | `v13.1` | |
| `tensorrtllm-runtime:1.1.0-efa-amd64` | `v1.3.0rc11` | AMD64 | `v13.1` | Experimental |
| `tensorrtllm-runtime:1.0.2` | `v1.3.0rc5.post1` | AMD64/ARM64 | `v13.1` | |
| `tensorrtllm-runtime:1.0.2-efa-amd64` | `v1.3.0rc5.post1` | AMD64 | `v13.1` | Experimental |
| `tensorrtllm-runtime:1.0.1` | `v1.3.0rc5.post1` | AMD64/ARM64 | `v13.1` | |
| `tensorrtllm-runtime:1.0.1-efa-amd64` | `v1.3.0rc5.post1` | AMD64 | `v13.1` | Experimental |
| `tensorrtllm-runtime:1.0.0` | `v1.3.0rc5.post1` | AMD64/ARM64 | `v13.1` | |
| `tensorrtllm-runtime:1.0.0-efa-amd64` | `v1.3.0rc5.post1` | AMD64 | `v13.1` | Experimental |
| `tensorrtllm-runtime:0.9.1` | `v1.3.0rc3` | AMD64/ARM64 | `v13.0` | |
| `tensorrtllm-runtime:0.9.0` | `v1.3.0rc1` | AMD64/ARM64 | `v13.0` | |
| `tensorrtllm-runtime:0.8.1.post3` | `v1.2.0rc6.post3` | AMD64/ARM64 | `v13.0` | Patch |
| `tensorrtllm-runtime:0.8.1.post1` | `v1.2.0rc6.post2` | AMD64/ARM64 | `v13.0` | Patch |
| `tensorrtllm-runtime:0.8.1` | `v1.2.0rc6.post1` | AMD64/ARM64 | `v13.0` | |
| `tensorrtllm-runtime:0.8.0` | `v1.2.0rc6.post1` | AMD64/ARM64 | `v13.0` | |
| `tensorrtllm-runtime:0.7.0.post2` | `v1.2.0rc2` | AMD64/ARM64 | `v13.0` | Patch |
| `tensorrtllm-runtime:0.7.1` | `v1.2.0rc3` | AMD64/ARM64 | `v13.0` | |
| `tensorrtllm-runtime:0.7.0.post1` | `v1.2.0rc3` | AMD64/ARM64 | `v13.0` | Patch |
| `tensorrtllm-runtime:0.7.0` | `v1.2.0rc2` | AMD64/ARM64 | `v13.0` | |
| `tensorrtllm-runtime:0.6.1-cuda13` | `v1.2.0rc1` | AMD64/ARM64 | `v13.0` | Experimental |
| `tensorrtllm-runtime:0.6.1.post1` | `v1.1.0rc5` | AMD64/ARM64 | `v12.9` | Patch |
| `tensorrtllm-runtime:0.6.1` | `v1.1.0rc5` | AMD64/ARM64 | `v12.9` | |
| `tensorrtllm-runtime:0.6.0` | `v1.1.0rc5` | AMD64/ARM64 | `v12.9` | |
#### dynamo-frontend
| Image:Tag | Arch | Notes |
|-----------|------|-------|
| `dynamo-frontend:1.2.0` | AMD64/ARM64 | |
| `dynamo-frontend:1.1.1` | AMD64/ARM64 | |
| `dynamo-frontend:1.1.0` | AMD64/ARM64 | |
| `dynamo-frontend:1.0.2` | AMD64/ARM64 | |
| `dynamo-frontend:1.0.1` | AMD64/ARM64 | |
| `dynamo-frontend:1.0.0` | AMD64/ARM64 | |
| `dynamo-frontend:0.9.1` | AMD64/ARM64 | |
| `dynamo-frontend:0.9.0` | AMD64/ARM64 | |
| `dynamo-frontend:0.8.1` | AMD64/ARM64 | |
| `dynamo-frontend:0.8.0` | AMD64/ARM64 | Initial |
#### kubernetes-operator
| Image:Tag | Arch | Notes |
|-----------|------|-------|
| `kubernetes-operator:1.2.0` | AMD64/ARM64 | |
| `kubernetes-operator:1.1.1` | AMD64/ARM64 | |
| `kubernetes-operator:1.1.0` | AMD64/ARM64 | |
| `kubernetes-operator:1.0.2` | AMD64/ARM64 | |
| `kubernetes-operator:1.0.1` | AMD64/ARM64 | |
| `kubernetes-operator:1.0.0` | AMD64/ARM64 | |
| `kubernetes-operator:0.9.1` | AMD64/ARM64 | |
| `kubernetes-operator:0.9.0` | AMD64/ARM64 | |
| `kubernetes-operator:0.8.1` | AMD64/ARM64 | |
| `kubernetes-operator:0.8.0` | AMD64/ARM64 | |
| `kubernetes-operator:0.7.1` | AMD64/ARM64 | |
| `kubernetes-operator:0.7.0.post1` | AMD64/ARM64 | Patch |
| `kubernetes-operator:0.7.0` | AMD64/ARM64 | |
| `kubernetes-operator:0.6.1` | AMD64/ARM64 | |
| `kubernetes-operator:0.6.0` | AMD64/ARM64 | |
#### dynamo-planner
| Image:Tag | Arch | Notes |
|-----------|------|-------|
| `dynamo-planner:1.2.0` | AMD64/ARM64 | |
| `dynamo-planner:1.1.1` | AMD64/ARM64 | |
| `dynamo-planner:1.1.0` | AMD64/ARM64 | New |
#### snapshot-agent
| Image:Tag | Arch | Notes |
|-----------|------|-------|
| `snapshot-agent:1.2.0` | AMD64/ARM64 | Preview |
| `snapshot-agent:1.1.1` | AMD64/ARM64 | Preview |
| `snapshot-agent:1.1.0` | AMD64/ARM64 | Preview |
| `snapshot-agent:1.0.2` | AMD64/ARM64 | Preview |
| `snapshot-agent:1.0.1` | AMD64/ARM64 | Preview |
| `snapshot-agent:1.0.0` | AMD64/ARM64 | Preview |
### Python Wheels
> **PyPI:** [ai-dynamo](https://pypi.org/project/ai-dynamo/) | [ai-dynamo-runtime](https://pypi.org/project/ai-dynamo-runtime/) | [kvbm](https://pypi.org/project/kvbm/)
>
> To access a specific version: `https://pypi.org/project/{package}/{version}/`
#### ai-dynamo (wheel)
| Package | Python | Platform | Notes |
|---------|--------|----------|-------|
| `ai-dynamo==1.2.0.post1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==1.1.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==1.1.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==1.0.2` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==1.0.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==1.0.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==0.9.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==0.9.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==0.8.1.post3` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | TRT-LLM `v1.2.0rc6.post3` |
| `ai-dynamo==0.8.1.post1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | TRT-LLM `v1.2.0rc6.post2` |
| `ai-dynamo==0.8.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==0.8.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==0.7.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==0.7.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==0.6.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo==0.6.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
#### ai-dynamo-runtime (wheel)
| Package | Python | Platform | Notes |
|---------|--------|----------|-------|
| `ai-dynamo-runtime==1.2.0.post1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==1.1.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==1.1.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==1.0.2` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==1.0.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==1.0.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==0.9.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==0.9.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==0.8.1.post3` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | TRT-LLM `v1.2.0rc6.post3` |
| `ai-dynamo-runtime==0.8.1.post1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | TRT-LLM `v1.2.0rc6.post2` |
| `ai-dynamo-runtime==0.8.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==0.8.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==0.7.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==0.7.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==0.6.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `ai-dynamo-runtime==0.6.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
#### kvbm (wheel)
| Package | Python | Platform | Notes |
|---------|--------|----------|-------|
| `kvbm==1.2.0.post1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `kvbm==1.1.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `kvbm==1.1.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `kvbm==1.0.2` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `kvbm==1.0.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `kvbm==1.0.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `kvbm==0.9.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `kvbm==0.9.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `kvbm==0.8.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `kvbm==0.8.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `kvbm==0.7.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | |
| `kvbm==0.7.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | Initial |
### Helm Charts
> **NGC Helm Registry:** [ai-dynamo](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo)
>
> Direct download: `https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/{chart}-{version}.tgz`
#### dynamo-crds (Helm chart) -- Deprecated
The `dynamo-crds` Helm chart is deprecated as of v1.0.0. CRDs are now managed by the Dynamo Operator.
| Chart | Notes |
|-------|-------|
| `dynamo-crds-0.9.1` | Last release |
| `dynamo-crds-0.9.0` | |
| `dynamo-crds-0.8.1` | |
| `dynamo-crds-0.8.0` | |
| `dynamo-crds-0.7.1` | |
| `dynamo-crds-0.7.0` | |
| `dynamo-crds-0.6.1` | |
| `dynamo-crds-0.6.0` | |
#### dynamo-platform (Helm chart)
| Chart | Notes |
|-------|-------|
| `dynamo-platform-1.2.0` | |
| `dynamo-platform-1.1.1` | |
| `dynamo-platform-1.1.0` | |
| `dynamo-platform-1.0.2` | |
| `dynamo-platform-1.0.1` | |
| `dynamo-platform-1.0.0` | |
| `dynamo-platform-0.9.1` | |
| `dynamo-platform-0.9.0-post1` | Helm fix: operator image tag |
| `dynamo-platform-0.9.0` | |
| `dynamo-platform-0.8.1` | |
| `dynamo-platform-0.8.0` | |
| `dynamo-platform-0.7.1` | |
| `dynamo-platform-0.7.0` | |
| `dynamo-platform-0.6.1` | |
| `dynamo-platform-0.6.0` | |
#### snapshot (Helm chart)
| Chart | Notes |
|-------|-------|
| `snapshot-1.2.0` | Preview |
| `snapshot-1.1.1` | Preview |
| `snapshot-1.1.0` | Preview |
| `snapshot-1.0.2` | Preview |
| `snapshot-1.0.1` | Preview |
| `snapshot-1.0.0` | Preview |
#### dynamo-graph (Helm chart) -- Deprecated
The `dynamo-graph` Helm chart is deprecated as of v0.9.0.
| Chart | Notes |
|-------|-------|
| `dynamo-graph-0.8.1` | Last release |
| `dynamo-graph-0.8.0` | |
| `dynamo-graph-0.7.1` | |
| `dynamo-graph-0.7.0` | |
| `dynamo-graph-0.6.1` | |
| `dynamo-graph-0.6.0` | |
### Rust Crates
> **crates.io:** [dynamo-runtime](https://crates.io/crates/dynamo-runtime) | [dynamo-llm](https://crates.io/crates/dynamo-llm) | [dynamo-protocols](https://crates.io/crates/dynamo-protocols) | [dynamo-async-openai](https://crates.io/crates/dynamo-async-openai) *(deprecated)* | [dynamo-parsers](https://crates.io/crates/dynamo-parsers) | [dynamo-memory](https://crates.io/crates/dynamo-memory) | [dynamo-config](https://crates.io/crates/dynamo-config) | [dynamo-tokenizers](https://crates.io/crates/dynamo-tokenizers) | [dynamo-tokens](https://crates.io/crates/dynamo-tokens)
>
> To access a specific version: `https://crates.io/crates/{crate}/{version}`
#### dynamo-runtime (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-runtime@1.2.0` | `v1.82` | |
| `dynamo-runtime@1.1.1` | `v1.82` | |
| `dynamo-runtime@1.1.0` | `v1.82` | |
| `dynamo-runtime@1.0.2` | `v1.82` | |
| `dynamo-runtime@1.0.1` | `v1.82` | |
| `dynamo-runtime@1.0.0` | `v1.82` | |
| `dynamo-runtime@0.9.1` | `v1.82` | |
| `dynamo-runtime@0.9.0` | `v1.82` | |
| `dynamo-runtime@0.8.1` | `v1.82` | |
| `dynamo-runtime@0.8.0` | `v1.82` | |
| `dynamo-runtime@0.7.1` | `v1.82` | |
| `dynamo-runtime@0.7.0` | `v1.82` | |
| `dynamo-runtime@0.6.1` | `v1.82` | |
| `dynamo-runtime@0.6.0` | `v1.82` | |
#### dynamo-llm (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-llm@1.2.0` | `v1.82` | |
| `dynamo-llm@1.1.1` | `v1.82` | |
| `dynamo-llm@1.1.0` | `v1.82` | |
| `dynamo-llm@1.0.2` | `v1.82` | |
| `dynamo-llm@1.0.1` | `v1.82` | |
| `dynamo-llm@1.0.0` | `v1.82` | |
| `dynamo-llm@0.9.1` | `v1.82` | |
| `dynamo-llm@0.9.0` | `v1.82` | |
| `dynamo-llm@0.8.1` | `v1.82` | |
| `dynamo-llm@0.8.0` | `v1.82` | |
| `dynamo-llm@0.7.1` | `v1.82` | |
| `dynamo-llm@0.7.0` | `v1.82` | |
| `dynamo-llm@0.6.1` | `v1.82` | |
| `dynamo-llm@0.6.0` | `v1.82` | |
#### dynamo-protocols (crate)
On crates.io, **`dynamo-protocols`** lists **`1.1.0`** as its first installable release (placeholder reservation **`0.0.0-prerelease.0`** omitted here like other **`0.0.0-prerelease.*`** uploads). Earlier semver lines for the OpenAI-compatible client shipped under **`dynamo-async-openai`** — see **`#### dynamo-async-openai (crate)`** below.
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-protocols@1.2.0` | `v1.82` | |
| `dynamo-protocols@1.1.1` | `v1.82` | |
| `dynamo-protocols@1.1.0` | `v1.82` | |
#### dynamo-async-openai (crate)
**Deprecated.** Prefer **`dynamo-protocols`**. This crate remains published on crates.io for manifests pinned to the old package name.
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-async-openai@1.0.2` | `v1.82` | Final crates.io release |
| `dynamo-async-openai@1.0.1` | `v1.82` | |
| `dynamo-async-openai@1.0.0` | `v1.82` | |
| `dynamo-async-openai@0.9.1` | `v1.82` | |
| `dynamo-async-openai@0.9.0` | `v1.82` | |
| `dynamo-async-openai@0.8.1` | `v1.82` | |
| `dynamo-async-openai@0.8.0` | `v1.82` | |
| `dynamo-async-openai@0.7.1` | `v1.82` | |
| `dynamo-async-openai@0.7.0` | `v1.82` | |
| `dynamo-async-openai@0.7.0-post1` | `v1.82` | |
| `dynamo-async-openai@0.6.1` | `v1.82` | |
| `dynamo-async-openai@0.6.0` | `v1.82` | |
| `dynamo-async-openai@0.5.1` | `v1.82` | |
| `dynamo-async-openai@0.5.0` | `v1.82` | |
| `dynamo-async-openai@0.4.1` | `v1.82` | |
#### dynamo-parsers (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-parsers@1.2.0` | `v1.82` | |
| `dynamo-parsers@1.1.1` | `v1.82` | |
| `dynamo-parsers@1.1.0` | `v1.82` | |
| `dynamo-parsers@1.0.2` | `v1.82` | |
| `dynamo-parsers@1.0.1` | `v1.82` | |
| `dynamo-parsers@1.0.0` | `v1.82` | |
| `dynamo-parsers@0.9.1` | `v1.82` | |
| `dynamo-parsers@0.9.0` | `v1.82` | |
| `dynamo-parsers@0.8.1` | `v1.82` | |
| `dynamo-parsers@0.8.0` | `v1.82` | |
| `dynamo-parsers@0.7.1` | `v1.82` | |
| `dynamo-parsers@0.7.0` | `v1.82` | |
| `dynamo-parsers@0.6.1` | `v1.82` | |
| `dynamo-parsers@0.6.0` | `v1.82` | |
#### dynamo-memory (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-memory@1.2.0` | `v1.82` | |
| `dynamo-memory@1.1.1` | `v1.82` | |
| `dynamo-memory@1.1.0` | `v1.82` | |
| `dynamo-memory@1.0.2` | `v1.82` | |
| `dynamo-memory@1.0.1` | `v1.82` | |
| `dynamo-memory@1.0.0` | `v1.82` | |
| `dynamo-memory@0.9.1` | `v1.82` | |
| `dynamo-memory@0.9.0` | `v1.82` | |
| `dynamo-memory@0.8.1` | `v1.82` | |
| `dynamo-memory@0.8.0` | `v1.82` | Initial |
#### dynamo-config (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-config@1.2.0` | `v1.82` | |
| `dynamo-config@1.1.1` | `v1.82` | |
| `dynamo-config@1.1.0` | `v1.82` | |
| `dynamo-config@1.0.2` | `v1.82` | |
| `dynamo-config@1.0.1` | `v1.82` | |
| `dynamo-config@1.0.0` | `v1.82` | |
| `dynamo-config@0.9.1` | `v1.82` | |
| `dynamo-config@0.9.0` | `v1.82` | |
| `dynamo-config@0.8.1` | `v1.82` | |
| `dynamo-config@0.8.0` | `v1.82` | Initial |
#### dynamo-tokenizers (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-tokenizers@1.2.0` | `v1.82` | Initial |
#### dynamo-tokens (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-tokens@1.2.0` | `v1.82` | |
| `dynamo-tokens@1.1.1` | `v1.82` | |
| `dynamo-tokens@1.1.0` | `v1.82` | |
| `dynamo-tokens@1.0.2` | `v1.82` | |
| `dynamo-tokens@1.0.1` | `v1.82` | |
| `dynamo-tokens@1.0.0` | `v1.82` | |
| `dynamo-tokens@0.9.1` | `v1.82` | |
| `dynamo-tokens@0.9.0` | `v1.82` | Initial |
#### dynamo-mocker (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-mocker@1.2.0` | `v1.82` | |
| `dynamo-mocker@1.1.1` | `v1.82` | |
| `dynamo-mocker@1.1.0` | `v1.82` | |
| `dynamo-mocker@1.0.2` | `v1.82` | |
| `dynamo-mocker@1.0.1` | `v1.82` | |
| `dynamo-mocker@1.0.0` | `v1.82` | Initial |
#### dynamo-kv-router (crate)
| Crate | MSRV (Rust) | Notes |
|-------|-------------|-------|
| `dynamo-kv-router@1.2.0` | `v1.82` | |
| `dynamo-kv-router@1.1.1` | `v1.82` | |
| `dynamo-kv-router@1.1.0` | `v1.82` | |
| `dynamo-kv-router@1.0.2` | `v1.82` | |
| `dynamo-kv-router@1.0.1` | `v1.82` | |
| `dynamo-kv-router@1.0.0` | `v1.82` | Initial |
---
## Pre-Release Artifacts
**Pre-Release artifacts do not go through QA validation.** Pre-release versions are experimental previews intended for early testing and feedback. They may contain bugs, breaking changes, or incomplete features. Use stable releases for production workloads.
**Pre-Release Python Wheels** are published on the NVIDIA package index at [pypi.nvidia.com](https://pypi.nvidia.com/), not on the public [PyPI](https://pypi.org/) index. Like stable wheels, they are **Linux (manylinux) builds** for the Python versions in the [Support Matrix](/dynamo/resources/support-matrix); `pip`/`uv` on macOS or Windows will not find matching wheels. Install on a supported Linux host or inside a Linux container.
Install by adding that URL as an extra index and allowing pre-releases (PEP 440 dev versions):
```bash
# uv (recommended in other Dynamo docs)
uv pip install --pre --extra-index-url https://pypi.nvidia.com/ ai-dynamo==1.1.0.dev2
# pip
pip install --pre --extra-index-url https://pypi.nvidia.com ai-dynamo==1.1.0.dev2
```
A GitHub or container tag `v1.1.0-dev.N` maps to a wheel version `1.1.0.devN` (for example `v1.1.0-dev.2` → `==1.1.0.dev2`). Optional extras such as `ai-dynamo[vllm]` use the same flags; pin the version you want from the sections below.
### v1.2.0-deepseek-v4-dev.3
- **Branch:** [release/1.2.0-deepseek-v4-dev.3](https://github.com/ai-dynamo/dynamo/tree/release/1.2.0-deepseek-v4-dev.3)
- **GitHub Tag:** [v1.2.0-deepseek-v4-dev.3](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0-deepseek-v4-dev.3)
- **Backends:** vLLM `v0.20.1` (DSv4 stabilization patch over `v0.20.0` native DSv4 support) | SGLang upstream `lmsysorg/sglang:deepseek-v4-blackwell` preview (refreshed for dev.3) | NIXL `v0.10.1`
- **Coverage:** Partial -- DeepSeek-V4-Flash and V4-Pro only. vLLM and SGLang containers are published for Blackwell (B200 plus GB200); no TensorRT-LLM container, no other component containers, no Helm charts, no wheels. Snapshot dev build for early-access V4 model support; not QA-gated.
#### Container Images
| Image:Tag | Backend | CUDA | Arch |
|-----------|---------|------|------|
| `vllm-runtime:1.2.0-deepseek-v4-cuda13-dev.3` | vLLM `v0.20.1` | `v13.0` | AMD64/ARM64 |
| `sglang-runtime:1.2.0-deepseek-v4-cuda12-dev.3` | SGLang upstream DSv4 preview | `v12.9` | AMD64 |
| `sglang-runtime:1.2.0-deepseek-v4-cuda13-dev.3` | SGLang upstream DSv4 preview | `v13.0` | ARM64 |
#### Python Wheels
Not published for this dev release. Use the `v1.1.1` wheels or `v1.1.0-dev.3` from [pypi.nvidia.com](https://pypi.nvidia.com/).
#### Helm Charts
Not published for this dev release. Use `v1.1.1` charts for platform install.
#### Rust Crates
Not shipped for pre-release versions.
### v1.2.0-deepseek-v4-dev.2
- **Branch:** [release/1.2.0-deepseek-v4-dev.2](https://github.com/ai-dynamo/dynamo/tree/release/1.2.0-deepseek-v4-dev.2)
- **GitHub Tag:** [v1.2.0-deepseek-v4-dev.2](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0-deepseek-v4-dev.2)
- **Backends:** vLLM `v0.20.0` (native DeepSeek-V4 support) | SGLang upstream `lmsysorg/sglang:deepseek-v4-blackwell` preview | NIXL `v0.10.1`
- **Coverage:** DeepSeek-V4-Flash and V4-Pro only. vLLM and SGLang containers are published for Blackwell. TensorRT-LLM container, other component containers, Helm charts, and wheels are not published for this tag. Snapshot dev build for early-access V4 model support; not QA-gated.
#### Container Images
| Image:Tag | Backend | CUDA | Arch |
|-----------|---------|------|------|
| `vllm-runtime:1.2.0-deepseek-v4-cuda13-dev.2` | vLLM `v0.20.0` | `v13.0` | AMD64/ARM64 |
| `sglang-runtime:1.2.0-deepseek-v4-cuda12-dev.2` | SGLang upstream DSv4 preview | `v12.9` | AMD64 |
| `sglang-runtime:1.2.0-deepseek-v4-cuda13-dev.2` | SGLang upstream DSv4 preview | `v13.0` | ARM64 |
#### Python Wheels
Not published for this dev release. Use the `v1.1.0` wheels or `v1.1.0-dev.3` from [pypi.nvidia.com](https://pypi.nvidia.com/).
#### Helm Charts
Not published for this dev release. Use `v1.1.0` charts for platform install.
#### Rust Crates
Not shipped for pre-release versions.
### v1.1.0-dev.3
- **Branch:** [release/1.1.0-dev.3](https://github.com/ai-dynamo/dynamo/tree/release/1.1.0-dev.3)
- **GitHub Tag:** [v1.1.0-dev.3](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0-dev.3)
- **Backends (branch ToT):** SGLang `v0.5.10.post1` | TensorRT-LLM `v1.3.0rc11` | vLLM `v0.19.0` | NIXL `v0.10.1`
- **Coverage:** TensorRT-LLM runtime container plus **`ai-dynamo`** and **`ai-dynamo-runtime`** wheels on [pypi.nvidia.com](https://pypi.nvidia.com/). SGLang and vLLM containers, component containers (`dynamo-frontend`, `dynamo-planner`, `kubernetes-operator`, `snapshot-agent`), **`kvbm`** wheel, and Helm charts are not published for this tag.
#### Container Images
| Image:Tag | Backend | CUDA | Arch |
|-----------|---------|------|------|
| `tensorrtllm-runtime:1.1.0-dev.3` | TRT-LLM `v1.3.0rc11` | `v13.1` | AMD64/ARM64 |
#### Python Wheels
Available from [pypi.nvidia.com](https://pypi.nvidia.com/) (pre-release index):
```bash
uv pip install --pre --extra-index-url https://pypi.nvidia.com/ ai-dynamo==1.1.0.dev3
uv pip install --pre --extra-index-url https://pypi.nvidia.com/ ai-dynamo-runtime==1.1.0.dev3
```
`kvbm==1.1.0.dev3` is not yet published.
#### Helm Charts
Not published for this dev release. Use the latest stable (`v1.1.0`) for platform install.
#### Rust Crates
Not shipped for pre-release versions.
### v1.1.0-dev.2
- **Branch:** [release/1.1.0-dev.2](https://github.com/ai-dynamo/dynamo/tree/release/1.1.0-dev.2)
- **GitHub Tag:** [v1.1.0-dev.2](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0-dev.2)
- **Backends (branch ToT):** SGLang `v0.5.9` | TensorRT-LLM `v1.3.0rc9` | vLLM `v0.19.0` | NIXL `v0.10.1`
- **Coverage:** SGLang and TensorRT-LLM runtime containers plus **`ai-dynamo`** and **`ai-dynamo-runtime`** wheels on [pypi.nvidia.com](https://pypi.nvidia.com/). vLLM runtime container, component containers (`dynamo-frontend`, `dynamo-planner`, `kubernetes-operator`, `snapshot-agent`), **`kvbm`** wheel, and Helm charts are not published for this tag.
#### Container Images
| Image:Tag | Backend | CUDA | Arch |
|-----------|---------|------|------|
| `sglang-runtime:1.1.0-dev.2` | SGLang `v0.5.9` | `v12.9` | AMD64/ARM64 |
| `tensorrtllm-runtime:1.1.0-dev.2` | TRT-LLM `v1.3.0rc9` | `v13.1` | AMD64/ARM64 |
#### Python Wheels
Available from [pypi.nvidia.com](https://pypi.nvidia.com/) (pre-release index):
```bash
uv pip install --pre --extra-index-url https://pypi.nvidia.com/ ai-dynamo==1.1.0.dev2
uv pip install --pre --extra-index-url https://pypi.nvidia.com/ ai-dynamo-runtime==1.1.0.dev2
```
#### Helm Charts
Not published for this dev release. Use the latest stable (`v1.1.0`) for platform install.
#### Rust Crates
Not shipped for pre-release versions.
### v1.1.0-dev.1
- **Branch:** [release/1.1.0-dev.1](https://github.com/ai-dynamo/dynamo/tree/release/1.1.0-dev.1)
- **GitHub Tag:** [v1.1.0-dev.1](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0-dev.1)
- **Backends:** SGLang `v0.5.9` | TensorRT-LLM `v1.3.0rc5.post1` | vLLM `v0.17.1` | NIXL `v0.10.1`
#### Container Images
| Image:Tag | Backend | CUDA | Arch |
|-----------|---------|------|------|
| `vllm-runtime:1.1.0-dev.1` | vLLM `v0.17.1` | `v12.9` | AMD64/ARM64 |
| `vllm-runtime:1.1.0-dev.1-cuda13` | vLLM `v0.17.1` | `v13.0` | AMD64/ARM64 |
| `vllm-runtime:1.1.0-dev.1-efa-amd64` | vLLM `v0.17.1` | `v12.9` | AMD64 |
| `sglang-runtime:1.1.0-dev.1` | SGLang `v0.5.9` | `v12.9` | AMD64/ARM64 |
| `sglang-runtime:1.1.0-dev.1-cuda13` | SGLang `v0.5.9` | `v13.0` | AMD64/ARM64 |
| `tensorrtllm-runtime:1.1.0-dev.1` | TRT-LLM `v1.3.0rc5.post1` | `v13.1` | AMD64/ARM64 |
| `tensorrtllm-runtime:1.1.0-dev.1-efa-amd64` | TRT-LLM `v1.3.0rc5.post1` | `v13.1` | AMD64 |
| `dynamo-frontend:1.1.0-dev.1` | — | — | AMD64/ARM64 |
| `kubernetes-operator:1.1.0-dev.1` | — | — | AMD64/ARM64 |
| `snapshot-agent:1.1.0-dev.1` | — | — | AMD64/ARM64 |
#### Python Wheels
Available from [pypi.nvidia.com](https://pypi.nvidia.com/) (pre-release index):
```bash
uv pip install --pre --extra-index-url https://pypi.nvidia.com/ ai-dynamo==1.1.0.dev1
uv pip install --pre --extra-index-url https://pypi.nvidia.com/ ai-dynamo-runtime==1.1.0.dev1
```
#### Helm Charts
| Chart | NGC |
|-------|-----|
| `dynamo-platform-1.1.0-dev.1` | [NGC Helm: dynamo-platform 1.1.0-dev.1](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-platform?version=1.1.0-dev.1) |
| `snapshot-1.1.0-dev.1` | [NGC Helm: snapshot 1.1.0-dev.1](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/snapshot?version=1.1.0-dev.1) |
#### Rust Crates
Not shipped for pre-release versions.
# Examples
The examples below assume you build the latest image yourself from source. If using a prebuilt image, follow the examples from the corresponding branch.
## Hello World
Demonstrates the basic concepts of Dynamo by creating a simple GPU-unaware graph.
[View Hello World Example](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/custom_backend/hello_world)
## vLLM
Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with vLLM.
[View vLLM Backend Guide](/dynamo/backends/v-llm)
## SGLang
Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with SGLang.
[View SGLang Backend Guide](/dynamo/backends/sg-lang)
## TensorRT-LLM
Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with TensorRT-LLM.
[View TensorRT-LLM Backend Guide](/dynamo/backends/tensor-rt-llm)
# Glossary
## B
**Block** - A fixed-size chunk of tokens (typically 16 or 64 tokens) used for efficient KV cache management and memory allocation, serving as the fundamental unit for techniques like PagedAttention.
## C
**Component** - The fundamental deployable unit in Dynamo. A discoverable service entity that can host multiple endpoints and typically maps to a Docker container (such as VllmWorker, Router, Processor).
**Conditional Disaggregation** - Dynamo's intelligent decision-making process within disaggregated serving that determines whether a request is processed locally or sent to a remote prefill engine based on prefill length and queue status.
## D
**Decode Phase** - The second phase of LLM inference that generates output tokens one at a time.
**Disaggregated Serving** - Dynamo's core architecture that separates prefill and decode phases into specialized engines to maximize GPU throughput and improve performance.
**Discovery Plane** - The service discovery layer where components (frontend, router, and workers) register services, discover services, and watch for new service life-cycle events at runtime using Kubernetes or etcd backends.
**Distributed Runtime** - Dynamo's Rust-based core system that manages service discovery, communication, and component lifecycle across distributed clusters.
**Dynamo** - NVIDIA's high-performance distributed inference framework for Large Language Models (LLMs) and generative AI models, designed for multinode environments with disaggregated serving and cache-aware routing.
**Dynamo Kubernetes Platform** - A Kubernetes platform providing managed deployment experience for Dynamo inference graphs.
## E
**Endpoint** - A specific network-accessible API within a Dynamo component, such as `generate` or `load_metrics`.
**Event Plane** - The pub/sub layer for KV cache updates, worker metrics, and sequence tracking; it supports KV-aware routing and disaggregated serving architectures.
## F
**Frontend** - Dynamo's API server component that receives user requests and provides OpenAI-compatible HTTP endpoints.
## G
**Graph** - A collection of interconnected Dynamo components that form a complete inference pipeline with request paths (single-in) and response paths (many-out for streaming). A graph can be packaged into a Dynamo Artifact for deployment.
## I
**Instance** - A running process with a unique `instance_id`. Multiple instances can serve the same namespace, component, and endpoint for load balancing.
**Inter-Token Latency (ITL)** - The latency between consecutive output tokens during the decode phase; typically paired with TTFT to define performance SLAs.
## K
**KV Block Manager (KVBM)** - Dynamo's scalable runtime component that handles memory allocation, management, and remote sharing of Key-Value blocks across heterogeneous and distributed environments.
**KV Cache** - Key-Value cache that stores computed attention states from previous tokens to avoid recomputation during inference.
**KV Router** - Dynamo's intelligent routing system that directs requests to workers with the highest cache overlap to maximize KV cache reuse. Determines routing based on KV cache hit rates and worker metrics.
**KVIndexer** - Dynamo component that maintains a global view of cached blocks across all workers using a prefix tree structure to calculate cache hit rates.
**KVPublisher** - Dynamo component that emits KV cache events (stored/removed) from individual workers to the global KVIndexer.
**KV Transfer Policy** - Kubernetes DGD policy under `spec.experimental.kvTransferPolicy` that tells Dynamo which worker topology domain to use for disaggregated prefill-to-decode KV-cache transfer routing.
## L
**LoRA (Low-Rank Adaptation)** - A fine-tuning technique for serving specialized model variants without duplicating full model weights. Dynamo supports dynamic loading and serving of LoRA adapters at runtime using worker APIs (for example, to load/unload,or for discovery in /v1/models).
## M
**Model Deployment Card (MDC)** - A configuration structure containing all information required for distributed model serving. When a worker loads a model, it creates an MDC containing references to components such as the tokenizer, templates, runtime config. Workers publish their MDC to make the model discoverable to frontends. Frontends use the MDC to configure request preprocessing (tokenization, prompt formatting).
## N
**Namespace** - Dynamo's logical grouping mechanism for related components. Similar to directories in a file system, they prevent collisions between different deployments.
**NIXL (NVIDIA Inference tranXfer Library)** - High-performance data transfer library optimized for inference workloads, supporting direct GPU-to-GPU transfers and multiple memory hierarchies.
## P
**PagedAttention** - Memory management technique from vLLM that efficiently manages KV cache by chunking requests into blocks.
**Planner** - Dynamo component that performs dynamic resource scaling based on real-time demand signals and system metrics.
**Prefill Phase** - The first phase of LLM inference that processes the input prompt and generates KV cache.
**Prefix Caching** - Optimization technique that reuses previously computed KV cache for common prompt prefixes.
**Processor** - Dynamo component that handles request preprocessing, tokenization, and routing decisions.
**Profiler** - Dynamo component that analyzes model performance to determine optimal engine configurations, including disagg/agg, parallelization mapping (TP, TEP, DEP), and other engine knobs (batch size, max num tokens), feeding the Planner for SLA-driven autoscaling.
## R
**RadixAttention** - Technique from SGLang that uses a prefix tree structure for efficient KV cache matching, insertion, and eviction.
**RDMA (Remote Direct Memory Access)** - Technology that allows direct memory access between distributed systems, used for efficient KV cache transfers.
**Request Plane** - The transport layer that transmits RPCs between components (frontend-to-worker or router-to-router) utilizing one of these protocols: TCP, HTTP, or NATS.
## S
**SGLang** - Fast LLM inference framework with native embedding support and RadixAttention.
**Speculative Decoding** - An optimization where a draft model proposes tokens for parallel verification by the main model; reduces latency (for example, vLLM with Eagle).
## T
**Tensor Parallelism (TP)** - Model parallelism technique where model weights are distributed across multiple GPUs.
**TensorRT-LLM** - NVIDIA's optimized LLM inference engine with multinode MPI distributed support.
**Time-To-First-Token (TTFT)** - The latency from receiving a request to generating the first output token.
**Topology-Aware KV Transfer** - Dynamo routing behavior that constrains or biases decode worker selection toward workers sharing the selected prefill worker's topology domain.
**Topology Domain** - A logical level in the cluster topology, such as `zone` or `rack`. For topology-aware KV transfer, workers publish domain values in `ModelRuntimeConfig.topology_domains`.
**Topology Taint** - A canonical worker taint generated from topology metadata in the form `dynamo.topology/=`. The router uses these taints through normal `RoutingConstraints`.
## V
**vLLM** - High-throughput LLM serving engine with distributed tensor/pipeline parallelism and PagedAttention.
## W
**Wide Expert Parallelism (WideEP)** - Mixture-of-Experts deployment strategy that spreads experts across many GPUs (e.g., 64-way EP) so each GPU hosts only a few experts.
## X
**xPyD (x Prefill y Decode)** - Dynamo notation describing disaggregated serving configurations where x prefill workers serve y decode workers. Dynamo supports runtime-reconfigurable xPyD.
# Dynamo Digest
> Technical deep dives, announcements, and updates from the Dynamo team.
Technical deep dives, announcements, and updates from the Dynamo team.
How Dynamo checkpoints warm inference workers and restores them quickly on
Kubernetes, with a path toward sub-five-second startup for large models.
A short pointer to the DynoSim deep dive on fast, workload-driven
simulation for finding Dynamo deployment Pareto frontiers.
A short note on TokenSpeed's launch, its kernel and scheduler work, and
Dynamo's day-0 integration.
Lessons from running Claude Code, Codex, and OpenClaw against Dynamo: prompt
stability, reasoning fidelity, and streaming tool dispatch.
How Dynamo optimizes for agentic workloads at three layers: the frontend API,
the router, and KV cache management.
How Dynamo's concurrent global index evolved through six iterations to sustain
over 100 million operations per second.
# NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes
> NVIDIA Dynamo Snapshot combines CUDA and host checkpointing to restore warm inference workers quickly on Kubernetes.

Cold-starting inference replicas on Kubernetes can take minutes while engines load weights, warm kernels, and compile graphs. In our blog post, [NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes](https://developer.nvidia.com/blog/nvidia-dynamo-snapshot-fast-startup-for-inference-workloads-on-kubernetes/), we introduce Dynamo Snapshot, a checkpoint/restore approach that combines `cuda-checkpoint`, CRIU, and a privileged `snapshot-agent` DaemonSet to restore warm workers from shared storage. We also walk through KV cache unmapping, CRIU restore optimizations, and GPU Memory Service (GMS), which bring the `gpt-oss-120b` prototype below five seconds and reduce startup time by 21x.
# DynoSim: Simulating the Pareto Frontier
> DynoSim is a workload-driven discrete-event simulation of the NVIDIA Dynamo serving stack for mapping Pareto frontiers before real-cluster validation.

DynoSim is a workload-driven discrete-event simulation of NVIDIA Dynamo: a Dynamo twin for exploring LLM serving behavior before running full deployments. It brings measured engine forward-pass timing, Mocker scheduler cores, Router and Planner behavior, KV cache effects, and workload traces onto one virtual timeline. In our blog post, [DynoSim: Simulating the Pareto Frontier](https://developer.nvidia.com/blog/dynosim-simulating-the-pareto-frontier/), we show how simulation becomes the inner loop for design exploration: sweep broadly, map the throughput-latency Pareto frontier, shortlist the most promising candidates, and verify them on real clusters.
# Dynamo Day 0 support for TokenSpeed
> Dynamo adds day-0 TokenSpeed support with the Dynamo frontend for Kimi K2.5.
[TokenSpeed](https://lightseek.org/blog/lightseek-tokenspeed.html) ([GitHub](https://github.com/lightseekorg/tokenspeed)) launched today as LightSeek's new inference engine for agentic workloads. The initial repo is a preview, with more model coverage and runtime features landing over the next few weeks.
Two pieces are worth calling out. First, TokenSpeed includes new MLA kernel work for long-context Kimi-style workloads on Blackwell. Second, TokenSpeed has a native C++ scheduler in `tokenspeed-scheduler/` that models request flow and cache operations as explicit state machines, while Python remains the runtime and integration layer.
Dynamo now has day-0 support for running TokenSpeed as a Dynamo backend through `python -m dynamo.tokenspeed`. The Dynamo frontend remains the user-facing OpenAI-compatible API entrypoint and handles request routing, streaming responses, and cancellation.
See the [Kimi K2.5 TokenSpeed recipe](https://github.com/ai-dynamo/dynamo/tree/main/recipes/kimi-k2.5/tokenspeed/agg/nvidia) for the current Dynamo launch recipe.
Things are moving quickly. Upstream TokenSpeed calls out ongoing work on model coverage, P/D, EPLB, KV store, Mamba cache, VLM, metrics, Hopper optimization, and related runtime features.
> Streaming Tokens and Tools: Multi-Turn Agentic Harness Support in Dynamo
# Streaming Tokens and Tools: Multi-Turn Agentic Harness Support in Dynamo
An agentic exchange must preserve a structured interaction: assistant turns interleave reasoning with one or more tool calls, and subsequent user turns return the corresponding tool results to the model context. Reasoning replay is model- and turn-dependent: some reasoning should be retained, while some should be dropped. The inference engine is responsible for this more expressive interaction and for producing correctly segmented API results. Tool-call parsing and reasoning parsing need to happen before the attached harness consumes the response. High-value agentic workflows such as coding also depend on a responsive harness experience: reasoning segments, tool-call events, and request metadata need to stream back as the turn unfolds instead of arriving after a final text response. This post covers lessons from running real agentic clients against Dynamo: how we hardened parser and API coverage and how those parser layers became standalone reusable crates.
These changes build on the performance considerations outlined in our [first post](/dynamo/digest/agentic-inference), which focused on the serving architecture underneath agentic inference: the frontend, router, and KV cache management. This follow-up focuses on correctness, user-experience equivalence, and performance.
Agentic harnesses are still evolving quickly. Claude Code, Codex, and OpenClaw expose the same pressure points from different API surfaces, so the examples below focus on the core behaviors that custom serving stacks need to reproduce.

## Harness-Facing Dynamo Settings
Our experiments used the newly released `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4` model, though the same issues apply across models, reasoning parsers, and tool-call parsers.
To reproduce our results, configure the frontend with the Anthropic-compatible API and the flags that preserve prompt, reasoning, and tool state:
- `--enable-anthropic-api` exposes the Anthropic Messages API to harnesses. Many harnesses can fall back to the default Messages API, but the experience is degraded.
- `--strip-anthropic-preamble` removes the Anthropic billing header that can destabilize KV reuse.
- `--enable-streaming-tool-dispatch` lets complete tool calls start executing as soon as they are decoded, rather than waiting for the end of the turn.
Putting all of this together:
```bash
python -m dynamo.frontend \
--http-port 8000 \
--enable-anthropic-api \
--strip-anthropic-preamble \
--enable-streaming-tool-dispatch
```
On the worker side, the important settings in this deployment are:
- `--dyn-tool-call-parser ` and `--dyn-reasoning-parser ` reconstruct tool calls and reasoning blocks in the model-specific format the harness expects. Those parsers also control whether reasoning from previous turns should be retained, transformed, or dropped.
## Prompt Stability Is Key for Cache Reuse
Claude Code sends thousands of tokens of reusable prompt scaffolding, much of which is designed to be the same across different users and sessions. However, at the very front of each prompt is a session-specific billing header which causes cache misses when pointed at custom endpoints that do not strip it out:
```text
x-anthropic-billing-header: cc_version=0.2.93; cch=abc123def456==;
You are Claude Code, an interactive CLI tool...
```
These headers poison the KV cache and prevent it from being reused, even across sessions by the same user. A varying line at position zero means every new session starts from a different token prefix, so the stable instructions and tool definitions behind it never line up cleanly for reuse.
To restore KV-cache reuse, Dynamo added `--strip-anthropic-preamble`. The fix is mechanically small and operationally important: remove the unstable billing header before tokenization so that the stable prompt starts at token zero.
The measured impact was large. On a Dynamo B200 deployment with a 52K-token prompt, a stable prefix landed at `168ms` TTFT. Keeping a varying per-session header in the prefix pushed that to `912ms`. Removing the billing header before tokenization brought it back to `169ms`. On this workload, the unstable header costs `744ms` per request and turns a reusable system prompt into a cold prefill. That is about a `5x` reduction in TTFT for new users hitting the same deployment or for the same user opening a new session.

## The Nuances of Reasoning and Tool Parsing
Reasoning replay into the next turn does not have one universal correct form. Some models intentionally drop prior thinking on ordinary assistant turns. Agentic turns with interleaved tool calls are different: there, the reasoning spans often need to remain attached to the tool calls they explain. The real contract is model-specific and turn-specific.
Anthropic's [April 23 Claude Code postmortem](https://www.anthropic.com/engineering/april-23-postmortem) gives a concrete production example of this policy: thinking from previous turns can be cleared on session resume to reduce the prefill burden after the cached prompt has expired.
Contemporary reasoning models tend to produce two different kinds of assistant turns:
- reasoning followed by a direct response to the user
- reasoning followed by one or more tool calls
Agentic models are especially good at producing turns where many reasoning and tool-call segments appear within a single response in the pattern of:
```text
reasoning_0 tool_call_0 reasoning_1 tool_call_1
```
On the next turn each reasoning span needs to stay attached to the tool call it explains. Dynamo now supports this interleaved format fully. Previously, the same turn could be reconstructed as:
```text
reasoning_0 reasoning_1 tool_call_0 tool_call_1
```
If the assistant turn is reconstructed as one generic reasoning block followed by one blob of tool calls, the model still has all the same tokens but loses the sequence and delimiters that made them meaningful. This grouped ordering came from legacy models that emitted only a single reasoning span and a single tool-call pass per turn.
In addition to the reordering bug, we also found that reasoning was often being dropped too aggressively before the next turn. For some models, dropping prior thinking on turns without tool calls is an established behavior and part of the model's fine-tuning (DeepSeek-R1 is the clearest example). But that same behavior is wrong for interleaved agentic turns where the prior reasoning explains the tool sequence. This issue was difficult to spot because users could see reasoning being decoded correctly in the outgoing response while it was still being *silently malformed* or dropped before the next turn.
We validated this against a Dynamo + TRT-LLM deployment: Nemotron-3-Super-120B-A12B-NVFP4 on 4x B200 with TP=4, with `--enable-anthropic-api`, `--strip-anthropic-preamble`, `--enable-streaming-tool-dispatch`, the `nemotron_deci` reasoning parser, and the `qwen3_coder` tool call parser.
### Combined Reasoning and Tool Calls
A model that reasons before calling a tool generates a response where `` content flows first, followed by `` XML. In the case of Nemotron, two different parsers, `nemotron_deci` for reasoning and `qwen3_coder` for tool calls, have to split that stream into the correct Anthropic Messages API content blocks without interfering with each other.
We sent the same prompt five times through the Anthropic Messages API: a system prompt instructing the model to think step by step, two tool definitions (calculator and weather), and the user message "Think carefully about what 15 * 23 equals, then use the calculator to verify." The response structure from a representative round:
```json
{
"content": [
{
"type": "thinking",
"thinking": "I need to calculate 15 * 23. Let me think: 15 * 20 = 300, and 15 * 3 = 45, so 300 + 45 = 345. I'll use the calculator to verify.\n"
},
{
"type": "tool_use",
"id": "call-a3364797-3160-4e84-b567-5c495694d502",
"name": "calculator",
"input": { "expression": "15 * 23" }
}
],
"stop_reason": "tool_use",
"usage": { "input_tokens": 403, "output_tokens": 95 }
}
```
### Streaming Two Parsers at Once
The streaming path makes the parser interaction more visible. A streaming request produces a sequence of SSE events, and the event type sequence shows exactly how the two parsers carve up the token stream:
```text
1ms message_start
82ms content_block_start type=thinking
82ms content_block_delta (thinking tokens stream here, ~7ms apart)
... (~70 thinking deltas over ~520ms)
602ms content_block_stop
602ms content_block_start type=text
602ms content_block_delta
800ms content_block_stop
800ms content_block_start type=tool_use
800ms content_block_delta
800ms content_block_stop
814ms message_delta stop_reason=tool_use
814ms message_stop
```
The thinking block streams token by token from `82ms` to `602ms`. Then a brief text block appears (the whitespace between the thinking and tool call regions of the raw token stream). Then the tool_use block arrives at `800ms` as a single structured unit. The `message_stop` follows at `814ms`.
This round-trip did not produce the correct Anthropic event sequence until [PR #7358](https://github.com/ai-dynamo/dynamo/pull/7358). The fix had three parts:
1. **One owner for reasoning parsing**: reasoning parsing used to happen at multiple competing layers. The backend parser could split model output into `reasoning_content` and normal `content`, while the Anthropic streaming converter still tried to infer `` boundaries when mapping the same stream into Anthropic content blocks. PR #7358 made ownership explicit. If a backend path has already produced structured reasoning deltas, the Anthropic converter trusts them and only maps them into the response format.
2. **Template-native reasoning when available**: Dynamo now checks whether the active chat template knows how to read `reasoning_content`. Templates like Nemotron and Qwen3 read that field directly, so Dynamo leaves it alone and lets the template decide how much prior thinking to keep. If the template only understands `content`, Dynamo falls back to the legacy representation: preserve reasoning by inserting `` blocks into `content`, or leave it out when the model/parser policy says prior thinking should not carry into the next turn. Both the Rust preprocessor path (`ModelInput::Tokens`) and the Python worker path (`ModelInput::Text`) use this same conditional rule.
3. **Respect per-request thinking controls**: Many templates default `truncate_history_thinking=true` to save context. That is reasonable for ordinary chat, but it removes the reasoning behind prior tool calls in agent workflows. Dynamo now changes that behavior only for requests where reasoning is actually in play: when a reasoning parser is configured and the client has not disabled thinking, the Anthropic path sets `enable_thinking=true` and `truncate_history_thinking=false`. That keeps the next-turn context agents need without changing the default for requests or models that should run without thinking.
In our B200 experiment with a 52K-token system prompt and an assistant turn containing about 500 tokens of thinking, the unchanged next-turn prefix landed at `167ms` TTFT while mutated thinking landed at `322ms`. That is a `1.9x` increase, or about `155ms` per request, from changing the reasoning content inside the next-turn prefix.
The key takeaway is that the harness, parser, and template path must agree on each model's expected reasoning behavior. Dropping thinking on ordinary turns may be correct for one model and wrong for another. Preserving interleaved reasoning on tool-calling turns may be essential even when ordinary turns are allowed to strip it. **In practice, you should not assume that the tokens produced on turn `N` will automatically arrive unchanged as the prefix of turn `N+1`.** Whether that is true depends on the reasoning parser, tool parser, and chat template for the model you are serving.
## Streaming Tool Calls
Streaming tokens make the user experience feel more responsive and dynamic. The hard part is preserving that streaming behavior while still emitting tool calls as coherent blocks. In the older Dynamo path, reasoning tokens streamed back normally, but tool calls stayed buffered until the very end of the turn before being released all at once to the harness. That reduces responsiveness and delays tool execution even when the model has already decided what to call.
| State | What the harness sees | When tool readiness becomes visible |
|-------|------------------------|-------------------------------------|
| Buffered | tool-call chunks withheld | only at `finish_reason: "tool_calls"` |
| Inline streaming | regular tool-call deltas | as soon as the model emits them |
| Dispatch | typed `event: tool_call_dispatch` side channel | at the same structural completion point, but already parsed |
The important change is from the first row to the latter two. That is where the harness stops waiting for stream end to learn that it needs to act.
Without dispatch, the harness sees a regular token stream and has to infer when a tool call is complete by accumulating deltas and waiting for enough structure to be present. With dispatch enabled, Dynamo can emit a typed SSE side channel:
```text
event: tool_call_dispatch
data: {"choice_index":0,"tool_call":{"index":0,"id":"call-...","type":"function","function":{"name":"calculator","arguments":"{\"expression\":\"42 * 17\"}"}}}
```
That event tells the harness, in one shot, that the tool call is ready to execute. No harness-side delta assembly, no guessing whether the arguments are complete, and no custom parser living inside the harness. This makes Dynamo more easily compatible with custom harnesses.

## Anthropic API Fidelity for Claude Code and OpenClaw
Claude Code and OpenClaw both exercise the Anthropic Messages API rather than only text generation behind an endpoint. Matching the harness experience depends on a collection of smaller behaviors that are easy to miss in ad hoc testing:
- model metadata at both `GET /v1/models` and `GET /v1/models/{model_id}`
- correct handling of slashed model IDs
- useful `input_tokens` in `message_start`
- acceptance of `cache_control`
Once the frontend is reachable and compliant, both harnesses can point at Dynamo's Anthropic-compatible endpoint:
```bash
ANTHROPIC_API_KEY=local-dev-token \
ANTHROPIC_BASE_URL=http://localhost:8000 \
ANTHROPIC_CUSTOM_MODEL_OPTION=nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
ANTHROPIC_CUSTOM_MODEL_OPTION_NAME="Dynamo NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \
claude --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
ANTHROPIC_API_KEY=local-dev-token \
ANTHROPIC_BASE_URL=http://localhost:8000 \
npx openclaw agent --local -m "Say ok" --json
```
The fixes in this area brought the custom deployment closer to the native backend behavior. One concrete example shows the flavor of these bugs better than a long checklist. During startup, the harness asks for details about the selected model directly, but Dynamo did not yet serve that endpoint:
```text
GET /v1/models/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
HTTP/1.1 404 Not Found
```
Another example is `message_start` reporting `input_tokens: 0` even when the final response later contains the real count. This can make the token count in the harness temporarily drop to `0` every time a new turn starts. [PR #7234](https://github.com/ai-dynamo/dynamo/pull/7234) fixed that Anthropic path by populating `input_tokens` before the stream begins. Those counts are also control-plane data for long sessions: harnesses use context length to decide when to compact the conversation before the next request would exceed the model window. The broader tokenizer-service work landed separately in [PR #7699](https://github.com/ai-dynamo/dynamo/pull/7699), which added `/v1/tokenize` and `/v1/detokenize` endpoints for accurate token counts before a request is processed by the engine.
## Responses and Codex Fidelity
The Codex-facing version of the same problem lives on the `v1/responses` side. Passing compliance tests is not enough to provide parity in user experience.
We found that a Responses API request could not survive an internal round-trip without losing the fields that made it a Responses request rather than a chat completions request. Preserving those fields required architectural changes in Dynamo's `ResponseParams` path, together with the upstream type-alignment work in [PR #6089](https://github.com/ai-dynamo/dynamo/pull/6089).
Codex should point at Dynamo through the OpenAI-compatible Responses API with request compression enabled:
```bash
OPENAI_API_KEY=local-dev-token \
codex exec \
-c 'openai_base_url="http://localhost:8000/v1"' \
-c 'features.enable_request_compression=true' \
-m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
"Say ok"
```
### Codex Model Metadata Shapes the Request
Codex parity begins before the first `POST /v1/responses`. The CLI resolves the configured model string into a local model-catalog record, and the resulting `ModelInfo` controls the harness state built around the model: base instructions, history formatting, tool registry, reasoning parameters, verbosity controls, image support, context accounting, tool-output truncation, `parallel_tool_calls`, and the final Responses payload.
Two endpoints can serve the same underlying model and still drive different agent behavior if Codex attaches different catalog metadata. The request may validate against the schema while the harness around it has changed.
Tool-output truncation is a useful example. Codex does not replay unlimited command output into the next model turn. Shell and tool observations are truncated according to the selected model's catalog policy before they re-enter context. In the catalog snapshot we tested, `gpt-5.5` used:
```json
{ "mode": "tokens", "limit": 10000 }
```
By contrast, `openai/openai/gpt-5.5` on a custom endpoint used fallback metadata:
```json
{ "mode": "bytes", "limit": 10000 }
```
Those budgets are not equivalent. A `10,000`-byte limit cuts off structured logs, tracebacks, JSON, or test output much earlier than a `10,000`-token limit for ASCII-heavy coding output. For a coding agent, that changes what the model can inspect after a failed test, a search command, or a compiler error. The model may need additional tool calls to recover context that the intended catalog profile would have preserved.
Reasoning settings are also catalog-derived. Codex sends a Responses `reasoning` object when the selected model metadata says reasoning summaries are supported. In that path, Codex also requests `reasoning.encrypted_content` so reasoning state can be replayed across turns. Fallback metadata removes that path.
Prompting changes too. In Codex, switching from the fallback/default profile to the `gpt-5.5` catalog profile changes the system prompt. The fallback prompt is organized around generic Codex operation (`# How you work`, `# AGENTS.md spec`, `# Tool Guidelines`) and emphasizes `AGENTS.md` precedence, planning, validation, and shell-search habits. The `gpt-5.5` prompt is a different instruction document (`# Personality`, `# General`, `# Working with the user`) that frames the agent as a pragmatic software engineer and adds stronger guidance on codebase reading, local-pattern reuse, scoped edits, dirty worktrees, `apply_patch`, collaboration updates, and final-answer formatting. Catalog aliasing therefore affects base behavioral policy as well as request fields such as truncation and reasoning.
We saw this directly in a 50-task subset of SWE-Bench Verified. In this setup, both routes reached OpenAI-served GPT-5.5; the difference was the endpoint and the model-catalog record Codex attached to it. When the custom endpoint used model ID `openai/openai/gpt-5.5` without being associated with the `gpt-5.5` catalog profile, Codex used generic fallback behavior. In one run, the fallback profile issued roughly half as many tool calls:
| Catalog profile | Total tool calls | Per task |
|-----------------|------------------|----------|
| `gpt-5.5` profile | 2,087 | 41.7 |
| Fallback profile | 1,048 | 21.0 |
| Delta | -1,039 | -20.8 |
The paired comparison pointed in the same direction on every task: the `gpt-5.5` profile used more tools in `50 / 50` tasks, while the fallback profile used more tools in `0 / 50`. A permutation test put the difference below `p < 0.001`.
After adding a model-catalog alias so `openai/openai/gpt-5.5` inherited the intended `gpt-5.5` profile, the same 50-task setup became much closer:
| Catalog profile | Total tool calls | Per task |
|-----------------|------------------|----------|
| `gpt-5.5` profile | 2,081 | 41.6 |
| Alias-backed custom profile | 2,205 | 44.1 |
| Delta | +124 | +2.5 |
The remaining difference was not statistically significant in this run: the permutation test was about `p = 0.22`, and the paired directions were mixed (`20 / 50` tasks favored the native profile, `28 / 50` favored the alias-backed profile, and `2 / 50` tied).
For Dynamo, the implication is that Codex compatibility needs to be evaluated at the catalog and request-shaping layer as well as the HTTP schema layer. If Codex cannot resolve a model ID into the intended profile, fallback defaults may change truncation, search-tool availability, verbosity controls, reasoning-summary support, and parallel tool-call support before Dynamo receives the request.
## What's Next
Dynamo now has `nvext.agent_hints`: `latency_sensitivity`, `priority`, `osl`, and `speculative_prefill`. Those fields give the harness a way to say more about the turn than the prompt alone. A session waiting on a user reply is not the same as one working through a long background tool sequence, and the API can now carry some of that difference.
In the v1.1.0 line, Dynamo is also making more of the agent stack available as reusable pieces. The protocol, parser, and tokenizer layers are versioned as standalone crates, including `dynamo-protocols`, `dynamo-parsers`, and `dynamo-tokenizers`. That gives teams a way to build or customize a harness-facing serving path without copying Dynamo internals into a separate project.
This is also the bridge to longer-running systems such as AutoResearch. The first post explained why agentic workloads stress the serving stack. This post shows the harness-facing contract needed to run those workloads correctly and sets the stage for efficient long-running agents backed by Dynamo endpoints.
> How Dynamo optimizes for agentic workloads at three layers: frontend API, router, and KV cache management.
# Full-Stack Optimizations for Agentic Inference with Dynamo
Coding agents are starting to write production code at scale. [Stripe’s agents generate 1,300+ PRs per week](https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents). [Ramp attributes 30% of merged PRs to agents](https://www.infoq.com/news/2026/01/ramp-coding-agent-platform/). [Spotify reports 650+ agent-generated PRs per month](https://engineering.atspotify.com/2025/11/spotifys-background-coding-agent-part-1). Tools like Claude Code and Codex make hundreds of API calls per coding session, each carrying the full conversation history. Behind every one of these workflows is an inference stack under significant KV cache pressure.

Lets take Claude Code as an example. After the first API call that writes the conversation prefix to KV cache, every subsequent call to the same worker hits 85-97% cache. Agent teams (or swarms) push this further with 97.2% aggregate cache hit rate across 4 Opus teammates. An 11.7x read/write ratio means the system reads from cache nearly 12 times for every token it writes. This is a write-once-read-many (WORM) access pattern: the system prompt and growing conversation prefix are computed once, then served from cache on every subsequent call. Maximizing cache reuse rate across all workers and keeping KV blocks warm and routable is the central optimization target for agentic inference.
These numbers come from managed API infrastructure where the provider controls prefix matching, cache placement, and eviction. For teams running open-source models on their own GPUs, none of this exists out of the box. We have been building Dynamo to close that gap. This post walks through how we are making Dynamo agent-native at three layers: the frontend API, the router, and KV cache management.
Throughout this post, we use three terms consistently:
- **Harness**: the agent framework that drives the workflow (Claude Code, Codex, OpenClaw, OpenCode, etc.)
- **Orchestrator**: Dynamo's routing, scheduling, and cache management layer
- **Runtime**: the inference engine that executes the model and owns the kv cache manager (SGLang, vLLM, TRT-LLM)
## Layer 1: The Frontend
### Multi-Protocol Support
Agent harnesses are increasingly adopting `v1/responses` and `v1/messages` over `v1/chat/completions` to cleanly handle new patterns including interleaved thinking and tool calls. The key difference in these APIs is structural. In `v1/chat/completions`, message content is a flat string and tool calls are bolted on as a separate field. As an example, notice how [GLM](https://docs.z.ai/guides/capabilities/thinking-mode#example-usage) and [MiniMax](https://platform.minimax.io/docs/guides/text-m2-function-call#important-note) API handle interleaved thinking differently when hosting their model behind the `v1/chat/completions` endpoint. The `v1/responses` and `v1/messages` APIs use typed content blocks, so a single assistant turn can contain thinking, tool calls, and text as distinct objects. This matters for inference because the orchestrator can see block boundaries, perform prompt optimizations, and apply different cache and scheduling policies per block type. Dynamo serves all three endpoints through a common internal representation, so a single deployment can act as the inference backend for any harness. Our team has been running a Dynamo deployment of GLM-5 and MiniMax2.5 internally to power our Codex and Claude Code harnesses. This lets us benchmark our backend implementations against closed-source inference, targeting parity on cache reuse performance. We will be sharing a full write-up and some optimized recipes for deploying both models in the upcoming weeks.
**Serving Claude Code with Dynamo**
**Serving Codex with Dynamo**
We have also invested in day-0 tool call and reasoning parsing support for various open-source models. If you find that a model is not supported, please [open an issue](https://github.com/ai-dynamo/dynamo/issues) or use the [tool-call-parser-generator](https://github.com/ai-dynamo/dynamo/blob/main/.agents/contributor-skills/tool-parser-generator/SKILL.md) skill to generate it with your harness of choice.
### Agent Hints: The Harness-Orchestrator Interface
Today, inference servers see anonymous tokenized requests. But agent harnesses have global context that the infrastructure never sees: which agents are blocked on tool calls, which just spawned, how many turns remain in a session, and whether the current call is a quick lookup or a long synthesis. When using coding agents, the user waits for a final result, not individual token streams, so the orchestrator can reorder and prioritize requests across agents without affecting the end-user experience. Sessions run for minutes to [even days](https://factory.ai/news/missions) with long tool-call pauses. This is enough to optimize inference scheduling in ways that traditional serving cannot.

Dynamo’s new agent hints extension was designed to bridge this gap. It allows any harness to attach structured hints to a request across all three API endpoints, giving the router and runtime the context they need to make agent aware scheduling and caching decisions. This is a v1 API that we are actively co-designing with the community and would love feedback from teams building agent harnesses on what signals are most useful. Please reach out to us if you have any ideas or feedback!
```json
{
"model": "MiniMaxAI/MiniMax-M2.5",
"messages": [...],
"tools": [...],
"nvext": {
"agent_hints": {
"osl": 256,
"speculative_prefill": true,
"priority": 10
},
"cache_control": {
"type": "ephemeral",
"ttl": "1h"
}
}
}
```
The `agent_hints` fields:
- **`priority`** controls scheduling across both the router and engine. Higher values mean "more important" at the Dynamo API level; Dynamo translates that into router queue ordering and backend-specific engine priority.
- **`osl`** (output sequence length) is the harness's estimate of how many tokens this request will generate. The router uses this to gauge how long a worker will be occupied, which improves load balancing. A harness can learn this over time by tracking average output lengths per tool call type.
- **`speculative_prefill`** signals the orchestrator to begin caching this request's prefix on a likely worker before the full request is ready. This is useful when the harness knows a tool call is about to return and wants to warm the cache ahead of time.
The `cache_control` field will look familiar to anyone who has used Anthropic's prompt caching API. It tells the orchestrator to pin the computed prefix on the worker for the specified TTL, protecting it from eviction during tool call gaps. Currently `ephemeral` is the only supported type (to match Anthropic's API). We discuss how this works in the cache retention section below. You can find complete documentation on agent hints [here](../../components/frontend/nvext.md#cache-control).
## Layer 2: The Router
A coding agent follows a sequential pattern: long prefill, tool call, extend prefix, repeat. A multi-agent harness fans out work across parallel subagents with short, independent contexts. Default round-robin routing is blind to both patterns -- it cannot account for cache locality, request priority, or session structure. Dynamo's router closes this gap with three mechanisms: KV-aware placement, priority scheduling, and extensible routing strategies.
### KV-Aware Placement
Without cache-aware routing, turn 2 of a conversation has a ~1/N chance of landing on the same worker as turn 1. Every miss is a full prefix recomputation which is a significant performance bottleneck and extremely costly for an end user. Dynamo's router maintains a global index of which KV cache blocks exist on which workers. The [Flash Indexer post](/dynamo/digest/flash-indexer) covers the six iterations that got this indexer to 170M ops/s (**planetary** scale KV routing). On every request, the router queries the index for per-worker overlap scores and selects the worker that minimizes the combined cost of cache miss and current decode load. This cost function is tunable, and we show below how teams can build custom agent aware routing strategies on top of it.
### Priority Scheduling
`priority` is the single user-facing scheduling knob. Higher values mean "more important" at the Dynamo API level. Dynamo uses that one hint at both layers:
- At the **router**, higher-priority requests are shifted earlier in the queue when `--router-queue-threshold` is enabled.
- At the **engine**, Dynamo normalizes backend-specific polarity and forwards the request for queue ordering, preemption, and KV cache eviction.
At the router, incoming requests enter a `BinaryHeap` ordered by effective arrival time. A higher `priority` makes the request appear as if it arrived earlier, placing it ahead of lower-priority work. Requests only enter the queue when all workers exceed a configurable load threshold. Below that threshold, they bypass the queue entirely and go straight to worker selection. When capacity frees up (prefill completes or a request finishes), the queue drains highest-priority entries first.
Once dispatched, SGLang, vLLM, and TRT-LLM may interpret engine priority differently, so Dynamo normalizes the engine-facing value per backend. Engines like SGLang can also use priority-based radix cache eviction where lower-priority blocks are evicted first under memory pressure.

### Agentic Workload Routing Strategies
A research agent with a 200K context window needs workers with enough free KV capacity to hold its full state. The router's default cost function (overlap score + decode load) handles the common case, but teams with domain-specific workloads can use the router's Python bindings to implement custom routing strategies. The core `KvRouter` class provides `best_worker()` for querying routing decisions, `get_potential_loads()` for per-worker load inspection, and `generate()` for routing + dispatch in one call. Custom routers register on the same service mesh as the default components and can override routing config per-request:
```python
# Query per-worker load and overlap for custom routing logic
loads = await router.get_potential_loads(token_ids)
# Override routing config based on request properties
# Long contexts benefit from stronger overlap credit
config = {"overlap_score_credit": 1.0} if len(token_ids) > 8192 else {}
worker_id, dp_rank, overlap = await router.best_worker(
token_ids,
request_id="req-123",
update_indexer=True,
router_config_override=config
)
# Or bypass the default selector entirely when the harness
# has its own worker selection logic (e.g., session affinity)
stream = await router.generate(
token_ids, model=model, worker_id=chosen_worker
)
```
The [NeMo Agent Toolkit (NAT)](https://github.com/NVIDIA/NeMo-Agent-Toolkit/tree/develop/examples/dynamo_integration) team used these APIs to build a custom online-learning agentic router. Their router extracts session metadata from `nvext` annotations and feeds it to a [Thompson Sampling](https://en.wikipedia.org/wiki/Thompson_sampling) bandit style cost function that learns which workers perform best for which prefix patterns under load. Compared to Dynamo's default routing, they measured 4x reduction in p50 TTFT and 1.5x increase in p50 tokens-per-second. Priority tagging of latency-sensitive requests achieved up to 63% p50 TTFT reduction under moderate memory pressure. See the [NAT Dynamo integration example](https://github.com/NVIDIA/NeMo-Agent-Toolkit/tree/develop/examples/dynamo_integration) for implementation details. We will be making this available as a routing strategy in Dynamo soon.
## Layer 3: KV Cache Management
Agentic workloads produce blocks with vastly different reuse value -- system prompts reused every turn, reasoning tokens never reused again -- but default LRU eviction treats all blocks identically. A 2-30 second tool call pause can age out an agent's entire prefix, forcing full recomputation when it resumes. The cache needs to understand block value, support cross-worker sharing, and respect agent lifecycle boundaries.
### The Problem with Uniform Eviction
| Block Type | Reuse Pattern | Value |
|------------|---------------|-------|
| System prompt + tool definitions | Every turn | Highest |
| Conversation history | Subsequent turns, growing monotonically | High |
| Thinking/reasoning tokens | Typically zero reuse after reasoning loop closes (a significant portion of output) | Near-zero |
| Subagent KV | Multiple turns then agent dies. No need to retain | Near-zero |
LRU sees only recency. In a high traffic environment, a wait for the completion of a called tool (2-30 seconds while the agent waits for an external API) might cause the agent's blocks to age out and when the agent resumes, the entire prefix must be recomputed. To solve this, we need to provide the orchestrator APIs to control which blocks should be retained, where they should live, and for how long.
### KV Cache as a Shared Resource
Today, KV cache is treated as a local, ephemeral resource on each worker. An agent's ~32K-token system prompt and tool definitions are computed independently on every worker that serves its requests. When a lead agent spawns 4 subagents, each with overlapping tool definitions, that shared prefix is recomputed 4 times if the subagents land on different workers. In our analysis of Claude Code team sessions, we measured this directly: teammates averaged 79.4% cache hit rate vs. 91.3% for the lead agent's explore subagents (5.0x vs. 11.7x read/write ratio), with the gap driven almost entirely by cold-start writes on each teammate's first call. The goal is to make high value KV cache blocks available to all workers in the cluster. Essentially, they are written once during cold start and then read by any worker at all times.
Solutions like SGLang's HiCache and Dynamo's KV Block Manager (KVBM) are building toward a 4-tier memory hierarchy:

Blocks follow a write-through path: when a worker computes KV for a prefix, the blocks flow from GPU to CPU to disk automatically. Each block is deduplicated by sequence hash in a global registry. Once a block is registered, it is immutable and addressable by any worker that can reach the storage tier.
This directly solves the subagent cold-start problem. When the lead agent computes tool definitions and system prompt, those blocks write through to shared storage. When subagent 1 spawns on a different worker, the router queries the Flash Indexer, finds the blocks in shared storage, and the worker loads them via NIXL (RDMA read) instead of recomputing from scratch. Subagent 2 does the same. Four redundant prefill computations become one compute and three loads. The same mechanism addresses cache coherence in disaggregated prefill-decode serving. In disagg mode, the prefill worker computes KV and transfers it to the decode worker via NIXL. The decode worker generates tokens, producing new KV state. On the next turn, a prefill worker needs both the original prefix and the generated tokens from turn 1, but those live only on the decode worker. With shared storage, the decode worker writes its new blocks to the common tier and any prefill worker can fetch them on the next turn.
Multi-tier storage solves sharing and persistence, but blocks still arrive on GPU only after the request hits the worker. The missing piece for agentic systems is prefetch: the harness can use historical timing data to predict when an agent's tool call might return, which means it knows which blocks will be needed and when. We are building prefetch hooks so the harness can signal "bring these blocks from storage to GPU ahead of the next request." Combined with the retention APIs (below), this gives the harness full lifecycle control: pin blocks to prevent eviction, set priority to control eviction ordering, and prefetch blocks proactively before they are needed.

### Selective Cache Retention
Making blocks globally available solves the sharing problem, but does not solve eviction. SGLang and vLLM both support priority-based eviction via a priority heap where the harness assigns a numeric priority per request and lower-priority blocks are evicted first. TensorRT-LLM takes this further with `TokenRangeRetentionConfig` (designed and implemented by a Dynamo team member[@jthomson04](https://github.com/jthomson04)) which allows per-region control within a single request.
A request carries zero or more directives. Blocks without directives follow the default LRU path with zero overhead. The evictor becomes a two-structure system: an LRU free list for unprioritized blocks (O(1), unchanged) and a priority queue for annotated blocks. The harness can express "system prompt blocks are evicted last (priority: 100); conversation context survives a 30-second tool call (duration: 45s); decode tokens are first to go (priority: 1)" without the engine needing to understand why.
Anthropic's prompt caching lets you mark prefixes as cacheable on their infrastructure. Dynamo's `cache_control` API brings the same semantics to self-hosted inference. When a request includes `cache_control: { type: "ephemeral", ttl: "1h" }`, the router pins the matching prefix nodes in the worker's radix tree for that TTL, protecting them from eviction in the worker's L2 storage.
The next step is connecting retention with the distributed cache. Today, retention directives apply to a single worker's local cache. When a block is pinned on worker A but the next request routes to worker B, the pin does not follow. Extending retention semantics across HiCache/KVBM's shared storage tier means the harness can pin a block once and have it survive across workers: the priority and TTL metadata travel with the block through the write-through path, and any worker that loads the block from shared storage inherits the retention policy. Combined with the prefetch hooks described above, this gives the harness end-to-end lifecycle control across the full memory hierarchy.
### Agent Lifecycle Awareness
Consider a typical Claude Code session. The lead agent runs for 20+ turns, accumulating a growing conversation prefix. Along the way it spawns explore subagents that each run 1-3 turns and terminate. It might spawn a team of 4 specialists that work in parallel on different subtasks and then terminate. Midway through, the agent hits a context limit and summarizes its history, compressing ~175K tokens down to ~40K. Each of these events produces ephemeral KV: blocks that will never be referenced again. Subagent termination, context summarization, and closed reasoning loops all generate ephemeral KV that occupies the same memory as high-value blocks like the system prompt. Reasoning models amplify this: `...` blocks account for ~40% of generated tokens but become ephemeral the moment the reasoning loop closes. Without lifecycle awareness, the cache treats all of these blocks identically.

The retention primitives from above (priority, TTL, token ranges) give us the building blocks. What is missing is the ability to associate them with sessions. If the harness can tag a subagent's requests as belonging to a session and mark that session's KV as ephemeral, the evictor can target those blocks first and skip writing them to shared storage entirely. When the subagent terminates, its session's blocks are the first to reclaim. The same mechanism applies to thinking tokens: the engine can detect `` boundaries during generation and tag those blocks as ephemeral at insertion time, so they skip L2 write-back and evict before normal blocks without any external signal. The design space here is wide: harness-driven session tagging, engine-native semantic detection, hybrid approaches that combine both. We are actively exploring multiple directions and expect the right answer will vary by workload and framework.
## Closing the Gap
The biggest optimization surface in agentic inference is the gap between what the harness knows and what the infrastructure can see. Which agents are blocked, which are about to resume, which KV is worth keeping, which can be thrown away -- all of this context exists at the harness layer but never crosses the API boundary. `nvext.agent_hints` is our first cut at closing that gap: a small set of structured signals that let the orchestrator make informed routing, scheduling, and cache management decisions instead of treating every request as anonymous tokens. This is a v1 API and we are actively evolving it. If you are building agent harnesses, running open-source models for agentic workloads, or thinking about cache-aware inference, we want to hear what signals matter most for your use case. Reach out on [GitHub](https://github.com/ai-dynamo/dynamo) or tag us on X: [@0xishand](https://x.com/0xishand), [@KranenKyle](https://x.com/KranenKyle), [@flowpow123](https://x.com/flowpow123).
# Flash Indexer: A Story of Inter-Galactic KV Routing
> Dynamo's Flash Indexer tracks every cached KV block across all inference workers at 170M ops/s. Six iterations of data structure design got it there.
The **Flash Indexer** is a concurrent global index of every cached KV block across every inference worker, sustaining over **100 million operations per second**. It evolved through six iterations—from a Python dictionary to a jump-optimized spatial index—to the point where network latency, tokenization, and hashing are the bottlenecks. We're shipping it as the default indexer in Dynamo v1.0.0.
For scale intuition: at 100M+ index ops/sec, the system can support approximately $$N \approx 10^8 / r$$ concurrent workloads, where $$r$$ is the workload's sustained index ops/sec (inserts + lookups) under real traffic, including bursty prefill, well beyond current planetary-scale inference demand.
This post walks through those iterations—how each redesign drove a new order-of-magnitude improvement, and the specific data structure or concurrency breakthrough behind it.
---
## 1. Background
### 1.1 KV Block Identity
Every cached block carries three identifiers:
- **Local block hash** (`u64`): Content hash of the tokens within a single block. Position-independent—two blocks with the same tokens produce the same hash. Both the frontend and publisher use the same algorithm.
- **Sequence block hash** (`u64`): Rolling hash of the entire prefix up to this block. Position-dependent—identical tokens at different positions produce different hashes.
```text
seq_hash[0] = local_hash[0]
seq_hash[i] = hash(seq_hash[i-1] || local_hash[i])
```
- **Worker ID**: Which worker holds the block.
Local hashes are deliberately *chunk hashes* (no prefix context) so frontends can hash query blocks cheaply in parallel. The tradeoff: chunk hashes can't distinguish position. *"Predict the next token | Learn from the error | Predict the next token."* produces identical hashes at blocks 0 and 2. This collision problem drives every data structure decision below.
### 1.2 Events and Requests
The indexer handles two kinds of traffic:
**KV Events** (writes): A publisher sitting alongside each engine emits `Store(worker_id, local_hash, seq_hash)` when a block is cached and `Remove(worker_id, seq_hash)` when evicted. We need explicit events because engines cache blocks beyond request lifetime and their eviction policies (LRU sweeps, memory pressure, preemption) are opaque—there's no way to infer cache state from request-response cycles alone. The stream is bursty: prefills produce dozens of stores at once; eviction sweeps produce bursts of removes.
Per-worker and aggregate KV cache event density heatmap derived from 5% of the Mooncake FAST'25 trace, replayed across 16 Mocker workers with 2,048 GPU blocks each. Green cells indicate Store-dominant time bins (prefill bursts); amber cells indicate Remove-dominant bins (eviction sweeps). The diverging colorscale is clamped at ±10 events per worker and ±100 events aggregate, highlighting the bursty, temporally correlated nature of KV cache traffic that the Flash Indexer must sustain at line rate.
**Requests** (reads): On every inference request, the frontend sends a sequence of chunk hashes `[local_hash_0, ..., local_hash_D]`. The indexer returns `(worker_id, match_depth)` scores so the router can pick the worker with the deepest cached prefix.
Each engine is paired with a publisher that enriches raw KV events with worker identity and block hashes, then broadcasts them via pub/sub. The router requests store lookups from the indexer, which computes prefix overlap scores used for routing decisions.
Both paths are hot. Slow events mean stale routing decisions. Slow queries to the indexer means user-facing latency. The design goal: keep both fast without mutual contention.
---
## 2. Nested Dictionary → Rust Actor
### 2.1 Python Dictionary
The simplest possible index is a nested dictionary. For each worker, store a mapping from local block hash to the set of external sequence hashes that share that chunk hash. Since local hashes are chunk hashes, the same tokens can appear at different positions in different sequences, and a single local hash can map to multiple sequence hashes on the same worker. To find matches, iterate every worker and walk through the query sequence, checking for hits.
```py
class KvIndex:
# worker_id -> { local_hash -> set of seq_hashes }
index: dict[int, dict[int, set[int]]] = {}
def store(self, worker_id: int, blocks: list[tuple[int, int]]):
if worker_id not in self.index:
self.index[worker_id] = {}
for local_hash, seq_hash in blocks:
if local_hash not in self.index[worker_id]:
self.index[worker_id][local_hash] = set()
self.index[worker_id][local_hash].add(seq_hash)
def remove(self, worker_id: int, seq_hashes: list[int]):
if worker_id not in self.index:
return
for seq_hash in seq_hashes:
for local_hash, hashes in self.index[worker_id].items():
hashes.discard(seq_hash)
def find_matches(self, query: list[int]) -> dict[int, int]:
scores = {}
for worker_id, blocks in self.index.items():
depth = 0
for local_hash in query:
if local_hash in blocks and blocks[local_hash]:
depth += 1
else:
break
if depth > 0:
scores[worker_id] = depth
return scores
```
There is a correctness issue with this approach. `local_hash in blocks` tells us the worker has *some* block with those tokens, but not *which* one—different sequences sharing the same chunk hash are conflated. This collision problem shapes every data structure decision that follows. This is `O(W × D)` per query (W workers, D query depth).
With hundreds of workers and sequences thousands of blocks long, it's a non-starter.
### 2.2 Rust Actor
Porting to Rust (`HashMap>>`) eliminates interpreter overhead. A **single-threaded actor** owns the index exclusively and communicates through channels—correct and lock-free, but serializes all reads behind all writes. The single thread is the throughput ceiling.
---
## 3. Inverted Index
`worker -> { hash -> ... }` forces `find_matches` to iterate every worker. But the question is *"which workers have this block?"*—keyed by block, not worker. Instead of iterating workers and checking blocks, build a forward index keyed by LocalHash that maps to the sequence hashes and their worker sets.
```rust
// local_hash -> (seq_hash -> set of workers)
index: HashMap>>
```
Now `find_matches` traverses the query once. At each position, take the union of worker sets. Workers only *drop out* as you go deeper—each is drained at most once—giving **O(D + W)** instead of O(W × D).
The inverted index is a major win for reads, but every data structure choice is a two-sided tradeoff between query performance and update cost.
On the read side, the collision issue from Section 2.1 resurfaces in a different shape. When we union worker sets across sequence hashes at a given local hash, we conflate workers that cached different sequences sharing the same chunk. The seq hash data is in the index, but `find_matches` cannot use it without computing the query's own seq hashes—which reintroduces rolling hash computation on the read path, exactly what chunk hashes were designed to avoid.
On the write side, removes are equally expensive: without a per-worker reverse lookup, removing a block by seq hash requires scanning the entire index. We could add a reverse lookup table, but that's more bookkeeping on every store.
The radix tree resolves both.
---
## 4. Radix Tree
Each node has a small children map keyed by `LocalHash`, plus a worker set. Parent-child relationships scope collision risk: two blocks with the same chunk hash collide only if they share the same parent, which means the same prefix. Different prefixes lead to different parents. This requires one new field in KV events: the **parent hash**, so the tree can link child to parent as events arrive.
Prefix-aware radix tree indexes cached blocks by local hash. Shared prefixes branch where sequences diverge; each node records which workers hold that block.
```rust
type SharedRadixBlock = Rc>;
struct RadixBlock {
children: HashMap,
workers: HashSet,
block_hash: Option,
}
struct RadixTree {
root: SharedRadixBlock,
lookup: HashMap>,
}
```
Each node also carries a sequence hash. A per-worker **lookup table** (`worker -> { seq_hash -> node }`) provides O(1) access for event processing: stores attach children via the parent's seq hash; removes find the node directly. Two keys for two access patterns—local hash for traversal, sequence hash for events.
Both the tree and the lookup table point to the same nodes via `Rc>` (shared ownership with interior mutability, single-threaded). The children maps at each node are small—bounded by branching factor, not total block count.
This approach remains single-threaded behind the actor, with serialized reads and writes.
---
## 5. Concurrent Radix Tree
Reads don't conflict with each other. We replace `Rc>` with `Arc>` (atomic reference counting + reader-writer lock). Now `find_matches` acquires only read locks and executes *inline on the caller's thread*—no channel, no actor, no queue.
Writes use **sticky routing**: a `ThreadPoolIndexer` deterministically assigns each `WorkerId` to one thread. Events for the same worker always land on the same thread, so there's no write-write contention on any worker's subtree.
Write events are sticky-routed by worker ID to a thread pool, ensuring sequential ordering. A concurrent radix tree with `Arc` allows `find_matches()` reads in parallel, enabling concurrent traversals.
```rust
type SharedBlock = Arc>;
struct ConcurrentRadixTree {
root: SharedBlock,
lookup: DashMap>>,
}
```
`DashMap` shards the outer map so reads and writes to different workers don't touch the same lock. `parking_lot::RwLock` avoids the OS syscall on uncontended paths (2–3x faster than `std::sync::RwLock`). `FxHashMap` replaces SipHash with a single multiply-xor step—safe here because keys are `u64` hashes, not user input.
`parking_lot::RwLock` is task-fair by default: it processes waiters in arrival order rather than unconditionally favoring readers or writers. Combined with sticky routing's guarantee that each worker's writes are serialized on a single thread, write contention is minimal and neither reads nor writes are starved.
The actor is gone for reads. Multiple `find_matches` calls proceed concurrently with writes to different workers.
---
## 6. Positional Indexer with Jump Search
The radix tree traverses node-by-node, following pointers from parent to child—cache-hostile and fundamentally sequential. You can't check position 128 without visiting 0 through 127.
Replace the tree with a `Vec>` indexed by position. `index[position]` is a concurrent map from local hash to sequence entry. Any position is O(1)—no traversal required.
```rust
enum SeqEntry {
Single(ExternalHash, HashSet),
Multi(HashMap>),
}
struct PositionalIndexer {
// index[position] -> { local_hash -> SeqEntry }
index: Vec>,
worker_blocks: DashMap,
jump_size: usize,
}
```
The `SeqEntry` enum handles collisions: in the common case a `(position, local_hash)` slot has exactly one sequence hash, stored inline without a `HashMap` allocation. Only when multiple prefixes produce the same chunk hash at the same position does it upgrade to `Multi`.
The `Single`/`Multi` split also enables lazy hash computation: when a lookup finds a `Single` entry, the match is unambiguous without computing the query's sequence hash. The expensive rolling hash is only needed on the rare `Multi` entries where chunk hash collisions require disambiguation.
But the positional indexer's biggest advantage isn't the data layout – it's what **random access makes possible.**
Random access enables **jump search**:
1. Initialize the active worker set from position 0.
2. Jump ahead by `jump_size` positions (e.g., 64) to the next checkpoint.
3. At the checkpoint, count how many active workers still match (cardinality check—no set cloning needed).
4. If all match: the entire skipped range is confirmed. Continue jumping.
5. If fewer match: some workers drained in the skipped range. Scan forward through positions `[previous_checkpoint + 1 .. current_checkpoint]` to find each lost worker's exact drain point.
6. Resume jumping from the current checkpoint.
With position as a first-class key, the indexer jumps ahead by a fixed stride. On a partial match, a lookback from the previous checkpoint identifies exact drain points, then resumes jumping from the current checkpoint.
Best case: `D / J` lookups instead of `D`.
Worst case (workers drop at every jump): degrades to a linear scan with jump overhead.
The positional indexer wins on long sequences with high prefix sharing; the radix tree wins on short or highly-divergent sequences.
The `Vec` layout also improves cache locality: early positions (shared system prompts, common preambles) are the hot path, cluster at the front of the array, and stay warm in cache.
With jump size *J* (= `jump_size`, defaulting to 64), amortized cost drops to **O(D/J + W)**. Since *J* is a tunable constant, the complexity remains linear in *D*; the practical benefit is skipping the vast majority of positions when prefix sharing is high.
---
## 7. Benchmarks
All benchmarks run on a 24-core Arrow Lake (285K) desktop, replaying publicly-available [Mooncake production traces](https://github.com/kvcache-ai/Mooncake/tree/main/FAST25-release/arxiv-trace) through a mock engine with 16,384 GPU blocks and prefix caching enabled. The harness tests all five backends with 24 concurrent event-processing threads.
**Ops throughput** is the combined rate of KV events and `find_matches` requests per second. We sweep offered load by compressing the same trace into shorter durations and compare achieved vs. offered throughput. The **threshold throughput** is where achieved throughput stops tracking offered—the indexer's saturation point.
Achieved vs. offered block throughput across five indexer backends, measured with `mooncake_bench` on real trace data. The Flash Indexer sustains 170M ops/s — 42x faster than the Radix Tree shipped in Dynamo v0.1.0 (4M ops/s) and 440x faster than the naive implementations (385K ops/s).
---
## 8. Future Directions
With the Flash Indexer shipping in Dynamo v1.0.0, the next round of optimizations targets the remaining constant factors:
- **Binary search within jumps.** Replace the linear scan-back after a failed jump with binary search: `O(log J)` instead of `O(J)` per failed jump.
- **Hierarchical routing.** A sparse top-level indexer for coarse-grained prefix coverage across deployment groups, with full indexers at the leaves.
- **Inline bitsets for worker sets.** Replace `HashSet` with fixed-width bitsets stored inline in each node, turning membership tests into single bit operations and eliminating pointer chases.
---
## 9. Conclusion
The journey from a Python dictionary to the Flash Indexer spans six iterations, each motivated by a concrete bottleneck in the previous design:
1. **Naive Nested Dict** — simple but O(W × D) per query.
2. **Rust + Actor Pattern** — fast language, correct concurrency, but single-threaded bottleneck.
3. **Inverted Index** — O(D + W) per query by flipping the key structure; secondary `seq_hash` layer for chunk-hash collision safety.
4. **Radix Tree** — tree structure replaces giant flat map; per-node children maps stay small; dual-key design (local hash for traversal, seq hash for event processing); `Rc>` for single-threaded shared ownership.
5. **Concurrent Radix Tree** — `Arc>` replaces `Rc>`; `DashMap` with per-worker inner `RwLock` for the lookup table (shard-level locking for rare mutations, cheap shared reads on the hot path); reads bypass the actor entirely; sticky routing serializes writes per worker with zero contention.
6. **Concurrent Positional Indexer via Jump Search (Flash Indexer)** — an alternative to the radix tree for long-sequence workloads; `Vec>` indexed by position replaces pointer chasing with O(1) random access, enabling jump search that skips most of the depth; `DashMap` with per-worker inner `RwLock` for the reverse lookup; hot prefix positions cluster at the front of the `Vec` and stay warm in cache.
The result: a sustained ops throughput of **170 million operations per second**—events and requests combined—with achieved throughput tracking offered throughput all the way to the limit.
# Kubernetes Quickstart
Get a model running on Kubernetes in minutes.
Dynamo's production path is Kubernetes-native: you install the platform with
Helm, submit Dynamo CRDs, and let the operator reconcile inference graphs into
pods, services, routing, model-loading, and scaling resources. The local and
container guides remain useful for development, but Kubernetes is the canonical
path for shared GPU clusters and multi-node serving.
**Deployment modes.** Dynamo supports two deployment modes on Kubernetes. This quickstart uses **standalone mode**, where the Dynamo Frontend serves requests and the integrated Dynamo Router does KV-aware routing. Dynamo can also run in **gateway mode** behind a [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/) gateway, where KV-aware routing happens in the Dynamo Endpoint Picker Plugin (EPP) at the gateway layer and the Frontend runs as a sidecar in `--router-mode direct`. See the [Inference Gateway (GAIE) guide](/dynamo/integrations/kubernetes-integrations/gateway-api-inference-extension-gaie) to set up gateway mode.
## Prerequisites
- Kubernetes cluster (v1.24+) with GPU nodes
- [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) (v1.24+)
- [Helm](https://helm.sh/docs/intro/install/) (v3.0+) installed
- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html) installed on the cluster
- HuggingFace token secret on cluster
### HuggingFace token secret
Create a HuggingFace token secret for model downloads. If you don't have a token, see the HuggingFace [token guide](https://huggingface.co/docs/hub/en/security-tokens).
```bash
export HF_TOKEN=
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="$HF_TOKEN"
```
### GPU Operator quick install
If you don't have the GPU Operator yet:
```bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --force-update
helm repo update nvidia
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace \
--wait --timeout=600s
```
If your cluster already provides GPU drivers (e.g., GKE with `gpu-driver-version=latest`, or AKS), add:
```bash
--set driver.enabled=false --set toolkit.enabled=false
```
### Detailed installation
The GPU Operator is the only prerequisite for a basic deployment. For additional features like RDMA, Prometheus, or multinode scheduling with Grove/KAI Scheduler, see the [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide).
If your GPU SKU and cloud provider are supported, you can use [AICR](https://github.com/NVIDIA/aicr) for rapid installation of prerequisites and the Dynamo Helm chart.
### Verify cluster is ready
Optionally, verify your cluster is ready:
```bash
./deploy/pre-deployment/pre-deployment-check.sh
```
## Install Dynamo
```bash
export NAMESPACE=dynamo-system
helm install dynamo-platform \
oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform \
--version "1.0.2" \
--namespace "$NAMESPACE" \
--create-namespace
```
Wait for the platform pods:
```bash
kubectl get pods -n $NAMESPACE
# Expected: dynamo-operator-*, etcd-*, nats-* pods all Running
```
## Understand Dynamo Deployment Resources
Before applying the first YAML, it helps to know the Kubernetes resources Dynamo
uses. These are Dynamo's native control-plane objects; you describe the
inference graph, and the operator owns the Kubernetes deployments, services, and
component rollout around it:
| Resource or path | What it does | In this quickstart |
|---|---|---|
| `DynamoGraphDeployment` (DGD) | The canonical live deployment. It describes the Dynamo inference graph that serves traffic. | Generated by DGDR in Option A, or applied directly in Option B. |
| `DynamoComponentDeployment` (DCD) | Per-component deployments created by the operator from the DGD, such as frontend and worker components. | Created for you by the operator. |
| `DynamoGraphDeploymentRequest` (DGDR) | A generator/profiler that can produce a DGD from a model, backend, workload, hardware, and optional SLA targets. | Option A uses DGDR so Dynamo can generate the first DGD. |
| Recipes | Tuned `deploy.yaml` manifests that are already DGD specs. | Use these later when a recipe matches your model, backend, and hardware. |
```mermaid
flowchart LR
DGDR["DGDR generator/profiler"] --> DGD["DGD live deployment"]
DGD --> DCD["DCDs component deployments"]
DCD --> Pods["Pods and Services"]
```
This quickstart uses DGDR because it avoids hand-writing the first DGD. After
DGDR generates and applies the DGD, the DGDR reaches a terminal state, similar
to a Kubernetes Job. The DGD persists and serves your model.
DGDR can also carry supported generated-deployment features such as
`features.planner` for Planner configuration and `features.mocker` for mocker
mode. KV-aware routing is not currently exposed as a DGDR feature field; use a
direct DGD, a tuned recipe, or `overrides.dgd` when you need to set router mode
or other graph-level details explicitly.
For tuned production-style manifests, start from
[Dynamo recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes). For the
full deployment model, see the [Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide).
## Deploy Your First Model
Save this DGDR to generate and deploy a DGD for `Qwen/Qwen3-0.6B`:
```yaml
# qwen3-quickstart.yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: qwen3-quickstart
spec:
model: Qwen/Qwen3-0.6B
backend: auto
image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.1.1" # dynamo-frontend for Dynamo < 1.1.0
```
The DGDR generates a DGD similar in shape to the following. If you already know
the backend and runtime image you want, you can apply this canonical DGD object
directly instead of using DGDR:
```yaml
# qwen3-dgd.yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeployment
metadata:
name: qwen3-direct
spec:
components:
- name: Frontend
type: frontend
replicas: 1
podTemplate:
spec:
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
envFrom:
- secretRef:
name: hf-token-secret
- name: VllmDecodeWorker
type: worker
replicas: 1
podTemplate:
spec:
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
envFrom:
- secretRef:
name: hf-token-secret
resources:
limits:
nvidia.com/gpu: "1"
requests:
ephemeral-storage: 2Gi
workingDir: /workspace/examples/backends/vllm
```
Apply exactly one of the manifests.
Option A: generate and apply a DGD with DGDR.
```bash
kubectl apply -f qwen3-quickstart.yaml -n $NAMESPACE
```
Option B: apply the DGD directly.
```bash
kubectl apply -f qwen3-dgd.yaml -n $NAMESPACE
```
If you use DGDR, watch it progress from `Pending` to `Profiling` to `Deploying`
to `Deployed`:
```bash
kubectl get dgdr qwen3-quickstart -n $NAMESPACE -w
```
In both paths, the DGD is the live serving resource:
```bash
kubectl get dynamographdeployment -n $NAMESPACE
kubectl get dynamocomponentdeployment -n $NAMESPACE
```
Dynamo supports vLLM, TensorRT-LLM, and SGLang backends. Setting `backend: auto` lets the profiler choose the best one for your model and hardware. See the [vLLM backend guide](/dynamo/backends/v-llm) for a backend guide example.
## Send a Request
Once the DGD is ready, it is serving the model:
```bash
# Find and port-forward the frontend
FRONTEND_SVC=$(kubectl get svc -n $NAMESPACE -o name | grep frontend | head -1)
kubectl port-forward "$FRONTEND_SVC" 8000:8000 -n $NAMESPACE &
# Send a request
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "What is NVIDIA Dynamo?"}],
"max_tokens": 200
}' | python3 -m json.tool
```
## Cleanup
```bash
kubectl delete dgdr qwen3-quickstart -n $NAMESPACE --ignore-not-found
kubectl delete dynamographdeployment qwen3-quickstart qwen3-direct \
-n $NAMESPACE --ignore-not-found
```
## Next Steps
- **[Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide)** — Cloud provider setup, GPU Operator details, optional components (Grove, RDMA, model caching, Prometheus)
- **[Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide)** — DGD, DCD, DGDR, recipes, strategy selection, and common pitfalls
- **[DGDR Reference](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference)** — Spec reference, lifecycle phases, monitoring commands, and generated DGD behavior
- **[Creating Deployments](/dynamo/additional-resources/creating-deployments)** — Hand-craft a DGD spec for full control
# Installation Guide
This guide walks you through installing everything needed to deploy models with Dynamo on Kubernetes. Follow the steps in order — each builds on the previous one.
## Prerequisites
Before you begin, make sure you have:
- A **Kubernetes cluster (v1.24+)** with GPU-capable nodes. See the cloud provider guides if you need to create one:
- [Amazon EKS](/dynamo/kubernetes-deployment/cloud-provider-guides/aws/eks-setup) | [Azure AKS](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/aks-setup) | [Google GKE](/dynamo/kubernetes-deployment/cloud-provider-guides/gcp/gke-setup)
- For local development: [Minikube Setup](/dynamo/kubernetes-deployment/start-here/minikube-setup)
- **kubectl** v1.24+ — [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl)
- **Helm** v3.0+ — [Install Helm](https://helm.sh/docs/intro/install/)
**Cloud provider GPU drivers**: The GPU Operator (Step 1) installs GPU drivers for you. When creating your cluster's GPU node pools, **do not enable provider-managed GPU driver installation** (e.g., skip AKS GPU driver install, don't use GKE `--accelerator gpu-driver-version=latest`). If your nodes already have provider-managed drivers, see the GPU Operator step for how to handle this.
Verify your tools:
```bash
kubectl version --client # Should show v1.24+
helm version # Should show v3.0+
```
## Overview
Every Dynamo deployment requires two Helm charts: the **GPU Operator** (Step 1) and the **Dynamo Platform** (Step 2). Everything else is optional. Decide what optional components you need before starting so you can install them in Step 3.
| Optional Component | When you need it | Required for |
|-----------|-----------------|--------------|
| Grove + KAI Scheduler | Multinode or disaggregated inference | Multinode deployments (operator errors without Grove or LWS) |
| Network Operator / RDMA | Disaggregated inference in production | Acceptable KV cache transfer performance (TCP fallback has ~200-500x degradation) |
| kube-prometheus-stack | Autoscaling, metrics dashboards, or the Planner | Planner `sla` mode, KEDA/HPA autoscaling |
| Shared storage (model cache) | Large models (>70B) or many replicas | Avoiding per-pod downloads and HuggingFace rate limits |
**Grove + KAI Scheduler** — Grove is the default multinode orchestrator. The operator returns a hard error on multinode deployments if neither Grove nor [LeaderWorkerSet (LWS)](https://github.com/kubernetes-sigs/lws#installation) is available. KAI Scheduler is optional but recommended alongside Grove for GPU-aware scheduling. See [Grove](/dynamo/kubernetes-deployment/scale/grove) for details.
**Network Operator / RDMA** — Without RDMA, disaggregated inference falls back to TCP automatically, but with severe performance degradation (~98s TTFT vs ~200-500ms with RDMA). Required for any production disaggregated deployment. Setup is cloud-provider-specific — see the [Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication) and your cloud provider guide.
**kube-prometheus-stack** — Required for the Planner's `sla` optimization mode (it reads live TTFT/ITL metrics from Prometheus). Also required for KEDA/HPA-based autoscaling. The Planner's `throughput` mode can function without it using internal queue depth signals, but metrics-driven features will not work. See [Metrics](/dynamo/kubernetes-deployment/operate/observability/metrics) for details.
**Shared storage** — Prevents each pod from downloading model weights independently. Without it, large models (>70B) take hours to download per pod, and many replicas will hit HuggingFace rate limits. Not enforced by the operator — this is an operational concern. See [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) for the full walkthrough.
## Step 1: Install the GPU Operator
The [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html) automates deployment of all NVIDIA software components needed to provision GPUs — drivers, container toolkit, device plugin, and monitoring.
```bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
```
```bash
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace
# Uncomment if your nodes already have provider-managed GPU drivers installed:
# --set driver.enabled=false
```
If your GPU nodes already have provider-managed drivers installed (e.g., you used GKE's `--accelerator gpu-driver-version=latest`), uncomment the `driver.enabled=false` line above so the operator doesn't conflict with the existing drivers.
Some cloud providers require additional GPU Operator configuration. See your provider guide for details:
- [AKS GPU Operator setup](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/aks-setup) — skip AKS-managed GPU driver install on node pools
- [EKS GPU Operator setup](/dynamo/kubernetes-deployment/cloud-provider-guides/aws/eks-setup)
- [GKE GPU Operator setup](/dynamo/kubernetes-deployment/cloud-provider-guides/gcp/gke-setup) — `LD_LIBRARY_PATH` and `ldconfig` init requirements
Verify the GPU Operator is running:
```bash
kubectl get pods -n gpu-operator
# Expected: gpu-operator, nvidia-driver-daemonset, nvidia-device-plugin-daemonset, etc. all Running
```
## Step 2: Install the Dynamo Platform
Set your environment variables:
```bash
export NAMESPACE=dynamo-system
export RELEASE_VERSION=1.0.2 # match a version from https://github.com/ai-dynamo/dynamo/releases
```
```bash
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$RELEASE_VERSION.tgz
helm install dynamo-platform dynamo-platform-$RELEASE_VERSION.tgz \
--namespace $NAMESPACE \
--create-namespace
# Note: add \ to --create-namespace above when uncommenting any optional flags below
#
# Grove + KAI Scheduler — uncomment if using multinode or disaggregated inference.
# Option A (install=true): Dynamo installs and manages Grove/KAI as bundled subcharts (dev/testing):
# --set "global.grove.install=true" \
# --set "global.kai-scheduler.install=true" \
# Option B (enabled=true): Grove/KAI are already installed externally (production):
# --set "global.grove.enabled=true" \
# --set "global.kai-scheduler.enabled=true" \
#
# kube-prometheus-stack — uncomment if Prometheus is installed (required for Planner sla mode and autoscaling):
# --set "dynamo-operator.dynamo.metrics.prometheusEndpoint=http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090"
```
All `helm install` commands can be customized with your own values file: `helm install ... -f your-values.yaml`
**Shared/Multi-Tenant Clusters**: If a cluster-wide Dynamo operator is already running, do **not** install another one. Check with:
```bash
kubectl get clusterrolebinding -o json | \
jq -r '.items[] | select(.metadata.name | contains("dynamo-operator-manager")) |
"Cluster-wide operator found in namespace: \(.subjects[0].namespace)"'
```
**Namespace-restricted mode** (`namespaceRestriction.enabled=true`) is deprecated and will be removed in a future release. Use the default cluster-wide mode for all new deployments.
Verify the Dynamo platform is running:
```bash
# Check CRDs
kubectl get crd | grep dynamo
# Expected: dynamographdeployments, dynamocomponentdeployments, dynamographdeploymentrequests, etc.
# Check operator and platform pods
kubectl get pods -n $NAMESPACE
# Expected: dynamo-operator-*, etcd-*, nats-* pods all Running
```
## Step 3: Install Optional Components
The Dynamo install command above includes commented flags for each optional component. Install the component first, then uncomment the corresponding flag before running `helm install` in Step 2 (or run `helm upgrade --reuse-values` with the flag if you've already installed Dynamo).
### Multinode:
Multinode deployments require either Grove + KAI Scheduler or an alternative orchestrator setup (LeaderWorkerSet + Volcano) to enable gang scheduling for workloads that span multiple nodes. See the [Multinode Deployment Guide](/dynamo/kubernetes-deployment/scale/multinode-deployments) for details on orchestrator selection and configuration.
#### Grove + KAI Scheduler
There are two ways to enable Grove and KAI Scheduler, controlled by which flags you uncomment in the Dynamo install command:
- **`install=true`** — Dynamo installs and manages Grove/KAI as bundled subcharts. Simplest path; recommended for dev/testing.
- **`enabled=true`** — Tells Dynamo that Grove/KAI are already installed and externally managed. Use this when you install Grove/KAI separately (e.g., to manage their lifecycle independently or share them across namespaces). Recommended for production.
For the `enabled=true` path, install Grove and KAI Scheduler separately first. See the [Grove installation guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md) and [KAI Scheduler deployment guide](https://github.com/NVIDIA/KAI-Scheduler) for instructions.
**Compatibility matrix:**
| dynamo-platform | kai-scheduler | Grove |
|-----------------|---------------|-------|
| 1.0.x | >= v0.13.0 | >= v0.1.0-alpha.6 |
| 1.1.x | >= v0.13.4 | >= v0.1.0-alpha.8 |
#### LWS + Volcano
If you are not using Grove for multinode, you can use [LeaderWorkerSet (LWS)](https://lws.sigs.k8s.io/docs/installation/) (>= v0.7.0) with [Volcano](https://github.com/volcano-sh/volcano#quick-start-guide) for gang scheduling. Both must be installed before deploying multinode workloads.
1. Install Volcano:
```bash
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace
```
2. Install LWS (>= v0.7.0) with Volcano gang scheduling enabled:
```bash
export LWS_VERSION=0.8.0
helm install lws oci://registry.k8s.io/lws/charts/lws \
--version=$LWS_VERSION \
--namespace lws-system \
--create-namespace \
--set gangSchedulingManagement.schedulerProvider=volcano \
--wait --timeout 300s
```
See the [LWS docs](https://lws.sigs.k8s.io/docs/) and [Volcano docs](https://github.com/volcano-sh/volcano#quick-start-guide) for configuration options, and the [Multinode Deployment Guide](/dynamo/kubernetes-deployment/scale/multinode-deployments) for orchestrator selection.
### Network Operator / RDMA
RDMA setup is cloud-provider-specific. See the [Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication) for transport options, UCX configuration, and performance expectations, and your cloud provider guide for setup instructions:
- [AKS — InfiniBand + Network Operator](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/rdma-infini-band)
- [EKS — EFA device plugin](/dynamo/kubernetes-deployment/cloud-provider-guides/aws/eks-setup) (also see the [EFA configuration guide](/dynamo/kubernetes-deployment/operate/disagg-communication#aws-efa-configuration))
- [GKE — GPUDirect-TCPXO](/dynamo/kubernetes-deployment/cloud-provider-guides/gcp/gke-setup)
### kube-prometheus-stack
Install Prometheus before running the Dynamo install command so you can set the endpoint in one pass:
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
--set-json 'prometheus.prometheusSpec.podMonitorNamespaceSelector={}' \
--set-json 'prometheus.prometheusSpec.probeNamespaceSelector={}'
```
Then uncomment the `prometheusEndpoint` line in the Dynamo install command. The Dynamo operator automatically creates PodMonitors for its components. See [Metrics](/dynamo/kubernetes-deployment/operate/observability/metrics) for dashboard setup and available metrics, and [Logging](/dynamo/kubernetes-deployment/operate/observability/logging) for the Grafana Loki + Alloy logging stack.
### Shared Storage for Model Caching
Set up a `ReadWriteMany` PVC so all pods share downloaded model weights instead of each downloading independently. No Dynamo chart flags are needed — storage is configured in your deployment spec. Setup is cloud-provider-specific:
- [AKS — Azure Files / Managed Lustre](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/aks-storage)
- [EKS — EFS](/dynamo/kubernetes-deployment/cloud-provider-guides/aws/efs)
- GKE — Cloud Filestore (see [GKE guide](/dynamo/kubernetes-deployment/cloud-provider-guides/gcp/gke-setup))
For large clusters with frequent model updates, consider [ModelExpress](/dynamo/kubernetes-deployment/model-loading/model-caching#option-2-modelexpress-p2p-distribution) for P2P model distribution and ModelStreamer for direct streaming from object storage. See [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) for the full walkthrough including the download Job, mount configuration, and ModelExpress setup.
## Step 4: Pre-Deployment Check
Run the pre-deployment check script to validate your cluster is ready for deployments:
```bash
./deploy/pre-deployment/pre-deployment-check.sh
```
This checks kubectl connectivity, default StorageClass configuration, GPU node availability, and GPU Operator status. See [Pre-Deployment Checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for details.
## Next Steps
Your cluster is ready. Follow the **[Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide)** to choose between applying a tuned DGD recipe, creating a DGD directly, or using DGDR to generate one.
## Troubleshooting
**"VALIDATION ERROR: Cannot install cluster-wide Dynamo operator"**
```
VALIDATION ERROR: Cannot install cluster-wide Dynamo operator.
Found existing namespace-restricted Dynamo operators in namespaces: ...
```
Cause: Attempting cluster-wide install on a shared cluster with existing namespace-restricted operators.
Solution: Migrate the existing namespace-restricted operators to cluster-wide mode. Namespace-restricted mode is deprecated.
**CRDs already exist**
Cause: Installing CRDs on a cluster where they're already present (common on shared clusters).
Solution: CRDs are installed automatically by the Helm chart. If you encounter conflicts, check existing CRDs with `kubectl get crd | grep dynamo`.
**Pods not starting?**
```bash
kubectl describe pod -n $NAMESPACE
kubectl logs -n $NAMESPACE
```
**Bitnami etcd "unrecognized" image?**
```bash
ERROR: Original containers have been substituted for unrecognized ones.
```
Add to the helm install command:
```bash
--set "etcd.image.repository=bitnamilegacy/etcd" --set "etcd.global.security.allowInsecureImages=true"
```
**Clean uninstall?**
```bash
# Uninstall the platform
helm uninstall dynamo-platform --namespace $NAMESPACE
# List Dynamo CRDs
kubectl get crd | grep "dynamo.*nvidia.com"
# Delete each CRD
kubectl delete crd
```
## Advanced: Build from Source
If you need to contribute to Dynamo or use the latest unreleased features from the main branch:
```bash
# 1. Set registry environment
export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/ # or your registry
export DOCKER_USERNAME='$oauthtoken'
export DOCKER_PASSWORD=
export IMAGE_TAG=$RELEASE_VERSION
# 2. Build and push operator image
cd deploy/operator
docker build -t $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG . && docker push $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG
cd -
# 3. Create namespace and image pull secret (only if using a private registry)
kubectl create namespace $NAMESPACE
kubectl create secret docker-registry docker-imagepullsecret \
--docker-server=$DOCKER_SERVER \
--docker-username=$DOCKER_USERNAME \
--docker-password=$DOCKER_PASSWORD \
--namespace=$NAMESPACE
# 4. Install from local chart
cd deploy/helm/charts
helm dep build ./platform/
helm install dynamo-platform ./platform/ \
--namespace "$NAMESPACE" \
--set "dynamo-operator.controllerManager.manager.image.repository=$DOCKER_SERVER/kubernetes-operator" \
--set "dynamo-operator.controllerManager.manager.image.tag=$IMAGE_TAG" \
--set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret"
```
## Reference
- [Helm Chart Configuration](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/platform/README.md)
- [Create Custom Deployments](/dynamo/additional-resources/creating-deployments)
- [Dynamo Operator Details](/dynamo/kubernetes-deployment/start-here/dynamo-operator)
- [ModelExpress Server](https://github.com/ai-dynamo/modelexpress)
# Dynamo Operator
## Overview
Dynamo operator is a Kubernetes operator that simplifies the deployment, configuration, and lifecycle management of DynamoGraphs. It automates the reconciliation of custom resources to ensure your desired state is always achieved. This operator is ideal for users who want to manage complex deployments using declarative YAML definitions and Kubernetes-native tooling.
## Architecture
- **Operator Deployment:**
Deployed as a Kubernetes `Deployment` in a specific namespace.
- **Controllers:**
- `DynamoGraphDeploymentController`: Watches `DynamoGraphDeployment` CRs and orchestrates graph deployments.
- `DynamoComponentDeploymentController`: Watches `DynamoComponentDeployment` CRs and handles individual component deployments.
- `DynamoGraphDeploymentRequestController`: Watches `DynamoGraphDeploymentRequest` CRs and runs the profiling/generation flow that produces a `DynamoGraphDeployment`.
- `DynamoGraphDeploymentScalingAdapterController`: Watches scaling adapter CRs used by external autoscalers and Planner-driven scaling flows.
- `DynamoModelController`: Watches `DynamoModel` CRs and manages model lifecycle (e.g., loading LoRA adapters).
- `DynamoCheckpointController`: Watches `DynamoCheckpoint` CRs for GPU worker checkpoint/restore workflows.
- **Workflow:**
1. A custom resource is created by the user or API server.
2. The corresponding controller detects the change and runs reconciliation.
3. Kubernetes resources (Deployments, Services, etc.) are created or updated to match the CR spec.
4. Status fields are updated to reflect the current state.
## Deployment Modes
The Dynamo operator supports three deployment modes to accommodate different cluster environments and use cases:
### 1. Cluster-Wide Mode (Default, Recommended)
The operator monitors and manages DynamoGraph resources across **all namespaces** in the cluster.
**When to Use:**
- You have full cluster admin access
- You want centralized management of all Dynamo workloads
- Standard production deployment on a dedicated cluster
---
### 2. Namespace-Scoped Mode (DEPRECATED)
> **DEPRECATED:** Namespace-scoped mode (`namespaceRestriction.enabled=true`) is deprecated and will be removed in a future release. Use cluster-wide mode instead. Do not use this for new deployments.
The operator monitors and manages DynamoGraph resources **only in a specific namespace**. A lease marker is created to signal the operator's presence to any cluster-wide operators.
**When to Use:**
- You're on a shared/multi-tenant cluster
- You only have namespace-level permissions
- You want to test a new operator version in isolation
- You need to avoid conflicts with other operators
**Installation:**
```bash
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace my-namespace \
--create-namespace \
--set dynamo-operator.namespaceRestriction.enabled=true
```
---
### 3. Hybrid Mode (DEPRECATED)
> **DEPRECATED:** Hybrid mode relies on namespace-scoped operators, which are deprecated and will be removed in a future release. Use a single cluster-wide operator instead.
A **cluster-wide operator** manages most namespaces, while **one or more namespace-scoped operators** run in specific namespaces (e.g., for testing new versions). The cluster-wide operator automatically detects and excludes namespaces with namespace-scoped operators using lease markers.
**When to Use:**
- Running production workloads with a stable operator version
- Testing new operator versions in isolated namespaces without affecting production
- Gradual rollout of operator updates
- Development/staging environments on production clusters
**How It Works:**
1. Namespace-scoped operator creates a lease named `dynamo-operator-namespace-scope` in its namespace
2. Cluster-wide operator watches for these lease markers across all namespaces
3. Cluster-wide operator automatically excludes any namespace with a lease marker
4. If namespace-scoped operator stops, its lease expires (TTL: 30s by default)
5. Cluster-wide operator automatically resumes managing that namespace
**Setup Example:**
```bash
# 1. Install cluster-wide operator (production, v1.0.0)
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace dynamo-system \
--create-namespace
# 2. Install namespace-scoped operator (testing, v2.0.0-beta)
helm install dynamo-test dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace test-namespace \
--create-namespace \
--set dynamo-operator.namespaceRestriction.enabled=true \
--set dynamo-operator.controllerManager.manager.image.tag=v2.0.0-beta
```
**Observability:**
```bash
# List all namespaces with local operators
kubectl get lease -A --field-selector metadata.name=dynamo-operator-namespace-scope
# Check which operator version is running in a namespace
kubectl get lease -n my-namespace dynamo-operator-namespace-scope \
-o jsonpath='{.spec.holderIdentity}'
```
## Custom Resource Definitions (CRDs)
Dynamo installs the following Custom Resources. The main deployment path is:
create or generate a `DynamoGraphDeployment`, then let the operator create the
lower-level resources that run it.
| Custom Resource | What it represents | Typical use |
|---|---|---|
| `DynamoGraphDeployment` (DGD) | The canonical live deployment for a Dynamo inference graph. | Author directly, apply a tuned recipe, or let DGDR generate it. |
| `DynamoGraphDeploymentRequest` (DGDR) | A deploy-by-intent request that profiles a model/hardware target and generates a DGD. | Start here when you want Dynamo to choose sizing, parallelism, or Planner-enabled generated config. |
| `DynamoComponentDeployment` (DCD) | Per-component deployments created from a DGD, such as frontend, router, prefill, decode, and planner components. | Usually inspected for debugging rather than authored directly. |
| `DynamoModel` | Model and adapter lifecycle management layered onto a running deployment. | Load, unload, or manage model artifacts such as LoRA adapters. |
| `DynamoCheckpoint` | Checkpoint metadata and job configuration for snapshotting GPU workers. | Use with Snapshotting GPU Workers to restore warm workers faster than cold start. |
Advanced and operator-owned resources:
- `DynamoGraphDeploymentScalingAdapter`: scaling interface used by Planner or external autoscalers to adjust component replicas.
- `DynamoWorkerMetadata`: discovery metadata written for worker pods.
For the complete technical API reference for Dynamo Custom Resource Definitions, see:
**📖 [Dynamo CRD API Reference](/dynamo/additional-resources/api-reference-k-8-s)**
For user-focused workflows, see:
- **[Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide)** for DGD, DCD, DGDR, and recipes
- **[DGDR Reference](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference)** for deploy-by-intent generated deployments
- **[Managing Models with DynamoModel Guide](/dynamo/kubernetes-deployment/deploy-models/managing-models-with-dynamo-model)**
- **[Snapshotting GPU Workers](/dynamo/kubernetes-deployment/advanced-platform/snapshot)** for `DynamoCheckpoint`
## Webhooks
The Dynamo Operator uses **Kubernetes admission webhooks** for real-time validation and mutation of custom resources before they are persisted to the cluster. Webhooks are a required component of the operator and ensure that invalid configurations are rejected immediately at the API server level.
**Key Features:**
- ✅ Shared certificate infrastructure across all webhook types
- ✅ Automatic certificate generation and rotation (default, all environments)
- ✅ cert-manager integration (optional, for custom PKI)
- ✅ Immutability enforcement for critical fields
For complete documentation on webhooks, certificate management, and troubleshooting, see:
**📖 [Webhooks Guide](/dynamo/kubernetes-deployment/advanced-platform/webhooks)**
## Observability
The Dynamo Operator provides comprehensive observability through Prometheus metrics and Grafana dashboards. This allows you to monitor:
- **Controller Performance**: Reconciliation loop duration, success rates, and error rates by resource type
- **Webhook Activity**: Validation performance, admission rates, and denial patterns
- **Resource Inventory**: Current count of managed resources by state and namespace
- **Operational Health**: Success rates and health indicators for controllers and webhooks
### Metrics Collection
Metrics are automatically exposed on the operator's `/metrics` endpoint (port 8443 by default) and collected by Prometheus via a ServiceMonitor. The ServiceMonitor is automatically created when you install the operator via Helm (controlled by `metricsService.enabled`, which defaults to `true`).
### Grafana Dashboard
A pre-built Grafana dashboard is available for visualizing operator metrics. The dashboard includes:
- **Reconciliation Metrics**: Rate, duration (P95), and errors by resource type
- **Webhook Metrics**: Request rate, duration (P95), and denials by resource type and operation
- **Resource Inventory**: Count of DynamoGraphDeployments by state and namespace
- **Operational Health**: Success rate gauges for controllers and webhooks
For complete setup instructions and metrics reference, see:
**📖 [Operator Metrics Guide](/dynamo/kubernetes-deployment/operate/observability/operator-metrics)**
## Installation
### Quick Install with Helm
```bash
# Set environment
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# Install Platform (includes operator)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
```
> **Note:** Namespace-scoped and hybrid deployment modes are deprecated. Use cluster-wide mode for all new deployments. See [Deployment Modes](#deployment-modes) above if you need backward-compatible configurations.
### Building from Source
```bash
# Set environment
export NAMESPACE=dynamo-system
export DOCKER_SERVER=your-registry.com/ # your container registry
export IMAGE_TAG=latest
# Build operator image
cd deploy/operator
docker build -t $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG \
--build-context snapshot=../snapshot \
--build-arg DOCKER_PROXY="" \
.
docker push $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG
cd -
# Install platform with custom operator image (CRDs are automatically installed by the chart)
cd deploy/helm/charts
helm install dynamo-platform ./platform/ \
--namespace ${NAMESPACE} \
--create-namespace \
--set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/kubernetes-operator" \
--set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
--set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret
```
For detailed installation options, see the [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide)
## Development
- **Code Structure:**
The operator is built using Kubebuilder and the operator-sdk, with the following structure:
- `controllers/`: Reconciliation logic
- `api/v1alpha1/`: CRD types
- `config/`: Manifests and Helm charts
## References
- [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)
- [Custom Resource Definitions](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
- [Operator SDK](https://sdk.operatorframework.io/)
- [Helm Best Practices for CRDs](https://helm.sh/docs/chart_best_practices/custom_resource_definitions/)
# Minikube Setup
Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Kubernetes Platform locally.
## 1. Install Minikube
First things first! Start by installing Minikube. Follow the official [Minikube installation guide](https://minikube.sigs.k8s.io/docs/start/) for your operating system.
## 2. Configure GPU Support (Optional)
Planning to use GPU-accelerated workloads? You'll need to configure GPU support in Minikube. Follow the [Minikube GPU guide](https://minikube.sigs.k8s.io/docs/tutorials/nvidia/) to set up NVIDIA GPU support before proceeding.
Make sure to configure GPU support before starting Minikube if you plan to use GPU workloads!
## 3. Start Minikube
Time to launch your local cluster!
```bash
# Start Minikube with GPU support (if configured)
minikube start --driver docker --container-runtime docker --gpus all --memory=16000mb --cpus=8
# Enable required addons
minikube addons enable istio-provisioner
minikube addons enable istio
minikube addons enable storage-provisioner-rancher
```
## 4. Verify Installation
Let's make sure everything is working correctly!
```bash
# Check Minikube status
minikube status
# Verify Istio installation
kubectl get pods -n istio-system
# Verify storage class
kubectl get storageclass
```
## Next Steps
Once your local environment is set up, you can proceed with the [Dynamo Kubernetes Platform installation guide](/dynamo/kubernetes-deployment/start-here/installation-guide) to deploy the platform to your local cluster.
# Deployment Overview
Dynamo's canonical Kubernetes deployment is a
[`DynamoGraphDeployment`](/dynamo/additional-resources/api-reference-k-8-s#dynamographdeployment) (DGD). A DGD
describes the inference graph you want to run. The Dynamo operator reconciles
that graph into one or more
[`DynamoComponentDeployment`](/dynamo/additional-resources/api-reference-k-8-s#dynamocomponentdeployment) (DCD)
resources, which run the frontend, router, prefill workers, decode workers, and
other graph components.
This is the Kubernetes-native control path for Dynamo: you author or generate
Dynamo resources, and the operator translates them into Kubernetes workloads,
services, routing metadata, model-loading resources, and status conditions. For
local development or incremental adoption, you can still run the same frontend,
router, and worker components outside Kubernetes.
You can create a DGD directly from a known-good manifest, or you can use a
[`DynamoGraphDeploymentRequest`](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference) (DGDR) to profile your model and
generate a DGD for you.
Most users only need three ideas before they deploy:
- **Recipes are the fastest path** when one matches your model, backend,
hardware, and serving pattern. They are already DGD manifests.
- **DGDR is the guided path** when you want Dynamo to profile and generate a
DGD from model/SLA intent.
- **DGD is the object that serves traffic**. DGDR can create it, but the DGD is
what persists after profiling completes.
You do not need to author DCDs directly for normal deployments.
## Start Here: Resource Model
```mermaid
flowchart LR
DGDR["DynamoGraphDeploymentRequest (DGDR) optional generator and profiler"]
Recipes["recipes/model/.../deploy.yaml pre-tuned DGD manifests"]
DGD["DynamoGraphDeployment (DGD) canonical live deployment"]
DCD["DynamoComponentDeployments (DCDs) per-component deployments"]
Pods["Pods and Services frontend, router, workers"]
DGDR -->|"profiles + generates"| DGD
Recipes -->|"kubectl apply"| DGD
DGD -->|"operator reconciles"| DCD
DCD --> Pods
```
| Resource or path | What it is | Use it when | Learn more |
|---|---|---|---|
| `DynamoGraphDeployment` (DGD) | The canonical live deployment for a Dynamo inference graph. | You have a known-good configuration or tuned YAML. | [Creating Deployments](/dynamo/additional-resources/creating-deployments), [DGD API](/dynamo/additional-resources/api-reference-k-8-s#dynamographdeployment) |
| `DynamoComponentDeployment` (DCD) | The per-component deployment objects created from a DGD. | Usually not authored directly; inspect them to debug frontend/router/worker rollout. | [DCD API](/dynamo/additional-resources/api-reference-k-8-s#dynamocomponentdeployment) |
| `DynamoGraphDeploymentRequest` (DGDR) | A deploy-by-intent request that profiles your model/hardware and generates a DGD. | You want Dynamo to size the deployment, choose parallelism, configure supported generated-deployment features such as Planner, or produce DGD YAML. | [DGDR Reference](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference) |
| Recipes | Curated `deploy.yaml` manifests that are already DGD specs. | A recipe matches your model, backend, hardware, and serving mode. | [Dynamo recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) |
| `DynamoModel` | Model and adapter lifecycle management layered onto an existing DGD or DCD. | You need declarative model operations such as LoRA adapter loading. | [Managing Models with DynamoModel](/dynamo/kubernetes-deployment/deploy-models/managing-models-with-dynamo-model) |
## Choose Your Path
Start with the row that matches your situation. The sections later in this page
are reference material; you can read them as needed instead of going linearly.
| Situation | Do this first | Then read |
|---|---|---|
| A recipe matches your model/backend/hardware | Apply the recipe's model cache resources, then apply its `deploy.yaml`. | [Deploy a Tuned DGD from Recipes](#deploy-a-tuned-dgd-from-recipes) |
| You want Dynamo to generate the deployment | Create a DGDR. Use `autoApply: true` to let the operator create the DGD, or `autoApply: false` to inspect the generated DGD YAML first. | [Use DGDR to Generate a DGD](#use-dgdr-to-generate-a-dgd) |
| You already know the exact topology | Author or edit a DGD directly, then apply it with `kubectl`. | [Creating Deployments](/dynamo/additional-resources/creating-deployments) |
| You are preparing for production | Add model caching, choose backend/search strategy, and validate networking/planner needs. | [Production Details](#production-details) |
## Deploy a Tuned DGD from Recipes
If a [recipe](https://github.com/ai-dynamo/dynamo/tree/main/recipes) matches
your target model, backend, GPU type, and serving mode, start there. Recipes are
curated `DynamoGraphDeployment` manifests with model-cache setup and, for many
recipes, benchmark jobs.
The common recipe flow is:
```bash
cd recipes
# Update the recipe storageClassName first, then create model cache resources.
kubectl apply -f /model-cache/ -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download \
-n ${NAMESPACE} --timeout=6000s
# Deploy a tuned DGD.
kubectl apply -f ///deploy.yaml -n ${NAMESPACE}
```
Follow the README in the specific recipe directory for model-specific images,
GPU requirements, cache setup, and request examples.
## Use DGDR to Generate a DGD
A DGDR is Dynamo's deploy-by-intent path. Instead of hand-crafting a deployment
spec with parallelism settings, replica counts, and resource limits, you
describe what you want to run (model, backend, workload, SLA targets) and DGDR
generates a DGD:
1. **Spec** — You submit a DGDR with your model, workload expectations, and
optional SLA targets.
2. **Hardware Discovery** — The operator discovers your cluster's GPU hardware
(SKU, VRAM, count per node) via DCGM or node labels.
3. **Profiling** — The profiler analyzes your model against the discovered
hardware, using either rapid simulation or thorough real-GPU benchmarking.
4. **DGD Generation** — The profiler produces an optimized
`DynamoGraphDeployment` (DGD) spec with the best parallelization strategy,
replica counts, and resource configuration.
5. **Review** (when `autoApply: false`) — The generated DGD is stored in
`.status.profilingResults.selectedConfig` for you to inspect and optionally
modify before deploying.
6. **Deploy** — With `autoApply: true`, the operator creates the DGD. With
`autoApply: false`, you apply the generated DGD yourself.
7. **Planner** (optional) — If enabled, the Planner monitors live traffic and
adjusts replica counts at runtime to meet your SLA targets.
DGDR currently supports generated-deployment feature configuration for Planner
(`features.planner`) and mocker mode (`features.mocker`). The DGDR API does not
currently expose `features.kvRouter`; configure explicit router mode in a DGD,
a tuned recipe, or a generated DGD override when you need KV-aware routing
details.
```text
┌──────┐ ┌───────────┐ ┌──────────┐ ┌─────────────┐ ┌────────┐ ┌─────────┐
│ Spec │───▶│ Hardware │───▶│ Profiler │───▶│ Generated │───▶│ Deploy │───▶│ Planner │
│ │ │ Discovery │ │ │ │ DGD │ │ │ │ (opt.) │
└──────┘ └───────────┘ └──────────┘ └─────────────┘ └────────┘ └─────────┘
│
autoApply: false?
▼ Review
```
For the DGDR spec reference, field descriptions, and lifecycle phases, see the
[DGDR Reference](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference).
## DGDR Detail: Choose a Search Strategy
The `searchStrategy` field controls how the profiler explores configurations.
Your choice depends on how much time you can invest and how close to optimal
you need.
### Rapid (Default)
```yaml
searchStrategy: rapid
```
Uses AIC-backed DynoSim-style performance modeling to search deployment
configurations without running real inference. Completes in ~30 seconds with no
GPU resources consumed during profiling.
**Use rapid when:**
- Getting started or iterating quickly
- Running in CI/CD pipelines
- Your GPU SKU is in the [AIC support matrix](#aic-support-matrix)
**Limitations:**
- If AIC does not support your model/hardware/backend combination, the profiler
falls back to a naive memory-fit config (basic TP calculation) which may not
be optimal.
- Simulated results may differ from real-hardware performance for unusual
configurations.
### Thorough
```yaml
searchStrategy: thorough
backend: vllm # must specify a concrete backend
```
Enumerates candidate parallelization configs, deploys each on real GPUs, and
benchmarks with AIPerf. Takes 2–4 hours.
**Use thorough when:**
- Tuning for production and you need the most optimal configuration
- Your hardware is not supported by AIC (e.g., PCIe GPUs)
- You want measured rather than simulated performance data
**Constraints:**
- **Disaggregated mode only** — thorough does not run aggregated configurations.
- **`backend: auto` is not supported** — you must specify `vllm`, `sglang`, or
`trtllm`. The DGDR will be rejected if you use `auto` with `thorough`.
- **Requires GPU resources** — the profiler deploys real inference engines on
your cluster during profiling.
## DGDR Detail: AIC Support Matrix
The rapid strategy relies on AIC performance models. AIC currently supports:
### GPU SKUs
| Supported (rapid) | Not Yet Supported (use thorough) |
|---|---|
| H100 SXM | V100 (SXM/PCIe) |
| H100 PCIe | T4 |
| H200 SXM | MI200, MI300 |
| A100 SXM | |
| A100 PCIe | |
| A30 | |
| B200 SXM | |
| GB200 SXM | |
| L40S | |
| L4 | |
Some rapid-mode SKUs use AIC estimate-only data until measured profiles are
available. Use `searchStrategy: thorough` when you need hardware-measured
profiling for an estimate-only or unsupported SKU.
When specifying GPU SKUs manually, use lowercase underscore format (e.g.,
`h100_sxm`, not `H100-SXM5-80GB`). See the
[DGDR Reference — SKU Format](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference#sku-format) for the full list.
### Backends
All three backends are supported for both rapid and thorough:
| Backend | Dense Models | MoE Models |
|---------|-------------|------------|
| vLLM | ✅ | 🚧 Work in progress |
| SGLang | ✅ | ✅ |
| TensorRT-LLM | ✅ | 🚧 Work in progress |
**If you are deploying a Mixture-of-Experts (MoE) model** (e.g., DeepSeek-R1,
Qwen3-MoE), use **SGLang** as the backend for full support. vLLM and TRT-LLM
have partial MoE support that is still under development.
### Parallelization Strategies
The profiler selects different parallelization strategies depending on the
model architecture:
| Model Architecture | Prefill | Decode |
|---|---|---|
| MLA+MoE (DeepSeek-V3, DeepSeek-R1) | TEP, DEP | TEP, DEP |
| GQA+MoE (Qwen3-MoE) | TP, TEP, DEP | TP, TEP, DEP |
| Dense models (Llama, Qwen, etc.) | TP | TP |
## Production Details
After the basic deployment path is clear, use this checklist to decide which
production topics apply:
| Concern | Why it matters | Section |
|---|---|---|
| Model startup is slow or the model is gated | Avoid repeated downloads and pass `HF_TOKEN` cleanly. | [Model Caching](#production-detail-model-caching) |
| Traffic changes over time | Planner can scale prefill/decode replicas at runtime. | [Planner](#production-detail-planner) |
| The model spans nodes or uses disaggregated serving | Grove/LWS and RDMA affect scheduling and KV transfer. | [Multinode and RDMA](#production-detail-multinode-and-rdma) |
| You need a specific inference engine | Backend choice affects MoE support, thorough profiling, and distributed behavior. | [Backend Selection](#production-detail-backend-selection) |
## Production Detail: Model Caching
**Set up model caching before deploying if any of these apply:**
- Your model is large (>70B parameters) — downloading hundreds of GB per pod
takes hours
- You are scaling to many replicas — each pod downloads the full model
independently, and HuggingFace will rate-limit concurrent downloads
- You want fast pod startup on scaling events
### How It Works with DGDR
Add a `modelCache` section to your DGDR spec that points to a pre-populated PVC:
```yaml
spec:
model: meta-llama/Llama-3.1-70B-Instruct
modelCache:
pvcName: model-cache
pvcMountPath: /home/dynamo/.cache/huggingface
pvcModelPath: hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/
```
The operator mounts this PVC at `pvcMountPath` read-only into the profiling job
and passes it through to the generated DGD, so both profiling and serving use
the cached weights.
`pvcModelPath` must be the HuggingFace snapshot path inside the PVC —
`hub/models----/snapshots/`. This follows the layout
that `huggingface-cli download` creates when `HF_HOME` is set to the mount
point. Replace `--` by substituting `/` with `--` in the model ID,
and replace `` with the actual snapshot revision. See
[Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching#find-the-snapshot-path) for how to look up the
hash after downloading.
### Setup
1. Create a `ReadWriteMany` PVC — see the
[Installation Guide — Shared Storage](/dynamo/kubernetes-deployment/start-here/installation-guide#shared-storage-for-model-caching)
for provider-specific options (EFS, Azure Lustre, GKE Filestore).
2. Run a one-time download Job to populate the PVC.
3. Reference the PVC in your DGDR's `modelCache` field.
See [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) for the full walkthrough with YAML
examples.
### Private and Gated Models
For models that require authentication (e.g., gated HuggingFace models), create
a Kubernetes Secret named `hf-token-secret` with a `HF_TOKEN` key:
```bash
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN= \
-n $NAMESPACE
```
The profiler and deployed pods will automatically use this token.
## Production Detail: Planner
The Planner provides **runtime autoscaling** for disaggregated deployments. It
adjusts prefill and decode replica counts to meet your SLA targets as traffic
fluctuates.
```yaml
spec:
features:
planner:
enabled: true
sla:
ttft: 500 # Target time to first token (ms)
itl: 50 # Target inter-token latency (ms)
```
### Planner Scaling Modes
| Mode | Description | Prometheus Required? |
|---|---|---|
| `throughput` (default) | Static queue-depth and KV-cache thresholds; scales based on saturation | No |
| `latency` | Same as throughput with more aggressive thresholds | No |
| `sla` | Rust engine perf shim targeting specific TTFT/ITL values; uses native AIC when available, optional bootstrap data, and live FPM tuning | Yes |
### Prometheus Requirement
The `sla` optimization target reads live TTFT/ITL metrics from Prometheus. If
you want SLA-driven autoscaling, install Prometheus before creating the DGDR.
See the [Installation Guide — Prometheus](/dynamo/kubernetes-deployment/start-here/installation-guide#kube-prometheus-stack)
for setup instructions.
The `throughput` and `latency` modes use internal queue-depth signals and work
**without Prometheus**.
See the [Planner Guide](/dynamo/components/planner/planner-guide) for advanced
configuration and scaling behavior details.
## Production Detail: Multinode and RDMA
Models that require more GPUs than a single node provides (e.g., DeepSeek-R1 on
8-GPU nodes) need multinode orchestration.
### Grove and KAI Scheduler
**Grove** is required for multinode DGDR deployments. It provides gang
scheduling (all pods in a group start together or not at all), coordinated
scaling, and network topology-aware placement. The operator will return an error
if you attempt a multinode deployment without Grove or LeaderWorkerSet (LWS)
installed.
**KAI Scheduler** is optional but recommended alongside Grove for GPU-aware
scheduling and topology optimization.
See the [Installation Guide — Grove + KAI Scheduler](/dynamo/kubernetes-deployment/start-here/installation-guide#grove--kai-scheduler)
for setup instructions and the compatibility matrix.
### High-Speed Networking (RDMA)
Disaggregated serving transfers KV cache data between prefill and decode workers.
Understanding the networking stack helps you diagnose performance issues:
| Layer | What it is |
|---|---|
| **NIXL** | Dynamo's KV cache transfer library. Moves data between prefill and decode pods. |
| **UCX / libfabric** | Low-level communication frameworks that NIXL uses underneath. |
| **RDMA** | Remote Direct Memory Access — the general technique for moving data between machines without involving the CPU. |
| **InfiniBand** | High-speed RDMA networking standard. Common on-prem and on Azure (AKS). |
| **RoCE** | RDMA over Converged Ethernet — RDMA on standard Ethernet hardware. |
| **EFA** | AWS Elastic Fabric Adapter — AWS's RDMA-capable networking for EKS. |
| **GPUDirect RDMA** | Allows data to go directly between a GPU and a network adapter, bypassing CPU memory entirely. |
| **NCCL** | NVIDIA Collective Communications Library — handles intra-model parallelism (TP/PP) communication _within_ a pod. Separate from NIXL. |
When RDMA is missing or not active, NIXL can fall back to TCP. That makes KV
cache movement the likely bottleneck and can produce very high TTFT or low
throughput even when the model workers appear healthy.
**Enable RDMA if:**
- You are running multinode disaggregated deployments
- You need low-latency KV cache transfer between workers
See the [Installation Guide — Network Operator / RDMA](/dynamo/kubernetes-deployment/start-here/installation-guide#network-operator--rdma)
for provider-specific setup instructions, and the
[Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication) for transport
details and performance expectations.
### MoE Models and Multinode Sweep Limits
The profiler sweeps MoE models across up to **4 nodes** (dense models: 1 node
max per engine during sweep). If your MoE model requires more than 4 nodes of
GPUs, the profiler will select the best config within that range and you may
need to adjust replica counts manually.
## Production Detail: Backend Selection
The `backend` field controls which inference engine is used. The default
(`auto`) lets the profiler pick the best backend, but you should specify a
backend explicitly in these cases:
| Scenario | Recommended Backend |
|---|---|
| MoE models (DeepSeek-R1, Qwen3-MoE) | `sglang` (full MoE support) |
| Using `searchStrategy: thorough` | Any except `auto` (required) |
| TensorRT-LLM compilation caching | `trtllm` (add a compilation cache PVC) |
| Need load-based planner scaling (FPM) | `vllm` (any config) or `trtllm` (non-attention-DP only). SGLang FPM is wired in Dynamo but the upstream module is not in the 1.2.0 runtime image. |
TensorRT-LLM does not support Python 3.11. If your environment uses
Python 3.11, use `vllm` or `sglang` instead.
### Multinode Backend Behavior
Each backend handles multinode inference differently:
- **vLLM**: Uses Ray for multi-node TP/PP. Ray head runs on the leader, agents on workers.
- **SGLang**: Uses `--dist-init-addr`, `--nnodes`, `--node-rank` flags for distributed setup.
- **TRT-LLM**: MPI-based. The operator auto-generates SSH keypairs; the leader runs `mpirun`.
## Troubleshooting
### OOM During Profiling or Serving
- **Cause**: The model doesn't fit in GPU memory with the selected TP size.
- **Fix**: Ensure `hardware.totalGpus` is large enough for your model. The
profiler calculates minimum TP from model size and VRAM, but edge cases
(large context lengths, KV cache overhead) may require more GPUs than the
minimum.
### GPU Auto-Detection Cap
The operator caps auto-detected GPU count at **32**. If your cluster has more
GPUs and you want the profiler to use them, set `hardware.totalGpus` explicitly:
```yaml
spec:
hardware:
totalGpus: 64
```
### Profiling Job Fails to Schedule
GPU nodes often have taints. Add tolerations via the `overrides` field:
```yaml
spec:
overrides:
profilingJob:
template:
spec:
containers: [] # required placeholder
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
```
### DGDR Spec Is Immutable
Once the DGDR enters the `Profiling` phase, the spec cannot be changed. If you
need to adjust settings, delete the DGDR and recreate it:
```bash
kubectl delete dgdr my-model -n $NAMESPACE
kubectl apply -f updated-dgdr.yaml -n $NAMESPACE
```
### DGD Persists After DGDR Deletion
Deleting a DGDR does **not** delete the DGD it created. This is intentional —
the DGD continues serving traffic independently. To clean up fully:
```bash
kubectl delete dgdr my-model -n $NAMESPACE
kubectl delete dgd my-model-dgd -n $NAMESPACE
```
## Example Workflows
### Small Dense Model (Quick Start)
A small model on a single node with rapid profiling — the simplest case:
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: qwen-small
spec:
model: Qwen/Qwen3-0.6B
```
### Large Dense Model with SLA Targets
A 70B model with model caching, SLA targets, and the planner enabled:
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: llama-70b
spec:
model: meta-llama/Llama-3.1-70B-Instruct
backend: vllm
searchStrategy: rapid
autoApply: false
modelCache:
pvcName: model-cache
pvcMountPath: /home/dynamo/.cache/huggingface
pvcModelPath: hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/
sla:
ttft: 500
itl: 50
workload:
isl: 4000
osl: 1000
requestRate: 10
features:
planner:
enabled: true
```
### MoE Model (DeepSeek-R1)
A large MoE model requiring multinode, SGLang backend, and thorough profiling:
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: deepseek-r1
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
searchStrategy: thorough
autoApply: false
modelCache:
pvcName: model-cache
pvcMountPath: /home/dynamo/.cache/huggingface
pvcModelPath: hub/models--deepseek-ai--DeepSeek-R1/snapshots/
sla:
ttft: 2000
itl: 100
hardware:
totalGpus: 32
features:
planner:
enabled: true
overrides:
profilingJob:
template:
spec:
containers: []
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
```
**Prerequisites for this deployment:**
- [Grove and KAI Scheduler](/dynamo/kubernetes-deployment/start-here/installation-guide#grove--kai-scheduler) installed
- [RDMA](/dynamo/kubernetes-deployment/start-here/installation-guide#network-operator--rdma) configured for efficient KV cache transfer
- Model [cached on a shared PVC](/dynamo/kubernetes-deployment/start-here/installation-guide#shared-storage-for-model-caching)
- [Prometheus](/dynamo/kubernetes-deployment/start-here/installation-guide#kube-prometheus-stack) installed (for SLA-driven planner scaling)
## Further Reading
- [DGDR Reference](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference) — Spec reference, lifecycle phases, monitoring commands
- [DGDR Examples](/dynamo/components/profiler/profiler-examples) — Ready-to-use YAML for various scenarios
- [Profiler Guide](/dynamo/components/profiler/profiler-guide) — Profiling algorithms, picking modes, gate checks
- [Planner Guide](/dynamo/components/planner/planner-guide) — Scaling modes, PlannerConfig reference
- [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) — PVC setup, ModelExpress, and ModelStreamer
- [Creating Deployments](/dynamo/additional-resources/creating-deployments) — Manual DGD spec for hand-crafted configs
- [Multinode Deployments](/dynamo/kubernetes-deployment/scale/multinode-deployments) — Grove, LWS, and multinode details
- [Disaggregated Communication](/dynamo/kubernetes-deployment/operate/disagg-communication) — NIXL, RDMA, and networking
# Managing Models with DynamoModel
## Overview
`DynamoModel` is a Kubernetes Custom Resource that represents a machine learning model deployed on Dynamo. It enables you to:
- **Deploy LoRA adapters** on top of running base models
- **Track model endpoints** and their readiness across your cluster
- **Manage model lifecycle** declaratively with Kubernetes
DynamoModel works alongside `DynamoGraphDeployment` (DGD) or `DynamoComponentDeployment` (DCD) resources. While DGD/DCD deploy the inference infrastructure (pods, services), DynamoModel handles model-specific operations like loading LoRA adapters.
## Quick Start
### Prerequisites
Before creating a DynamoModel, you need:
1. A running `DynamoGraphDeployment` or `DynamoComponentDeployment`
2. Components configured with `modelRef` pointing to your base model
3. Pods are ready and serving your base model
For complete setup including DGD configuration, see [Integration with DynamoGraphDeployment](#integration-with-dynamographdeployment).
### Deploy a LoRA Adapter
**1. Create your DynamoModel:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: my-lora
namespace: dynamo-system
spec:
modelName: my-custom-lora
baseModelName: Qwen/Qwen3-0.6B # Must match modelRef.name in your DGD
modelType: lora
source:
uri: s3://my-bucket/loras/my-lora
```
**2. Apply and verify:**
```bash
# Apply the DynamoModel
kubectl apply -f my-lora.yaml
# Check status
kubectl get dynamomodel my-lora
```
**Expected output:**
```
NAME TOTAL READY AGE
my-lora 2 2 30s
```
That's it! The operator automatically discovers endpoints and loads the LoRA.
For detailed status monitoring, see [Monitoring & Operations](#monitoring--operations).
## Understanding DynamoModel
### Model Types
DynamoModel supports three model types:
| Type | Description | Use Case |
|------|-------------|----------|
| **`base`** | Reference to an existing base model | Tracking endpoints for a base model (default) |
| **`lora`** | LoRA adapter that extends a base model | Deploy fine-tuned adapters on existing models |
| **`adapter`** | Generic model adapter | Future extensibility for other adapter types |
Most users will use **`lora`** to deploy fine-tuned models on top of their base model deployments.
### How It Works
When you create a DynamoModel, the operator:
1. **Discovers endpoints**: Finds all pods running your `baseModelName` (by matching `modelRef.name` in DGD/DCD)
2. **Creates service**: Automatically creates a Kubernetes Service to track these pods
3. **Loads LoRA**: Calls the LoRA load API on each endpoint (for `lora` type)
4. **Updates status**: Reports which endpoints are ready
**Key linkage:**
```yaml
# DGD modelRef.name ↔ DynamoModel baseModelName must match
Worker:
modelRef:
name: Qwen/Qwen3-0.6B
---
spec:
baseModelName: Qwen/Qwen3-0.6B
```
## Configuration Overview
DynamoModel requires just a few key fields to deploy a model or adapter:
| Field | Required | Purpose | Example |
|-------|----------|---------|---------|
| `modelName` | Yes | Model identifier | `my-custom-lora` |
| `baseModelName` | Yes | Links to DGD modelRef | `Qwen/Qwen3-0.6B` |
| `modelType` | No | Type: base/lora/adapter | `lora` (default: `base`) |
| `source.uri` | For LoRA | Model location | `s3://bucket/path` or `hf://org/model` |
**Example minimal LoRA configuration:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: my-lora
spec:
modelName: my-custom-lora
baseModelName: Qwen/Qwen3-0.6B
modelType: lora
source:
uri: s3://my-bucket/my-lora
```
**For complete field specifications, validation rules, and all options, see:**
📖 [DynamoModel API Reference](/dynamo/additional-resources/api-reference-k-8-s#dynamomodel)
### Status Summary
The status shows discovered endpoints and their readiness:
```bash
kubectl get dynamomodel my-lora
```
**Key status fields:**
- `totalEndpoints` / `readyEndpoints`: Counts of discovered vs ready endpoints
- `endpoints[]`: List with addresses, pod names, and ready status
- `conditions`: Standard Kubernetes conditions (EndpointsReady, ServicesFound)
For detailed status usage, see the [Monitoring & Operations](#monitoring--operations) section below
## Common Use Cases
### Use Case 1: S3-Hosted LoRA Adapter
Deploy a LoRA adapter stored in an S3 bucket.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: customer-support-lora
namespace: production
spec:
modelName: customer-support-adapter-v1
baseModelName: meta-llama/Llama-3.3-70B-Instruct
modelType: lora
source:
uri: s3://my-models-bucket/loras/customer-support/v1
```
**Prerequisites:**
- S3 bucket accessible from your pods (IAM role or credentials)
- Base model `meta-llama/Llama-3.3-70B-Instruct` running via DGD/DCD
**Verification:**
```bash
# Check LoRA is loaded
kubectl get dynamomodel customer-support-lora -o jsonpath='{.status.readyEndpoints}'
# Should output: 2 (or your number of replicas)
# View which pods are serving
kubectl get dynamomodel customer-support-lora -o jsonpath='{.status.endpoints[*].podName}'
```
### Use Case 2: HuggingFace-Hosted LoRA
Deploy a LoRA adapter from HuggingFace Hub.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: multilingual-lora
namespace: dynamo-system
spec:
modelName: multilingual-adapter
baseModelName: Qwen/Qwen3-0.6B
modelType: lora
source:
uri: hf://myorg/qwen-multilingual-lora@v1.0.0 # Optional: @revision
```
**Prerequisites:**
- HuggingFace Hub accessible from your pods
- If private repo: HF token configured as secret and mounted in pods
- Base model `Qwen/Qwen3-0.6B` running via DGD/DCD
**With HuggingFace token:**
```yaml
# In your DGD/DCD
spec:
services:
worker:
envFromSecret: hf-token-secret # Provides HF_TOKEN env var
modelRef:
name: Qwen/Qwen3-0.6B
# ... rest of config
```
### Use Case 3: Multiple LoRAs on Same Base Model
Deploy multiple LoRA adapters on the same base model deployment.
```yaml
---
# LoRA for customer support
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: support-lora
spec:
modelName: support-adapter
baseModelName: Qwen/Qwen3-0.6B
modelType: lora
source:
uri: s3://models/support-lora
---
# LoRA for code generation
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: code-lora
spec:
modelName: code-adapter
baseModelName: Qwen/Qwen3-0.6B # Same base model
modelType: lora
source:
uri: s3://models/code-lora
```
Both LoRAs will be loaded on all pods serving `Qwen/Qwen3-0.6B`. Your application can then route requests to the appropriate adapter.
## Monitoring & Operations
### Checking Status
**Quick status check:**
```bash
kubectl get dynamomodel
```
**Example output:**
```
NAME TOTAL READY AGE
my-lora 2 2 5m
customer-lora 4 3 2h
```
**Detailed status:**
```bash
kubectl describe dynamomodel my-lora
```
**Example output:**
```
Name: my-lora
Namespace: dynamo-system
Spec:
Model Name: my-custom-lora
Base Model Name: Qwen/Qwen3-0.6B
Model Type: lora
Source:
Uri: s3://my-bucket/my-lora
Status:
Ready Endpoints: 2
Total Endpoints: 2
Endpoints:
Address: http://10.0.1.5:9090
Pod Name: worker-0
Ready: true
Address: http://10.0.1.6:9090
Pod Name: worker-1
Ready: true
Conditions:
Type: EndpointsReady
Status: True
Reason: EndpointsDiscovered
Events:
Type Reason Message
---- ------ -------
Normal EndpointsReady Discovered 2 ready endpoints for base model Qwen/Qwen3-0.6B
```
### Understanding Readiness
An endpoint is **ready** when:
1. The pod is running and healthy
2. The LoRA load API call succeeded
**Condition states:**
- `EndpointsReady=True`: All endpoints are ready (full availability)
- `EndpointsReady=False, Reason=NotReady`: Not all endpoints ready (check message for counts)
- `EndpointsReady=False, Reason=NoEndpoints`: No endpoints found
When `readyEndpoints < totalEndpoints`, the operator automatically retries loading every 30 seconds.
### Viewing Endpoints
**Get endpoint addresses:**
```bash
kubectl get dynamomodel my-lora -o jsonpath='{.status.endpoints[*].address}' | tr ' ' '\n'
```
**Output:**
```
http://10.0.1.5:9090
http://10.0.1.6:9090
```
**Get endpoint pod names:**
```bash
kubectl get dynamomodel my-lora -o jsonpath='{.status.endpoints[*].podName}' | tr ' ' '\n'
```
**Check readiness of each endpoint:**
```bash
kubectl get dynamomodel my-lora -o json | jq '.status.endpoints[] | {podName, ready}'
```
**Output:**
```json
{
"podName": "worker-0",
"ready": true
}
{
"podName": "worker-1",
"ready": true
}
```
### Updating a Model
To update a LoRA (e.g., deploy a new version):
```bash
# Edit the source URI
kubectl edit dynamomodel my-lora
# Or apply an updated YAML
kubectl apply -f my-lora-v2.yaml
```
The operator will detect the change and reload the LoRA on all endpoints.
### Deleting a Model
```bash
kubectl delete dynamomodel my-lora
```
For LoRA models, the operator will:
1. Unload the LoRA from all endpoints
2. Clean up associated resources
3. Remove the DynamoModel CR
The base model deployment (DGD/DCD) continues running normally.
## Troubleshooting
### No Endpoints Found
**Symptom:**
```yaml
status:
totalEndpoints: 0
readyEndpoints: 0
conditions:
- type: EndpointsReady
status: "False"
reason: NoEndpoints
message: "No endpoint slices found for base model Qwen/Qwen3-0.6B"
```
**Common Causes:**
1. **Base model deployment not running**
```bash
# Check if pods exist
kubectl get pods -l nvidia.com/dynamo-component-type=worker
```
**Solution:** Deploy your DGD/DCD first, wait for pods to be ready.
2. **`baseModelName` mismatch**
```bash
# Check modelRef in your DGD
kubectl get dynamographdeployment my-deployment -o yaml | grep -A2 modelRef
```
**Solution:** Ensure `baseModelName` in DynamoModel exactly matches `modelRef.name` in DGD.
3. **Pods not ready**
```bash
# Check pod status
kubectl get pods -l nvidia.com/dynamo-component-type=worker
```
**Solution:** Wait for pods to reach `Running` and `Ready` state.
4. **Wrong namespace**
**Solution:** Ensure DynamoModel is in the same namespace as your DGD/DCD.
### LoRA Load Failures
**Symptom:**
```yaml
status:
totalEndpoints: 2
readyEndpoints: 0 # ← No endpoints ready despite pods existing
conditions:
- type: EndpointsReady
status: "False"
reason: NoReadyEndpoints
```
**Common Causes:**
1. **Source URI not accessible**
```bash
# Check operator logs
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager -f | grep "Failed to load"
```
**Solution:**
- For S3: Verify bucket permissions, IAM role, credentials
- For HuggingFace: Verify token is valid, repo exists and is accessible
2. **Invalid LoRA format**
**Solution:** Ensure your LoRA weights are in the format expected by your backend framework (SGLang, vLLM, etc.)
3. **Endpoint API errors**
```bash
# Check operator logs for HTTP errors
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "error"
```
**Solution:** Check the backend framework's logs in the worker pods:
```bash
kubectl logs worker-0
```
4. **Out of memory**
**Solution:** LoRA adapters require additional memory. Increase memory limits in your DGD:
```yaml
resources:
limits:
memory: "32Gi" # Increase if needed
```
### Status Shows Not Ready
**Symptom:**
Some endpoints remain not ready for extended periods.
**Diagnosis:**
```bash
# Check which endpoints are not ready
kubectl get dynamomodel my-lora -o json | jq '.status.endpoints[] | select(.ready == false)'
# View operator logs for that specific pod
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "worker-0"
# Check the worker pod logs
kubectl logs worker-0 | tail -50
```
**Common Causes:**
1. **Network issues**: Pod can't reach S3/HuggingFace
2. **Resource constraints**: Pod is OOMing or being throttled
3. **API endpoint not responding**: Backend framework isn't serving the LoRA API
**When to wait vs investigate:**
- **Wait**: If readyEndpoints is increasing over time (LoRAs loading progressively)
- **Investigate**: If stuck at same readyEndpoints for >5 minutes
### Viewing Events and Logs
**Check events:**
```bash
kubectl describe dynamomodel my-lora | tail -20
```
**View operator logs:**
```bash
# Follow logs
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager -f
# Filter for specific model
kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "my-lora"
```
**Common events and messages:**
| Event/Message | Meaning | Action |
|---------------|---------|--------|
| `EndpointsReady` | All endpoints are ready | ✅ Good - full service availability |
| `NotReady` | Not all endpoints ready | ⚠️ Check readyEndpoints count - operator will retry |
| `PartialEndpointFailure` | Some endpoints failed to load | Check logs for errors |
| `NoEndpointsFound` | No pods discovered | Verify DGD running and modelRef matches |
| `EndpointDiscoveryFailed` | Can't query endpoints | Check operator RBAC permissions |
| `Successfully reconciled` | Reconciliation complete | ✅ Good |
## Integration with DynamoGraphDeployment
This section shows the complete end-to-end workflow for deploying base models and LoRA adapters together.
DynamoModel and DynamoGraphDeployment work together to provide complete model deployment:
- **DGD**: Deploys the infrastructure (pods, services, resources)
- **DynamoModel**: Manages model-specific operations (LoRA loading)
### Linking Models to Components
The connection is established through the `modelRef` field in your DGD:
**Complete example:**
```yaml
---
# 1. Deploy the base model infrastructure
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
backendFramework: vllm
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:latest
Worker:
# This modelRef creates the link to DynamoModel
modelRef:
name: Qwen/Qwen3-0.6B # ← Key linking field
componentType: worker
replicas: 2
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:latest
args:
- --model
- Qwen/Qwen3-0.6B
- --tensor-parallel-size
- "1"
---
# 2. Deploy LoRA adapters on top
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: my-lora
spec:
modelName: my-custom-lora
baseModelName: Qwen/Qwen3-0.6B # ← Must match modelRef.name above
modelType: lora
source:
uri: s3://my-bucket/loras/my-lora
```
### Deployment Workflow
**Recommended order:**
```bash
# 1. Deploy base model infrastructure
kubectl apply -f my-deployment.yaml
# 2. Wait for pods to be ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-component-type=worker --timeout=5m
# 3. Deploy LoRA adapters
kubectl apply -f my-lora.yaml
# 4. Verify LoRA is loaded
kubectl get dynamomodel my-lora
```
**What happens behind the scenes:**
| Step | DGD | DynamoModel |
|------|-----|-------------|
| 1 | Creates pods with modelRef | - |
| 2 | Pods become running and ready | - |
| 3 | - | CR created, discovers endpoints via auto-created Service |
| 4 | - | Calls LoRA load API on each endpoint |
| 5 | - | All endpoints ready ✓ |
The operator automatically handles all service discovery - you don't configure services, labels, or selectors manually.
## API Reference
For complete field specifications, validation rules, and detailed type definitions, see:
**📖 [Dynamo CRD API Reference](/dynamo/additional-resources/api-reference-k-8-s#dynamomodel)**
## Summary
DynamoModel provides declarative model management for Dynamo deployments:
✅ **Simple**: 2-step deployment of LoRA adapters
✅ **Automatic**: Endpoint discovery and loading handled by operator
✅ **Observable**: Rich status reporting and conditions
✅ **Integrated**: Works seamlessly with DynamoGraphDeployment
**Next Steps:**
- Try the [Quick Start](#quick-start) example
- Explore [Common Use Cases](#common-use-cases)
- Check the [API Reference](/dynamo/additional-resources/api-reference-k-8-s#dynamomodel) for advanced configuration
# DGDR Reference
A `DynamoGraphDeploymentRequest` (DGDR) is Dynamo's deploy-by-intent generator
for [`DynamoGraphDeployment`](/dynamo/additional-resources/api-reference-k-8-s#dynamographdeployment) (DGD)
resources. You describe what you want to run and your performance targets; the
profiler determines a configuration and produces the DGD that serves traffic.
For the full deployment mental model — including DGD, DCD, DGDR, recipes,
strategy selection, model caching, planner setup, and common pitfalls — see the
[Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide).
## DGDR, DGD, and Recipes
Dynamo provides two Custom Resources for deploying inference graphs:
| | DGD (canonical live deployment) | DGDR (generator/profiler) |
|---|---|---|
| **You provide** | Full deployment spec (services, parallelism, replicas, resource limits, etc.) | Model, backend, workload, hardware, and optional SLA targets |
| **What happens** | The operator reconciles the DGD into `DynamoComponentDeployment` resources and pods | The profiler generates a DGD; with `autoApply: true`, the operator creates it |
| **Best for** | Known-good configs, tuned recipes, or full manual control | New model/hardware combinations, SLA-driven sizing, or generated DGD YAML |
| **Persistence** | Persists and serves traffic | Reaches a terminal state after generation/deploy |
Use DGD directly when you have a hand-crafted configuration for a specific
model/hardware combination. Most
[recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) are tuned DGD
manifests. Use DGDR when you want Dynamo to generate the DGD for you.
For DGD deployment details, see [Creating Deployments](/dynamo/additional-resources/creating-deployments).
## Spec Reference
### Minimal Example
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: my-model
spec:
model: Qwen/Qwen3-0.6B
image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.1.1" # dynamo-frontend for Dynamo < 1.1.0
```
### Field Reference
| Field | Required | Default | Purpose |
|---|---|---|---|
| `model` | Yes | — | HuggingFace model ID (e.g. `Qwen/Qwen3-0.6B`) |
| `image` | No | — | Container image for the profiling job. Dynamo >= 1.1.0: use `dynamo-planner`; earlier versions: use `dynamo-frontend`. |
| `backend` | No | `auto` | Inference engine: `auto`, `vllm`, `sglang`, `trtllm` |
| `searchStrategy` | No | `rapid` | Profiling depth: `rapid` (AIC-backed DynoSim-style modeling, ~30s) or `thorough` (real GPU, 2–4h) |
| `autoApply` | No | `true` | Automatically deploy the profiler's recommended config |
| `sla.ttft` | No | — | Target time to first token (ms) |
| `sla.itl` | No | — | Target inter-token latency (ms) |
| `sla.e2eLatency` | No | — | Target end-to-end latency (ms). Cannot be combined with explicit `ttft`/`itl`. |
| `workload.isl` | No | `4000` | Expected average input sequence length |
| `workload.osl` | No | `1000` | Expected average output sequence length |
| `workload.requestRate` | No | — | Target requests per second |
| `workload.concurrency` | No | — | Target concurrent requests |
| `hardware.gpuSku` | No | auto-detected | GPU SKU (see [SKU Format](#sku-format)) |
| `hardware.vramMb` | No | auto-detected | GPU VRAM in MB |
| `hardware.totalGpus` | No | auto-detected (capped at 32) | Total GPUs available to the deployment |
| `hardware.numGpusPerNode` | No | auto-detected | GPUs per node |
| `hardware.interconnect` | No | auto-detected | Interconnect type |
| `hardware.rdma` | No | auto-detected | Whether RDMA is available |
| `modelCache.pvcName` | No | — | Name of a `ReadWriteMany` PVC containing cached model weights |
| `modelCache.pvcModelPath` | No | — | Path to the model directory inside the PVC |
| `modelCache.pvcMountPath` | No | `/opt/model-cache` | Mount path inside containers |
| `features.planner` | No | disabled | Enable the SLA-aware Planner; the generated DGD includes Planner service/configuration |
| `features.mocker` | No | disabled | Enable mocker mode for testing |
| `overrides.profilingJob` | No | — | `batchv1.JobSpec` overrides for the profiling job (e.g., tolerations) |
| `overrides.dgd` | No | — | Raw DGD override base applied to the generated deployment |
For the complete CRD spec, see the [API Reference](/dynamo/additional-resources/api-reference-k-8-s).
DGDR does not currently expose a `features.kvRouter` field. To configure
router mode or KV-aware routing details, use a direct DGD, a tuned recipe, or
`overrides.dgd` when you still want DGDR to generate the base deployment.
### Generated DGD Overrides
Use `spec.overrides.dgd` when the generated `DynamoGraphDeployment` needs a
field that DGDR does not expose directly. The value is a partial
`nvidia.com/v1alpha1` DGD object that is merged into the profiler-generated
deployment after Dynamo selects a configuration.
For example, to inject an environment variable into every generated service:
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: qwen3-sglang
spec:
model: Qwen/Qwen3-30B-A3B
backend: sglang
image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.1.1" # dynamo-frontend for Dynamo < 1.1.0
overrides:
dgd:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
envs:
- name: TRITON_PTXAS_PATH
value: /usr/local/cuda/bin/ptxas
```
Use `spec.envs` for variables that should apply to all generated services. To
target a single service, override that service's `envs` entry instead:
```yaml
spec:
overrides:
dgd:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
services:
decode: # replace with the generated service name
envs:
- name: CUSTOM_WORKER_ENV
value: "enabled"
```
`overrides.profilingJob` only customizes the profiling Job. Use
`overrides.dgd` for settings that must appear on the deployed worker pods.
### SKU Format
When providing hardware configuration manually, use lowercase underscore format:
| Correct | Incorrect |
|---|---|
| `h100_sxm` | `H100-SXM5-80GB` |
| `h200_sxm` | `H200-SXM-141GB` |
| `a100_sxm` | `A100-SXM4-80GB` |
| `a30` | `A30` |
| `l40s` | `L40S` |
All supported values: `gb200_sxm`, `b200_sxm`, `h200_sxm`, `h100_sxm`,
`h100_pcie`, `a100_sxm`, `a100_pcie`, `a30`, `l40s`, `l40`, `l4`,
`v100_sxm`, `v100_pcie`, `t4`, `mi200`, `mi300`.
Not all SKUs are supported by the AIC profiler for `rapid` mode. See
[AIC Support Matrix](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide#aic-support-matrix) for details.
**PCIe variants not yet supported by profiler.** The CRD admits PCIe SKUs
(`h100_pcie`, `a100_pcie`, `v100_pcie`), but the profiler does not currently
ship training data for them. You can submit a DGDR with a PCIe value; the
operator will accept it but profiler-assisted sizing will fall back to
defaults. Profiler support for PCIe SKUs is tracked as an engineering
follow-up.
## Lifecycle
When you create a DGDR, it progresses through these phases:
| Phase | What is happening |
|---|---|
| `Pending` | Spec validated; operator is discovering GPU hardware and preparing the profiling job |
| `Profiling` | Profiling job running — sub-phases: `Initializing`, `SweepingPrefill`, `SweepingDecode`, `SelectingConfig`, `BuildingCurves`, `GeneratingDGD`, `Done` |
| `Ready` | Profiling complete; optimal config stored in `.status.profilingResults.selectedConfig`. Terminal state when `autoApply: false`. |
| `Deploying` | Creating the `DynamoGraphDeployment` (only when `autoApply: true`) |
| `Deployed` | DGD is running and healthy |
| `Failed` | Unrecoverable error — profiling failures are not retried (`backoffLimit: 0`); check events and conditions for details |
### Conditions
The operator maintains these conditions on the DGDR status:
| Condition | Meaning |
|---|---|
| `Validation` | Spec validation passed or failed |
| `Profiling` | Profiling job is running, succeeded, or failed |
| `SpecGenerated` | Generated DGD spec is available |
| `DeploymentReady` | DGD is deployed and healthy |
| `Succeeded` | Aggregate condition — true when the DGDR has reached its target state |
### Monitoring
```bash
# Watch phase transitions
kubectl get dgdr my-model -n $NAMESPACE -w
# Detailed status, conditions, and events
kubectl describe dgdr my-model -n $NAMESPACE
# Profiling sub-phase
kubectl get dgdr my-model -n $NAMESPACE -o jsonpath='{.status.profilingPhase}'
# Profiling job logs
kubectl get pods -n $NAMESPACE -l nvidia.com/dgdr-name=my-model
kubectl logs -f -n $NAMESPACE
# View generated DGD spec (when autoApply: false)
kubectl get dgdr my-model -n $NAMESPACE \
-o jsonpath='{.status.profilingResults.selectedConfig}' | python3 -m json.tool
# View Pareto-optimal configs from profiling
kubectl get dgdr my-model -n $NAMESPACE \
-o jsonpath='{.status.profilingResults.pareto}'
```
### Resource Ownership
- The DGDR does **not** set an owner reference on the DGD it creates. Deleting
a DGDR does not delete the DGD — it persists independently so it can continue
serving traffic.
- The relationship is tracked via labels: `dgdr.nvidia.com/name` and
`dgdr.nvidia.com/namespace`.
- Additional resources (planner ConfigMaps) are created in the same namespace
and labeled with `dgdr.nvidia.com/name`.
## Known Issues
- **`pareto_analysis.py` produces NaN for some configurations.** Tracked as an
engineering follow-up. Workaround: re-run with a narrower sweep; narrow
sweeps bypass the NaN path in practice.
- **PCIe profiler data not yet available.** See the PCIe callout under
[SKU Format](#sku-format).
## Further Reading
- [Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide) — DGD, DCD, DGDR, recipes, strategy selection, and common pitfalls
- [Profiler Guide](/dynamo/components/profiler/profiler-guide) — Profiling algorithms, picking modes, gate checks
- [Profiler Examples](/dynamo/components/profiler/profiler-examples) — Ready-to-use YAML for SLA targets, private models, MoE, overrides
- [Planner Guide](/dynamo/components/planner/planner-guide) — Scaling modes, PlannerConfig reference
- [API Reference](/dynamo/additional-resources/api-reference-k-8-s) — Complete CRD field specifications
- [Creating Deployments](/dynamo/additional-resources/creating-deployments) — DGD spec for full manual control
# Model Caching
Large language models can take minutes to download. Without caching, every pod downloads the full model independently, wasting bandwidth and delaying startup. Dynamo supports a simple shared-storage path and a ModelExpress path for faster weight distribution across larger clusters.
## Option 1: PVC + Download Job (Recommended)
The simplest approach: create a shared PVC, run a one-time Job to download the model, then mount the PVC in your DynamoGraphDeployment.
This is the pattern used by all Dynamo recipes today.
### Step 1: Create a Shared PVC
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
```
`ReadWriteMany` access mode is required so multiple pods can mount the PVC simultaneously. Ensure your storage class supports RWX (e.g., NFS, CephFS, or cloud-provider shared file systems).
### Step 2: Download the model
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
template:
spec:
restartPolicy: Never
containers:
- name: downloader
image: python:3.12-slim
command: ["sh", "-c"]
args:
- |
pip install huggingface_hub hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
$MODEL_NAME --revision $MODEL_REVISION
env:
- name: MODEL_NAME
value: "Qwen/Qwen3-0.6B"
- name: MODEL_REVISION
value: "main"
- name: HF_HOME
value: /cache/huggingface
envFrom:
- secretRef:
name: hf-token-secret
volumeMounts:
- name: model-cache
mountPath: /cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
```
### Find the Snapshot Path
After the Job completes, the model is stored in HuggingFace's cache layout:
```
hub/models----/snapshots//
```
For example, `meta-llama/Llama-3.1-70B-Instruct` becomes:
```
hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/9d3b8e0f71f8c1e0f9b7c2a3d4e5f6a7b8c9d0e1/
```
To find the exact commit hash after the download Job completes:
```bash
kubectl run find-snapshot --rm -it --image=busybox --restart=Never \
--overrides='{
"spec": {
"volumes": [{"name": "c", "persistentVolumeClaim": {"claimName": "model-cache"}}],
"containers": [{
"name": "f", "image": "busybox",
"command": ["find", "/c/hub", "-mindepth", "3", "-maxdepth", "3", "-type", "d"],
"volumeMounts": [{"name": "c", "mountPath": "/c"}]
}]
}
}'
```
Alternatively, look up the commit hash on the HuggingFace Hub model page under **Files and versions**.
You need this path for the `pvcModelPath` field in a DGDR spec (see [Deployment Overview — Model Caching](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide#production-detail-model-caching)).
### Step 3: Mount in DynamoGraphDeployment
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
pvcs:
- create: false
name: model-cache
services:
VllmWorker:
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
```
All `VllmWorker` pods that mount `model-cache` now read from the shared cache, avoiding per-pod worker downloads. If you also want the frontend to reuse tokenizer and config files, mount the same PVC there too.
### Compilation Cache
For vLLM, you can also cache compiled artifacts (CUDA graphs, etc.) with a second PVC:
```yaml
spec:
pvcs:
- create: false
name: model-cache
- create: false
name: compilation-cache
services:
VllmWorker:
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
- name: compilation-cache
mountPoint: /home/dynamo/.cache/vllm
```
## Option 2: ModelExpress (P2P Distribution)
[ModelExpress](https://github.com/ai-dynamo/modelexpress) is a model weight distribution service that integrates with vLLM's weight loading pipeline. It can publish model weights from one worker and let later workers pull those tensors from GPU memory over NIXL/RDMA instead of repeating a full storage download.
ModelExpress can also use **ModelStreamer** as a loading strategy. ModelStreamer streams safetensors directly from object storage or a local filesystem path into GPU memory through the `runai-model-streamer` package. In that setup, the first worker can stream from storage and then publish ModelExpress metadata so later workers can use the P2P path.
Use this path when startup time or fleet-wide model rollout time matters more than the simplicity of a shared PVC.
### How It Works
1. A ModelExpress server runs in the cluster and stores metadata for available sources.
2. vLLM workers use the ModelExpress loader (`--load-format mx` on newer ModelExpress images, or `mx-source` / `mx-target` on older split-loader images).
3. If a compatible source is already serving the model, a new worker pulls model tensors from that source over NIXL/RDMA.
4. If no source is available, the worker falls back to storage. With a shared filesystem (RWX PVC, NFS, hostPath), the worker reads directly from the server's cache. Without a shared filesystem, set `MODEL_EXPRESS_NO_SHARED_STORAGE=1` so the client streams files from the server over gRPC; see [Streaming Without Shared Storage](#streaming-without-shared-storage) below. When `MX_MODEL_URI` is set, ModelStreamer can stream safetensors from S3, GCS, Azure Blob Storage, or a local path.
5. The Kubernetes operator can inject `MODEL_EXPRESS_URL` into all Dynamo pods from the platform `modelExpressURL` setting.
### What To Configure
| Layer | What to configure | Notes |
|-------|-------------------|-------|
| Runtime image | Include the `modelexpress` Python package and, for ModelStreamer, `runai-model-streamer` plus the object-storage dependencies. | Dynamo or vLLM raises an import error if the worker uses a ModelExpress load format but the package is missing. |
| ModelExpress server | Deploy the server with Redis or Kubernetes CRD metadata backend. | See the [ModelExpress deployment guide](https://github.com/ai-dynamo/modelexpress/blob/main/docs/DEPLOYMENT.md). |
| Dynamo platform | Set `dynamo-operator.modelExpressURL`. | The operator injects `MODEL_EXPRESS_URL` into pods. |
| vLLM worker | Set the ModelExpress load format and point at the server. | Newer ModelExpress images use `--load-format mx`; older Dynamo images may use `mx-source` / `mx-target`. |
| ModelStreamer | Set `MX_MODEL_URI` to the storage location. | Supported URI forms include `s3://...`, `gs://...`, `az://...`, an absolute local path, or a Hugging Face model ID resolved from the local cache. |
### Setup
**Install with Dynamo Platform:**
```bash
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace ${NAMESPACE} \
--set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"
```
**Configure workers to use ModelExpress:**
```yaml
services:
VllmWorker:
extraPodSpec:
mainContainer:
image:
command: ["python3", "-m", "dynamo.vllm"]
args:
- --model
- meta-llama/Llama-3.1-70B-Instruct
- --load-format
- mx
- --model-express-url
- http://model-express-server.model-express.svc.cluster.local:8080
env:
- name: VLLM_PLUGINS
value: modelexpress
```
When `MODEL_EXPRESS_URL` is configured in the operator, it is automatically injected as an environment variable into all component pods. Passing `--model-express-url` explicitly is still useful in examples because the worker validates that a server URL is available when using the older `mx-source` / `mx-target` load formats.
Use the load format supported by your runtime image. ModelExpress v0.3 and newer document the unified `mx` loader. Some Dynamo images still expose the older split `mx-source` and `mx-target` loader names; those require the same server URL but separate source and target roles.
### Streaming Without Shared Storage
If the ModelExpress server's cache is on a non-shared volume (e.g. a `ReadWriteOnce` PVC, a cross-namespace deployment, or any topology where worker pods cannot mount the same filesystem as the server), the default shared-storage mode fails: the server reports the model as downloaded and returns its own local path, the worker cannot read that path from inside its own pod, and the load silently falls back to a direct HuggingFace download -- defeating the point of running ModelExpress.
Set `MODEL_EXPRESS_NO_SHARED_STORAGE=1` on every worker pod to switch the ModelExpress client into gRPC streaming mode. The server then sends model files to the client over the existing gRPC channel and the worker writes them to its own local cache.
```yaml
services:
VllmWorker:
extraPodSpec:
mainContainer:
image:
command: ["python3", "-m", "dynamo.vllm"]
args:
- --model
- meta-llama/Llama-3.1-70B-Instruct
- --load-format
- mx
env:
- name: VLLM_PLUGINS
value: modelexpress
- name: MODEL_EXPRESS_NO_SHARED_STORAGE
value: "1"
```
`MODEL_EXPRESS_URL` is injected automatically by the operator (`dynamo-operator.modelExpressURL`); you do not need to set it explicitly here. No volume mount for the ModelExpress cache is required on worker pods in this mode.
Use this path when:
- The server runs with an RWO PVC, or in a different namespace from the workers.
- The cluster has no RDMA / InfiniBand fabric available, so P2P over NIXL is not an option.
- You want ModelExpress to act as a centralized download-and-cache server (one HuggingFace pull, fan out over gRPC to many workers) without standing up object storage and `MX_MODEL_URI`.
Shared-filesystem mode is still faster when available, so prefer an RWX PVC mounted on both the server and the workers when the storage class supports it. See the [ModelExpress storage access modes documentation](https://github.com/ai-dynamo/modelexpress/blob/main/docs/DEPLOYMENT.md#storage-access-modes) for the full trade-off and tuning knobs (chunk size, etc.).
### ModelStreamer From Object Storage
Set `MX_MODEL_URI` when the first worker should stream safetensors directly from storage instead of reading a PVC or relying on a prior source worker.
```yaml
services:
VllmWorker:
extraPodSpec:
mainContainer:
image:
command: ["python3", "-m", "dynamo.vllm"]
args:
- --model
- meta-llama/Llama-3.1-70B-Instruct
- --load-format
- mx
- --model-express-url
- http://model-express-server.model-express.svc.cluster.local:8080
env:
- name: VLLM_PLUGINS
value: modelexpress
- name: MX_MODEL_URI
value: s3://my-model-bucket/meta-llama/Llama-3.1-70B-Instruct
- name: RUNAI_STREAMER_CONCURRENCY
value: "8"
```
ModelStreamer relies on the underlying cloud SDK credentials:
| Storage backend | `MX_MODEL_URI` example | Credential options |
|-----------------|------------------------|--------------------|
| S3 or S3-compatible storage | `s3://bucket/path/to/model` | IRSA / workload identity, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`, `AWS_DEFAULT_REGION`, and optional `AWS_ENDPOINT_URL` |
| Google Cloud Storage | `gs://bucket/path/to/model` | GKE Workload Identity, Application Default Credentials, or `GOOGLE_APPLICATION_CREDENTIALS` |
| Azure Blob Storage | `az://container/path/to/model` | Managed Identity, service principal env vars, or `AZURE_ACCOUNT_NAME` / `AZURE_ACCOUNT_KEY` |
| Local filesystem or PVC | `/models/meta-llama/Llama-3.1-70B-Instruct` | Mount the path into the worker pod |
Credentials are consumed by the storage SDKs in the worker pod. They do not flow through the ModelExpress server.
### Relationship To Shadow Engine Failover
ModelExpress and ModelStreamer are model loading and distribution paths. They are not required for [Shadow Engine Failover](/dynamo/kubernetes-deployment/advanced-platform/shadow-engine-failover), and enabling them does not create standby engines.
Use Shadow Engine Failover only when you specifically need an active/shadow recovery topology backed by GPU Memory Service (GMS), DRA, and a backend load format such as `--load-format gms`. Keep the ModelExpress / ModelStreamer configuration separate unless you have validated a combined workflow for your runtime image and cluster.
### When to Use ModelExpress
| Scenario | Recommended Approach |
|----------|---------------------|
| Small cluster, simple setup | PVC + Download Job |
| Large cluster, many nodes | ModelExpress P2P |
| Models already on shared storage (NFS) | PVC |
| Models in S3, GCS, Azure Blob Storage, or local safetensors paths | ModelExpress + ModelStreamer |
| Frequent model updates across fleet | ModelExpress P2P, optionally seeded by ModelStreamer |
| ModelExpress server with non-shared storage (RWO PVC, cross-namespace) | ModelExpress with `MODEL_EXPRESS_NO_SHARED_STORAGE=1` |
## See Also
- [Managing Models with DynamoModel](/dynamo/kubernetes-deployment/deploy-models/managing-models-with-dynamo-model) — declarative model management CRD
- [Detailed Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) — Helm chart configuration including ModelExpress
- [Shadow Engine Failover](/dynamo/kubernetes-deployment/advanced-platform/shadow-engine-failover) — GMS-backed active/shadow engine recovery, separate from model distribution
- [ModelExpress deployment guide](https://github.com/ai-dynamo/modelexpress/blob/main/docs/DEPLOYMENT.md) — server, P2P, and ModelStreamer configuration
- [LoRA Adapters](/dynamo/user-guides/lo-ra-adapters) — dynamic adapter loading (separate from base model caching)
# ModelExpress
ModelExpress is a model weight distribution service for faster worker startup in larger Dynamo clusters. Instead of every worker downloading the full model from storage, one worker can publish model weight availability and later workers can pull compatible tensors from that source over NIXL/RDMA. ModelExpress can also pair with ModelStreamer to stream safetensors directly from object storage into GPU memory.
Use ModelExpress when model rollout time, autoscale cold start, or fleet-wide model updates matter more than the simplicity of a shared PVC. For smaller clusters, start with [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching).
## When to Use It
| Scenario | Recommended path |
| --- | --- |
| Small cluster or first deployment | [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) with PVC + download Job |
| Large cluster with many replicas | ModelExpress P2P distribution |
| Models already on shared storage | PVC or shared filesystem path |
| Models in S3, GCS, Azure Blob Storage, or local safetensors paths | ModelExpress + ModelStreamer |
| Frequent model updates across a fleet | ModelExpress P2P, optionally seeded by ModelStreamer |
| ModelExpress server has non-shared storage | ModelExpress with `MODEL_EXPRESS_NO_SHARED_STORAGE=1` |
## How It Works
1. A ModelExpress server runs in the cluster and stores metadata for available model sources.
2. vLLM workers use the ModelExpress loader (`--load-format mx` on newer images, or `mx-source` / `mx-target` on older split-loader images).
3. If a compatible source worker is already serving the model, a new worker pulls model tensors from that source over NIXL/RDMA.
4. If no source is available, the worker falls back to storage. With ModelStreamer, the first worker can stream safetensors from `s3://`, `gs://`, `az://`, or a local path.
5. The Kubernetes operator can inject `MODEL_EXPRESS_URL` into all Dynamo pods from the platform `modelExpressURL` setting.
## Configure the Platform
Set the ModelExpress server URL when installing the Dynamo platform:
```bash
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace ${NAMESPACE} \
--set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"
```
If the ModelExpress server is installed separately, point `dynamo-operator.modelExpressURL` at that service. The operator injects the value into worker pods as `MODEL_EXPRESS_URL`.
## Configure vLLM Workers
Use a runtime image that includes the `modelexpress` Python package. For ModelStreamer, the image also needs `runai-model-streamer` and the relevant object-storage SDK dependencies.
```yaml
services:
VllmWorker:
extraPodSpec:
mainContainer:
image:
command: ["python3", "-m", "dynamo.vllm"]
args:
- --model
- meta-llama/Llama-3.1-70B-Instruct
- --load-format
- mx
env:
- name: VLLM_PLUGINS
value: modelexpress
```
Use the load format supported by your runtime image. ModelExpress v0.3 and newer document the unified `mx` loader. Some older Dynamo images expose `mx-source` and `mx-target` loader names instead.
## Stream Without Shared Storage
If the ModelExpress server cache is on a non-shared volume, workers cannot read the server's local cache path. Set `MODEL_EXPRESS_NO_SHARED_STORAGE=1` on worker pods so the client streams model files from the server over gRPC:
```yaml
services:
VllmWorker:
extraPodSpec:
mainContainer:
env:
- name: VLLM_PLUGINS
value: modelexpress
- name: MODEL_EXPRESS_NO_SHARED_STORAGE
value: "1"
```
Use this path when the server has an RWO PVC, runs in a different namespace, or the cluster has no RDMA fabric available. Shared-filesystem mode is still faster when available.
## Stream From Object Storage
Set `MX_MODEL_URI` when the first worker should stream safetensors directly from object storage or a local mounted path:
```yaml
services:
VllmWorker:
extraPodSpec:
mainContainer:
image:
command: ["python3", "-m", "dynamo.vllm"]
args:
- --model
- meta-llama/Llama-3.1-70B-Instruct
- --load-format
- mx
env:
- name: VLLM_PLUGINS
value: modelexpress
- name: MX_MODEL_URI
value: s3://my-model-bucket/meta-llama/Llama-3.1-70B-Instruct
- name: RUNAI_STREAMER_CONCURRENCY
value: "8"
```
| Storage backend | `MX_MODEL_URI` example | Credential options |
| --- | --- | --- |
| S3 or S3-compatible storage | `s3://bucket/path/to/model` | IRSA / workload identity, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`, `AWS_DEFAULT_REGION`, optional `AWS_ENDPOINT_URL` |
| Google Cloud Storage | `gs://bucket/path/to/model` | GKE Workload Identity, Application Default Credentials, or `GOOGLE_APPLICATION_CREDENTIALS` |
| Azure Blob Storage | `az://container/path/to/model` | Managed Identity, service principal env vars, or `AZURE_ACCOUNT_NAME` / `AZURE_ACCOUNT_KEY` |
| Local filesystem or PVC | `/models/meta-llama/Llama-3.1-70B-Instruct` | Mount the path into the worker pod |
Credentials are consumed by the storage SDKs in the worker pod. They do not flow through the ModelExpress server.
## See Also
- [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) - simple PVC-based model caching and the longer ModelExpress background.
- [ModelExpress deployment guide](https://github.com/ai-dynamo/modelexpress/blob/main/docs/DEPLOYMENT.md) - server, P2P, and ModelStreamer configuration.
- [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) - Dynamo platform install options, including `modelExpressURL`.
# Autoscaling
This guide explains how to configure autoscaling for DynamoGraphDeployment (DGD) services using the `sglang-agg` example from `examples/backends/sglang/deploy/agg.yaml`.
## Example DGD
All examples in this guide use the following DGD:
```yaml
# examples/backends/sglang/deploy/agg.yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: sglang-agg
namespace: default
spec:
services:
Frontend:
componentType: frontend
replicas: 1
decode:
componentType: worker
replicas: 1
resources:
limits:
gpu: "1"
```
**Key identifiers:**
- **DGD name**: `sglang-agg`
- **Namespace**: `default`
- **Services**: `Frontend`, `decode`
- **dynamo_namespace label**: `default-sglang-agg` (used for metric filtering)
## Overview
Dynamo provides flexible autoscaling through the `DynamoGraphDeploymentScalingAdapter` (DGDSA) resource. To have the operator create a DGDSA for a service, follow the Enabling DGDSA for a Service section below. These adapters implement the Kubernetes [Scale subresource](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource), enabling integration with:
| Autoscaler | Description | Best For |
|------------|-------------|----------|
| **KEDA** | Event-driven autoscaling (recommended) | Most use cases |
| **Kubernetes HPA** | Native horizontal pod autoscaling | Simple CPU/memory-based scaling |
| **Dynamo Planner** | LLM-aware autoscaling with SLA optimization | Production LLM workloads |
| **Custom Controllers** | Any scale-subresource-compatible controller | Custom requirements |
> **⚠️ Deprecation Notice**: The `spec.services[X].autoscaling` field in DGD is **deprecated and ignored**. Use DGDSA with HPA, KEDA, or Planner instead. If you have existing DGDs with `autoscaling` configured, you'll see a warning. Remove the field to silence the warning.
## Architecture
```
┌──────────────────────────────────┐ ┌─────────────────────────────────────┐
│ DynamoGraphDeployment │ │ Scaling Adapters (auto-created) │
│ "sglang-agg" │ │ (one per service) │
├──────────────────────────────────┤ ├─────────────────────────────────────┤
│ │ │ │
│ spec.services: │ │ ┌─────────────────────────────┐ │ ┌──────────────────┐
│ │ │ │ sglang-agg-frontend │◄───┼──────│ Autoscalers │
│ ┌────────────────────────┐◄───┼──────────┼──│ spec.replicas: 1 │ │ │ │
│ │ Frontend: 1 replica │ │ │ └─────────────────────────────┘ │ │ • KEDA │
│ └────────────────────────┘ │ │ │ │ • HPA │
│ │ │ ┌─────────────────────────────┐ │ │ • Planner │
│ ┌────────────────────────┐◄───┼──────────┼──│ sglang-agg-decode │◄───┼──────│ • Custom │
│ │ decode: 1 replica │ │ │ │ spec.replicas: 1 │ │ │ │
│ └────────────────────────┘ │ │ └─────────────────────────────┘ │ └──────────────────┘
│ │ │ │
└──────────────────────────────────┘ └─────────────────────────────────────┘
```
**How it works:**
1. You deploy a DGD with services (Frontend, decode)
2. The operator auto-creates one DGDSA per service
3. Autoscalers (KEDA, HPA, Planner) target the adapters via `/scale` subresource
4. Adapter controller syncs replica changes to the DGD
5. DGD controller reconciles the underlying pods
## Viewing Scaling Adapters
After deploying the `sglang-agg` DGD, verify the auto-created adapters:
```bash
kubectl get dgdsa -n default
# Example output:
# NAME DGD SERVICE REPLICAS AGE
# sglang-agg-frontend sglang-agg Frontend 1 5m
# sglang-agg-decode sglang-agg decode 1 5m
```
## Replica Ownership Model
When DGDSA is enabled, it becomes the **source of truth** for replica counts. This follows the same pattern as Kubernetes Deployments owning ReplicaSets.
### How It Works
1. **DGDSA owns replicas**: Autoscalers (HPA, KEDA, Planner) update the DGDSA's `spec.replicas`
2. **DGDSA syncs to DGD**: The DGDSA controller writes the replica count to the DGD's service
3. **Direct DGD edits blocked**: A validating webhook prevents users from directly editing `spec.services[X].replicas` in the DGD
4. **Controllers allowed**: Only authorized controllers (operator, Planner) can modify DGD replicas
### Manual Scaling with DGDSA Enabled
When DGDSA is enabled, use `kubectl scale` on the adapter (not the DGD):
```bash
# ✅ Correct - scale via DGDSA
kubectl scale dgdsa sglang-agg-decode --replicas=3
# ❌ Blocked - direct DGD edit rejected by webhook
kubectl patch dgd sglang-agg --type=merge -p '{"spec":{"services":{"decode":{"replicas":3}}}}'
# Error: spec.services[decode].replicas cannot be modified directly when scaling adapter is enabled;
# use 'kubectl scale dgdsa/sglang-agg-decode --replicas=3' or update the DynamoGraphDeploymentScalingAdapter instead
```
## Enabling DGDSA for a Service
By default, no DGDSA is created for services, allowing direct replica management via the DGD. To enable autoscaling via HPA, KEDA, or Planner, explicitly enable the scaling adapter:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: sglang-agg
spec:
services:
Frontend:
replicas: 2 # ← No DGDSA by default, direct edits allowed
decode:
replicas: 1
scalingAdapter:
enabled: true # ← DGDSA created, managed via adapter
```
**When to enable DGDSA:**
- You want to use HPA, KEDA, or Planner for autoscaling
- You want a clear separation between "desired scale" (adapter) and "deployment config" (DGD)
- You want protection against accidental direct replica edits
**When to keep DGDSA disabled (default):**
- You want simple, manual replica management
- You don't need autoscaling for that service
- You prefer direct DGD edits over adapter-based scaling
## Autoscaling with Dynamo Planner
The Dynamo Planner is an LLM-aware autoscaler that optimizes scaling decisions based on inference-specific metrics like Time To First Token (TTFT), Inter-Token Latency (ITL), and KV cache utilization.
**When to use Planner:**
- You want LLM-optimized autoscaling out of the box
- You need coordinated scaling across prefill/decode services
- You want SLA-driven scaling (e.g., target TTFT \< 500ms)
**How Planner works:**
Planner is deployed as a service component within your DGD. It:
1. Queries Prometheus for frontend metrics (request rate, latency, etc.)
2. Uses profiling data to predict optimal replica counts
3. Scales prefill/decode workers to meet SLA targets
**Deployment:**
The recommended way to deploy Planner is via `DynamoGraphDeploymentRequest` (DGDR). See the [SLA Planner Quick Start](/dynamo/components/planner/planner-guide) for complete instructions.
Example configurations with Planner:
- `examples/backends/vllm/deploy/disagg_planner.yaml`
- `examples/backends/sglang/deploy/disagg_planner.yaml`
- `examples/backends/trtllm/deploy/disagg_planner.yaml`
For more details, see the [SLA Planner documentation](/dynamo/components/planner/planner-guide).
## Autoscaling with Kubernetes HPA
The Horizontal Pod Autoscaler (HPA) is Kubernetes' native autoscaling solution.
**When to use HPA:**
- You have simple, predictable scaling requirements
- You want to use standard Kubernetes tooling
- You need CPU or memory-based scaling
For custom metrics (like TTFT or queue depth), consider using [KEDA](#autoscaling-with-keda-recommended) instead - it's simpler to configure.
### Basic HPA (CPU-based)
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: sglang-agg-frontend-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentScalingAdapter
name: sglang-agg-frontend
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300
scaleUp:
stabilizationWindowSeconds: 0
```
### HPA with Dynamo Metrics
Dynamo exports several metrics useful for autoscaling. These are available at the `/metrics` endpoint on each frontend pod.
> **See also**: For a complete list of all Dynamo metrics, see the [Metrics Reference](/dynamo/user-guides/observability-local/metrics). For Prometheus and Grafana setup, see the [Prometheus and Grafana Setup Guide](/dynamo/user-guides/observability-local/prometheus-grafana-setup).
#### Available Dynamo Metrics
| Metric | Type | Description | Good for scaling |
|--------|------|-------------|------------------|
| `dynamo_frontend_active_requests` | Gauge | Total concurrent requests from HTTP entry to response complete | ✅ All services |
| `dynamo_frontend_stage_requests{stage,phase}` | Gauge | Requests currently in a given frontend pipeline stage (`preprocess`, `route`, `dispatch`) | ✅ Workers — use `sum(...)` for queue-depth behavior, or `stage="dispatch"` for backend-prefill saturation |
| `dynamo_frontend_time_to_first_token_seconds` | Histogram | TTFT latency | ✅ Workers |
| `dynamo_frontend_inter_token_latency_seconds` | Histogram | ITL latency | ✅ Decode |
| `dynamo_frontend_request_duration_seconds` | Histogram | Total request duration | ⚠️ General |
| `dynamo_frontend_inflight_requests` | Gauge | Concurrent requests to engine | ⚠️ **Deprecated** — use `dynamo_frontend_active_requests` |
| `dynamo_frontend_queued_requests` | Gauge | Requests waiting in HTTP queue | ⚠️ **Deprecated** — use `sum(dynamo_frontend_stage_requests)` across `preprocess` + `route` + `dispatch` |
For the full definition of the `stage` and `phase` labels and derived-signal formulas, see [Stage and phase labels](/dynamo/user-guides/observability-local/metrics#stage-and-phase-labels) in the Metrics Reference.
#### Metric Labels
Dynamo metrics include these labels for filtering:
| Label | Description | Example |
|-------|-------------|---------|
| `dynamo_namespace` | Unique DGD identifier (`{k8s-namespace}-{dgd-name}`) | `default-sglang-agg` |
| `model` | Model being served | `Qwen/Qwen3-0.6B` |
When you have multiple DGDs in the same namespace, use `dynamo_namespace` to filter metrics for a specific DGD.
#### Example: Scale Decode Service Based on TTFT
Using HPA with Prometheus Adapter requires configuring external metrics.
**Step 1: Configure Prometheus Adapter**
Add this to your Helm values file (e.g., `prometheus-adapter-values.yaml`):
```yaml
# prometheus-adapter-values.yaml
prometheus:
url: http://prometheus-kube-prometheus-prometheus.monitoring.svc
port: 9090
rules:
external:
# TTFT p95 from frontend - used to scale decode
- seriesQuery: 'dynamo_frontend_time_to_first_token_seconds_bucket{namespace!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
name:
as: "dynamo_ttft_p95_seconds"
metricsQuery: |
histogram_quantile(0.95,
sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{<<.LabelMatchers>>}[5m]))
by (le, namespace, dynamo_namespace)
)
```
**Step 2: Install Prometheus Adapter**
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \
-n monitoring --create-namespace \
-f prometheus-adapter-values.yaml
```
**Step 3: Verify the metric is available**
```bash
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces//dynamo_ttft_p95_seconds" | jq
```
**Step 4: Create the HPA**
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: sglang-agg-decode-hpa
spec:
scaleTargetRef:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentScalingAdapter
name: sglang-agg-decode # ← DGD name + service name (lowercase)
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: dynamo_ttft_p95_seconds
selector:
matchLabels:
dynamo_namespace: "default-sglang-agg" # ← {namespace}-{dgd-name}
target:
type: Value
value: "500m" # Scale up when TTFT p95 > 500ms
behavior:
scaleDown:
stabilizationWindowSeconds: 60 # Wait 1 min before scaling down
policies:
- type: Pods
value: 1
periodSeconds: 30
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Pods
value: 2
periodSeconds: 30
```
**How it works:**
1. Frontend pods export `dynamo_frontend_time_to_first_token_seconds` histogram
2. Prometheus Adapter calculates p95 TTFT per `dynamo_namespace`
3. HPA monitors this metric filtered by `dynamo_namespace: "default-sglang-agg"`
4. When TTFT p95 > 500ms, HPA scales up the `sglang-agg-decode` adapter
5. Adapter controller syncs the replica count to the DGD's `decode` service
6. More decode workers are created, reducing TTFT
#### Example: Scale Based on Queue Depth
"Queue depth" here means the number of requests that have entered the frontend but haven't yet received a first token — i.e. the sum of `dynamo_frontend_stage_requests` across the `preprocess`, `route`, and `dispatch` stages. This replaces the deprecated `dynamo_frontend_queued_requests` gauge.
Add this rule to your `prometheus-adapter-values.yaml` (alongside the TTFT rule):
```yaml
# Add to rules.external in prometheus-adapter-values.yaml
- seriesQuery: 'dynamo_frontend_stage_requests{namespace!="",stage=~"preprocess|route|dispatch"}'
resources:
overrides:
namespace: {resource: "namespace"}
name:
as: "dynamo_frontend_pending_requests"
metricsQuery: |
sum(<<.Series>>{<<.LabelMatchers>>}) by (namespace, dynamo_namespace)
```
Then create the HPA:
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: sglang-agg-decode-queue-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentScalingAdapter
name: sglang-agg-decode
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: dynamo_frontend_pending_requests
selector:
matchLabels:
dynamo_namespace: "default-sglang-agg"
target:
type: Value
value: "10" # Scale up when queue > 10 requests
```
## Autoscaling with KEDA (Recommended)
KEDA (Kubernetes Event-driven Autoscaling) extends Kubernetes with event-driven autoscaling, supporting 50+ scalers including Prometheus.
**Advantages over HPA + Prometheus Adapter:**
- No Prometheus Adapter configuration needed
- PromQL queries are defined in the ScaledObject itself (declarative, per-deployment)
- Easy to update - just `kubectl apply` the ScaledObject
- Can scale to zero when idle
- Supports multiple triggers per object
**When to use KEDA:**
- You want simpler configuration (no Prometheus Adapter to manage)
- You need event-driven scaling (e.g., queue depth, Kafka, etc.)
- You want to scale to zero when idle
### Installing KEDA
```bash
# Add KEDA Helm repo
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
# Install KEDA
helm install keda kedacore/keda \
--namespace keda \
--create-namespace
# Verify installation
kubectl get pods -n keda
```
If you have Prometheus Adapter installed, either uninstall it first (`helm uninstall prometheus-adapter -n monitoring`) or install KEDA with `--set metricsServer.enabled=false` to avoid API conflicts.
### Example: Scale Decode Based on TTFT
Using the `sglang-agg` DGD from `examples/backends/sglang/deploy/agg.yaml`:
```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: sglang-agg-decode-scaler
namespace: default
spec:
scaleTargetRef:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentScalingAdapter
name: sglang-agg-decode
minReplicaCount: 1
maxReplicaCount: 10
pollingInterval: 15 # Check metrics every 15 seconds
cooldownPeriod: 60 # Wait 60s before scaling down
triggers:
- type: prometheus
metadata:
# Update this URL to match your Prometheus service
serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
metricName: dynamo_ttft_p95
query: |
histogram_quantile(0.95,
sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{dynamo_namespace="default-sglang-agg"}[5m]))
by (le)
)
threshold: "0.5" # Scale up when TTFT p95 > 500ms (0.5 seconds)
activationThreshold: "0.1" # Start scaling when TTFT > 100ms
```
Apply it:
```bash
kubectl apply -f sglang-agg-decode-scaler.yaml
```
### Verify KEDA Scaling
```bash
# Check ScaledObject status
kubectl get scaledobject -n default
# KEDA creates an HPA under the hood - you can see it
kubectl get hpa -n default
# Example output:
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# keda-hpa-sglang-agg-decode-scaler DynamoGraphDeploymentScalingAdapter/sglang-agg-decode 45m/500m 1 10 1
# Get detailed status
kubectl describe scaledobject sglang-agg-decode-scaler -n default
```
### Example: Scale Based on Queue Depth
```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: sglang-agg-decode-queue-scaler
namespace: default
spec:
scaleTargetRef:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentScalingAdapter
name: sglang-agg-decode
minReplicaCount: 1
maxReplicaCount: 10
pollingInterval: 15
cooldownPeriod: 60
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
metricName: dynamo_frontend_pending_requests
query: |
sum(dynamo_frontend_stage_requests{dynamo_namespace="default-sglang-agg",stage=~"preprocess|route|dispatch"})
threshold: "10" # Scale up when queue > 10 requests
```
### How KEDA Works
KEDA creates and manages an HPA under the hood:
```
┌──────────────────────────────────────────────────────────────────────┐
│ You create: ScaledObject │
│ - scaleTargetRef: sglang-agg-decode │
│ - triggers: prometheus query │
└──────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ KEDA Operator automatically creates: HPA │
│ - name: keda-hpa-sglang-agg-decode-scaler │
│ - scaleTargetRef: sglang-agg-decode │
│ - metrics: External (from KEDA metrics server) │
└──────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ DynamoGraphDeploymentScalingAdapter: sglang-agg-decode │
│ - spec.replicas: updated by HPA │
└──────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ DynamoGraphDeployment: sglang-agg │
│ - spec.services.decode.replicas: synced from adapter │
└──────────────────────────────────────────────────────────────────────┘
```
## Mixed Autoscaling
For disaggregated deployments (prefill + decode), you can use different autoscaling strategies for different services:
```yaml
---
# HPA for Frontend (CPU-based)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: sglang-agg-frontend-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentScalingAdapter
name: sglang-agg-frontend
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
---
# KEDA for Decode (TTFT-based)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: sglang-agg-decode-scaler
namespace: default
spec:
scaleTargetRef:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentScalingAdapter
name: sglang-agg-decode
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
query: |
histogram_quantile(0.95,
sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{dynamo_namespace="default-sglang-agg"}[5m]))
by (le)
)
threshold: "0.5"
```
## Manual Scaling
### With DGDSA Enabled
When DGDSA is enabled, scale via the adapter:
```bash
kubectl scale dgdsa sglang-agg-decode -n default --replicas=3
```
Verify the scaling:
```bash
kubectl get dgdsa sglang-agg-decode -n default
# Output:
# NAME DGD SERVICE REPLICAS AGE
# sglang-agg-decode sglang-agg decode 3 10m
```
If an autoscaler (KEDA, HPA, Planner) is managing the adapter, your change will be overwritten on the next evaluation cycle.
### With DGDSA Disabled (default)
If you've disabled the scaling adapter for a service, edit the DGD directly:
```bash
kubectl patch dgd sglang-agg --type=merge -p '{"spec":{"services":{"decode":{"replicas":3}}}}'
```
Or edit the YAML (no `scalingAdapter.enabled: true` means direct edits are allowed):
```yaml
spec:
services:
decode:
replicas: 3
# No scalingAdapter.enabled means replicas can be edited directly
```
## Best Practices
### 1. Choose One Autoscaler Per Service
Avoid configuring multiple autoscalers for the same service:
| Configuration | Status |
|---------------|--------|
| HPA for frontend, Planner for prefill/decode | ✅ Good |
| KEDA for all services | ✅ Good |
| Planner only (default) | ✅ Good |
| HPA + Planner both targeting decode | ❌ Bad - they will fight |
### 2. Use Appropriate Metrics
| Service Type | Recommended Metrics | Dynamo Metric |
|--------------|---------------------|---------------|
| Frontend | CPU utilization, request rate | `dynamo_frontend_requests_total` |
| Prefill | Dispatch-stage depth (backend prefill saturation), TTFT | `dynamo_frontend_stage_requests{stage="dispatch"}`, `dynamo_frontend_time_to_first_token_seconds` |
| Decode | ITL, active concurrency | `dynamo_frontend_inter_token_latency_seconds`, `dynamo_frontend_active_requests` |
### 3. Configure Stabilization Windows
Prevent thrashing with appropriate stabilization:
```yaml
# HPA
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
# KEDA
spec:
cooldownPeriod: 300
```
### 4. Set Sensible Min/Max Replicas
Always configure minimum and maximum replicas in your HPA/KEDA to prevent:
- Scaling to zero (unless intentional)
- Unbounded scaling that exhausts cluster resources
## Troubleshooting
### Adapters Not Created
```bash
# Check DGD status
kubectl describe dgd sglang-agg -n default
# Check operator logs
kubectl logs -n dynamo-system deployment/dynamo-operator
```
### Scaling Not Working
```bash
# Check adapter status
kubectl describe dgdsa sglang-agg-decode -n default
# Check HPA/KEDA status
kubectl describe hpa sglang-agg-decode-hpa -n default
kubectl describe scaledobject sglang-agg-decode-scaler -n default
# Verify metrics are available in Kubernetes metrics API
kubectl get --raw /apis/external.metrics.k8s.io/v1beta1
```
### Metrics Not Available
If HPA/KEDA shows `` for metrics:
```bash
# Check if Dynamo metrics are being scraped
kubectl port-forward -n default svc/sglang-agg-frontend 8000:8000
curl http://localhost:8000/metrics | grep dynamo_frontend
# Example output (note: stage_requests has no `model` label — it's per frontend pod):
# dynamo_frontend_active_requests{model="Qwen/Qwen3-0.6B"} 5
# dynamo_frontend_stage_requests{stage="preprocess",phase=""} 0
# dynamo_frontend_stage_requests{stage="route",phase="aggregated"} 0
# dynamo_frontend_stage_requests{stage="dispatch",phase="aggregated"} 2
# dynamo_frontend_queued_requests{model="Qwen/Qwen3-0.6B"} 2 # deprecated
# dynamo_frontend_inflight_requests{model="Qwen/Qwen3-0.6B"} 5 # deprecated
# Verify Prometheus is scraping the metrics
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Then query: dynamo_frontend_time_to_first_token_seconds_bucket
# Check KEDA operator logs
kubectl logs -n keda deployment/keda-operator
```
### Rapid Scaling Up and Down
If you see unstable scaling:
1. Check if multiple autoscalers are targeting the same adapter
2. Increase `cooldownPeriod` in KEDA ScaledObject
3. Increase `stabilizationWindowSeconds` in HPA behavior
## References
- [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
- [KEDA Documentation](https://keda.sh/)
- [Prometheus Adapter](https://github.com/kubernetes-sigs/prometheus-adapter)
- [Planner Documentation](/dynamo/components/planner/planner-guide)
- [Dynamo Metrics Reference](/dynamo/user-guides/observability-local/metrics)
- [Prometheus and Grafana Setup](/dynamo/user-guides/observability-local/prometheus-grafana-setup)
# Rolling Updates
This guide covers how rolling updates work for `DynamoGraphDeployment` (DGD) resources. Rolling updates allow you to update worker configurations (images, resources, environment variables, etc.) with minimal downtime by gradually replacing old pods with new ones.
The behavior of rolling updates depends on the backing resource type of your deployment. DGDs backed by Kubernetes Deployments benefit from **managed rolling updates** with namespace isolation, while Grove and LWS-backed deployments use their native update mechanisms.
## Example
Consider a disaggregated deployment with separate prefill and decode workers. You want to update the tensor parallelism of the decode worker to 2.
**Before** — original deployment:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-disagg
spec:
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
VllmDecodeWorker:
componentType: worker
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --disaggregation-mode
- decode
VllmPrefillWorker:
componentType: worker
subComponentType: prefill
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --disaggregation-mode
- prefill
```
**After** — updated with parallelism tuning:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-disagg
spec:
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
VllmDecodeWorker:
componentType: worker
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --disaggregation-mode
- decode
- --tensor-parallelism
- "2"
VllmPrefillWorker:
componentType: worker
subComponentType: prefill
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --disaggregation-mode
- prefill
```
Apply the update:
```bash
kubectl apply -f vllm-disagg.yaml
```
Monitor rolling update progress:
```bash
kubectl get dgd vllm-disagg -n dynamo -o jsonpath='{.status.rollingUpdate}'
```
## Default Behavior (Grove and LWS)
For DGDs backed by **Grove** (PodCliques, PodCliqueSets) or **LWS** (LeaderWorkerSets), the operator does not manage rolling updates directly. Instead, these deployments rely on the native rolling update mechanisms of their underlying resources.
### What Happens
- A modification to the pod spec of a service triggers the rolling update behavior of the backing resource. In the example above, the modification to the pod spec of the decode worker triggers the rolling update of just the decode worker.
- For Grove, PodCliques (PCLQ) and PodCliqueScalingGroups use a static rolling update strategy of `maxUnavailable: 1` and `maxSurge: 0`. LWS follows the same `maxUnavailable: 1` and `maxSurge: 0` strategy.
- **Old and new workers operate within the same Dynamo namespace.** This means old and new workers can discover each other through service discovery.
The following diagram illustrates the rolling update of the decode worker in a Grove PodCliqueSet (PCS). Only the decode PodClique is updated — the frontend and prefill PodCliques are unaffected:
```
┌─ PodCliqueSet: vllm-disagg ───────────────────────────────────────────────────────┐
│ │
│ ┌─ PCLQ: Frontend ──────┐ ┌─ PCLQ: VllmPrefillWorker ─┐ │
│ │ │ │ │ │
│ │ ┌──────────────────┐ │ │ ┌──────────────────────┐ │ │
│ │ │ Pod (v1) ✓ │ │ │ │ Pod (v1) ✓ │ │ No changes — │
│ │ └──────────────────┘ │ │ └──────────────────────┘ │ not rolling │
│ │ │ │ │ │
│ └────────────────────────┘ └────────────────────────────┘ │
│ │
│ ┌─ PCLQ: VllmDecodeWorker ──────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ maxUnavailable: 1, maxSurge: 0 │ │
│ │ │ │
│ │ ┌──────────────────────┐ ┌──────────────────────┐ │ │
│ │ │ Pod (v2) ✓ NEW │ │ Pod (v1) Terminating │ ← rolling one at a time │ │
│ │ └──────────────────────┘ └──────────────────────┘ │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────┐ │
│ │ Dynamo Namespace: vllm-disagg │ │
│ │ │ │
│ │ All v1 and v2 pods registered │ │
│ │ and discoverable by each other │ │
│ └──────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────────────────┘
```
### Implications for Disaggregated Deployments
Because old and new workers share the same Dynamo namespace, they are grouped together by the router. In a disaggregated setup, this can lead to cross-generation communication — for example, the router might send a request from a newly deployed prefill worker to an old decode worker (or vice versa). If the old and new versions are incompatible, this can result in errors.
For Grove and LWS deployments with disaggregated prefill/decode workers, be aware that during a rolling update, new workers may communicate with old workers. Ensure that your worker versions are backward-compatible, or consider using Deployment-backed DGDs which provide namespace isolation during updates.
Managed rolling updates with namespace isolation are planned for Grove and LWS-backed deployments in a future release. See [Future Work](#future-work) for details.
## Managed Rolling Updates (Deployments)
For DGDs backed by Kubernetes **Deployments** (single-node, non-multinode services), the Dynamo operator implements managed rolling updates with namespace isolation. This is tracked in the DGD status and provides stronger guarantees for disaggregated deployments.
### How It Works
1. **Spec change detection** — The operator computes a hash of all worker service specs (prefill, decode, and worker component types). When this hash changes, a rolling update is triggered.
2. **Namespace isolation** — New worker `DynamoComponentDeployments` (DCDs) are created with the spec hash appended to their Dynamo namespace. This means new workers register in a different Dynamo namespace than old workers, preventing cross-generation discovery. A new prefill worker will only discover and route to new decode workers, avoiding compatibility issues.
3. **Gradual replacement** — The operator gradually scales up new worker DCDs and scales down old ones, respecting `maxSurge` and `maxUnavailable` constraints. When a worker service is updated (all new replicas are ready, all old replicas are terminated), it is marked as completed.
4. **Cleanup** — Once all worker services have completed the transition, old worker DCDs are deleted and the rolling update is marked as completed.
```
┌─ DynamoGraphDeployment: vllm-disagg ──────────────────────────────────────────────┐
│ │
│ ┌─ DCD: Frontend ──────────┐ │
│ │ │ │
│ │ ┌────────────────────┐ │ No changes — │
│ │ │ Pod (v1) ✓ │ │ not a worker component │
│ │ └────────────────────┘ │ │
│ │ │ │
│ └──────────────────────────┘ │
│ │
│ ┌─ OLD DCDs (hash: a1b2c3d4) ──────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌─ DCD: VllmDecodeWorker-a1b2c3d4 ──┐ ┌─ DCD: VllmPrefillWorker-a1b2c3d4 ┐│ │
│ │ │ │ │ ││ │
│ │ │ ┌──────────────────────┐ │ │ ┌─────────────────────┐ ││ │
│ │ │ │ Pod (v1) Terminating │ │ │ │ Pod (v1) Terminating│ ││ │
│ │ │ └──────────────────────┘ │ │ └─────────────────────┘ ││ │
│ │ │ │ │ ││ │
│ │ │ Dynamo Namespace: vllm-disagg │ │ Dynamo Namespace: vllm-disagg ││ │
│ │ │ -a1b2c3d4 │ │ -a1b2c3d4 ││ │
│ │ └────────────────────────────────────┘ └───────────────────────────────────┘│ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─ NEW DCDs (hash: f5e6d7c8) ──────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌─ DCD: VllmDecodeWorker-f5e6d7c8 ──┐ ┌─ DCD: VllmPrefillWorker-f5e6d7c8 ┐│ │
│ │ │ │ │ ││ │
│ │ │ ┌──────────────────────┐ │ │ ┌─────────────────────┐ ││ │
│ │ │ │ Pod (v2) ✓ NEW │ │ │ │ Pod (v2) ✓ NEW │ ││ │
│ │ │ └──────────────────────┘ │ │ └─────────────────────┘ ││ │
│ │ │ │ │ ││ │
│ │ │ Dynamo Namespace: vllm-disagg │ │ Dynamo Namespace: vllm-disagg ││ │
│ │ │ -f5e6d7c8 │ │ -f5e6d7c8 ││ │
│ │ └────────────────────────────────────┘ └───────────────────────────────────┘│ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────────────────┘ │
│ │
│ Old and new workers are in different Dynamo namespaces — │
│ new prefill only discovers new decode, preventing cross-generation routing. │
│ │
└────────────────────────────────────────────────────────────────────────────────────┘
```
Only worker component types (`worker`, `prefill`, `decode`) participate in managed rolling updates. Non-worker components like `frontend` are updated in-place without namespace isolation.
### Rolling Update Phases
The rolling update progress is tracked in `.status.rollingUpdate` with the following phases:
| Phase | Description |
|-------|-------------|
| `Pending` | A spec change was detected and the rolling update has been initialized. |
| `InProgress` | New worker DCDs are being scaled up and old ones are being scaled down. |
| `Completed` | All worker services have transitioned to new replicas. Old DCDs have been cleaned up. |
The status also tracks:
- `startTime` — When the rolling update began.
- `endTime` — When the rolling update completed.
- `updatedServices` — List of worker services that have completed the transition.
### Configuring maxSurge and maxUnavailable
You can configure the rolling update strategy per service using annotations:
| Annotation | Description | Default |
|------------|-------------|---------|
| `nvidia.com/deployment-rolling-update-max-surge` | Maximum number of extra pods that can be created above the desired count during the update. | `25%` |
| `nvidia.com/deployment-rolling-update-max-unavailable` | Maximum number of pods that can be unavailable during the update. | `25%` |
Values can be absolute integers (e.g., `"1"`, `"2"`) or percentages (e.g., `"25%"`, `"50%"`). Percentages are resolved against the desired replica count — rounding up for `maxSurge` and rounding down for `maxUnavailable`. The operator ensures at least one of `maxSurge` or `maxUnavailable` is greater than zero to guarantee forward progress.
**Example** — zero-downtime update with surge capacity:
```yaml
VllmPrefillWorker:
componentType: worker
subComponentType: prefill
replicas: 4
annotations:
nvidia.com/deployment-rolling-update-max-surge: "1"
nvidia.com/deployment-rolling-update-max-unavailable: "0"
```
This ensures that all 4 existing prefill replicas remain available while 1 new replica is brought up at a time.
**Example** — fast update allowing temporary capacity reduction:
```yaml
VllmDecodeWorker:
componentType: worker
subComponentType: decode
replicas: 8
annotations:
nvidia.com/deployment-rolling-update-max-surge: "0"
nvidia.com/deployment-rolling-update-max-unavailable: "2"
```
This avoids creating extra pods but allows up to 2 decode replicas to be unavailable at a time, speeding up the transition.
### Worker Hash and DCD Naming
Worker DCDs always include a hash suffix derived from the worker specs: `{dgd-name}-{service-name}-{hash}` (e.g., `vllm-disagg-vllmdecodeworker-a1b2c3d4`). During a rolling update, the new worker DCDs are created with the new spec hash while the old DCDs retain the previous hash, allowing both generations to coexist:
- **Old worker DCD:** `vllm-disagg-vllmdecodeworker-a1b2c3d4` (previous hash)
- **New worker DCD:** `vllm-disagg-vllmdecodeworker-f5e6d7c8` (new hash)
The hash is computed from a SHA-256 digest of all worker service specs (excluding non-pod-template fields like `replicas`, `autoscaling`, and `ingress`). This means:
- Scaling changes (replica count) do **not** trigger a rolling update.
- Pod template changes (image, resources, env vars, volumes, etc.) **do** trigger a rolling update.
- The hash covers **all** worker services together — changing any single worker's spec triggers a rolling update for all workers.
The current worker hash is stored as the annotation `nvidia.com/current-worker-hash` on the DGD resource, and individual worker DCDs are labeled with `nvidia.com/dynamo-worker-hash` for filtering.
### Status During Rolling Updates
During a rolling update, the DGD status aggregates information from both old and new worker DCDs:
- **Replicas** — Total count across old and new.
- **ReadyReplicas** — Aggregate ready count across old and new.
- **UpdatedReplicas** — Only new worker replicas.
This provides a holistic view of the deployment's health during the transition.
## Comparison
| Aspect | Grove / LWS | Deployments (Managed) |
|--------|-------------|----------------------|
| Update mechanism | Native resource rolling update | Operator-managed with DCD lifecycle |
| Namespace isolation | No — old and new share the same namespace | Yes — hash-based namespace separation |
| Cross-generation discovery | Possible — old and new workers can see each other | Prevented — new workers only discover new workers |
| maxSurge / maxUnavailable | Fixed (`maxUnavailable: 1`, `maxSurge: 0` for Grove) | Configurable per service via annotations |
| Status tracking | Native resource status | DGD `.status.rollingUpdate` with phase and per-service tracking |
| Multinode support | Yes | No (single-node only) |
## Future Work
The following enhancements are planned for future releases:
- **Managed rolling updates for Grove and LWS** — Extending managed rolling updates with namespace isolation to Grove and LWS-backed deployments, providing the same cross-generation discovery protection that Deployment-backed DGDs have today.
- **Coordinated worker updates** — Currently, prefill and decode workers are updated independently, which can result in an imbalance between old and new sets during the transition. Future releases will coordinate the rollout across worker types.
- **Partitioned rollouts** — The ability to roll out updates to a percentage of workers (e.g., 30%), pause, observe metrics, and then continue. This enables canary-style deployments for safer rollouts.
- **DGD-level rolling update configuration** — The ability to configure `maxSurge` and `maxUnavailable` at the DGD API level, regardless of the backing resource type.
# Disaggregated Inference Communication Guide
This guide explains how prefill and decode workers communicate in Dynamo's disaggregated inference architecture on Kubernetes. It answers the frequently asked question: **Why can't prefill and decode workers use NVLink to communicate on the same node?**
## Summary
- **NVLink cannot be used between Kubernetes pods** due to process isolation and GPU partitioning
- **RDMA (InfiniBand, RoCE, or AWS EFA) is required** for production disaggregated deployments
- **Without RDMA, expect 200-500x performance degradation** in Time To First Token (TTFT) — observed ~98s TTFT with TCP vs ~200-500ms with RDMA
- **UCX or libfabric** are the communication layers that NIXL uses to transfer KV cache between workers
- **Topology-aware KV transfer** can constrain or bias decode routing so KV transfers stay within a selected topology domain such as zone or rack. See [Topology-Aware KV Transfer](/dynamo/kubernetes-deployment/operate/topology-aware-kv-transfer).
---
## Architecture Overview
### Communication Stack
### Component Responsibilities
| Component | Role | Location |
|-----------|------|----------|
| **NIXL** | High-level KV cache transfer API | Dynamo runtime library |
| **UCX or libfabric** | Low-level communication framework | System library |
| **Transports** | Physical data movement | Hardware/kernel drivers |
---
## Why NVLink Cannot Be Used Between Pods
### The Fundamental Constraint
NVLink is a **direct GPU-to-GPU interconnect** that operates at the hardware level. It requires:
1. **Same process** - Both GPUs must be visible to a single process so `cudaDeviceEnablePeerAccess()` can be called
2. **Direct memory access** - Process must have permission to access both GPU memory regions
3. **Peer-to-peer mapping** - CUDA runtime must establish memory mappings between GPUs
**Kubernetes pods violate all three requirements:**
### Technical Explanation
1. **Process Isolation**: Kubernetes pods run in separate Linux namespaces. Even on the same node, Pod A cannot directly access Pod B's memory space.
2. **GPU Partitioning**: The Kubernetes device plugin assigns specific GPUs to each pod via `CUDA_VISIBLE_DEVICES`. Pod A's GPU 0 and Pod B's GPU 0 are physically different devices.
3. **Process/Namespace Isolation**: Each pod runs in a separate process namespace. NVLink peer-to-peer transfers require both GPUs to be within the same process so `cudaDeviceEnablePeerAccess()` can be called.
4. **Memory Registration**: NVLink transfers use `cudaMemcpy` with peer access enabled. This requires calling `cudaDeviceEnablePeerAccess()` - impossible across process boundaries.
### Where NVLink DOES Work
NVLink works **within a pod** for parallelism strategies (TP, EP) where all GPUs are in the same process:
```yaml
# Decode worker with TP=4 uses NVLink between its 4 GPUs
VLLMDecodeWorker:
resources:
limits:
gpu: "4" # All 4 GPUs visible to single process
args:
- --tensor-parallel-size
- "4" # NVLink used for TP/EP communication within pod
```
---
## Supported Communication Options
### Transport Comparison
| Transport | Bandwidth | Latency | Same-Node | Cross-Node | GPU Direct |
|-----------|-----------|---------|-----------|------------|------------|
| **NVLink** | 450-900 GB/s | ~µs | ✅ (intra-pod only) | ❌ | ✅ |
| **InfiniBand RDMA** | 20-50 GB/s | ~1 µs | ✅ | ✅ | ✅ (with GPUDirect) |
| **RoCE RDMA** | 10-25 GB/s | ~2 µs | ✅ | ✅ | ✅ (with GPUDirect) |
| **TCP** | 1-3 GB/s | ~50 µs | ✅ | ✅ | ❌ (host staging) |
### Same-Node Communication
When prefill and decode workers are on the **same physical node**:
**Options (best to worst):**
1. InfiniBand RDMA with GPUDirect → GPU-to-GPU, bypasses CPU
2. RoCE RDMA with GPUDirect → GPU-to-GPU, bypasses CPU
3. Host-staged RDMA → GPU→CPU→RDMA→CPU→GPU
4. TCP (fallback) → GPU→CPU→TCP→CPU→GPU
**Best Practice**: Use RDMA even for same-node communication. The overhead is minimal and it provides consistent behavior whether pods land on the same or different nodes.
### Cross-Node Communication
When prefill and decode workers are on **different nodes**:
**Requirements for optimal cross-node performance:**
- RDMA network fabric (InfiniBand, RoCE, or AWS EFA)
- GPUDirect RDMA enabled (GPU memory registered with NIC)
- Proper UCX or libfabric configuration
---
## UCX Configuration Reference
### Environment Variables
UCX behavior is controlled through environment variables. Set these on both prefill and decode worker pods.
#### Core Transport Selection
```yaml
env:
- name: UCX_TLS
value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
```
| Transport | Description | When to Use |
|-----------|-------------|-------------|
| `rc_x` | Reliable Connection (accelerated) | Primary RDMA transport |
| `rc` | Reliable Connection (standard) | Fallback RDMA |
| `dc_x` | Dynamically Connected (accelerated) | Scalable RDMA (many endpoints) |
| `dc` | Dynamically Connected (standard) | Fallback scalable RDMA |
| `cuda_copy` | GPU↔Host memory staging | Required for GPU buffers |
| `cuda_ipc` | CUDA IPC (same-node, same-pod) | Intra-pod GPU transfers |
| `tcp` | TCP sockets | Fallback when RDMA unavailable |
| `srd` | Scalable Reliable Datagram (AWS EFA) | AWS-specific (provided by EFA, not core UCX) |
**Excluding transports**: Use `^` prefix to exclude (e.g., `UCX_TLS=^mm` excludes memory mapping).
**Note**: When specifying `UCX_TLS` explicitly with GPU memory, you must include `cuda_copy` or `cuda_ipc` for UCX to recognize GPU buffers.
#### Rendezvous Protocol Settings
```yaml
env:
- name: UCX_RNDV_SCHEME
value: "get_zcopy"
- name: UCX_RNDV_THRESH
value: "0"
```
| Variable | Value | Description |
|----------|-------|-------------|
| `UCX_RNDV_SCHEME` | `get_zcopy` | Zero-copy RDMA GET (receiver pulls data) |
| `UCX_RNDV_SCHEME` | `put_zcopy` | Zero-copy RDMA PUT (sender pushes data) |
| `UCX_RNDV_SCHEME` | `auto` | Let UCX choose based on message size |
| `UCX_RNDV_THRESH` | `0` | Use rendezvous for all message sizes |
| `UCX_RNDV_THRESH` | `8192` | Use rendezvous for messages ≥8KB |
| `UCX_RNDV_THRESH` | `auto` | Let UCX calculate optimal threshold |
**Recommendation**: Use `get_zcopy` with threshold `0` for KV cache transfers (always large).
> **⚠️ AWS EFA Exception**: Do NOT use `get_zcopy` on AWS with Ubuntu 24.04 + Kernel ≥6.8. See [AWS EFA Configuration](#aws-efa-configuration) for required settings.
#### Memory Registration
```yaml
env:
- name: UCX_IB_REG_METHODS
value: "odp,rcache"
```
| Method | Description |
|--------|-------------|
| `odp` | On-Demand Paging (dynamic registration) |
| `rcache` | Registration cache (reuse registrations) |
| `direct` | Direct registration (each transfer) |
#### Debugging and Diagnostics
```yaml
env:
- name: UCX_LOG_LEVEL
value: "info" # Options: fatal, error, warn, info, debug, trace, data, func
- name: UCX_LOG_FILE
value: "/tmp/ucx.log" # Optional: log to file instead of stdout
```
**Note**: UCX statistics (`UCX_STATS_DEST`, `UCX_STATS_TRIGGER`) require UCX compiled with `--enable-stats` flag, which is not enabled in default builds.
### Complete Production Configuration
```yaml
env:
# Transport selection - RDMA with GPU support
- name: UCX_TLS
value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
# Rendezvous for large transfers
- name: UCX_RNDV_SCHEME
value: "get_zcopy"
- name: UCX_RNDV_THRESH
value: "0"
# Memory registration optimization
- name: UCX_IB_REG_METHODS
value: "odp,rcache"
# RDMA settings
- name: UCX_IB_GID_INDEX
value: "3" # RoCE v2 GID index (cluster-specific)
```
### InfiniBand Configuration
For clusters with InfiniBand RDMA (e.g., ConnectX NICs), use UCX with the `rc` (Reliable Connection) transport. This is the standard path for on-premises and bare-metal Kubernetes clusters.
**RDMA Resources:**
Request one `rdma/ib` device per GPU. The RDMA device plugin injects `/dev/infiniband/*` devices automatically:
```yaml
resources:
limits:
gpu: "4"
custom:
rdma/ib: "4"
```
No pod annotations are needed. InfiniBand devices are injected by the device plugin.
**Security Context:**
Add `IPC_LOCK` and `SYS_RESOURCE` capabilities. `IPC_LOCK` allows RDMA memory pinning, `SYS_RESOURCE` allows memlock limit escalation:
```yaml
securityContext:
runAsUser: 0
capabilities:
add:
- IPC_LOCK
- SYS_RESOURCE
```
**Environment Variables (worker containers):**
```yaml
env:
# --- UCX (RDMA transport) ---
- name: UCX_TLS
value: "rc_x,rc,cuda_copy,cuda_ipc"
- name: UCX_NET_DEVICES
value: ":1" # e.g. "mlx5_0:1" — run `ibv_devinfo` to find your device
- name: UCX_IB_ADDR_TYPE
value: "eth" # required for cross-pod IB on Kubernetes
- name: UCX_RNDV_SCHEME
value: "get_zcopy"
- name: UCX_RNDV_THRESH
value: "0"
- name: UCX_RC_TIMEOUT
value: "600s"
- name: UCX_KEEPALIVE_INTERVAL
value: "300s"
```
| Variable | Description |
|----------|-------------|
| `UCX_TLS` | `rc_x` (accelerated RC) listed first for optimal RDMA performance |
| `UCX_NET_DEVICES` | Bind to a specific IB device. Run `ibv_devinfo` inside a pod to list available devices. Use a non-bonded device with a valid LID. |
| `UCX_IB_ADDR_TYPE` | Must be `eth` for cross-pod communication on Kubernetes. Without this, UCX uses LID-based addressing which does not route between pods. |
| `UCX_RNDV_SCHEME` | `get_zcopy` enables zero-copy RDMA GET, optimal for large KV cache transfers |
> **Note**: `UCX_IB_ADDR_TYPE=eth` is the most common missing setting when bringing up NIXL disagg on InfiniBand clusters. If NIXL init succeeds but transfers fail with `NIXL_ERR_REMOTE_DISCONNECT`, this is likely the cause.
**Known Issue — Bonded IB devices:**
Some clusters expose bonded InfiniBand devices (e.g., `mlx5_bond_0`) with LID=0. If UCX selects a bonded device, transfers may fail. Verify device LIDs and select a non-bonded device:
```bash
# Inside a pod with rdma/ib resources:
ibv_devinfo | grep -E "hca_id|lid"
# Use a device with a non-zero LID in UCX_NET_DEVICES
```
### AWS EFA Configuration
NIXL supports **libfabric** as the backend for AWS EFA deployments. This is the **recommended approach** for disaggregated inference on AWS, achieving ~9.6 GB/s KV transfer bandwidth. See the [AWS EFA with NIXL documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nixl.html) for complete setup instructions.
**Requirements:**
- EFA installer version **1.47.0** or later
- Libfabric (installed via EFA installer at `/opt/amazon/efa`)
- GDRCopy for GPU Direct RDMA operations (GPU Operator v26.x installs this automatically)
- EFA-enabled container image (e.g., `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0-efa-amd64`)
**Kernel Compatibility:**
GDRCopy v2.5.1 has a build failure on kernel 6.15+ due to a `vm_flags_set` redefinition. Pin your Ubuntu EKS AMI to kernel 6.14 or earlier until GDRCopy v2.5.2 is available in GPU Operator.
| Kernel Version | GDRCopy v2.5.1 | GDRCopy v2.5.2 |
|----------------|----------------|----------------|
| 6.14 and below | ✅ Works | ✅ Works |
| 6.15+ | ❌ Build fails | ✅ Works |
**Pod Anti-Affinity (Required):**
EFA is designed for **cross-node** communication. Prefill and decode workers must be scheduled on **different nodes** to avoid EAGAIN errors during KV transfer.
```yaml
VllmDecodeWorker:
extraPodSpec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvidia.com/dynamo-component
operator: In
values:
- VllmPrefillWorker
topologyKey: kubernetes.io/hostname
```
> **Note**: Anti-affinity only needs to be configured on one side (here, the decode worker). The Kubernetes scheduler enforces the constraint symmetrically—if decode cannot be placed with prefill, they will end up on different nodes regardless of which pod has the rule.
**EFA Resource Requests:**
Request EFA interfaces in your pod spec. The p5.48xlarge instance has **32 EFA interfaces** (32 network cards × 1 interface each) with 3200 Gbps total bandwidth. The number of interfaces to allocate per worker depends on your deployment:
| Deployment | EFA per Worker | Rationale |
|------------|----------------|-----------|
| 1P + 1D per node pair | 4 | Achieved ~9.6 GB/s; leaves 24 interfaces for other pods |
| Multi-worker per node | 2-4 | Balance between workers sharing the node |
| Maximum bandwidth | 8-16 | For very large KV cache transfers or TP>1 |
Example with 4 EFA interfaces (validated configuration):
```yaml
extraPodSpec:
mainContainer:
securityContext:
capabilities:
add: ["IPC_LOCK"]
resources:
limits:
vpc.amazonaws.com/efa: "4"
requests:
vpc.amazonaws.com/efa: "4"
```
> **Note**: NIXL/libfabric automatically stripes traffic across all allocated EFA interfaces. The 4-interface configuration achieved ~9.6 GB/s in testing, which is sufficient for Llama-3.1-8B KV cache transfers at ISL=8000. Increase the count if your workload requires higher bandwidth (e.g., larger models or higher TP).
**Environment Variables:**
```yaml
env:
- name: NIXL_LOG_LEVEL
value: "INFO"
- name: LD_LIBRARY_PATH
value: "/usr/local/nixl/lib/x86_64-linux-gnu:/opt/amazon/efa/lib64:$(LD_LIBRARY_PATH)"
```
**vLLM Configuration:**
```bash
vllm serve \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device":"cuda","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}'
```
| Parameter | Value | Purpose |
|-----------|-------|---------|
| `kv_connector` | `NixlConnector` | Enables NIXL for KV-cache transfer |
| `kv_role` | `kv_both` | Symmetric functionality (producer and consumer) |
| `kv_buffer_device` | `cuda` | Uses GPU memory for KV-cache buffer |
| `backends` | `["LIBFABRIC"]` | Routes NIXL traffic over EFA |
**Verification:**
```bash
# Confirm EFA/libfabric installation
fi_info -p efa -t FI_EP_RDM
# Verify GDRCopy device
ls -la /dev/gdrdrv
# Check NIXL initialization in pod logs (should show 32 EFA devices on p5.48xlarge)
kubectl logs | grep -i "NIXL\|libfabric\|efa"
```
**Expected Log Output:**
```text
NIXL INFO Loaded backend plugin: LIBFABRIC
NIXL INFO Found 32 fabric devices
```
---
## Deployment Configuration
### Kubernetes Resource Requirements
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
services:
VLLMPrefillWorker:
resources:
limits:
gpu: "2"
extraPodSpec:
mainContainer:
securityContext:
capabilities:
add: ["IPC_LOCK"] # Required for RDMA memory pinning
resources:
limits:
rdma/ib: "2" # RDMA resources (match TP size)
requests:
rdma/ib: "2"
```
### Required Capabilities and Resources
| Setting | Purpose | Notes |
|---------|---------|-------|
| `IPC_LOCK` capability | Pin memory for RDMA | Bypasses RLIMIT_MEMLOCK; required for `ibv_reg_mr()` to pin GPU/host buffers |
| `rdma/ib` resources | RDMA NIC access | Provided by RDMA device plugin |
| `sharedMemory.size` | IPC between processes | 16Gi for vLLM, 80Gi for TRT-LLM |
### Infrastructure Prerequisites
1. **RDMA Device Plugin**: Exposes `rdma/ib` or `vpc.amazonaws.com/efa` resources to Kubernetes
```bash
# InfiniBand/RoCE
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.rdma/ib}'
# AWS EFA
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.vpc\.amazonaws\.com/efa}'
```
2. **RDMA Network**: One of:
- InfiniBand or RoCE fabric
- AWS EFA (Elastic Fabric Adapter)
3. **GPUDirect RDMA** (optional but recommended):
- NVIDIA driver with GPUDirect enabled
- `nvidia-peermem` kernel module loaded (InfiniBand/RoCE)
- GDRCopy installed (AWS EFA with libfabric)
---
## Diagnostics and Performance Validation
### Pre-Deployment Validation
#### 1. Verify RDMA Availability
```bash
# Check RDMA devices on node
kubectl debug node/ -it --image=ubuntu:22.04 -- bash
ibv_devinfo
```
Expected output shows InfiniBand or RoCE devices:
```text
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 28.35.2000
...
```
#### 2. Check UCX Transport Capabilities
```bash
# Inside a Dynamo worker pod
ucx_info -d
```
Look for GPU memory support:
```text
# Memory domain: mlx5_0
# Component: ib
# memory types: host (access,reg,cache), cuda (access,reg,cache)
# ^^^^ GPU memory supported
```
**If you only see `host`**: GPUDirect RDMA is not working. KV transfers will use host staging.
#### 3. Test UCX Performance
```bash
# Server (on decode worker pod)
ucx_perftest -t tag_bw -n 100 -s 134217728
# Client (on prefill worker pod)
ucx_perftest -t tag_bw -n 100 -s 134217728
```
**Expected bandwidth**:
- InfiniBand HDR: 20-25 GB/s per port
- RoCE 100GbE: 10-12 GB/s
- TCP fallback: 1-2 GB/s
### NIXL Benchmark Tool
Deploy the NIXL benchmark to validate end-to-end KV transfer performance:
```bash
cd deploy/pre-deployment/nixl
./build_and_deploy.sh
```
This deploys a benchmark that measures actual GPU-to-GPU transfer rates through NIXL.
### Runtime Diagnostics
#### Verify NIXL Backend Initialization
```bash
kubectl logs | grep -i "NIXL\|UCX"
```
**Good output**:
```text
NIXL INFO Backend UCX was instantiated
```
**Bad output** (RDMA not working):
```text
UCX WARN no RDMA transports available
NIXL INFO falling back to TCP transport
```
#### Monitor Transfer Performance
Check Grafana dashboards for:
- **NIXL transfer bandwidth**: Should show GB/s, not MB/s
- **KV cache transfer latency**: Should be under 500ms for typical workloads
**Red flags indicating RDMA issues**:
- Transfer bandwidth under 1 GB/s
- TTFT > 10 seconds
- `Unsupported operation` errors in logs
### Common Diagnostic Commands
```bash
# Check UCX transport selection
kubectl exec -- env | grep UCX
# Verify RDMA device visibility
kubectl exec -- ls /dev/infiniband/
# Check GPUDirect RDMA status (on node)
kubectl debug node/ -it --image=ubuntu:22.04 -- \
nsenter -t 1 -m -u -n -p -- dmesg | grep -i "nvidia\|peermem\|gdr"
# Test basic connectivity between pods
kubectl exec -- ping -c 3
```
---
## Performance Expectations
### KV Cache Transfer Overhead
| Configuration | TTFT Overhead (avg) | KV Transfer BW | Source |
|---------------|---------------------|----------------|--------|
| Aggregated (baseline) | 0 | N/A | No KV transfer needed |
| Disagg + InfiniBand RDMA with GPUDirect | +200-500ms | 20-50 GB/s | *Expected* based on hardware specs |
| Disagg + RoCE RDMA with GPUDirect | +300-800ms | 10-25 GB/s | *Expected* based on hardware specs |
| Disagg + AWS EFA with libfabric + GDRCopy | **+37ms** | **~9.6 GB/s** | *Measured* on AWS p5.48xlarge (Llama-3.1-8B, ISL=8000, OSL=50) |
| Disagg + Host-staged (no GPUDirect) | +1-3s | 1-3 GB/s | *Expected* - CPU bottleneck |
| Disagg + AWS EFA with UCX (without GPUDirect) | ~3x slower than aggregated | ~1 GB/s | *Measured* on AWS p5.48xlarge |
| Disagg + TCP fallback | **+90-100s** | ~100 MB/s | *Measured* ~98s TTFT on AWS p5.48xlarge |
> **Note**: For AWS EFA deployments, use libfabric with GDRCopy to enable GPUDirect RDMA. UCX on AWS EFA does not support GPUDirect on kernel ≥6.8 and results in severely degraded performance. See [AWS EFA Configuration](#aws-efa-configuration) for setup instructions.
### When Disaggregated Makes Sense
**Use disaggregated architecture when:**
- Input sequence length (ISL) ≥ 4000 tokens (14-22% throughput gain)
- You need independent scaling of prefill vs decode capacity
- Prefill and decode have different hardware requirements
**Use aggregated architecture when:**
- Low-latency TTFT is critical
- Input sequences under 2000 tokens (minimal disagg benefit)
- RDMA is not available
### Break-Even Analysis
The KV transfer overhead is amortized across output tokens. **Measured data from Llama-3.1-8B-Instruct** on AWS p5.48xlarge with NIXL+libfabric:
```text
KV Transfer Overhead (TTFT min, unqueued):
- Aggregated: ~173ms
- Disaggregated: ~210ms
- KV transfer cost: ~37ms
Performance at ISL=8000, OSL=50, concurrency=10:
- ITL improvement: 41% faster per-token generation
- Throughput gain: 22% higher output throughput
```
**Key Insight**: The KV transfer overhead via libfabric+EFA is only **~37ms**. Combined with 41% faster decode (ITL), disaggregated inference delivers **22% higher throughput** for prefill-bound workloads.
| Metric | Aggregated | Disaggregated | Difference |
|--------|------------|---------------|------------|
| TTFT (min, unqueued) | 173 ms | 210 ms | +37ms |
| TTFT (p95) | 2097 ms | 1752 ms | **-16%** |
| ITL (avg) | 28.5 ms | 16.9 ms | **-41%** |
| Output throughput (ISL=8000, OSL=50) | 204 tok/s | 248 tok/s | **+22%** |
**Disagg advantage scales with input length (ISL)** (all at OSL=50, concurrency=10):
| ISL | Throughput Δ | ITL Δ | Recommendation |
|-----|--------------|-------|----------------|
| 1000 | ~0% | -7% | Use aggregated |
| 2000 | +3% | -11% | Either works |
| 4000 | +14% | -18% | Disagg preferred |
| 8000 | **+22%** | **-41%** | **Disagg strongly preferred** |
---
## Troubleshooting Guide
### Problem: TTFT is 10+ seconds
**Symptoms**: TTFT degrades from expected 200-500ms to 10+ seconds
**Root Cause**: RDMA not active, falling back to TCP
**Diagnosis**:
```bash
kubectl logs | grep -i "transport\|UCX\|TCP"
```
**Solutions**:
1. Verify RDMA device plugin is installed
2. Add `rdma/ib` resource requests to pod spec
3. Add `IPC_LOCK` capability
4. Set UCX environment variables
### Problem: "Unsupported operation" errors
**Symptoms**: Logs show `Unexpected UCX error: Unsupported operation`
**Root Cause**: UCX attempting GPU RDMA on hardware that doesn't support it
**Solutions**:
1. Check if GPUDirect RDMA is enabled: `ucx_info -d | grep cuda`
2. If not supported, set `UCX_RNDV_THRESH=inf` to disable GPU RDMA
3. Verify `nvidia-peermem` module is loaded
### Problem: AWS EFA not using GPU Direct
**Symptoms**: 3x performance degradation on AWS despite EFA configured
**Root Cause**: GPU Direct RDMA not functional on kernel ≥6.8 with EFA when using UCX
**Solution**: Use libfabric instead of UCX for AWS EFA deployments. Libfabric with GDRCopy provides efficient GPU Direct RDMA operations on AWS. See the [AWS EFA Configuration](#aws-efa-configuration) section for setup instructions.
**Alternative options** (if libfabric is not available):
1. Use kernel before 6.8 (Ubuntu 22.04 with kernel 5.15)
2. Accept host-staging performance penalty
### Problem: EFA EAGAIN errors (fi_read still retrying)
**Symptoms**: Decode worker logs show repeated EAGAIN errors:
```text
fi_read still retrying EAGAIN on rail 0
fi_read still retrying EAGAIN on rail 1
...
```
**Root Cause**: Prefill and decode workers are scheduled on the **same node**. AWS EFA is designed for cross-node communication and does not function correctly for intra-node transfers.
**Diagnosis**:
```bash
# Check if workers are on the same node
kubectl get pods -o wide | grep vllm
```
If both prefill and decode workers show the same NODE, this is the problem.
**Solution**: Add pod anti-affinity rules to ensure workers are scheduled on different nodes:
```yaml
VllmDecodeWorker:
extraPodSpec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvidia.com/dynamo-component
operator: In
values:
- VllmPrefillWorker
topologyKey: kubernetes.io/hostname
```
> **Note**: Use `nvidia.com/dynamo-component` as the label key, not `app.kubernetes.io/component`. The Dynamo operator uses this label to identify component types.
### Problem: NIXL_ERR_BACKEND at create_backend on InfiniBand
**Symptoms**: NIXL backend creation fails immediately with `NIXL_ERR_BACKEND`. UCX logs show:
```text
mlx5dv_devx_obj_destroy(SRQ) failed: Invalid argument
mlx5dv_devx_obj_destroy(CQ) failed: Invalid argument
```
Or:
```text
select.c: no active messages transport: Unsupported operation
```
**Root Causes**:
1. **Bonded IB device with LID=0**: UCX selects `mlx5_bond_0` by default, but bonded devices may have LID=0 (invalid for UD transport). Fix: set `UCX_NET_DEVICES` to a non-bonded device with a valid LID.
2. **UCX/OFED version mismatch**: The container's UCX mlx5 library may be compiled against a different devx ABI than the host kernel driver. Any transport using IB (rc, cuda_ipc with IB) triggers the devx crash.
3. **Missing RDMA device injection**: If `rdma/ib` is not requested in the pod spec, no IB devices are injected into the container.
**Diagnosis**:
```bash
# Check which IB devices are visible and their LIDs
ibv_devinfo | grep -E "hca_id|lid"
# Verify rdma/ib was requested
kubectl get pod -o jsonpath='{.spec.containers[0].resources}'
# Check /dev/infiniband exists
ls -la /dev/infiniband/
```
**Solutions**:
1. Request `rdma/ib` resources (1 per GPU) in the pod spec
2. Set `UCX_NET_DEVICES` to a non-bonded device if `mlx5_bond_0` has LID=0
3. Ensure the container image's UCX build matches the host OFED version
### Problem: Intermittent transfer failures
**Symptoms**: Sporadic `getXferStatus: backend 'UCX' returned error status`
**Diagnosis**:
```bash
# Enable UCX debug logging
kubectl set env deployment/ UCX_LOG_LEVEL=debug
kubectl logs | grep -i error
```
**Common causes**:
- Network congestion or packet loss
- Mismatched UCX versions between pods
- RDMA resource exhaustion
---
## Quick Reference
### Minimum Viable RDMA Configuration
```yaml
env:
- name: UCX_TLS
value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
- name: UCX_RNDV_SCHEME
value: "get_zcopy"
- name: UCX_RNDV_THRESH
value: "0"
securityContext:
capabilities:
add: ["IPC_LOCK"]
resources:
limits:
rdma/ib: "2"
requests:
rdma/ib: "2"
```
### Diagnostic Checklist
- [ ] `rdma/ib` resources visible: `kubectl get nodes -o jsonpath='{..allocatable.rdma/ib}'`
- [ ] NIXL initialized: `kubectl logs | grep "Backend"`
- [ ] Transfer bandwidth > 1 GB/s (check Grafana metrics)
**For UCX deployments:**
- [ ] UCX sees RDMA devices: `ucx_info -d | grep "Transport: rc"`
- [ ] UCX sees GPU memory: `ucx_info -d | grep "memory types.*cuda"`
**For libfabric deployments (AWS EFA):**
- [ ] EFA devices available: `fi_info -p efa`
- [ ] GDRCopy installed: `ls /dev/gdrdrv`
---
## Related Documentation
- [Disaggregated Serving Architecture](/dynamo/design-docs/disaggregated-serving)
- [AIConfigurator Deployment Guide](/dynamo/user-guides/disaggregated-serving)
- [NIXL Benchmark Deployment](../../deploy/pre-deployment/nixl/README.md)
- [KV Cache Transfer Methods](/dynamo/additional-resources/tensor-rt-llm-details/kv-cache-transfer)
# Topology-Aware KV Transfer
# Topology-Aware KV Transfer
Topology-aware KV transfer lets a disaggregated Dynamo deployment route decode requests toward workers that share the selected prefill worker's topology domain, such as zone or rack. This reduces slow cross-domain KV-cache transfers when prefill and decode workers exchange KV data over NIXL.
Use this feature when:
- Your deployment uses separate prefill and decode workers.
- Your cluster exposes useful node labels, such as `topology.kubernetes.io/zone` or a rack/block label.
- Same-domain KV transfer is required for correctness or strongly preferred for latency and bandwidth.
This page covers the Kubernetes operator path. For router and runtime behavior, see [Router Topology-Aware KV Transfer](/dynamo/components/router/topology-aware-kv-transfer).
For RDMA/NIXL transport setup, see [Disagg Communication](/dynamo/kubernetes-deployment/operate/disagg-communication).
## How It Works
```mermaid
flowchart LR
DGD["DGD spec.experimental.kvTransferPolicy"] --> Operator["Operator injects worker env + Downward API volume"]
Node["Node label"] --> Controller["Topology label controller"]
Controller --> Pod["Worker pod label"]
Pod --> Volume["/etc/dynamo/topology/"]
Volume --> Runtime["Worker publishes ModelRuntimeConfig topology metadata"]
Runtime --> Prefill["Prefill router derives decode constraints"]
Prefill --> Decode["Decode router selects same or preferred topology"]
```
The operator configures worker pods from `spec.experimental.kvTransferPolicy`:
- Adds a `nvidia.com/topology-label-key` annotation to worker pods.
- Runs a topology-label controller that copies the configured node label onto the worker pod after scheduling.
- Projects that pod label into `/etc/dynamo/topology/` with a Downward API volume.
- Injects worker environment variables that tell the backend runtime which topology domain and enforcement policy to publish.
The frontend does not read this policy from its own environment. Workers publish the topology metadata in their `ModelRuntimeConfig`; the router reads it from runtime discovery.
## Prerequisites
| Requirement | Details |
|-------------|---------|
| Disaggregated serving | Separate prefill and decode worker services. |
| KV router | The frontend should use `DYN_ROUTER_MODE=kv`. |
| Node topology labels | Every node that can host a worker must carry the configured `labelKey`. |
| Dynamo operator | The operator must include topology-label controller and node-read RBAC. |
| KV transfer transport | RDMA, EFA, or another NIXL-compatible transport should already be configured for production disaggregated deployments. |
Confirm that the label you plan to use exists on worker nodes:
```bash
kubectl get nodes -L topology.kubernetes.io/zone
```
## Required Same-Domain Routing
`enforcement: required` constrains decode worker selection to workers whose topology value matches the selected prefill worker for the configured domain. If no decode worker satisfies the generated constraint, the router fails the request instead of silently crossing the domain.
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeployment
metadata:
name: qwen3-disagg-zone
spec:
experimental:
kvTransferPolicy:
labelKey: topology.kubernetes.io/zone
domain: zone
enforcement: required
components:
- name: Frontend
type: frontend
replicas: 1
podTemplate:
spec:
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
env:
- name: DYN_ROUTER_MODE
value: kv
- name: VllmPrefillWorker
type: worker
replicas: 2
podTemplate:
spec:
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
command: ["python3", "-m", "dynamo.vllm"]
args: ["--model", "Qwen/Qwen3-0.6B", "--disaggregation-mode", "prefill"]
envFrom:
- secretRef:
name: hf-token-secret
resources:
limits:
nvidia.com/gpu: "1"
- name: VllmDecodeWorker
type: worker
replicas: 2
podTemplate:
spec:
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
command: ["python3", "-m", "dynamo.vllm"]
args: ["--model", "Qwen/Qwen3-0.6B", "--disaggregation-mode", "decode"]
envFrom:
- secretRef:
name: hf-token-secret
resources:
limits:
nvidia.com/gpu: "1"
```
`enforcement` defaults to `required` when omitted.
`required` is a decode-routing constraint, not a capacity planner. The `DynamoGraphDeployment` author or cluster administrator must ensure that every topology domain that can receive prefill workers also has sufficient same-domain decode capacity. If a domain has prefill workers but no matching decode workers, or too little decode capacity, the router cannot spill to another domain without violating the policy.
### Capacity Planning Across Domains
Plan prefill and decode capacity per topology domain before enabling `enforcement: required`. For example, assume:
- Two availability zones: `az-1` and `az-2`.
- The target fleet is 60 prefill workers and 120 decode workers.
- The fleet should be split evenly across the two zones.
- The target prefill-to-decode ratio is 1:2 in each zone.
That means each zone should run 30 prefill workers and 60 decode workers:
| Zone | Prefill workers | Decode workers | Ratio |
|------|-----------------|----------------|-------|
| `az-1` | 30 | 60 | 1:2 |
| `az-2` | 30 | 60 | 1:2 |
In a `DynamoGraphDeployment`, express this as separate prefill and decode components per zone. Pin each component to its zone and set `kvTransferPolicy.enforcement` to `required` so the router refuses cross-zone decode selection. The DGD author or cluster administrator must ensure each zone has enough schedulable capacity for its pinned replicas. Worker command and args are omitted here; configure each worker for prefill or decode mode as in the base disaggregated serving manifest:
```yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeployment
metadata:
name: qwen3-disagg-zone-capacity
spec:
experimental:
kvTransferPolicy:
labelKey: topology.kubernetes.io/zone
domain: zone
enforcement: required
components:
- name: Frontend
type: frontend
replicas: 1
podTemplate:
spec:
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
env:
- name: DYN_ROUTER_MODE
value: kv
- name: VllmPrefillWorkerAz1
type: worker
replicas: 30
podTemplate:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["az-1"]
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
envFrom:
- secretRef:
name: hf-token-secret
- name: VllmDecodeWorkerAz1
type: worker
replicas: 60
podTemplate:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["az-1"]
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
envFrom:
- secretRef:
name: hf-token-secret
- name: VllmPrefillWorkerAz2
type: worker
replicas: 30
podTemplate:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["az-2"]
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
envFrom:
- secretRef:
name: hf-token-secret
- name: VllmDecodeWorkerAz2
type: worker
replicas: 60
podTemplate:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["az-2"]
containers:
- name: main
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
envFrom:
- secretRef:
name: hf-token-secret
```
## Preferred Same-Domain Routing
`enforcement: preferred` keeps all decode workers eligible but biases worker selection toward the same topology domain.
```yaml
spec:
experimental:
kvTransferPolicy:
labelKey: topology.kubernetes.io/zone
domain: zone
enforcement: preferred
preferredWeight: 0.85
```
`preferredWeight` is required with `enforcement: preferred`. It must be between `0` and `1`. A higher value creates a stronger same-domain preference, but it is not a probability and does not guarantee same-domain selection.
## Field Reference
| Field | Required | Description |
|-------|----------|-------------|
| `labelKey` | Yes | Kubernetes node label key to copy onto worker pods, for example `topology.kubernetes.io/zone`. |
| `domain` | Yes | Logical topology domain name published by workers, for example `zone` or `rack`. Must match `^[a-z0-9]([a-z0-9-]*[a-z0-9])?$`. |
| `enforcement` | No | `required` or `preferred`. Defaults to `required`. |
| `preferredWeight` | Only with `preferred` | Bias weight from `0` to `1`; only valid with `enforcement: preferred`. |
The runtime uses `domain`, not the Kubernetes label key, when creating routing constraints. For example, `labelKey: topology.kubernetes.io/zone` and `domain: zone` produce worker topology metadata like:
```json
{
"topology_domains": {
"zone": "us-east-1a"
},
"kv_transfer_domain": "zone",
"kv_transfer_enforcement": "required"
}
```
## Verify the Deployment
After the DGD creates worker pods, verify the operator pipeline from node label to runtime topology file.
```bash
export NAMESPACE=
export POD=
kubectl get pod "$POD" -n "$NAMESPACE" \
-o jsonpath='{.metadata.annotations.nvidia\.com/topology-label-key}{"\n"}'
kubectl get pod "$POD" -n "$NAMESPACE" \
-o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}'
kubectl exec "$POD" -n "$NAMESPACE" -- \
sh -c 'find /etc/dynamo/topology -maxdepth 1 -type f -print -exec cat {} \;'
```
Expected results:
- The annotation value is the configured `labelKey`.
- The worker pod has the copied topology label.
- `/etc/dynamo/topology/` exists and contains the topology value.
Worker logs should include topology config during startup:
```bash
kubectl logs "$POD" -n "$NAMESPACE" | grep -i "Topology config"
```
## Troubleshooting
### Pod Has No Copied Topology Label
Check whether the node has the configured label:
```bash
NODE=$(kubectl get pod "$POD" -n "$NAMESPACE" -o jsonpath='{.spec.nodeName}')
kubectl get node "$NODE" -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}'
```
If the label is missing, the topology-label controller emits a warning event with reason `TopologyLabelMissing` and leaves topology metadata unavailable for that worker.
```bash
kubectl get events -n "$NAMESPACE" \
--field-selector involvedObject.name="$POD",reason=TopologyLabelMissing
```
### Worker Exits While Waiting for Topology
When topology is enabled, the worker waits for the transfer-domain file to appear and contain data. If it stays empty, check:
- `spec.experimental.kvTransferPolicy.domain` matches the projected file name.
- `spec.experimental.kvTransferPolicy.labelKey` exists on the worker's node.
- The worker pod has the `nvidia.com/topology-label-key` annotation.
- The topology-label controller is running and has node `get` RBAC.
### Required Policy Fails Requests
With `enforcement: required`, decode routing fails if no decode worker has the same generated topology taint as the selected prefill worker. Verify both prefill and decode workers publish the same `domain`, and that each domain where prefill workers can be selected has enough matching decode workers for the expected p/d ratio.
Use `preferred` while validating a heterogeneous rollout if cross-domain routing is acceptable during partial capacity.
## Relationship to Topology Aware Scheduling
[Topology Aware Scheduling](/dynamo/kubernetes-deployment/scale/topology-aware-scheduling) controls where Kubernetes places pods. Topology-aware KV transfer controls how Dynamo routes between already-running prefill and decode workers.
Use them together when possible:
- Topology Aware Scheduling keeps workers placed inside useful topology boundaries.
- Topology-aware KV transfer prevents the router from choosing a decode worker outside the selected prefill worker's transfer domain.
# Metrics
## Overview
This guide provides a walkthrough for collecting and visualizing metrics from Dynamo components using the kube-prometheus-stack. The kube-prometheus-stack provides a powerful and flexible way to configure monitoring for Kubernetes applications through custom resources like PodMonitors, making it easy to automatically discover and scrape metrics from Dynamo components.
## Prerequisites
### Install kube-prometheus-stack
If you don't have an existing Prometheus setup, you'll likely want to install the kube-prometheus-stack. This is a collection of Kubernetes manifests that includes the Prometheus Operator, Prometheus, Grafana, and other monitoring components in a pre-configured setup. The stack introduces custom resources that make it easy to deploy and manage monitoring in Kubernetes:
- `PodMonitor`: Automatically discovers and scrapes metrics from pods based on label selectors
- `ServiceMonitor`: Similar to PodMonitor but works with Services
- `PrometheusRule`: Defines alerting and recording rules
For a basic installation:
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Values allow PodMonitors to be picked up that are outside of the kube-prometheus-stack helm release
helm install prometheus -n monitoring --create-namespace prometheus-community/kube-prometheus-stack \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorNamespaceSelector.matchLabels=null \
--set prometheus.prometheusSpec.probeNamespaceSelector.matchLabels=null
```
The commands enumerated below assume you have installed the kube-prometheus-stack with the installation method listed above. Depending on your installation configuration of the monitoring stack, you may need to modify the `kubectl` commands that follow in this document accordingly (e.g modifying Namespace or Service names accordingly).
### Install Dynamo Operator
Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) for detailed instructions on deploying the Dynamo operator.
Make sure to set the `dynamo-operator.dynamo.metrics.prometheusEndpoint` to the Prometheus endpoint you installed in the previous step.
```bash
helm install dynamo-platform ...
--set dynamo-operator.dynamo.metrics.prometheusEndpoint=http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090
```
### Node Exporter for CPU/Memory Metrics
The Dynamo Grafana dashboard includes panels for node-level CPU utilization, system load, and container resource usage. These metrics are collected and exported to Prometheus via [node-exporter](https://github.com/prometheus/node_exporter), which exposes hardware and OS metrics from Linux systems.
The kube-prometheus-stack installation described above includes node-exporter by default. If you're using a custom Prometheus setup, you'll need to ensure node-exporter is deployed as a DaemonSet on your cluster nodes.
To verify node-exporter is running:
```bash
kubectl get daemonset -A | grep node-exporter
```
If node-exporter is not running, you can install it via the kube-prometheus-stack or deploy it separately. For more information, see the [node-exporter documentation](https://github.com/prometheus/node_exporter).
### DCGM Metrics Collection (Optional)
GPU utilization metrics are collected and exported to Prometheus via dcgm-exporter. The Dynamo Grafana dashboard includes a panel for GPU utilization related to your Dynamo deployment. For that panel to be populated, you need to ensure that the dcgm-exporter is running in your cluster. To check if the dcgm-exporter is running, please run the following command:
```bash
kubectl get daemonset -A | grep dcgm-exporter
```
If the output is empty, you need to install the dcgm-exporter. For more information, please consult the official [dcgm-exporter documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html).
## Deploy a DynamoGraphDeployment
Let's start by deploying a simple vLLM aggregated deployment:
```bash
export NAMESPACE=dynamo-system # namespace where dynamo operator is installed
pushd examples/backends/vllm/deploy
kubectl apply -f agg.yaml -n $NAMESPACE
popd
```
This will create two components:
- A Frontend component exposing metrics on its HTTP port
- A Worker component exposing metrics on its system port
Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
- Deployment configuration: See the [vLLM README](/dynamo/backends/v-llm)
- Available metrics: See the [metrics guide](/dynamo/user-guides/observability-local/metrics)
### Validate the Deployment
Let's send some test requests to populate metrics:
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": true,
"max_tokens": 30
}'
```
For more information about validating the deployment, see the [vLLM README](/dynamo/backends/v-llm).
## Set Up Metrics Collection
### Enable NIXL Telemetry (Optional)
To enable NIXL telemetry metrics in addition to Dynamo metrics, set the following environment variables in your worker component:
spec:
services:
YourWorker:
envs:
- name: NIXL_TELEMETRY_ENABLE
value: "y"
NIXL telemetry is disabled by default. When enabled, NIXL metrics will be exposed on the port specified by `NIXL_TELEMETRY_PROMETHEUS_PORT` (19090 by default).
### Create PodMonitors
The Prometheus Operator uses PodMonitor resources to automatically discover and scrape metrics from pods. To enable this discovery, the Dynamo operator automatically creates PodMonitor resource and adds these labels to all pods:
- `nvidia.com/metrics-enabled: "true"` - Enables metrics collection
- `nvidia.com/dynamo-component-type: "frontend|worker"` - Identifies the component type
You can opt-out specific deployments from metrics collection by adding this annotation to your DynamoGraphDeployment:
```yaml
apiVersion: nvidia.com/v1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
annotations:
nvidia.com/enable-metrics: "false"
spec:
# …
```
### Configure Grafana Dashboard
Apply the Dynamo dashboard configuration to populate Grafana with the Dynamo dashboard:
```bash
kubectl apply -n monitoring -f deploy/observability/grafana-dynamo-dashboard-configmap.yaml
```
The dashboard is embedded in the ConfigMap. Since it is labeled with `grafana_dashboard: "1"`, the Grafana will discover and populate it to its list of available dashboards. The dashboard includes panels for:
- Frontend request rates
- Time to first token
- Inter-token latency
- Request duration
- Input/Output sequence lengths
- GPU utilization via DCGM
- Node CPU utilization and system load
- Container CPU usage per pod
- Memory usage per pod
## Viewing the Metrics
### In Prometheus
```bash
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
```
Visit http://localhost:9090 and try these example queries:
- `dynamo_frontend_requests_total`
- `dynamo_frontend_time_to_first_token_seconds_bucket`

### In Grafana
```bash
# Get Grafana credentials
export GRAFANA_USER=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-user}" | base64 --decode)
export GRAFANA_PASSWORD=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode)
echo "Grafana user: $GRAFANA_USER"
echo "Grafana password: $GRAFANA_PASSWORD"
# Port forward Grafana service
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
```
Visit http://localhost:3000 and log in with the credentials captured above.
Once logged in, find the Dynamo dashboard under General.

## Operator Metrics
> **Note:** The metrics described above are for Dynamo **applications** (frontends, workers). The Dynamo **Operator** itself also exposes metrics for monitoring controller reconciliation, webhook validation, and resource inventory.
>
> See the **[Operator Metrics Guide](/dynamo/kubernetes-deployment/operate/observability/operator-metrics)** for details on operator-specific metrics and the operator dashboard.
# Logging
This guide demonstrates how to set up logging for Dynamo in Kubernetes using Grafana Loki and Alloy. This setup provides a simple reference logging setup that can be followed in Kubernetes clusters including Minikube and MicroK8s.
This setup is intended for development and testing purposes. For production environments, please refer to the official documentation for high-availability configurations.
## Components Overview
- **[Grafana Loki](https://grafana.com/oss/loki/)**: Fast and cost-effective Kubernetes-native log aggregation system.
- **[Grafana Alloy](https://grafana.com/oss/alloy/)**: OpenTelemetry collector that replaces Promtail, gathering logs, metrics and traces from Kubernetes pods.
- **[Grafana](https://grafana.com/grafana/)**: Visualization platform for querying and exploring logs.
## Prerequisites
### 1. Dynamo Kubernetes Platform
This guide assumes you have installed Dynamo Kubernetes Platform. For more information, see [Dynamo Kubernetes Platform](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart).
### 2. Kube-prometheus
While this guide does not use Prometheus, it assumes Grafana is pre-installed with the kube-prometheus. For more information, see [kube-prometheus](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack).
### 3. Environment Variables
#### Kubernetes Setup Variables
The following env variables are set:
- `MONITORING_NAMESPACE`: The namespace where Loki is installed
- `DYN_NAMESPACE`: The namespace where Dynamo Kubernetes Platform is installed
```bash
export MONITORING_NAMESPACE=monitoring
export DYN_NAMESPACE=dynamo-system
```
#### Dynamo Logging Variables
| Variable | Description | Example |
|----------|-------------|---------|
| `DYN_LOGGING_JSONL` | Enable JSONL logging format (required for Loki) | `true` |
| `DYN_LOG` | Log levels per target `,=,=` | `DYN_LOG=info,dynamo_runtime::system_status_server:trace` |
| `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps | `true` |
## Installation Steps
### 1. Install Loki
First, we'll install Loki in single binary mode, which is ideal for testing and development:
```bash
# Add the Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Loki
helm install --values deploy/observability/logging/values/loki-values.yaml loki grafana/loki -n $MONITORING_NAMESPACE
```
Our configuration (`loki-values.yaml`) sets up Loki in a simple configuration that is suitable for testing and development. It uses a local MinIO for storage. The installation pods can be viewed with:
```bash
kubectl get pods -n $MONITORING_NAMESPACE -l app=loki
```
### 2. Install Grafana Alloy
Next, install the Grafana Alloy collector to gather logs from your Kubernetes cluster and forward them to Loki. Here we use the Helm chart `k8s-monitoring` provided by Grafana to install the collector:
```bash
# Generate a custom values file with the namespace information
envsubst < deploy/observability/logging/values/alloy-values.yaml > alloy-custom-values.yaml
# Install the collector
helm install --values alloy-custom-values.yaml alloy grafana/k8s-monitoring -n $MONITORING_NAMESPACE
```
The values file (`alloy-values.yaml`) includes the following configurations for the collector:
- Destination to forward logs to Loki
- Namespace to collect logs from
- Pod labels to be mapped to Loki labels
- Collection method (kubernetesApi or tailing `/var/log/containers/`)
```yaml
destinations:
- name: loki
type: loki
url: http://loki-gateway.$MONITORING_NAMESPACE.svc.cluster.local/loki/api/v1/push
podLogs:
enabled: true
gatherMethod: kubernetesApi # collect logs from the kubernetes api, rather than /var/log/containers/; friendly for testing and development
collector: alloy-logs
labels:
app_kubernetes_io_name: app.kubernetes.io/name
nvidia_com_dynamo_component_type: nvidia.com/dynamo-component-type
nvidia_com_dynamo_graph_deployment_name: nvidia.com/dynamo-graph-deployment-name
labelsToKeep:
- "app_kubernetes_io_name"
- "container"
- "instance"
- "job"
- "level"
- "namespace"
- "service_name"
- "service_namespace"
- "deployment_environment"
- "deployment_environment_name"
- "nvidia_com_dynamo_component_type" # extract this label from the dynamo graph deployment
- "nvidia_com_dynamo_graph_deployment_name" # extract this label from the dynamo graph deployment
namespaces:
- $DYN_NAMESPACE
```
### 3. Configure Grafana with the Loki datasource and Dynamo Logs dashboard
We will be viewing the logs associated with our DynamoGraphDeployment in Grafana. To do this, we need to configure Grafana with the Loki datasource and Dynamo Logs dashboard.
Since we are using Grafana with the Prometheus Operator, we can simply apply the following ConfigMaps to quickly achieve this configuration.
```bash
# Configure Grafana with the Loki datasource
envsubst < deploy/observability/logging/grafana/loki-datasource.yaml | kubectl apply -n $MONITORING_NAMESPACE -f -
# Configure Grafana with the Dynamo Logs dashboard
kubectl apply -f deploy/observability/logging/grafana/logging-dashboard.yaml -n $MONITORING_NAMESPACE
```
If using Grafana installed without the Prometheus Operator, you can manually import the Loki datasource and Dynamo Logs dashboard using the Grafana UI.
### 4. Deploy a DynamoGraphDeployment with JSONL Logging
At this point, we should have everything in place to collect and view logs in our Grafana instance. All that is left is to deploy a DynamoGraphDeployment to collect logs from.
To enable structured logs in a DynamoGraphDeployment, we need to set the `DYN_LOGGING_JSONL` environment variable to `1`. This is done for us in the `agg_logging.yaml` setup for the Sglang backend. We can now deploy the DynamoGraphDeployment with:
```bash
kubectl apply -n $DYN_NAMESPACE -f examples/backends/sglang/deploy/agg_logging.yaml
```
Send a few chat completions requests to generate structured logs across the frontend and worker pods across the DynamoGraphDeployment. We are now all set to view the logs in Grafana.
## Viewing Logs in Grafana
Port-forward the Grafana service to access the UI:
```bash
kubectl port-forward svc/prometheus-grafana 3000:80 -n $MONITORING_NAMESPACE
```
If everything is working, under Home > Dashboards > Dynamo Logs, you should see a dashboard that can be used to view the logs associated with our DynamoGraphDeployments
The dashboard enables filtering by DynamoGraphDeployment, namespace, and component type (e.g., frontend, worker, etc.).
# Operator Metrics
## Overview
The Dynamo Operator exposes Prometheus metrics for monitoring its own health and performance. These metrics are separate from application metrics (frontend/worker) and provide visibility into:
- **Controller Reconciliation**: How efficiently controllers process DynamoGraphDeployments, DynamoComponentDeployments, and DynamoModels
- **Webhook Validation**: Performance and outcomes of admission webhook requests
- **Resource Inventory**: Current count of managed resources by state and namespace
## Prerequisites
The operator metrics feature requires the same monitoring infrastructure as application metrics. For detailed setup instructions, see the [Kubernetes Metrics Guide](/dynamo/kubernetes-deployment/operate/observability/metrics#prerequisites).
**Quick checklist:**
- ✅ kube-prometheus-stack installed (for ServiceMonitor support)
- ✅ Prometheus and Grafana running
- ✅ Dynamo Operator installed via Helm
## Metrics Collection
### ServiceMonitor
Operator metrics are automatically collected via a ServiceMonitor, which is created by the Helm chart when `metricsService.enabled: true` (default).
**Unlike application metrics** (which use PodMonitor), the operator uses ServiceMonitor and requires no manual RBAC configuration. The operator's metrics endpoint uses controller-runtime's built-in `WithAuthenticationAndAuthorization` filter for secure serving.
To verify the ServiceMonitor is created:
```bash
kubectl get servicemonitor -n dynamo-system
```
### Disabling Metrics Collection
To disable operator metrics collection:
```bash
helm upgrade dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace dynamo-system \
--set dynamo-operator.metricsService.enabled=false
```
## Available Metrics
All metrics use the `dynamo_operator` namespace prefix.
### Reconciliation Metrics
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `dynamo_operator_reconcile_duration_seconds` | Histogram | `resource_type`, `namespace`, `result` | Duration of reconciliation loops |
| `dynamo_operator_reconcile_total` | Counter | `resource_type`, `namespace`, `result` | Total number of reconciliations |
| `dynamo_operator_reconcile_errors_total` | Counter | `resource_type`, `namespace`, `error_type` | Total reconciliation errors by type |
**Labels:**
- `resource_type`: `DynamoGraphDeployment`, `DynamoComponentDeployment`, `DynamoModel`, `DynamoGraphDeploymentRequest`, `DynamoGraphDeploymentScalingAdapter`
- `namespace`: Target namespace of the resource
- `result`: `success`, `error`, `requeue`
- `error_type`: `not_found`, `already_exists`, `conflict`, `validation`, `bad_request`, `unauthorized`, `forbidden`, `timeout`, `server_timeout`, `unavailable`, `rate_limited`, `internal`
### Webhook Metrics
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `dynamo_operator_webhook_duration_seconds` | Histogram | `resource_type`, `operation` | Duration of webhook validation requests |
| `dynamo_operator_webhook_requests_total` | Counter | `resource_type`, `operation`, `result` | Total webhook admission requests |
| `dynamo_operator_webhook_denials_total` | Counter | `resource_type`, `operation`, `reason` | Total webhook denials with reasons |
**Labels:**
- `resource_type`: Same as reconciliation metrics
- `operation`: `CREATE`, `UPDATE`, `DELETE`
- `result`: `allowed`, `denied`
- `reason`: Validation failure reason (e.g., `immutable_field_changed`, `invalid_config`)
### Resource Inventory Metrics
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `dynamo_operator_resources_total` | Gauge | `resource_type`, `namespace`, `status` | Current count of resources by state |
**Labels:**
- `resource_type`: `DynamoGraphDeployment`, `DynamoComponentDeployment`, `DynamoModel`, `DynamoGraphDeploymentRequest`, `DynamoGraphDeploymentScalingAdapter`
- `namespace`: Resource namespace
- `status`: Resource state derived from each CRD's status. Common values:
- `"ready"` - Resource is healthy and operational (DCD, DM, DGDSA)
- `"not_ready"` - Resource exists but is not operational (DCD, DM, DGDSA)
- `"unknown"` - State cannot be determined (default for empty status)
- DGD uses: `"pending"`, `"successful"`, `"failed"` from `.status.state`
- DGDR uses: `"Pending"`, `"Profiling"`, `"Ready"`, `"Deploying"`, `"Deployed"`, `"Failed"` from `.status.phase`
## Example Queries
### Reconciliation Performance
```promql
# P95 reconciliation duration by resource type
histogram_quantile(0.95,
sum by (resource_type, le) (
rate(dynamo_operator_reconcile_duration_seconds_bucket[5m])
)
)
# Reconciliation rate by result
sum by (resource_type, result) (
rate(dynamo_operator_reconcile_total[5m])
)
# Error rate by type
sum by (resource_type, error_type) (
rate(dynamo_operator_reconcile_errors_total[5m])
)
```
### Webhook Performance
```promql
# Webhook P95 latency
histogram_quantile(0.95,
sum by (resource_type, le) (
rate(dynamo_operator_webhook_duration_seconds_bucket[5m])
)
)
# Webhook denial rate
sum by (resource_type, operation, reason) (
rate(dynamo_operator_webhook_denials_total[5m])
)
```
### Resource Inventory
```promql
# Total resources by type and state
sum by (resource_type, status) (
dynamo_operator_resources_total
)
# DynamoGraphDeployments by state
sum by (status) (
dynamo_operator_resources_total{resource_type="DynamoGraphDeployment"}
)
# All resources by namespace and state
sum by (resource_type, namespace, status) (
dynamo_operator_resources_total
)
```
## Grafana Dashboard
A pre-built Grafana dashboard is available for visualizing operator metrics.
### Dashboard Sections
1. **Reconciliation Metrics** (3 panels)
- Reconciliation rate by resource type and result
- P95 reconciliation duration
- Reconciliation errors by type
2. **Webhook Metrics** (3 panels)
- Webhook request rate by operation
- P95 webhook duration
- Webhook denials by reason
3. **Resource Inventory** (2 panels)
- Resource inventory timeline by state and namespace (filterable by resource type)
- Current resource count by state (filterable by resource type)
4. **Operational Health** (2 panels)
- Reconciliation success rate gauges
- Webhook admission success rate gauges
### Deploying the Dashboard
```bash
kubectl apply -f deploy/observability/grafana-operator-dashboard-configmap.yaml
```
The dashboard will automatically appear in Grafana (assuming you have the Grafana dashboard sidecar configured, which is included in kube-prometheus-stack).
### Finding the Dashboard
1. Port-forward to Grafana (if needed):
```bash
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
```
2. Log in to Grafana at http://localhost:3000
3. Navigate to **Dashboards** → Search for **"Dynamo Operator"**
### Dashboard Filters
The dashboard includes two filter variables:
- **Namespace**: View metrics across all namespaces or filter by specific ones (multi-select)
- **Resource Type**: Filter all panels by resource type or select "All" to see aggregated metrics across all CRDs (single select)
When "All" is selected for Resource Type, all panels will show data for all five managed CRDs with resource_type labels for differentiation.
## Accessing Metrics Directly
For instructions on accessing Prometheus and Grafana, see the [Kubernetes Metrics Guide](/dynamo/kubernetes-deployment/operate/observability/metrics#viewing-the-metrics).
Once you have access to Prometheus, you can query operator metrics directly:
```bash
# Port-forward to Prometheus
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
# Visit http://localhost:9090 and try queries like:
# - dynamo_operator_reconcile_total
# - dynamo_operator_webhook_requests_total
# - dynamo_operator_resources_total
```
## Troubleshooting
### Metrics Not Appearing in Prometheus
1. **Check ServiceMonitor exists:**
```bash
kubectl get servicemonitor -n dynamo-system | grep operator
```
2. **Check ServiceMonitor is discovered by Prometheus:**
- Go to Prometheus UI → Status → Targets
- Look for `serviceMonitor/dynamo-system/dynamo-platform-dynamo-operator-operator`
- Should show state: `UP`
3. **Check Prometheus selector configuration:**
```bash
kubectl get prometheus -o yaml | grep serviceMonitorSelector
```
Ensure `serviceMonitorSelectorNilUsesHelmValues: false` was set during kube-prometheus-stack installation.
### Dashboard Not Appearing in Grafana
1. **Check ConfigMap is created:**
```bash
kubectl get configmap -n monitoring grafana-operator-dashboard
```
2. **Check ConfigMap has the label:**
```bash
kubectl get configmap -n monitoring grafana-operator-dashboard -o jsonpath='{.metadata.labels.grafana_dashboard}'
```
Should return `"1"`
3. **Check Grafana dashboard sidecar configuration:**
```bash
kubectl get deployment -n monitoring prometheus-grafana -o yaml | grep -A 5 sidecar
```
The sidecar should be configured to watch for `grafana_dashboard: "1"` label.
4. **Restart Grafana pod** to force dashboard refresh:
```bash
kubectl rollout restart deployment/prometheus-grafana -n monitoring
```
## Related Documentation
- [Kubernetes Metrics Guide](/dynamo/kubernetes-deployment/operate/observability/metrics) - Application metrics for frontends and workers
- [Dynamo Operator Guide](/dynamo/kubernetes-deployment/start-here/dynamo-operator) - Operator architecture and deployment modes
- [Operator Webhooks](/dynamo/kubernetes-deployment/advanced-platform/webhooks) - Webhook validation details
# Multinode Deployments
This guide explains how to deploy Dynamo workloads across multiple nodes. Multinode deployments enable you to scale compute-intensive LLM workloads across multiple physical machines, maximizing GPU utilization and supporting larger models.
## Overview
Dynamo supports multinode deployments through the `multinode` section in resource specifications. This allows you to:
- Distribute workloads across multiple physical nodes
- Scale GPU resources beyond a single machine
- Support large models requiring extensive tensor parallelism
- Achieve high availability and fault tolerance
## Basic requirements
- **Kubernetes Cluster**: Version 1.24 or later
- **GPU Nodes**: Multiple nodes with NVIDIA GPUs
- **High-Speed Networking**: InfiniBand, RoCE, or high-bandwidth Ethernet (recommended for optimal performance)
### Advanced Multinode Orchestration
#### Using Grove (default)
For sophisticated multinode deployments, Dynamo integrates with advanced Kubernetes orchestration systems:
- **[Grove](https://github.com/NVIDIA/grove)**: Network topology-aware gang scheduling and auto-scaling for AI workloads
- **[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler)**: Kubernetes native scheduler optimized for AI workloads at scale
These systems provide enhanced scheduling capabilities including topology-aware placement, gang scheduling, and coordinated auto-scaling across multiple nodes.
**Features Enabled with Grove:**
- Declarative composition of AI workloads
- Multi-level horizontal auto-scaling
- Custom startup ordering for components
- Resource-aware rolling updates
- [Topology Aware Scheduling](/dynamo/kubernetes-deployment/scale/topology-aware-scheduling) — pack pods within a rack, block, or other topology domain for lower latency
[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a Kubernetes native scheduler optimized for AI workloads at large scale.
**Features Enabled with KAI-Scheduler:**
- Gang scheduling
- Network topology-aware pod placement
- AI workload-optimized scheduling algorithms
- GPU resource awareness and allocation
- Support for complex scheduling constraints
- Integration with Grove for enhanced capabilities
- Performance optimizations for large-scale deployments
##### Prerequisites
- [Grove](https://github.com/NVIDIA/grove/blob/main/docs/installation.md) installed on the cluster
- (Optional) [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) installed on the cluster with the default queue name `dynamo` created. If no queue annotation is specified on the DGD resource, the operator uses the `dynamo` queue by default. Custom queue names can be specified via the `nvidia.com/kai-scheduler-queue` annotation, but the queue must exist in the cluster before deployment.
KAI-Scheduler is optional but recommended for advanced scheduling capabilities.
#### Using LWS and Volcano
LWS is a simple multinode deployment mechanism that allows you to deploy a workload across multiple nodes.
- **LWS**: [LWS Installation](https://github.com/kubernetes-sigs/lws#installation)
- **Volcano**: [Volcano Installation](https://github.com/volcano-sh/volcano#quick-start-guide)
Volcano is a Kubernetes native scheduler optimized for AI workloads at scale. It is used in conjunction with LWS to provide gang scheduling support.
## Core Concepts
### Orchestrator Selection Algorithm
Dynamo automatically selects the best available orchestrator for multinode deployments using the following logic:
#### When Both Grove and LWS are Available:
- **Grove is selected by default** (recommended for advanced AI workloads)
- **LWS is selected** if you explicitly set `nvidia.com/enable-grove: "false"` annotation on your DGD resource
#### When Only One Orchestrator is Available:
- The installed orchestrator (Grove or LWS) is automatically selected
#### Scheduler Integration:
- **With Grove**: Automatically integrates with [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) when available, providing:
- Advanced queue management via `nvidia.com/kai-scheduler-queue` annotation
- AI-optimized scheduling policies
- Resource-aware workload placement
- **With LWS**: Uses Volcano scheduler for gang scheduling and resource coordination
#### Configuration Examples:
**Default (Grove with KAI-Scheduler):**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-multinode-deployment
annotations:
nvidia.com/kai-scheduler-queue: "dynamo"
spec:
# ... your deployment spec
```
> **Note:** The `nvidia.com/kai-scheduler-queue` annotation defaults to `"dynamo"`. If you specify a custom queue name, ensure the queue exists in your cluster before deploying. You can verify available queues with `kubectl get queues`.
**Force LWS usage:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-multinode-deployment
annotations:
nvidia.com/enable-grove: "false"
spec:
# ... your deployment spec
```
### The `multinode` Section
The `multinode` section in a resource specification defines how many physical nodes the workload should span:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-multinode-deployment
spec:
# ... your deployment spec
services:
my-service:
...
multinode:
nodeCount: 2
resources:
limits:
gpu: "2" # 2 GPUs per node
```
### GPU Distribution
The relationship between `multinode.nodeCount` and `gpu` is multiplicative:
- **`multinode.nodeCount`**: Number of physical nodes
- **`gpu`**: Number of GPUs per node
- **Total GPUs**: `multinode.nodeCount × gpu`
**Example:**
- `multinode.nodeCount: "2"` + `gpu: "4"` = 8 total GPUs (4 GPUs per node across 2 nodes)
- `multinode.nodeCount: "4"` + `gpu: "8"` = 32 total GPUs (8 GPUs per node across 4 nodes)
### Tensor Parallelism Alignment
The tensor parallelism (`tp-size` or `--tp`) in your command/args must match the total number of GPUs:
```yaml
# Example: 2 multinode.nodeCount × 4 GPUs = 8 total GPUs
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-multinode-deployment
spec:
# ... your deployment spec
services:
my-service:
...
multinode:
nodeCount: 2
resources:
limits:
gpu: "4"
extraPodSpec:
mainContainer:
...
args:
# Command args must use tp-size=8
- "--tp-size"
- "8" # Must equal multinode.nodeCount × gpu
```
## Backend-Specific Operator Behavior
When you deploy a multinode workload, the Dynamo operator automatically applies backend-specific configurations to enable distributed execution. Understanding these automatic modifications helps troubleshoot issues and optimize your deployments.
### vLLM Backend
For vLLM multinode deployments, the operator automatically selects and configures the appropriate distributed execution mode based on your parallelism settings:
#### Deployment Modes
The operator automatically determines the deployment mode based on your parallelism configuration:
**1. Tensor/Pipeline Parallelism Mode (Single model across nodes)**
- **When used**: When `world_size > GPUs_per_node` where `world_size = tensor_parallel_size × pipeline_parallel_size`
- **Use case**: Distributing a single model instance across multiple nodes using tensor or pipeline parallelism
The operator uses Ray for multi-node tensor/pipeline parallel deployments. Ray provides automatic placement group management and worker spawning across nodes.
**Leader Node:**
- **Command**: `ray start --head --port=6379 && --distributed-executor-backend ray`
- **Behavior**: Starts Ray head node, then runs vLLM which creates a placement group spanning all Ray workers
- **Probes**: All health probes remain active (liveness, readiness, startup)
**Worker Nodes:**
- **Command**: `ray start --address=:6379 --block`
- **Behavior**: Joins Ray cluster and blocks; vLLM on leader spawns Ray actors to these workers
- **Probes**: All probes (liveness, readiness, startup) are automatically removed
vLLM's Ray executor automatically creates a placement group and spawns workers across the cluster. The `--nnodes` flag is NOT used with Ray - it's only compatible with the `mp` backend.
**2. Data Parallel Mode (Multiple model instances across nodes)**
- **When used**: When `world_size × data_parallel_size > GPUs_per_node`
- **Use case**: Running multiple independent model instances across nodes with data parallelism (e.g., MoE models with expert parallelism)
**All Nodes (Leader and Workers):**
- **Injected Flags**:
- `--data-parallel-address ` - Address of the coordination server
- `--data-parallel-size-local ` - Number of data parallel workers per node
- `--data-parallel-rpc-port 13445` - RPC port for data parallel coordination
- `--data-parallel-start-rank ` - Starting rank for this node (calculated automatically)
- **Probes**: Worker probes are removed; leader probes remain active
**Note**: The operator intelligently injects these flags into your command regardless of command structure (direct Python commands or shell wrappers)
#### Why Ray for Multi-Node TP/PP?
vLLM supports two distributed executor backends: `ray` and `mp`. For multi-node deployments:
- **Ray executor**: vLLM creates a placement group and spawns Ray actors across the cluster. Workers don't run vLLM directly - the leader's vLLM process manages everything.
- **mp executor**: Each node must run its own vLLM process with `--nnodes`, `--node-rank`, `--master-addr`, `--master-port`. This approach is more complex to orchestrate.
The Dynamo operator uses Ray because:
1. It aligns with vLLM's official multi-node documentation (see `multi-node-serving.sh`)
2. Simpler orchestration - only the leader runs vLLM, workers just need Ray agents
3. vLLM automatically handles placement group creation and worker management
#### Compilation Cache Support
When a volume mount is configured with `useAsCompilationCache: true`, the operator automatically sets:
- **`VLLM_CACHE_ROOT`**: Environment variable pointing to the cache mount point
### SGLang Backend
For SGLang multinode deployments, the operator injects distributed training parameters:
#### Leader Node
- **Distributed Flags**: Injects `--dist-init-addr :29500 --nnodes --node-rank 0`
- **Probes**: All health probes remain active
#### Worker Nodes
- **Distributed Flags**: Injects `--dist-init-addr :29500 --nnodes --node-rank `
- The `node-rank` is automatically determined from the pod's stateful identity
- **Probes**: All probes (liveness, readiness, startup) are automatically removed
**Note:** The operator intelligently injects these flags regardless of your command structure (direct Python commands or shell wrappers).
### TensorRT-LLM Backend
For TensorRT-LLM multinode deployments, the operator configures MPI-based communication:
#### Leader Node
- **SSH Configuration**: Automatically sets up SSH keys and configuration from a Kubernetes secret
- **MPI Command**: Wraps your command in an `mpirun` command with:
- Proper host list including all worker nodes
- SSH configuration for passwordless authentication on port 2222
- Environment variable propagation to all nodes
- Activation of the Dynamo virtual environment
- **Probes**: All health probes remain active
#### Worker Nodes
- **SSH Daemon**: Replaces your command with SSH daemon setup and execution
- Generates host keys in user-writable directories (non-privileged)
- Configures SSH daemon to listen on port 2222
- Sets up authorized keys for leader access
- **Probes**:
- **Liveness and Startup**: Removed (workers run SSH daemon, not the main application)
- **Readiness**: Replaced with TCP socket check on SSH port 2222
- Initial Delay: 20 seconds
- Period: 20 seconds
- Timeout: 5 seconds
- Failure Threshold: 10
#### Additional Configuration
- **Environment Variable**: `OMPI_MCA_orte_keep_fqdn_hostnames=1` is added to all nodes
- **SSH Volume**: Automatically mounts the SSH keypair secret (typically named `mpirun-ssh-key-`)
- **Automatic SSH key generation**: The operator automatically generates the SSH keypair secret when it detects a multi-node `DynamoGraphDeployment`. No manual secret creation is required.
### Compilation Cache Configuration
The operator supports compilation cache volumes for backend-specific optimization:
| Backend | Support Level | Environment Variables | Default Mount Point |
|---------|--------------|----------------------|---------------------|
| vLLM | Fully Supported | `VLLM_CACHE_ROOT` | User-specified |
| SGLang | Partial Support | _None (pending upstream)_ | User-specified |
| TensorRT-LLM | Partial Support | _None (pending upstream)_ | User-specified |
To enable compilation cache, add a volume mount with `useAsCompilationCache: true` in your component specification. For vLLM, the operator will automatically configure the necessary environment variables. For other backends, volume mounts are created, but additional environment configuration may be required until upstream support is added.
## Next Steps
For additional support and examples, see the working multinode configurations in:
- **SGLang**: [examples/backends/sglang/deploy/](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/sglang/deploy/README.md)
- **TensorRT-LLM**: [examples/backends/trtllm/deploy/](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/trtllm/deploy/README.md)
- **vLLM**: [examples/backends/vllm/deploy/](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/vllm/deploy/README.md)
These examples demonstrate proper usage of the `multinode` section with corresponding `gpu` limits and correct `tp-size` configuration.
# Grove
Grove is a Kubernetes API specifically designed to address the orchestration challenges of modern AI workloads, particularly disaggregated inference systems. Grove provides seamless integration with NVIDIA Dynamo for comprehensive AI infrastructure management.
## Overview
Grove was originally motivated by the challenges of orchestrating multinode, disaggregated inference systems. It provides a consistent and unified API that allows users to define, configure, and scale prefill, decode, and any other components like routing within a single custom resource.
### How Grove Works for Disaggregated Serving
Grove enables disaggregated serving by breaking down large language model inference into separate, specialized components that can be independently scaled and managed. This architecture provides several advantages:
- **Component Specialization**: Separate prefill, decode, and routing components optimized for their specific tasks
- **Independent Scaling**: Each component can scale based on its individual resource requirements and workload patterns
- **Resource Optimization**: Better utilization of hardware resources through specialized workload placement
- **Fault Isolation**: Issues in one component don't necessarily affect others
## Core Components and API Resources
Grove implements disaggregated serving through several custom Kubernetes resources that provide declarative composition of role-based pod groups:
### PodCliqueSet
The top-level Grove object that defines a group of components managed and colocated together. Key features include:
- Support for autoscaling
- Topology-aware spread of replicas for availability
- Unified management of multiple disaggregated components
### PodClique
Represents a group of pods with a specific role (e.g., leader, worker, frontend). Each clique features:
- Independent configuration options
- Custom scaling logic support
- Role-specific resource allocation
### PodCliqueScalingGroup
A set of PodCliques that scale and are scheduled together, ideal for tightly coupled roles like prefill leader and worker components that need coordinated scaling behavior.
## Key Capabilities for Disaggregated Serving
Grove provides several specialized features that make it particularly well-suited for disaggregated serving:
### Flexible Gang Scheduling
PodCliques and PodCliqueScalingGroups allow users to specify flexible gang-scheduling requirements at multiple levels within a PodCliqueSet to prevent resource deadlocks and ensure all components of a disaggregated system start together.
### Multi-level Horizontal Auto-Scaling
Supports pluggable horizontal auto-scaling solutions to scale PodCliqueSet, PodClique, and PodCliqueScalingGroup custom resources independently based on their specific metrics and requirements.
### Network Topology-Aware Scheduling
Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability, crucial for disaggregated systems where components need efficient inter-node communication. Dynamo exposes this capability through the `topologyConstraint` field on DynamoGraphDeployment resources, so users can opt in to topology-aware placement without interacting with Grove internals. See the [Topology Aware Scheduling guide](/dynamo/kubernetes-deployment/scale/topology-aware-scheduling) for configuration details and examples.
### Custom Startup Dependencies
Prescribes the order in which PodCliques must start in a declarative specification, with pod startup decoupled from pod creation or scheduling. This ensures proper initialization order for disaggregated components.
## Use Cases and Examples
Grove specifically supports:
- **Multi-node disaggregated inference** for large models such as DeepSeek-R1 and Llama-4-Maverick
- **Single-node disaggregated inference** for optimized resource utilization
- **Agentic pipelines of models** for complex AI workflows
- **Standard aggregated serving** patterns for single node or single GPU inference
## Integration with NVIDIA Dynamo
Grove is strategically aligned with NVIDIA Dynamo for seamless integration within the AI infrastructure stack:
### Complementary Roles
- **Grove**: Handles the Kubernetes orchestration layer for disaggregated AI workloads
- **Dynamo**: Provides comprehensive AI infrastructure capabilities including serving backends, routing, and resource management
### Release Coordination
Grove is aligning its release schedule with NVIDIA Dynamo to ensure seamless integration, with the finalized release cadence reflected in the project roadmap.
### Unified AI Platform
The integration creates a comprehensive platform where:
- Grove manages complex orchestration of disaggregated components
- Dynamo provides the serving infrastructure, routing capabilities, and backend integrations
- Together they enable sophisticated AI serving architectures with simplified management
## Architecture Benefits
Grove represents a significant advancement in Kubernetes-based orchestration for AI workloads by:
1. **Simplifying Complex Deployments**: Provides a unified API that can manage multiple components (prefill, decode, routing) within a single resource definition
2. **Enabling Sophisticated Architectures**: Supports advanced disaggregated inference patterns that were previously difficult to orchestrate
3. **Reducing Operational Complexity**: Abstracts away the complexity of coordinating multiple interdependent AI components
4. **Optimizing Resource Utilization**: Enables fine-grained control over component placement and scaling
## Getting Started
Grove relies on KAI Scheduler for resource allocation and scheduling.
For KAI Scheduler, see the [KAI Scheduler Deployment Guide](https://github.com/NVIDIA/KAI-Scheduler).
For installation instructions, see the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md).
For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](/dynamo/kubernetes-deployment/scale/multinode-deployments), which demonstrates multi-node disaggregated serving scenarios.
For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove).
Dynamo Kubernetes Platform also allows you to install Grove and KAI Scheduler as part of the platform installation. See the [Dynamo Kubernetes Platform Deployment Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) for more details.
# Topology Aware Scheduling
Topology Aware Scheduling (TAS) lets you control where Dynamo places inference workload pods relative to the cluster's network topology. By packing related pods within the same rack, block, or other topology domain, you reduce inter-node latency and improve throughput — especially for disaggregated serving where prefill, decode, and routing components communicate frequently.
TAS is **opt-in**. Existing deployments without topology constraints continue to work unchanged.
TAS controls pod placement. To constrain or bias the Dynamo router's prefill-to-decode handoff after pods are already running, see [Topology-Aware KV Transfer](/dynamo/kubernetes-deployment/operate/topology-aware-kv-transfer).
## Prerequisites
| Requirement | Details |
|-------------|---------|
| **Grove** | Installed on the cluster. See the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md). |
| **ClusterTopology CR** | A cluster-scoped `ClusterTopology` resource configured by the cluster admin, mapping topology domain names to node labels. See [Grove documentation](https://github.com/NVIDIA/grove) for setup instructions. |
| **KAI Scheduler** | [KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is required by Grove for topology-aware pod placement. |
| **Dynamo operator** | The latest Dynamo operator Helm chart includes read-only RBAC for `clustertopologies.grove.io` via a dedicated ClusterRole. No extra configuration is needed. |
## Topology Domains
Topology domains are **free-form** identifiers defined by the cluster admin in the `ClusterTopology` CR. Common examples include `region`, `zone`, `datacenter`, `block`, `rack`, `host`, and `numa`, but any name matching the pattern `^[a-z0-9]([a-z0-9-]*[a-z0-9])?$` is valid (no leading or trailing hyphens).
Domain names must match exactly what is configured in the `ClusterTopology` CR referenced by `topologyProfile`. During DGD creation, the Dynamo webhook validates that every `packDomain` exists in the referenced `ClusterTopology`.
When you specify a `packDomain`, the scheduler packs all replicas of the constrained component within a single instance of that domain. For example, `packDomain: rack` means "place all pods within the same rack."
## Topology Profile
Every DGD that uses topology constraints must reference a `ClusterTopology` CR by name via the `topologyProfile` field. This field is set at `spec.topologyConstraint` (the deployment level) and is inherited by all services — services must not set `topologyProfile` themselves.
The `topologyProfile` tells the Dynamo operator and the underlying framework which topology hierarchy to use for scheduling and validation.
## Enabling TAS on a DGD
Add a `topologyConstraint` field to your `DynamoGraphDeployment` at the deployment level, at the service level, or both. The deployment level must include a `topologyProfile`. Each constraint specifies a `packDomain`.
### Example 1: Deployment-Level Constraint (Services Inherit)
All services inherit the deployment-level constraint. This is the simplest configuration when you want uniform topology packing.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
spec:
topologyConstraint:
topologyProfile: my-cluster-topology
packDomain: zone
services:
VllmWorker:
componentType: worker
replicas: 2
envFromSecret: hf-token-secret
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: my-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: my-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.frontend
```
### Example 2: Service-Level Constraint Only
Only the specified service gets topology packing. Other services are scheduled without topology constraints. The deployment level must still set `topologyProfile`.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
spec:
topologyConstraint:
topologyProfile: my-cluster-topology
services:
VllmWorker:
componentType: worker
replicas: 2
multinode:
nodeCount: 4
topologyConstraint:
packDomain: rack
envFromSecret: hf-token-secret
resources:
limits:
gpu: "8"
extraPodSpec:
mainContainer:
image: my-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.vllm --model meta-llama/Llama-4-Maverick-17B-128E
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: my-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.frontend
```
### Example 3: Mixed (Deployment-Level Default + Per-Service Override)
Set a broad constraint at the deployment level and a narrower override on specific services. Service-level constraints must be **equal to or narrower than** the deployment-level constraint (determined by the ordering in the `ClusterTopology` CR).
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
spec:
topologyConstraint:
topologyProfile: my-cluster-topology
packDomain: zone
services:
VllmWorker:
componentType: worker
replicas: 2
multinode:
nodeCount: 4
topologyConstraint:
packDomain: block # narrower than zone — valid
envFromSecret: hf-token-secret
resources:
limits:
gpu: "8"
extraPodSpec:
mainContainer:
image: my-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.vllm --model meta-llama/Llama-4-Maverick-17B-128E
Frontend:
componentType: frontend
replicas: 1
# inherits zone from spec.topologyConstraint
extraPodSpec:
mainContainer:
image: my-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.frontend
```
## Hierarchy Rules
When **both** a deployment-level and a service-level `topologyConstraint` are set, the service's `packDomain` must be **equal to or narrower** than the deployment-level `packDomain`. "Narrower" is determined by the ordering of levels in the referenced `ClusterTopology` CR — levels appearing later in the `spec.levels` array are considered narrower.
The Dynamo webhook rejects the DGD at creation time if a service constraint is broader than the deployment constraint (when validating against a `ClusterTopology` CR).
When only one level is set (deployment-level only or service-level only), no hierarchy check applies.
| Configuration | Behavior |
|---------------|----------|
| `spec.topologyConstraint` set, service has none | Service inherits the deployment-level constraint |
| `spec.topologyConstraint` set, service also set | Both applied; service must be narrower or equal |
| `spec.topologyConstraint.topologyProfile` set, no `packDomain` at spec | Profile is provided for service-level constraints only |
| Neither set | No topology constraints (default) |
## Field Reference
| Field | Level | Required | Description |
|-------|-------|----------|-------------|
| `topologyProfile` | `spec.topologyConstraint` | Yes (when any constraint is set) | Name of the `ClusterTopology` CR defining the topology hierarchy. |
| `topologyProfile` | service-level `topologyConstraint` | N/A (not in schema) | Inherited from `spec.topologyConstraint`. The service-level type does not include this field. |
| `packDomain` | `spec.topologyConstraint` | Optional | Default pack domain for services that don't specify their own. |
| `packDomain` | service-level `topologyConstraint` | Required | Pack domain for this service. Must match a level in the `ClusterTopology` CR. |
## Multinode Considerations
For multinode services (services with a `multinode` section), the topology constraint is applied at the **scaling group** level rather than on individual worker pods. This is important because a multinode service spawns `replicas × nodeCount` pods — for example, 2 replicas with `nodeCount: 4` produces 8 pods across 8 nodes. Applying the constraint at the scaling group level means the scheduler packs each replica's set of nodes within the requested domain, without over-constraining individual pods to a single host.
For example, with this configuration:
```yaml
VllmWorker:
replicas: 2
multinode:
nodeCount: 4
topologyConstraint:
packDomain: rack
```
Each replica's 4 nodes are packed within a single rack. The two replicas may land in different racks (the constraint applies per-replica, not across all replicas).
**Recommendation:** For multinode services, use `rack` or `block` as the `packDomain` to keep workers within a high-bandwidth domain while still allowing the scheduler to spread them across hosts within that domain. Avoid `host` for multinode services, as packing multiple nodes onto one host is not meaningful.
## Immutability
Topology constraints **cannot be changed after the DGD is created**. This includes:
- Adding a topology constraint to a DGD or service that did not have one
- Removing an existing topology constraint
- Changing the `topologyProfile` value
- Changing the `packDomain` value
To change topology constraints, **delete and recreate** the DGD. This matches the behavior of the underlying framework, which enforces immutability on topology constraints for generated resources.
## Monitoring Topology Enforcement
When any topology constraint is set, the DGD status includes a `TopologyLevelsAvailable` condition that reports whether the topology levels referenced by your constraints still exist in the cluster topology.
**Healthy state:**
```yaml
status:
conditions:
- type: Ready
status: "True"
- type: TopologyLevelsAvailable
status: "True"
reason: AllTopologyLevelsAvailable
message: "All required topology levels are available in the cluster topology"
```
**Degraded state** (e.g., an admin removed a topology level from the `ClusterTopology` CR after deployment):
```yaml
status:
conditions:
- type: Ready
status: "True"
- type: TopologyLevelsAvailable
status: "False"
reason: TopologyLevelsUnavailable
message: "Topology level 'rack' is no longer available in the cluster topology"
```
When topology levels become unavailable, Dynamo emits a **Warning** event on the DGD. The deployment may still appear `Ready` because the underlying framework keeps pods running, but topology placement is no longer guaranteed.
## Troubleshooting
### DGD rejected: "ClusterTopology not found"
The Dynamo webhook validates that the `ClusterTopology` CR referenced by `topologyProfile` exists when any topology constraint is set. If it cannot read the `ClusterTopology` CR:
- Verify that the cluster admin has created the `ClusterTopology` resource named in `topologyProfile`. See the [Grove documentation](https://github.com/NVIDIA/grove) for setup.
- Verify that the Dynamo operator has RBAC to read `clustertopologies.grove.io` (included in the default Helm chart).
### DGD rejected: "packDomain not found in cluster topology"
The specified `packDomain` does not exist as a level in the referenced `ClusterTopology` CR. Check which domains are defined:
```bash
kubectl get clustertopology -o yaml
```
Ensure the domain you are requesting (e.g., `rack`) is configured in the `ClusterTopology` with a corresponding node label.
### DGD rejected: "topologyProfile is required"
Any DGD that has a topology constraint (at the spec or service level) must set `spec.topologyConstraint.topologyProfile` to the name of a `ClusterTopology` CR. Add the `topologyProfile` field to `spec.topologyConstraint`.
### Pods stuck in Pending
The scheduler cannot satisfy the topology constraint. Common causes:
- Not enough nodes within a single instance of the requested domain (e.g., requesting 8 GPUs packed in one rack, but no rack has 8 available GPUs).
- Node labels do not match the `ClusterTopology` configuration.
Inspect scheduler events for details:
```bash
kubectl describe pod -n
```
### TopologyLevelsAvailable is False
The DGD was deployed successfully, but the topology definition has since changed. The underlying framework detected that one or more required topology levels are no longer available.
- Check the condition message for specifics.
- Inspect the `ClusterTopology` CR to see if a domain was removed or renamed.
- If the topology was intentionally changed, delete and recreate the DGD to pick up the new topology.
### DGD rejected: hierarchy violation
A service-level `packDomain` is broader than the deployment-level `packDomain`. "Broader" and "narrower" are determined by the order of levels in the `ClusterTopology` CR — levels appearing earlier in `spec.levels` are broader.
Ensure service-level constraints are equal to or narrower than the deployment-level constraint.
# Service Discovery
Dynamo components (frontends, workers, planner) need to be able to discover each other and their capabilities at runtime. We refer to this as service discovery. There are 2 kinds of service discovery backends supported on Kubernetes.
## Discovery Backends
| Backend | Default | Dependencies | Use Case |
|---------|---------|--------------|----------|
| **Kubernetes** | ✅ Yes | None (native K8s) | Recommended for all Kubernetes deployments |
| **KV Store (etcd)** | No | etcd cluster | Legacy deployments |
## Kubernetes Discovery (Default)
Kubernetes discovery is the default and recommended backend when running on Kubernetes. It uses native Kubernetes primitives to facilitate discovery of components:
- **DynamoWorkerMetadata CRD**: Each worker stores its registered endpoints and model cards in a Custom Resource
- **EndpointSlices**: EndpointSlices signal each component's readiness status
### Implementation Details
Each pod runs a **discovery daemon** that watches both EndpointSlices and DynamoWorkerMetadata CRs. A pod is only discoverable when it appears as "ready" in an EndpointSlice AND has a corresponding `DynamoWorkerMetadata` CR. This correlation ensures pods aren't discoverable until they're ready, metadata is immediately available, and stale entries are cleaned up when pods terminate.
#### DynamoWorkerMetadata CRD
Each worker pod creates a `DynamoWorkerMetadata` CR that stores its discovery metadata:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoWorkerMetadata
metadata:
name: my-worker-pod-abc123
namespace: dynamo-system
ownerReferences:
- apiVersion: v1
kind: Pod
name: my-worker-pod-abc123
uid:
controller: true
spec:
data:
endpoints:
"dynamo/backend/generate":
type: Endpoint
namespace: dynamo
component: backend
endpoint: generate
instance_id: 12345678901234567890
transport:
nats_tcp: "dynamo_backend.generate-abc123"
model_cards: {}
```
The CR is named after the pod and includes an owner reference for automatic garbage collection when the pod is deleted.
#### EndpointSlices
While DynamoWorkerMetadata resources provide an up-to-date snapshot of a component's capabilities, EndpointSlices give a snapshot of health of the various Dynamo components.
The operator creates a Kubernetes Service targeting the Dynamo components. The Kubernetes controller in turn creates and maintains EndpointSlice resources that keep track of the readiness of the pods targeted by the Service. Watching these slices gives us an up-to-date snapshot of which Dynamo components are ready to serve traffic.
##### Readiness Probes
A pod is marked ready if the readiness probe succeeds. On Dynamo workers, this is when the `generate` endpoint is available and healthy. These probes are configured by the Dynamo operator for each pod/component.
#### RBAC
Each Dynamo component pod is automatically given a ServiceAccount that allows it to watch `EndpointSlice` and `DynamoWorkerMetadata` resources within its namespace.
#### Environment Variables
The following environment variables are automatically injected into pods by the operator to facilitate service discovery:
| Variable | Description |
|----------|-------------|
| `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` |
| `POD_NAME` | Pod name (via downward API) |
| `POD_NAMESPACE` | Pod namespace (via downward API) |
| `POD_UID` | Pod UID (via downward API) |
The pod's instance ID is deterministically generated by hashing the pod name, ensuring consistent identity and correlation between EndpointSlices and CRs.
## KV Store Discovery (etcd)
To use etcd-based discovery instead of Kubernetes-native discovery, add the annotation to your DynamoGraphDeployment:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
annotations:
nvidia.com/dynamo-discovery-backend: etcd
spec:
services:
# ...
```
This requires an etcd cluster to be available. The etcd connection is configured via the platform Helm chart.
# Webhooks
This document describes the webhook functionality in the Dynamo Operator, including validation webhooks, certificate management, and troubleshooting.
## Overview
The Dynamo Operator uses **Kubernetes admission webhooks** to provide real-time validation and mutation of custom resources. Currently, the operator implements **validation webhooks** that ensure invalid configurations are rejected immediately at the API server level, providing faster feedback to users compared to controller-based validation.
All webhook types (validating, mutating, conversion, etc.) share the same **webhook server** and **TLS certificate infrastructure**, making certificate management consistent across all webhook operations.
### Key Features
- ✅ **Always enabled** - Webhooks are a required component of the operator
- ✅ **Shared certificate infrastructure** - All webhook types use the same TLS certificates
- ✅ **Automatic certificate generation and rotation** - Built-in cert-controller, no manual management required
- ✅ **cert-manager integration** - Optional integration for custom PKI or organizational certificate policies
- ✅ **Immutability enforcement** - Critical fields protected via CEL validation rules
### Current Webhook Types
- **Validating Webhooks**: Validate custom resource specifications before persistence
- `DynamoComponentDeployment` validation
- `DynamoGraphDeployment` validation
- `DynamoModel` validation
- `DynamoGraphDeploymentRequest` validation
- **Mutating Webhooks**: Apply default values to resources on creation
- `DynamoGraphDeployment` defaulting
**Note:** All webhook types use the same certificate infrastructure described in this document.
---
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ API Server │
│ 1. User submits CR (kubectl apply) │
│ 2. API server calls MutatingWebhookConfiguration │
└────────────────────────┬────────────────────────────────────────┘
│ HTTPS (TLS required)
▼
┌─────────────────────────────────────────────────────────────────┐
│ Webhook Server (in Operator Pod) │
│ 3. Applies defaults (e.g., operator version annotation) │
│ 4. Returns mutated CR │
└────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ API Server │
│ 5. API server calls ValidatingWebhookConfiguration │
└────────────────────────┬────────────────────────────────────────┘
│ HTTPS (TLS required)
▼
┌─────────────────────────────────────────────────────────────────┐
│ Webhook Server (in Operator Pod) │
│ 6. Validates CR against business rules │
│ 7. Returns admit/deny decision + warnings │
└────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ API Server │
│ 8. If admitted: Persist CR to etcd │
│ 9. If denied: Return error to user │
└─────────────────────────────────────────────────────────────────┘
```
### Admission Flow
1. **Mutating webhooks**: Apply defaults and transformations before validation
2. **Validating webhooks**: Validate the (possibly mutated) CR against business rules
3. **CEL validation**: Kubernetes-native immutability checks (always active)
---
## Upgrading from versions with `webhook.enabled: false`
The `webhook.enabled` Helm value has been removed. Webhooks are now a required component of the operator and are always active. If you previously ran with `webhook.enabled: false`, take the following steps before upgrading:
1. **Remove `webhook.enabled`** from any custom values files. Helm will ignore the unknown key, but it should be cleaned up to avoid confusion.
2. **Ensure port 9443 is reachable** from the Kubernetes API server to the operator pod. If you have `NetworkPolicy` rules or firewall configurations restricting traffic, add an ingress rule allowing the API server to reach the webhook server on port 9443.
3. **Ensure webhook TLS certificates are available.** By default, the operator's built-in cert-controller generates and rotates self-signed certificates automatically at startup — no action needed. If you use cert-manager or externally managed certificates, verify your configuration is in place before upgrading.
---
## Configuration
### Certificate Management Options
The operator supports three certificate management modes:
| Mode | Description | Use Case |
|------|-------------|----------|
| **Automatic (Default)** | Operator's built-in cert-controller generates and rotates certificates | All environments (recommended) |
| **cert-manager** | Integrate with cert-manager for certificate lifecycle management | Clusters with cert-manager and custom PKI requirements |
| **External** | Bring your own certificates | Environments with externally managed PKI |
---
### Advanced Configuration
#### Complete Configuration Reference
```yaml
dynamo-operator:
webhook:
# Certificate management (optional, to use cert-manager instead of built-in)
certManager:
enabled: false
issuerRef:
kind: Issuer
name: selfsigned-issuer
# Certificate secret configuration
certificateSecret:
name: webhook-server-cert
external: false # Set to true for externally managed certificates
# Webhook behavior configuration
failurePolicy: Fail # Fail (reject on error) or Ignore (allow on error)
timeoutSeconds: 10 # Webhook timeout
# Namespace filtering (advanced)
namespaceSelector: {} # Kubernetes label selector for namespaces
```
#### Failure Policy
```yaml
# Fail: Reject resources if webhook is unavailable (recommended for production)
webhook:
failurePolicy: Fail
# Ignore: Allow resources if webhook is unavailable (use with caution)
webhook:
failurePolicy: Ignore
```
**Recommendation:** Use `Fail` in production to ensure validation is always enforced. Only use `Ignore` if you need high availability and can tolerate occasional invalid resources.
#### Namespace Filtering
Control which namespaces are validated (applies to **cluster-wide operator** only):
```yaml
# Only validate resources in namespaces with specific labels
webhook:
namespaceSelector:
matchLabels:
dynamo-validation: enabled
# Or exclude specific namespaces
webhook:
namespaceSelector:
matchExpressions:
- key: dynamo-validation
operator: NotIn
values: ["disabled"]
```
**Note:** For **namespace-restricted operators** (deprecated), the namespace selector is automatically set to validate only the operator's namespace. This configuration is ignored in namespace-restricted mode.
---
## Certificate Management
### Automatic Certificates (Default)
**Zero configuration required!** The operator's built-in cert-controller generates and rotates certificates automatically at startup.
#### How It Works
1. **Operator starts**: The `CertManager` checks for an existing certificate Secret (configured via `webhook.certificateSecret.name`, default: `webhook-server-cert`). If missing or invalid, it generates a self-signed Root CA and server certificate and writes them to the Secret.
2. **CA bundle injection**: The `CABundleInjector` reads `ca.crt` from the Secret and patches both the `ValidatingWebhookConfiguration` and `MutatingWebhookConfiguration` with the base64-encoded CA bundle.
3. **Certificate rotation**: The cert-controller monitors certificate validity and regenerates certificates before they expire.
4. **Webhook server starts**: The webhook server only begins serving after certificates are confirmed ready, preventing startup races.
#### Certificate Validity
- **Root CA**: 10 years
- **Server Certificate**: 10 years (same as Root CA)
- **Automatic rotation**: The cert-controller monitors validity and regenerates before expiration
#### Smart Certificate Management
The cert-controller is intelligent about certificate lifecycle:
- ✅ **Checks existing certificates** at startup before generating new ones
- ✅ **Skips generation** if valid certificates already exist in the Secret
- ✅ **Regenerates** only when needed (missing, expiring soon, or incorrect SANs)
This means:
- Fast operator restarts (no unnecessary cert generation)
- No dependency on Helm hooks or external Jobs
- Certificates persist across pod restarts (stored in Secret)
#### Manual Certificate Rotation
If you need to rotate certificates manually:
```bash
# Delete the certificate secret -- the operator will regenerate it on restart
kubectl delete secret -webhook-server-cert -n
# Restart the operator pod to trigger regeneration
kubectl rollout restart deployment/-dynamo-operator -n
```
---
### cert-manager Integration
For clusters with cert-manager installed, you can enable automated certificate lifecycle management.
#### Prerequisites
1. **cert-manager installed** (v1.0+)
2. **CA issuer configured** (e.g., `selfsigned-issuer`)
#### Configuration
```yaml
dynamo-operator:
webhook:
certManager:
enabled: true
issuerRef:
kind: Issuer # Or ClusterIssuer
name: selfsigned-issuer # Your issuer name
```
#### How It Works
1. **Helm creates Certificate resource**: Requests TLS certificate from cert-manager
2. **cert-manager generates certificate**: Based on configured issuer
3. **cert-manager stores in Secret**: `-webhook-server-cert`
4. **cert-manager ca-injector**: Automatically injects CA bundle into `ValidatingWebhookConfiguration`
5. **Operator pod**: Mounts certificate secret and serves webhook
#### When to Use cert-manager
- ✅ **Custom validity periods**: Configure certificate lifetime to match organizational policy
- ✅ **Integration with existing PKI**: Use your organization's certificate infrastructure
- ✅ **Centralized certificate management**: Manage all cluster certificates through cert-manager
#### Certificate Rotation
With cert-manager, certificate rotation is **fully automated**:
1. **Leaf certificate rotation** (default: every year)
- cert-manager auto-renews before expiration
- controller-runtime auto-reloads new certificate
- **No pod restart required**
- **No caBundle update required** (same Root CA)
2. **Root CA rotation** (every 10 years)
- cert-manager rotates Root CA
- ca-injector auto-updates caBundle in `ValidatingWebhookConfiguration`
- **No manual intervention required**
#### Example: Self-Signed Issuer
```yaml
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: selfsigned-issuer
namespace: dynamo-system
spec:
selfSigned: {}
---
# Enable in platform values.yaml
dynamo-operator:
webhook:
certManager:
enabled: true
issuerRef:
kind: Issuer
name: selfsigned-issuer
```
---
### External Certificates
Bring your own certificates for custom PKI requirements.
#### Steps
1. **Create certificate secret manually**:
```bash
kubectl create secret tls -webhook-server-cert \
--cert=tls.crt \
--key=tls.key \
-n
# Also add ca.crt to the secret
kubectl patch secret -webhook-server-cert -n \
--type='json' \
-p='[{"op": "add", "path": "/data/ca.crt", "value": "'$(base64 -w0 < ca.crt)'"}]'
```
2. **Configure operator to use external secret**:
```yaml
dynamo-operator:
webhook:
certificateSecret:
external: true
caBundle: # Must manually specify
```
3. **Deploy operator**:
```bash
helm install dynamo-platform . -n -f values.yaml
```
#### Certificate Requirements
- **Secret name**: Must match `webhook.certificateSecret.name` (default: `webhook-server-cert`)
- **Secret keys**: `tls.crt`, `tls.key`, `ca.crt`
- **Certificate SAN**: Must include `..svc`
- Example: `dynamo-platform-dynamo-operator-webhook-service.dynamo-system.svc`
---
## Multi-Operator Deployments (DEPRECATED)
> **DEPRECATED:** Namespace-restricted mode and multi-operator deployments are deprecated and will be removed in a future release. Use a single cluster-wide operator instead.
The operator supports running both **cluster-wide** and **namespace-restricted** instances simultaneously using a **lease-based coordination mechanism**.
### Scenario
```
Cluster:
├─ Operator A (cluster-wide, namespace: platform-system)
│ └─ Validates all namespaces EXCEPT team-a
└─ Operator B (namespace-restricted, namespace: team-a)
└─ Validates only team-a namespace
```
### How It Works
1. **Namespace-restricted operator** creates a Lease in its namespace
2. **Cluster-wide operator** watches for Leases named `dynamo-operator-ns-lock`
3. **Cluster-wide operator** skips validation for namespaces with active Leases
4. **Namespace-restricted operator** validates resources in its namespace
### Lease Configuration
The lease mechanism is **automatically configured** based on deployment mode:
```yaml
# Cluster-wide operator (default)
namespaceRestriction:
enabled: false
# → Watches for leases in all namespaces
# → Skips validation for namespaces with active leases
# Namespace-restricted operator
namespaceRestriction:
enabled: true
namespace: team-a
# → Creates lease in team-a namespace
# → Does NOT check for leases (no cluster permissions)
```
### Deployment Example
```bash
# 1. Deploy cluster-wide operator
helm install platform-operator dynamo-platform \
-n platform-system \
--set namespaceRestriction.enabled=false
# 2. Deploy namespace-restricted operator for team-a
helm install team-a-operator dynamo-platform \
-n team-a \
--set namespaceRestriction.enabled=true \
--set namespaceRestriction.namespace=team-a
```
### ValidatingWebhookConfiguration Naming
The webhook configuration name reflects the deployment mode:
- **Cluster-wide**: `-validating`
- **Namespace-restricted**: `-validating-`
Example:
```bash
# Cluster-wide
platform-operator-validating
# Namespace-restricted (team-a)
team-a-operator-validating-team-a
```
This allows multiple webhook configurations to coexist without conflicts.
### Lease Health
If the namespace-restricted operator is deleted or becomes unhealthy:
- Lease expires after `leaseDuration + gracePeriod` (default: ~30 seconds)
- Cluster-wide operator automatically resumes validation for that namespace
---
## Troubleshooting
### Webhook Not Called
**Symptoms:**
- Invalid resources are accepted
- No validation errors in logs
**Checks:**
1. **Verify webhook configuration exists**:
```bash
kubectl get validatingwebhookconfiguration | grep dynamo
```
2. **Check webhook configuration**:
```bash
kubectl get validatingwebhookconfiguration -o yaml
# Verify:
# - caBundle is present and non-empty
# - clientConfig.service points to correct service
# - webhooks[].namespaceSelector matches your namespace
```
3. **Verify webhook service exists**:
```bash
kubectl get service -n | grep webhook
```
4. **Check operator logs for webhook startup**:
```bash
kubectl logs -n deployment/-dynamo-operator | grep webhook
# Should see: "Registering validation webhooks"
# Should see: "Starting webhook server"
```
---
### Connection Refused Errors
**Symptoms:**
```
Error from server (InternalError): Internal error occurred: failed calling webhook:
Post "https://...webhook-service...:443/validate-...": dial tcp ...:443: connect: connection refused
```
**Checks:**
1. **Verify operator pod is running**:
```bash
kubectl get pods -n -l app.kubernetes.io/name=dynamo-operator
```
2. **Check webhook server is listening**:
```bash
# Port-forward to pod
kubectl port-forward -n pod/ 9443:9443
# In another terminal, test connection
curl -k https://localhost:9443/validate-nvidia-com-v1alpha1-dynamocomponentdeployment
# Should NOT get "connection refused"
```
3. **Verify webhook port in deployment**:
```bash
kubectl get deployment -n -dynamo-operator -o yaml | grep -A5 "containerPort: 9443"
```
4. **Check for webhook initialization errors**:
```bash
kubectl logs -n deployment/-dynamo-operator | grep -i error
```
---
### Certificate Errors
**Symptoms:**
```
Error from server (InternalError): Internal error occurred: failed calling webhook:
x509: certificate signed by unknown authority
```
**Checks:**
1. **Verify caBundle is present**:
```bash
kubectl get validatingwebhookconfiguration -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d
# Should output a valid PEM certificate
```
2. **Verify certificate secret exists**:
```bash
kubectl get secret -n -webhook-server-cert
```
3. **Check certificate validity**:
```bash
kubectl get secret -n -webhook-server-cert -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -text
# Check:
# - Not expired
# - SAN includes: ..svc
```
4. **Check operator logs for CA injection errors**:
```bash
kubectl logs -n deployment/-dynamo-operator | grep -i "cert\|ca.*bundle\|inject"
```
---
### Certificate Controller Errors
**Symptoms:**
- Operator logs show cert-controller errors
- Certificate Secret is not created
- CA bundle is not injected into webhook configurations
**Checks:**
1. **Check cert-controller logs**:
```bash
kubectl logs -n deployment/-dynamo-operator | grep -i "cert-manager\|cert-rotation\|cert-controller"
```
2. **Verify RBAC permissions**:
```bash
# The operator needs permissions to manage Secrets, ValidatingWebhookConfigurations,
# MutatingWebhookConfigurations, and CustomResourceDefinitions
kubectl auth can-i create secrets -n --as=system:serviceaccount::-dynamo-operator
kubectl auth can-i patch validatingwebhookconfigurations --as=system:serviceaccount::-dynamo-operator
```
3. **Check if the certificate Secret was created**:
```bash
kubectl get secret -n -webhook-server-cert
```
4. **Force certificate regeneration**:
```bash
# Delete the certificate secret and restart the operator
kubectl delete secret -webhook-server-cert -n
kubectl rollout restart deployment/-dynamo-operator -n
```
---
### Validation Errors Not Clear
**Symptoms:**
- Webhook rejects resource but error message is unclear
**Solution:**
Check operator logs for detailed validation errors:
```bash
kubectl logs -n deployment/-dynamo-operator | grep "validate create\|validate update"
```
Webhook logs include:
- Resource name and namespace
- Validation errors with context
- Warnings for immutable field changes
---
### Stuck Deleting Resources
**Symptoms:**
- Resource stuck in "Terminating" state
- Webhook blocks finalizer removal
**Solution:**
The webhook automatically skips validation for resources being deleted. If stuck:
1. **Check if webhook is blocking**:
```bash
kubectl describe -n
# Look for events mentioning webhook errors
```
2. **Temporarily work around the webhook**:
```bash
# Option 1: Set failurePolicy to Ignore
kubectl patch validatingwebhookconfiguration \
--type='json' \
-p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]'
# Option 2 (last resort): Delete ValidatingWebhookConfiguration
kubectl delete validatingwebhookconfiguration
```
3. **Delete resource again**:
```bash
kubectl delete -n
```
4. **Restore webhook configuration**:
```bash
helm upgrade dynamo-platform -n
```
---
## Best Practices
### Production Deployments
1. ✅ **Use `failurePolicy: Fail`** (default) to ensure validation is enforced
2. ✅ **Monitor webhook latency** - Validation adds ~10-50ms per resource operation
3. ✅ **Automatic certificates work well for production** - The built-in cert-controller handles generation and rotation; use cert-manager only if you need integration with organizational PKI
4. ✅ **Test webhook configuration** in staging before production
### Development Deployments
1. ✅ **Use `failurePolicy: Ignore`** if webhook availability is problematic during development
2. ✅ **Keep automatic certificates** (zero configuration, built into the operator)
### Multi-Tenant Deployments
1. ✅ **Deploy one cluster-wide operator** for platform-wide validation
2. ~~Deploy namespace-restricted operators for tenant-specific namespaces~~ (**DEPRECATED** - use cluster-wide mode instead)
---
## Additional Resources
- [Kubernetes Admission Webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/)
- [cert-manager Documentation](https://cert-manager.io/docs/)
- [Kubebuilder Webhook Tutorial](https://book.kubebuilder.io/cronjob-tutorial/webhook-implementation.html)
- [CEL Validation Rules](https://kubernetes.io/docs/reference/using-api/cel/)
---
## Support
For issues or questions:
- Check [Troubleshooting](#troubleshooting) section
- Review operator logs: `kubectl logs -n deployment/-dynamo-operator`
- Open an issue on GitHub
# Snapshotting GPU Workers
> ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in preview and may only be functional in some cluster setups. The `snapshot-agent` DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
**Dynamo Snapshot** is infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in Userspace) and NVIDIA's `cuda-checkpoint` utility. The usual flow is:
1. start a worker once and checkpoint its initialized state
2. store that checkpoint on a namespace-local snapshot volume
3. restore later workers from that checkpoint instead of cold-starting again
| Startup Type | Time | What Happens |
|--------------|------|--------------|
| **Cold Start** | ~1 min | Download model, load to GPU, initialize engine |
| **Warm Start** (restore from checkpoint) | ~10 sec | Restore from a ready checkpoint directory |
> ⚠️ Restore time depends on storage bandwidth, GPU model, and whether the restore stays on the same node.
For more background on the snapshot architecture and startup improvements, see
[NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes](https://developer.nvidia.com/blog/nvidia-dynamo-snapshot-fast-startup-for-inference-workloads-on-kubernetes/).
## Prerequisites
- x86_64 (`amd64`) GPU nodes
- NVIDIA driver 580.xx or newer on the target GPU nodes (590.xx or newer if testing multi-GPU snapshots)
- vLLM or SGLang backend today
- Checkpoint storage. `ReadWriteMany` is the safest default for cross-node or
concurrent multi-node access, but `podMount` mode can also use suitable
`ReadWriteOnce` storage for sequential checkpoint/restore workflows.
- **CRI-O / OpenShift:** set `runtime.type=crio` on the snapshot chart (and `openshift.enabled=true` on OpenShift). Defaults are for containerd; see the chart README for sockets and Helm flags.
## Quick Start via `DynamoCheckpoint` CR
1. Build a placeholder image
2. Install the snapshot chart
3. Create a `DynamoCheckpoint` and wait for it to become ready
4. Deploy a `DynamoGraphDeployment` that restores from the corresponding `checkpointRef`
### 1. Build and push a placeholder image
Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with restore tooling. If you do not already have one, build it and push it to a registry your cluster can pull from:
```bash
export RUNTIME_IMAGE=registry.example.com/dynamo/vllm-runtime:1.0.0
export PLACEHOLDER_IMAGE=registry.example.com/dynamo/vllm-placeholder:1.0.0
cd deploy/snapshot
make docker-build-placeholder \
PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"
make docker-push-placeholder \
PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"
```
The placeholder image preserves the normal runtime entrypoint/command contract and adds the `criu`, `cuda-checkpoint`, and `nsrestore` tooling needed for checkpoint and restore.
To build either snapshot image against a custom CRIU fork or ref, pass
`CRIU_REPO` and `CRIU_REF` through `make`. If they are unset, the Dockerfile
defaults are used.
```bash
make docker-build-agent \
IMG=registry.example.com/dynamo/snapshot-agent:1.0.0 \
CRIU_REPO="${YOUR_CRIU_REPO}" \
CRIU_REF="branch-or-sha"
make docker-build-placeholder \
PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}" \
CRIU_REPO="${YOUR_CRIU_REPO}" \
CRIU_REF="branch-or-sha"
```
### 2. Enable checkpointing in the platform and verify it
Whether you are installing or upgrading `dynamo-platform`, the operator only needs checkpointing enabled:
```yaml
dynamo-operator:
checkpoint:
enabled: true
```
If the platform is already installed, verify that the operator config contains the checkpoint block:
```bash
OPERATOR_CONFIG=$(kubectl get deploy -n "${PLATFORM_NAMESPACE}" \
-l app.kubernetes.io/name=dynamo-operator,app.kubernetes.io/component=manager \
-o jsonpath='{.items[0].spec.template.spec.volumes[?(@.name=="operator-config")].configMap.name}')
kubectl get configmap "${OPERATOR_CONFIG}" -n "${PLATFORM_NAMESPACE}" \
-o jsonpath='{.data.config\.yaml}' | sed -n '/^checkpoint:/,/^[^[:space:]]/p'
```
Verify that the rendered config includes `enabled: true`.
### 3. Install the snapshot chart
For the default namespace-local mode, install the snapshot chart in each
workload namespace. The chart creates the PVC and the agent in that namespace:
```bash
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
--namespace ${NAMESPACE} \
--create-namespace \
--set storage.pvc.create=true
```
In the default `agentMount` mode, the snapshot-agent DaemonSet mounts the
checkpoint PVC directly. On a multi-node GPU cluster that means agent pods on
multiple nodes may mount the same PVC, so the PVC generally needs
`ReadWriteMany`. The chart defaults to that mode. If your cluster does not have
a default storage class, also set `storage.pvc.storageClass`.
If you are reusing an existing checkpoint PVC, do not set `storage.pvc.create=true`; install the chart with `storage.pvc.create=false` and set `storage.pvc.name` instead.
CRI-O or OpenShift: append for example `--set runtime.type=crio` and, on OpenShift, `--set openshift.enabled=true` (see `deploy/helm/charts/snapshot/README.md`).
For clusters that prefer one privileged snapshot agent instead of one DaemonSet
per workload namespace, install the chart once in an infrastructure namespace.
In this mode the chart does not create workload PVCs; the Dynamo operator either
creates each namespace-local PVC or verifies that it already exists:
```bash
helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
--namespace dynamo-system \
--create-namespace \
--set storage.accessMode=podMount \
--set storage.pvc.create=false \
--set rbac.namespaceRestricted=false
```
To let the operator create the workload PVC in each namespace that uses
checkpoint/restore, configure the operator with `create: true`:
```yaml
dynamo-operator:
checkpoint:
enabled: true
storage:
type: pvc
pvc:
pvcName: snapshot-pvc
basePath: /checkpoints
create: true
size: 1Ti
storageClassName: ""
accessMode: ReadWriteMany
```
The chart and operator use separate configuration surfaces here: the snapshot
chart PVC name is `storage.pvc.name`, while the operator config field is
`checkpoint.storage.pvc.pvcName`.
This is a key difference from `agentMount`: `podMount` removes the requirement
that the snapshot-agent DaemonSet mount the checkpoint PVC on every GPU node.
Only the active checkpoint/restore workload pod mounts the PVC, and the agent
reaches it through that pod's mount namespace. `ReadWriteMany` remains the
safest operator-managed default, especially when multiple checkpoint/restore
pods may access the same PVC concurrently or when restore scheduling can span
nodes. Suitable `ReadWriteOnce` storage classes can still be used for
sequential `podMount` checkpoint/restore flows when the backend can attach the
volume to the node running the active workload pod.
`podMount` depends on the target container remaining alive while the agent
resolves `/host/proc//root/`. If the container exits or restarts
during checkpoint/restore setup, if the runtime cannot expose a stable host PID,
or if node security settings prevent host proc traversal, the agent fails or
skips that attempt and Kubernetes/operator reconciliation must try again after a
fresh container is available.
To use an already-present PVC instead, omit `create` or set it to `false`. The
operator will fail reconciliation with a clear error if the named PVC does not
exist in the workload namespace.
Verify that the DaemonSet is ready. After a checkpoint or restore workload is
reconciled, verify the workload namespace PVC:
```bash
kubectl rollout status daemonset/snapshot-agent -n dynamo-system
kubectl get pods -n dynamo-system -l app.kubernetes.io/component=snapshot-agent -o wide
kubectl get pvc snapshot-pvc -n ${NAMESPACE}
```
### 4. Create a standalone `DynamoCheckpoint`
The checkpoint Job pod template should match the worker container you want to checkpoint. For a standalone checkpoint, the important parts are the legacy `spec.identity` metadata, a container named `main`, and the placeholder image; the rest of the pod template should mirror your normal worker config. Extra containers are allowed, but only `main` is checkpointed unless `spec.job.targetContainerName` selects another container.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoCheckpoint
metadata:
name: qwen3-06b-bf16
spec:
identity:
model: Qwen/Qwen3-0.6B
backendFramework: vllm
tensorParallelSize: 1
dtype: bfloat16
maxModelLen: 2048
job:
activeDeadlineSeconds: 3600
podTemplateSpec:
spec:
...
containers:
- name: main
image: registry.example.com/dynamo/vllm-placeholder:1.0.0
...
```
GMS + Snapshot support is currently disabled.
For a full working example, see [deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml).
Apply it:
```bash
kubectl apply -f qwen3-checkpoint.yaml -n ${NAMESPACE}
```
### 5. Wait for the checkpoint to become ready
```bash
kubectl get dckpt -n ${NAMESPACE} \
-o custom-columns=NAME:.metadata.name,CHECKPOINT_ID:.status.checkpointID,PHASE:.status.phase
kubectl wait \
--for=jsonpath='{.status.phase}'=Ready \
dynamocheckpoint/qwen3-06b-bf16 \
-n ${NAMESPACE} \
--timeout=30m
```
The useful status fields are:
- `status.phase`: high-level lifecycle (`Pending`, `Creating`, `Ready`, `Failed`)
- `status.checkpointID`: artifact ID used by the snapshot protocol
- `status.identityHash`: deprecated compatibility alias for `status.checkpointID`
- `status.jobName`: checkpoint Job name
- `status.createdAt`: timestamp recorded when the checkpoint became ready
- `status.message`: progress or failure detail when available
### 6. Deploy a `DynamoGraphDeployment` that restores from `checkpointRef`
Once the checkpoint is `Ready`, restore a worker from it explicitly:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-checkpointref-demo
spec:
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: registry.example.com/dynamo/vllm-runtime:1.0.0
VllmDecodeWorker:
componentType: worker
replicas: 1
checkpoint:
enabled: true
checkpointRef: qwen3-06b-bf16
extraPodSpec:
mainContainer:
image: registry.example.com/dynamo/vllm-placeholder:1.0.0
...
...
```
Apply it:
```bash
kubectl apply -f vllm-checkpointref-demo.yaml -n ${NAMESPACE}
kubectl get pods -n ${NAMESPACE} -w
```
The `VllmDecodeWorker` pod should restore from the ready checkpoint instead of creating a new one.
## DGD Auto Flow
`checkpointRef` is the most explicit path. If you set it, the DGD uses that
existing `DynamoCheckpoint` and does not create a new automatic checkpoint for
the component. This is the escape hatch for users who intentionally want to
reuse a pre-existing checkpoint and accept the compatibility risk.
Without `checkpointRef`, `mode: Auto` is the DGD-managed path: for each
checkpoint-enabled worker generation, the DGD controller creates a DGD-scoped
`DynamoCheckpoint` and the checkpoint controller starts a checkpoint Job.
Automatic DGD checkpoints are not reused across DGDs, even when two manifests
are identical.
The automatic checkpoint ID is derived from the DGD namespace/name/UID, component name, and active worker hash. The DGD UID prevents cross-DGD reuse; the worker hash keeps a scale down/up on the same worker generation using the same DGD-scoped checkpoint while creating a new checkpoint for a new worker generation.
By default, `startupPolicy: Immediate` starts workers cold while the checkpoint job runs in the background. Once the checkpoint becomes `Ready`, only newly-created Pods restore from it. Existing Pods are not mutated or restarted just because the checkpoint became ready.
If you want workers to wait for the checkpoint before starting, set `startupPolicy: WaitForCheckpoint`. That policy keeps normal worker replicas at zero until the checkpoint is `Ready`, then starts workers from the checkpoint.
```yaml
checkpoint:
enabled: true
mode: Auto
startupPolicy: Immediate # default; optional
```
Inside a `DynamoGraphDeployment`, it looks like this:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-auto-demo
spec:
services:
Frontend:
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: registry.example.com/dynamo/vllm-runtime:1.0.0
VllmDecodeWorker:
componentType: worker
replicas: 1
checkpoint:
enabled: true
mode: Auto
startupPolicy: Immediate
extraPodSpec:
mainContainer:
image: registry.example.com/dynamo/vllm-placeholder:1.0.0
...
...
```
The legacy `checkpoint.identity` field is ignored for DGD-managed automatic checkpoints. It is retained only for API compatibility and standalone `DynamoCheckpoint` workflows.
Useful inspection commands:
```bash
kubectl get dgd vllm-auto-demo -n ${NAMESPACE} \
-o jsonpath='{.status.checkpoints.VllmDecodeWorker.checkpointName}{"\n"}{.status.checkpoints.VllmDecodeWorker.checkpointID}{"\n"}{.status.checkpoints.VllmDecodeWorker.ready}{"\n"}'
kubectl get dckpt -n ${NAMESPACE}
```
If you use the default `Immediate` policy and want to create restored pods after the checkpoint becomes ready, scale the worker:
```bash
kubectl patch dgd vllm-auto-demo -n ${NAMESPACE} --type=merge \
-p '{"spec":{"services":{"VllmDecodeWorker":{"replicas":2}}}}'
```
## Failover Restore
Failover restore is not yet available. The current Snapshot flow does not support GMS + Snapshot, so do not use failover restore as a supported checkpoint/restore path. For current GMS and active/passive failover guidance, see [Shadow Engine Failover](/dynamo/kubernetes-deployment/advanced-platform/shadow-engine-failover).
## Lower-Level Testing With `snapshotctl`
It is possible to checkpoint and restore pods without the Dynamo operator via the lower-level `snapshotctl` utility. However, the snapshot helm chart must be installed, with a running `snapshot-agent` DaemonSet in the namespace with the checkpoint PVC mounted.
`snapshotctl` is intended for lower-level debugging and validation workflows, not as the primary user-facing checkpoint interface. For command details and manifest requirements, see [deploy/snapshot/cmd/snapshotctl/README.md](../../deploy/snapshot/cmd/snapshotctl/README.md).
### Checkpoint from a worker pod manifest
```bash
snapshotctl checkpoint \
--manifest ./worker-pod.yaml \
--container main \
--namespace ${NAMESPACE}
```
The checkpoint manifest must be for a pod and use a placeholder image. `--container` names the workload container to checkpoint.
If you do not pass `--checkpoint-id`, `snapshotctl` generates one and prints it:
```text
status=completed
namespace=...
name=...
checkpoint_job=...
checkpoint_id=manual-snapshot-...
checkpoint_location=/checkpoints/...
```
### Restore from a worker pod manifest
```bash
snapshotctl restore \
--manifest ./worker-pod.yaml \
--namespace ${NAMESPACE} \
--checkpoint-id manual-snapshot-... \
--containers main
```
This creates a new restore pod and returns after the request is submitted. Observe progress through Kubernetes readiness, events, and logs.
### Restore an existing pod in place
```bash
snapshotctl restore \
--pod existing-restore-target \
--namespace ${NAMESPACE} \
--checkpoint-id manual-snapshot-... \
--containers main
```
This patches restore metadata onto an existing pod that is already snapshot-compatible and returns after the patch is accepted.
## Checkpoint IDs and Legacy Identity
`status.checkpointID` is the artifact ID used by the snapshot protocol and the
directory name under checkpoint storage. For DGD-managed automatic checkpoints,
this ID is scoped to a single DGD/component worker generation. It is not a
compatibility claim across DGDs, and identical manifests are not treated as
proof that a checkpoint can be reused safely.
The legacy `spec.identity` shape is still required on standalone
`DynamoCheckpoint` objects and remains the fallback for explicit/manual
workflows. When a standalone checkpoint does not already have
`status.checkpointID` or the checkpoint-ID label, the operator computes the
legacy **16-character SHA256 hash** (64 bits) from these fields:
| Legacy field | Required | Affects legacy hash | Example |
|--------------|----------|---------------------|---------|
| `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` |
| `backendFramework` | ✓ | ✓ | `vllm` |
| `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` |
| `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` |
| `pipelineParallelSize` | | ✓ | `1`, `2` |
| `dtype` | | ✓ | `float16`, `bfloat16`, `fp8` |
| `maxModelLen` | | ✓ | `4096`, `8192` |
| `extraParameters` | | ✓ | custom key-value pairs |
Fields that do **not** change the legacy hash include:
- replica count
- node placement (`nodeSelector`, `affinity`, `tolerations`)
- resource requests/limits
- logging or observability configuration
DGD-managed automatic checkpoints ignore this legacy identity as a reuse
boundary. The DGD controller creates its own DGD-scoped checkpoint ID and
synthesizes a legacy identity only because the v1alpha1 `DynamoCheckpoint` API
still requires the field.
## `DynamoCheckpoint` CRD
The `DynamoCheckpoint` (shortname: `dckpt`) is the operator-managed resource for checkpoint lifecycle.
Use it when you want:
- pre-warmed checkpoints before any `DynamoGraphDeployment` exists
- explicit lifecycle control independent from a DGD
- a stable human-readable name that services can reference with `checkpointRef`
The operator requires:
- `spec.identity`
- `spec.job.podTemplateSpec`
`spec.job.backoffLimit` is deprecated and ignored. Checkpoint Jobs are always single-attempt.
Check status with:
```bash
kubectl get dckpt -n ${NAMESPACE}
kubectl describe dckpt qwen3-06b-bf16 -n ${NAMESPACE}
kubectl get dckpt qwen3-06b-bf16 -n ${NAMESPACE} -o yaml
```
The `status` block looks like:
```yaml
status:
phase: Ready
checkpointID: 3bff874d069f0ed5
identityHash: 3bff874d069f0ed5 # deprecated compatibility alias
jobName: checkpoint-job-3bff874d069f0ed5-1
createdAt: "2026-01-29T10:05:00Z"
message: ""
```
## Limitations
- **Backend support is limited**: checkpoint/restore currently supports vLLM workers only, and that support is still a limited preview.
- **Worker coverage is narrow**: specialized workers such as multimodal, embedding, and diffusion are not supported.
- **Multi-GPU remains preview**: vLLM tensor-parallel configurations have limited validation and are not yet a broadly supported path across clusters.
- **GMS restore remains experimental**: GMS + Snapshot is currently disabled.
- **Admission is create-only**: with DGD `startupPolicy: Immediate`, only Pods created after a checkpoint is `Ready` are restore-shaped. Existing Pods cold-started before checkpoint readiness keep running as-is.
- **Network state is sensitive**: restore is sensitive to live TCP socket state. Loopback bootstrap/control sockets are the most reliable path today.
- **Privileged DaemonSet required**: `snapshot-agent` must run privileged to execute CRIU and `cuda-checkpoint`. Workload pods do not need to be privileged.
## Troubleshooting
### Checkpoint Job finishes but the checkpoint never becomes `Ready`
Snapshot only becomes `Ready` after `snapshot-agent` confirms the checkpoint contents. A completed Job is not enough by itself.
```bash
kubectl get dckpt -n ${NAMESPACE} \
-o custom-columns=NAME:.metadata.name,PHASE:.status.phase,MESSAGE:.status.message,JOB:.status.jobName
JOB_NAME=$(kubectl get dckpt -n ${NAMESPACE} -o jsonpath='{.status.jobName}')
if [ -n "${JOB_NAME}" ]; then
kubectl logs job/"${JOB_NAME}" -n ${NAMESPACE}
fi
kubectl logs daemonset/snapshot-agent -n ${NAMESPACE} --all-containers
```
If the worker template is wrong, the most common causes are using the raw runtime image instead of the placeholder image, or leaving out normal mounts and secrets that the worker needs to start.
### Restore cannot find or mount checkpoint storage
For the default `agentMount` install, restore discovers checkpoint storage from
the `snapshot-agent` DaemonSet in the workload namespace. That DaemonSet must be
ready and must mount the checkpoint PVC.
```bash
kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
kubectl get daemonset -n ${NAMESPACE} -l app.kubernetes.io/component=snapshot-agent -o wide
kubectl get pvc -n ${NAMESPACE}
```
For a shared-agent `podMount` install, the `snapshot-agent` DaemonSet can run in
the infrastructure namespace instead. Verify the shared-agent pods there, then
verify that the workload namespace has the checkpoint PVC that the operator
created or validated:
```bash
kubectl rollout status daemonset/snapshot-agent -n dynamo-system
kubectl get pods -n dynamo-system -l app.kubernetes.io/component=snapshot-agent -o wide
kubectl get pvc snapshot-pvc -n ${NAMESPACE}
```
In `podMount` mode the agent reaches the checkpoint through the workload pod's
mount namespace rather than by mounting the PVC itself. Check the workload pod's
checkpoint storage annotations and the `snapshot-agent` logs to see the actual
resolved checkpoint path. `snapshotctl` uses the chart's storage resolution
path, so for lower-level `snapshotctl` debugging make sure the snapshot chart
configuration matches the access mode you are testing.
### `snapshotctl` manifest is rejected or the restore target is wrong
`snapshotctl` requires a `Pod` manifest and a target-container list. Multi-container manifests are supported as long as every name passed via `--container` or `--containers` exists in the pod spec.
```bash
snapshotctl checkpoint --manifest ./worker-pod.yaml --container main --namespace ${NAMESPACE}
snapshotctl restore --manifest ./worker-pod.yaml --containers main --namespace ${NAMESPACE} --checkpoint-id
```
If the manifest already carries snapshot target metadata, it must agree with the CLI flag; `snapshotctl` rejects mismatches instead of silently picking one.
## Planned Features
- Stable multi-GPU and multinode support
- TensorRT-LLM support
## Related Documentation
- [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide)
- [Shadow Engine Failover](/dynamo/kubernetes-deployment/advanced-platform/shadow-engine-failover)
- [API Reference](/dynamo/additional-resources/api-reference-k-8-s)
# Shadow Engine Failover
> ⚠️ **Experimental Feature**: Shadow Engine Failover is an opt-in preview
> feature. It depends on GPU Memory Service (GMS), Dynamic Resource Allocation
> (DRA), and backend-specific support. Its API shape and behavior may change,
> and the failover state machine is still settling. Use it only for
> non-production evaluation unless you have validated the exact backend,
> topology, and failure mode in your cluster.
## Overview
Use Shadow Engine Failover when you want a standby engine to take over after an
unknown backend engine or software-process failure while the GPU and node remain
healthy. The goal is to avoid paying a full model weight reload after a
same-node process failure.
Shadow Engine Failover is the Kubernetes workflow. GPU Memory Service is the
enabling mechanism underneath it: GMS owns the GPU-resident model weights, and
the active and standby engines attach to those weights through DRA.
This is separate from [Dynamo Snapshot](/dynamo/kubernetes-deployment/advanced-platform/snapshot). Snapshot captures and
restores a process image with CRIU and `cuda-checkpoint`. Shadow Engine Failover
keeps model weights resident in GPU memory so a standby or replacement engine
can attach after selected process-level failures. They both target recovery
latency, but they solve different problems and are not interchangeable.
## Failure Recovery Flow
The following diagram illustrates same-node process-level recovery:
```text
┌──────────────────────── Same healthy node + GPU ───────────────────────┐
│ │
│ Before failure │
│ ┌──────────────┐ attach/use ┌───────────────────────────┐ │
│ │ Engine A │ ───────────────────▶ │ GMS-owned model weights │ │
│ │ active │ │ resident in GPU memory │ │
│ └──────┬───────┘ └────────────┬──────────────┘ │
│ │ ▲ │
│ │ │ attach/use │
│ │ unknown software/engine failure │ │
│ ▼ │ │
│ ┌──────────────┐ ┌──────┴───────┐ │
│ │ Engine A │ exits │ Engine B │ │
│ └──────────────┘ │ shadow │ │
│ └──────┬───────┘ │
│ │ takeover │
│ ▼ │
│ ┌──────────────┐ │
│ │ Engine B │ │
│ │ active │ │
│ └──────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘
```
**How it works:**
1. The operator creates active and standby engine containers or pods for the
worker, depending on the selected failover mode.
2. The engines share GPU access through DRA and attach to model weights owned by
GMS.
3. An unknown software or engine failure terminates the active engine, while the
GMS process, GPU, and node remain healthy.
4. The standby or replacement engine takes over and attaches to the resident
GMS-owned weights instead of performing a full weight reload.
5. In-flight requests and KV cache state are not preserved. If the GPU, node, or
GMS process is lost, the replacement worker must use the normal rescheduling
and model-load path.
## When to Use It Today
- Use it to evaluate same-node recovery from unknown vLLM engine or
software-process failures.
- Use it when the cost you are trying to avoid is loading another independent
copy of model weights into GPU memory.
- Use the GMS-only examples to validate backend weight loading through GMS, not
as a complete failover workflow.
- Do not use it for hardware failure, GPU loss, node loss, cross-node recovery,
in-flight request recovery, or KV-cache recovery.
- Do not combine it with Snapshot restore. Snapshot plus GMS is not yet
available.
## GPU Memory Service
GMS moves ownership of GPU-resident model weights out of the engine process and
into a separate GPU memory service. In the failover workflow, this lets the
active and standby engines share the same weight memory boundary instead of
loading independent copies.
Direct GMS enablement is useful for backend integration testing and
sleep/wake-style lifecycle experiments. By itself, it does not configure
active/passive failover; use the `failover` field for the shadow engine flow.
## Prerequisites
- Kubernetes 1.34 or newer with DRA v1 (`resource.k8s.io/v1`) enabled.
- NVIDIA GPU DRA driver installed.
- A matching DRA `DeviceClass`, defaulting to `gpu.nvidia.com`.
- A supported backend image. The current failover examples are vLLM-focused.
- Backend command-line support for GMS loading, such as `--load-format gms`.
- Enough GPU memory for the GMS processes and active or standby engines sharing
the device.
## Limitations
- It is not a general checkpoint/restore system.
- It is not a hardware fault tolerance mechanism for GPU, node, or rack loss.
- It does not diagnose or fix the backend failure.
- It does not preserve in-flight requests, network sockets, or KV cache state.
- It does not make Snapshot restore supported for GPU memory workloads.
- Snapshot plus GMS is temporarily blocked by admission because of known GPU
driver restore issues.
- It is not covered by the normal v1beta1 compatibility guarantees while it
lives under `experimental`.
## API Placement
For `v1alpha1` `DynamoGraphDeployment`, GMS and failover are service-level
fields:
```yaml
gpuMemoryService:
enabled: true
failover:
enabled: true
```
For `v1beta1`, preview fields are grouped under `experimental` to make the
stability contract explicit:
```yaml
experimental:
gpuMemoryService:
mode: IntraPod
failover:
mode: IntraPod
```
See the [API reference](/dynamo/additional-resources/api-reference-k-8-s) for the exact schema supported by your
CRD version.
## Basic Shadow Engine Failover Example
Failover builds on GMS. In intra-pod mode, the operator clones the worker's main
container into active and standby engine containers that share GPUs through DRA
and the GMS sidecar. The standby engine takes over when the active engine fails.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-agg-failover
annotations:
nvidia.com/dynamo-kube-discovery-mode: container
spec:
services:
VllmWorker:
componentType: worker
replicas: 1
resources:
limits:
gpu: "2"
gpuMemoryService:
enabled: true
failover:
enabled: true
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
args:
- --model
- Qwen/Qwen3-0.6B
- --tensor-parallel-size
- "2"
- --load-format
- gms
```
See the [vLLM failover example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/vllm/deploy/agg_failover.yaml)
for the full manifest.
## Basic GMS Example
The worker must request GPUs through the normal Dynamo service resources, enable
`gpuMemoryService`, and run a backend command that can load from GMS.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-agg-gms
spec:
services:
VllmWorker:
componentType: worker
replicas: 1
resources:
limits:
gpu: "1"
gpuMemoryService:
enabled: true
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
workingDir: /workspace/examples/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --load-format
- gms
```
Working GMS-only examples:
- [vLLM GMS example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/vllm/deploy/agg_gms.yaml)
- [SGLang GMS example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/sglang/deploy/agg_gms.yaml)
## Related Documentation
- [Snapshot](/dynamo/kubernetes-deployment/advanced-platform/snapshot)
- [API Reference](/dynamo/additional-resources/api-reference-k-8-s)
- [vLLM failover example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/vllm/deploy/agg_failover.yaml)
- [vLLM GMS example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/vllm/deploy/agg_gms.yaml)
- [SGLang GMS example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/sglang/deploy/agg_gms.yaml)
# Developing the Operator with Tilt
## Overview
[Tilt](https://tilt.dev) provides a live-reload development environment for the
Dynamo Kubernetes operator. Instead of manually building images, pushing to a
registry, and redeploying on every change, Tilt watches your source files and
automatically recompiles the Go binary, syncs it into the running container, and
restarts the process — all in seconds.
Under the hood, the Tiltfile:
1. **Compiles** the Go manager binary locally (`CGO_ENABLED=0`).
2. **Builds** a minimal Docker image containing only the binary.
3. **Renders** the production Helm chart (`deploy/helm/charts/platform`) with
`helm template`, applies CRDs via `kubectl`, and deploys all rendered
resources.
4. **Live-updates** the binary inside the running container on every code
change — no full image rebuild required.
This gives you a fully working cluster where you can apply `DynamoGraphDeployment`
and `DynamoGraphDeploymentRequest` resources and have them reconcile into real
workloads — while iterating on controller logic with sub-second feedback.
## Prerequisites
| Tool | Version | Purpose |
|------|---------|---------|
| [Tilt](https://docs.tilt.dev/install.html) | v0.33+ | Development orchestration |
| [Helm](https://helm.sh/docs/intro/install/) | v3 | Chart rendering |
| [Go](https://go.dev/dl/) | 1.25+ | Compiling the operator |
| [kubectl](https://kubernetes.io/docs/tasks/tools/) | — | Cluster access |
| A Kubernetes cluster | — | kind, minikube, or remote cluster |
You also need a **container registry** that is accessible to your cluster's
nodes, so they can pull the operator image Tilt builds. If you use a local
cluster like kind with a local registry, Tilt can push there directly.
## Quick Start
```bash
cd deploy/operator
# Create your personal settings file (gitignored)
cat > tilt-settings.yaml <
If no registry is configured, the image is only available locally. This works
with kind using a local registry but will fail on remote clusters.
## How It Works
When you run `tilt up`, the following resources are created in order:
```
manager-build Compile Go binary locally
│
├───── crds Apply CRDs via server-side apply
│
operator Deploy operator pod (live-updated)
```
The operator handles webhook certificate generation, CA bundle injection, and
MPI SSH key provisioning at runtime — no external setup needed.
### What Each Resource Does
**manager-build** — Runs `CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build` to
compile the operator binary. Re-runs on changes to `api/`, `cmd/`, `internal/`,
`go.mod`, or `go.sum`.
**crds** — Applies CRDs from the Helm chart via `kubectl apply --server-side`.
When `skip_codegen` is `false`, runs `make generate && make manifests` first.
**operator** — The operator Deployment itself. Tilt watches the compiled binary
and uses `live_update` to sync it into the running container and restart the
process — no image rebuild needed. On startup, the operator's built-in cert
controller generates a self-signed TLS certificate, injects the CA bundle into
webhook configurations, and creates the MPI SSH secret — matching production
behavior exactly.
### Live Update Cycle
The inner development loop looks like this:
1. You edit Go source files under `deploy/operator/`.
2. Tilt detects the change and recompiles the binary (~2-5 seconds).
3. The new binary is synced into the running container via `live_update`.
4. The process restarts automatically.
5. Your controller changes are live — test by applying a DGD/DGDR.
No `docker build`, no `docker push`, no `kubectl rollout restart`.
## Webhook Certificates
The operator handles webhook TLS certificates automatically at runtime using a
built-in cert controller (based on OPA cert-controller). On startup it:
1. Creates a self-signed CA and webhook serving certificate.
2. Stores them in the `webhook-server-cert` Secret.
3. Injects the CA bundle into `ValidatingWebhookConfiguration` and
`MutatingWebhookConfiguration` resources.
This matches production behavior and requires no external tooling. For
alternative certificate management (cert-manager or external certs), see the
[webhook documentation](/dynamo/kubernetes-deployment/advanced-platform/webhooks) and configure via
`helm_values` in `tilt-settings.yaml`.
## Typical Workflows
### Iterating on Controller Logic
The most common workflow — you're modifying reconciliation logic and want fast
feedback:
```yaml
# tilt-settings.yaml
allowed_contexts: [my-cluster]
registry: docker.io/myuser
skip_codegen: true
```
```bash
tilt up
# Edit files under internal/controller/
# Tilt auto-recompiles and live-updates
# Apply test resources:
kubectl apply -f examples/backends/vllm/deploy/agg.yaml
```
### Changing API Types (CRDs)
When you modify files under `api/`, you need codegen to run:
```yaml
# tilt-settings.yaml
skip_codegen: false # or omit — false is the default
```
Tilt will run `make generate && make manifests` and re-apply CRDs whenever
`api/` files change.
### Testing Multi-Node Features
Enable the necessary subcharts:
```yaml
# tilt-settings.yaml
enable_grove: true
enable_kai_scheduler: true
```
### Using Environment Variables
You can override the registry without editing the settings file:
```bash
REGISTRY=ghcr.io/myorg tilt up
```
## Tilt UI
The web UI at [http://localhost:10350](http://localhost:10350) shows:
- **Resource status** — green/red/pending for each resource
- **Build logs** — compilation output and errors
- **Runtime logs** — operator logs streamed in real time
- **Port forwards** — the health endpoint is forwarded to `localhost:8081`
Resources are grouped by label (`operator` and `infrastructure`) to keep the
UI organized.
## Cleanup
```bash
# Stop Tilt and leave resources deployed
# (Ctrl-C in the terminal)
# Stop Tilt and tear down all resources
tilt down
```
## Troubleshooting
### Image Pull Errors
If pods show `ImagePullBackOff`:
- Verify `registry` is set in `tilt-settings.yaml` or via `REGISTRY` env var.
- Ensure your cluster nodes can pull from that registry.
- For kind with a local registry, follow the
[kind local registry guide](https://kind.sigs.k8s.io/docs/user/local-registry/).
### Webhook TLS Errors
If applying a DGD/DGDR fails with `x509: certificate signed by unknown authority`:
- Check the operator logs in the Tilt UI — the cert controller logs its
progress on startup.
- Verify the `webhook-server-cert` Secret exists and has been populated:
```bash
kubectl -n dynamo-system get secret webhook-server-cert
```
- The operator may need a few seconds after startup to generate certs and
inject the CA bundle. Wait for the `cert-controller` log messages before
applying resources.
### CRD Codegen Failures
If `crds` fails with codegen errors:
- Ensure `controller-gen` is installed: `make controller-gen`
- Try running codegen manually: `make generate && make manifests`
- Set `skip_codegen: true` temporarily to bypass if you haven't changed API types.
### Context Safety Guard
If Tilt refuses to start with a context error, add your cluster context to
`allowed_contexts` in `tilt-settings.yaml`:
```yaml
allowed_contexts:
- my-cluster-context
```
# Amazon Elastic Kubernetes Service (EKS)
# Steps to create an EKS cluster
This guide demonstrates the Dynamo platform on Amazon Elastic Kubernetes Service (EKS).
## Setup environment variables
We will use those environment variables throughout this guide.
If you would like to use a different region, modify the `AWS_REGION` variable
```bash
export AWS_REGION="us-east-1"
export CLUSTER_NAME="ai-dynamo"
export DYNAMO_NAMESPACE="dynamo-system"
export DYNAMO_RELEASE_VERSION="1.0.0"
```
## Install CLIs
### Install AWS CLI ([AWS CLI installation guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html))
```bash
sudo apt install unzip
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
```
### Install Kubernetes CLI ([kubectl installation guide for EKS](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html))
```bash
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.35.2/2026-02-27/bin/darwin/amd64/kubectl
chmod +x ./kubectl
mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH
echo 'export PATH=$HOME/bin:$PATH' >> ~/.bashrc
```
### Install Eksctl CLI ([eksctl installation guide](https://eksctl.io/installation/))
```bash
ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
curl -sL "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt" | grep $PLATFORM | sha256sum --check
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
sudo mv /tmp/eksctl /usr/local/bin
```
### Install Helm CLI ([Helm setup for EKS](https://docs.aws.amazon.com/eks/latest/userguide/helm.html))
```bash
curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 > get_helm.sh
chmod 700 get_helm.sh
./get_helm.sh
```
## Create an EKS Auto Mode cluster
Creating an EKS Auto Mode cluster using Eksctl with `eksctl.yaml`.
This will create an EKS Auto Mode cluster with the Amazon EFS CSI Driver installed as an addon, we will later use Amazon EFS to store model weights and compilation to be used by Dynamo.
```bash
# Use all availability zones in a region, exclude use1-az3 where EKS control plane is not available
export EKS_CP_AZS=$(aws ec2 describe-availability-zones \
--region ${AWS_REGION} \
--filters "Name=opt-in-status,Values=opt-in-not-required" \
--query "AvailabilityZones[?ZoneId!='use1-az3'].[ZoneName]" \
--output text | sed 's/ /, /g; s/^/ - /')
eksctl create cluster -f <(envsubst < templates/eksctl.yaml)
```
*Note: eksctl will automatically configure kubeconfig context for you, if not you can run: `aws eks update-kubeconfig --region $AWS_REGION --name $CLUSTER_NAME`*
### Create an EKS Auto Mode GPU NodePool
Creating a GPU NodePool that targets the **g5,g6,g6e,g7e,p5,p5e,p5en** instance families.
```bash
kubectl apply -f automode-np-gpu.yaml
```
## Create a default StorageClass
Create a default StorageClass to use the storage capability of EKS Auto Mode, this will make the default StorageClass to use EBS volumes for Stateful workloads needed by NATS that is used with Dynamo.
```bash
kubectl apply -f - << EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: auto-ebs-sc
annotations:
storageclass.kubernetes.io/is-default-class: "true"
allowedTopologies:
- matchLabelExpressions:
- key: eks.amazonaws.com/compute-type
values:
- auto
provisioner: ebs.csi.eks.amazonaws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
type: gp3
encrypted: "true"
EOF
```
## Create an Amazon EFS shared file system
Follow the [EFS setup guide](/dynamo/kubernetes-deployment/cloud-provider-guides/aws/efs) to create an EFS file system and make it available as shared storage for Dynamo workloads.
## Install Dynamo Kubernetes Platform
### Install Dynamo Platform
```bash
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-"$DYNAMO_RELEASE_VERSION".tgz
helm install dynamo-platform dynamo-platform-"$DYNAMO_RELEASE_VERSION".tgz \
--namespace "$DYNAMO_NAMESPACE" \
--create-namespace
```
### Setup HuggingFace TOKEN
```bash
export HF_TOKEN=
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${DYNAMO_NAMESPACE}
```
### Verify installation
Validate that the Dynamo platform pods are running, you should see an output similar to output below.
```bash
kubectl get pods -n ${DYNAMO_NAMESPACE}
NAME READY STATUS RESTARTS AGE
dynamo-platform-dynamo-operator-controller-manager-ff54b5dstgcq 1/1 Running 0 106s
dynamo-platform-nats-0 2/2 Running 0 106s
```
Validate that the Dynamo CRDs were installed
```bash
kubectl get crds | grep dynamo
dynamocheckpoints.nvidia.com 2026-03-17T13:18:05Z
dynamocomponentdeployments.nvidia.com 2026-03-17T13:18:06Z
dynamographdeploymentrequests.nvidia.com 2026-03-17T13:18:08Z
dynamographdeployments.nvidia.com 2026-03-17T13:18:09Z
dynamographdeploymentscalingadapters.nvidia.com 2026-03-17T13:18:10Z
dynamomodels.nvidia.com 2026-03-17T13:18:10Z
dynamoworkermetadatas.nvidia.com 2026-03-17T13:18:11Z
```
## Deploy a Dynamo DynamoGraphDeployment (DGD)
| Manifest | Description |
|----------|-------------|
| `manifests/vllm/disagg.yaml` | Disaggregated prefill/decode DGD using NIXL with LIBFABRIC backend over EFA. Targets `g7e.12xlarge` instances with GPUDirect RDMA support for high-throughput KV-cache transfer between prefill and decode workers. |
| `manifests/vllm/disagg-p5.yaml` | Disaggregated prefill/decode DGD using NIXL with LIBFABRIC backend over EFA. Targets `p5.48xlarge` reserved instances with 8 EFA devices (4 EFAs per 1 GPU for p5.48xlarge) and TP-2 for Qwen3-32B. Uses 2 decode and 6 prefill replicas on reserved capacity (`karpenter.sh/capacity-type: reserved`). |
| `manifests/vllm/disagg-tcp.yaml` | Alternative disaggregated prefill/decode inference graph using TCP instead of EFA. Targets `g6e.2xlarge` instances, suitable for instance types without EFA support. |
| `manifests/vllm/agg.yaml` | Aggregated (single-worker) inference graph where a single vLLM worker handles both prefill and decode phases. Simpler deployment without KV-cache transfer overhead. |
### Cache Models on EFS
Before deploying an inference graph, download the model weights onto the shared EFS file system. Each Dynamo recipe includes a `model-cache/model-download.yaml` Job manifest that downloads the model from HuggingFace.
Copy the recipe's download manifest into the local kustomize directory and apply it:
```bash
# Example: cache the Qwen3-32B model which we will be using later
cp ../../../recipes/qwen3-32b/model-cache/model-download.yaml manifests/model-download/model-download.yaml
kubectl kustomize manifests/model-download | kubectl -n ${DYNAMO_NAMESPACE} apply -f -
rm -f manifests/model-download/model-download.yaml
```
The recipe manifests don't set any memory resources on the download container. Without a memory request, the Job pod can get OOMKilled during download — especially for large models. The `kustomization.yaml` in `manifests/model-download/` patches in a memory request to prevent this. By default it adds `4Gi`.
For larger models (e.g. DeepSeek-R1, Nemotron-3-Super-120B) increase this value in `manifests/model-download/kustomization.yaml` before applying:
```yaml
patches:
- target:
kind: Job
name: model-download
patch: |
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
template:
spec:
containers:
- name: model-download
resources:
requests:
memory: "16Gi" # increase for larger models
```
Then apply:
```bash
kubectl kustomize manifests/model-download | kubectl -n ${DYNAMO_NAMESPACE} apply -f -
```
Monitor the download Job:
```bash
kubectl -n ${DYNAMO_NAMESPACE} get jobs model-download
kubectl -n ${DYNAMO_NAMESPACE} logs -f job/model-download
```
To re-run a download (e.g. after changing the model or fixing an OOM), delete the previous Job first:
```bash
kubectl -n ${DYNAMO_NAMESPACE} delete job model-download
```
Then copy the new recipe's manifest and apply again.
### Disaggregated Serving
This example deploys a disaggregated prefill/decode Dynamo Inference Graph that uses NIXL with the LIBFABRIC backend using [Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) for high-throughput KV-cache transfer between workers.
It targets `g7e.12xlarge` instances, which support GPUDirect RDMA, and uses the Dynamo EFA-enabled vLLM container `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0-efa-amd64` that ships with the [EFA Installer](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-changelog.html) pre-installed.
*Note: For a full list of EFA-supported instance types, see [the AWS EC2 Docs](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types).*
```yaml
nodeSelector:
node.kubernetes.io/instance-type: g7e.12xlarge
```
KV-cache transfer between workers uses [NIXL](https://github.com/ai-dynamo/nixl) with the LIBFABRIC backend. Enable it by passing the following argument to vLLM:
`--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config": {"backends": ["LIBFABRIC"]}}'`
*Note: On instance types without EFA support, NIXL's libfabric backend falls back to TCP automatically. However, vLLM's `NixlConnector` defaults to `cuda` as the buffer device, so you must add `"kv_buffer_device":"cpu"` to the `kv-transfer-config` argument for disaggregated serving to work without EFA.*
Request an EFA device for each worker pod using the `vpc.amazonaws.com/efa` extended resource:
```yaml
resources:
requests:
gpu: "1"
custom:
vpc.amazonaws.com/efa: "1"
limits:
gpu: "1"
custom:
vpc.amazonaws.com/efa: "1"
```
*Note: EKS Auto Mode includes the EFA device plugin making `vpc.amazonaws.com/efa` extended resource available.*
All workers (prefill and decode) must be co-located in the same availability zone, since EFA traffic does not cross AZ boundaries. Use a pod affinity rule to enforce this:
```yaml
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: "topology.kubernetes.io/zone"
labelSelector:
matchLabels:
nvidia.com/dynamo-graph-deployment-name: "vllm-disagg"
```
```bash
kubectl -n ${DYNAMO_NAMESPACE} apply -f manifests/vllm/disagg.yaml
```
*Note: `manifests/vllm/disagg-tcp.yaml` provides an alternative example that uses TCP instead of EFA, targeting `g6e.2xlarge` instances.*
Verify that all pods reach `Running` status:
```bash
kubectl -n ${DYNAMO_NAMESPACE} get pods
NAME READY STATUS RESTARTS AGE
dynamo-platform-dynamo-operator-controller-manager-ff54b5dstgcq 1/1 Running 0 39m
dynamo-platform-nats-0 2/2 Running 0 39m
vllm-disagg-frontend-85f8476887-wwtwk 1/1 Running 0 2m13s
vllm-disagg-vllmdecodeworker-510a1741-7666987b-tp58w 1/1 Running 0 2m13s
vllm-disagg-vllmprefillworker-510a1741-54f76d7954-tjgn8 1/1 Running 0 2m13s
```
```bash
kubectl -n ${DYNAMO_NAMESPACE} port-forward svc/vllm-disagg-frontend 8000:8000
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-32B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": false,
"max_tokens": 30
}'
```
You should see output similar to below
```bash
{"id":"chatcmpl-23a7c94b-99cb-42ca-ae56-2397aa5a560f","choices":[{"index":0,"message":{"content":"\nOkay, so I need to develop a character background for someone who's an intrepid explorer in Eldoria, specifically focusing on their motivations,","role":"assistant","reasoning_content":null},"finish_reason":"length"}],"created":1773336002,"model":"Qwen/Qwen3-0.6B","object":"chat.completion","usage":{"prompt_tokens":196,"completion_tokens":30,"total_tokens":226,"prompt_tokens_details":{"audio_tokens":null,"cached_tokens":192}},"nvext":{"worker_id":{"prefill_worker_id":4265733549773195,"prefill_dp_rank":0,"decode_worker_id":7535192362430132,"decode_dp_rank":0},"timing":{"request_received_ms":1773336002136,"prefill_wait_time_ms":0.852483,"prefill_time_ms":12.90597,"ttft_ms":13.758453000000001,"total_time_ms":110.89621500000001,"kv_hit_rate":0.0}}}
```
*Note: The initial request for each worker will occur increased latency, this is due to the NIXL backend handshake and initialization overhead, this operation is only for the very first transfer*
Watch logs
```bash
kubectl logs -n ${DYNAMO_NAMESPACE} -l nvidia.com/dynamo-graph-deployment-name=vllm-disagg --all-containers=true --max-log-requests=20 --prefix=true --timestamps -f
```
Cleanup
```bash
kubectl -n ${DYNAMO_NAMESPACE} delete -f manifests/vllm/disagg.yaml
```
### Aggregated Serving
```bash
kubectl -n ${DYNAMO_NAMESPACE} apply -f manifests/vllm/agg.yaml
```
Your pods should be running like below output, making sure they are in status "Running".
```bash
kubectl -n ${DYNAMO_NAMESPACE} get pods
NAME READY STATUS RESTARTS AGE
dynamo-platform-dynamo-operator-controller-manager-ff54b5dstgcq 1/1 Running 0 12m
dynamo-platform-nats-0 2/2 Running 0 12m
vllm-agg-frontend-ff8457bcf-tq9jh 1/1 Running 0 4m46s
vllm-agg-vllmdecodeworker-d0a70291-759df94478-8lc74 1/1 Running 0 4m46s
```
Watch logs
```bash
kubectl logs -n ${DYNAMO_NAMESPACE} -l nvidia.com/dynamo-graph-deployment-name=vllm-agg --all-containers=true --max-log-requests=20 --prefix=true --timestamps -f
```
```bash
kubectl -n ${DYNAMO_NAMESPACE} port-forward svc/vllm-agg-frontend 8000:8000
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": false,
"max_tokens": 30
}'
```
You should see output similar to below
```bash
{"id":"chatcmpl-093fac0e-f75e-43b5-90dc-96c8c77a2e7c","choices":[{"index":0,"message":{"content":"\nOkay, I need to develop a character background for the explorer in Eldoria. Let me start by understanding the user's query. They mentioned","role":"assistant","reasoning_content":null},"finish_reason":"length"}],"created":1773443560,"model":"Qwen/Qwen3-0.6B","object":"chat.completion","usage":{"prompt_tokens":196,"completion_tokens":30,"total_tokens":226},"nvext":{"timing":{"request_received_ms":1773443560878,"total_time_ms":99.89782}}}%
```
Cleanup
```bash
kubectl -n ${DYNAMO_NAMESPACE} delete -f manifests/vllm/agg.yaml
```
## Using On-Demand Capacity Reservations (ODCR) and Capacity Blocks (CBs) for ML
GPU instances can be difficult to acquire on-demand. AWS provides two reservation mechanisms to guarantee capacity for ML workloads:
- [On-Demand Capacity Reservations (ODCRs)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html) reserve capacity in a specific AZ for any duration. You pay for the reserved capacity whether or not you use it.
- [Capacity Blocks for ML](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html) reserve GPU instances for a fixed time window (hours to days). Instances are placed in EC2 UltraClusters for low-latency networking. Capacity Blocks have a defined end time, and EC2 will terminate instances before the block expires.
EKS Auto Mode uses Karpenter under the hood, which models reserved capacity as `karpenter.sh/capacity-type: reserved` and prioritizes it over on-demand and spot.
By default, EKS Auto Mode can launch into open ODCRs automatically, but does not prioritize them. Capacity Blocks are never used automatically. Both require explicit `capacityReservationSelectorTerms` configuration on a NodeClass to be prioritized and labeled as `reserved`.
### Create a NodeClass with Capacity Reservation
Create a NodeClass that references your ODCR or Capacity Block reservation. You can select by reservation ID or by tags.
First, extract the subnet, security group, and role configuration from the `default` NodeClass that EKS Auto Mode already created:
```bash
export NC_SUBNETS=$(kubectl get nodeclass default -o json | jq -c '.spec.subnetSelectorTerms')
export NC_SG=$(kubectl get nodeclass default -o json | jq -c '.spec.securityGroupSelectorTerms')
export NC_ROLE=$(kubectl get nodeclass default -o json | jq -r '.spec.role')
```
Replace `` with your actual reservation ID from the EC2 console.
```bash
export CR_ID=
kubectl apply -f - << EOF
apiVersion: eks.amazonaws.com/v1
kind: NodeClass
metadata:
name: gpu-reserved
spec:
role: ${NC_ROLE}
subnetSelectorTerms: ${NC_SUBNETS}
securityGroupSelectorTerms: ${NC_SG}
capacityReservationSelectorTerms:
# Select by reservation ID (ODCR or Capacity Block)
- id: "${CR_ID}"
# Or select by tags (can be combined)
# - tags:
# team: "dynamo"
EOF
```
Wait until the status of the capacityReservation state is `active`.
```bash
kubectl get nodeclass gpu-reserved -o json | jq '.status.capacityReservations'
[
{
"availabilityZone": "us-east-2c",
"endTime": "2026-03-18T11:30:00Z",
"id": "cr-xxxxxxxxxxxxxx",
"instanceMatchCriteria": "targeted",
"instanceType": "p5.48xlarge",
"ownerID": "xxxxxxxxxxx",
"reservationType": "capacity-block",
"state": "active"
}
]
```
### Create a NodePool for Reserved Capacity
Create a NodePool that references the `gpu-reserved` NodeClass and uses the `reserved` capacity type. You can optionally include `on-demand` and `spot` as a fallback when the reservation is exhausted.
```bash
kubectl apply -f - << EOF
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-reserved
spec:
disruption:
budgets:
- nodes: 10%
consolidateAfter: 300s
consolidationPolicy: WhenEmptyOrUnderutilized
template:
spec:
nodeClassRef:
group: eks.amazonaws.com
kind: NodeClass
name: gpu-reserved
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- reserved
# Uncomment to fallback to on-demand or spot when reservation is exhausted
# - on-demand
# - spot
- key: eks.amazonaws.com/instance-family
operator: In
values:
- g6e
- g7e
- p5
- p5e
- p5en
taints:
- effect: NoSchedule
key: nvidia.com/gpu
value: Exists
EOF
```
Validate that the `gpu-reserved` NodePool is ready
```bash
kubectl get nodepool gpu-reserved
NAME NODECLASS NODES READY AGE
gpu-reserved gpu-reserved 0 True 8s
```
When configuring `capacityReservationSelectorTerms` on any NodeClass in the cluster, EKS Auto Mode will stop automatically using open ODCRs for all NodeClasses. Make sure all NodeClasses that should use ODCRs have explicit selector terms configured.
### Targeting Reserved Nodes from Workloads
Pods are scheduled onto reserved nodes through the existing NodePool requirements and taints. If you want to ensure a workload only runs on reserved capacity, add a node selector:
```yaml
nodeSelector:
karpenter.sh/capacity-type: reserved
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
```
### Capacity Blocks Considerations
Capacity Blocks have a fixed end time. EC2 begins terminating instances 30 minutes before the block expires (60 minutes for UltraServer types). Karpenter will start draining nodes 10 minutes before EC2 termination begins, giving your workloads time to gracefully shut down.
Plan your inference workloads accordingly, and consider using `on-demand` as a fallback capacity type in the NodePool if you need continuity beyond the Capacity Block window.
## Cleanup
Delete all DynamoGraphDeployment
```bash
kubectl -n ${DYNAMO_NAMESPACE} get dgd
# If you have any, delete them
kubectl -n ${DYNAMO_NAMESPACE} delete dgd
```
Uninstall Dynamo platform
```bash
helm uninstall -n ${DYNAMO_NAMESPACE} dynamo-platform
```
Clean leftover PVCs related to NATS
```bash
kubectl -n ${DYNAMO_NAMESPACE} get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
dynamo-platform-nats-js-dynamo-platform-nats-0 Bound pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 10Gi RWO auto-ebs-sc 75m
kubectl -n ${DYNAMO_NAMESPACE} delete pvc dynamo-platform-nats-js-dynamo-platform-nats-0
```
Delete the AutoMode GPU nodepool
```bash
kubectl delete nodepool gpu
```
Cleanup EFS related resources, follow the [EFS setup guide](/dynamo/kubernetes-deployment/cloud-provider-guides/aws/efs#cleanup) cleanup section
Delete the EKS Auto Mode cluster using Eksctl
```bash
eksctl delete cluster -f <(envsubst < templates/eksctl.yaml)
```
# EFA (RDMA over AWS Fabric) on EKS
# EFA (RDMA over AWS Fabric) on EKS
This guide covers setting up RDMA over AWS Elastic Fabric Adapter (EFA) on EKS for high-performance disaggregated inference with Dynamo. EFA is the only RDMA fabric available on AWS — InfiniBand and RoCE are not offered. With EFA, Dynamo's prefill and decode workers transfer KV cache directly between GPUs across nodes via GPU-Direct RDMA, bypassing CPU and TCP/IP stacks.
Without RDMA, disaggregated inference falls back to TCP with severe performance degradation (~98s TTFT vs ~1s with EFA on Llama-3.1-8B at ISL 8000). See the [Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication) for the transport-layer fundamentals.
## Prerequisites
**Recommended GPU EC2 instance types with EFA:**
| Instance family | GPU | Aggregate EFA bandwidth | Arch |
| ------------------------------ | ------------------------------------------------------- | ------------------------------------------------------- | ----------------- |
| `p5.48xlarge` / `p5e.48xlarge` | 8× H100 / H200 | 3.2 Tbps | x86_64 |
| `p5en.48xlarge` | 8× H200 | 3.2 Tbps | x86_64 |
| `p6-b200.48xlarge` | 8× B200 | 3.2 Tbps | x86_64 |
| P6e-GB200 UltraServer | GB200 (topology-dependent, up to 72 GPUs / UltraServer) | 400 GB/s EFAv4 per GPU; up to 28.8 Tbps per UltraServer | **arm64 (Grace)** |
This table is not an exhaustive list of all AWS instance types that support EFA. It lists the GPU families most relevant to Dynamo disaggregated inference.
**Cluster setup:**
- **GPU-Direct RDMA enabled on the host** — either kernel ≥ 5.12 (DMA-BUF path; default on current AWS EKS AMIs, typically 6.14+) **or** an older kernel with the `nvidia-peermem` / AWS `efa_nv_peermem` module loaded (legacy peer-memory path; see [Step 2](#step-2-verify-host-kernel-modules) for how to install it).
- **EFA-enabled security group** — VPC security groups must allow all traffic between EFA-attached ENIs. The standard recommendation is a self-referencing security group rule that allows all protocols within the group. See [AWS EFA security group setup](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security).
- **EKS node groups created with EFA support** — when using `eksctl`, set `efaEnabled: true` on the GPU node group. This attaches the appropriate number of EFA ENIs per instance type.
## Overview
EFA setup involves three pieces:
1. **AWS EFA Kubernetes device plugin** — exposes EFA NICs as the `vpc.amazonaws.com/efa` extended resource (host-level setup, [Step 1](#step-1-install-the-aws-efa-kubernetes-device-plugin)). On modern kernels (≥ 5.12) the DMA-BUF path is used and `efa_nv_peermem` is not required; older kernels need it loaded ([Step 2](#step-2-verify-host-kernel-modules)).
2. **Container image** with libfabric + aws-ofi-nccl + Dynamo ([Step 3](#step-3-build-a-dynamo-efa-image)).
3. **Workload spec** that selects the LIBFABRIC NIXL backend, requests EFA resources, and runs privileged ([Step 4](#step-4-configure-nixl-backend), [Step 5](#step-5-pod-resource-requests)).
## Step 1: Install the AWS EFA Kubernetes Device Plugin
The AWS EFA Kubernetes Device Plugin exposes each node's EFA endpoints as the `vpc.amazonaws.com/efa` extended resource so pods can request them. AWS publishes two install paths — pick one:
**Helm (recommended, from the official `aws/eks-charts` repo):**
```bash
helm repo add eks https://aws.github.io/eks-charts
helm repo update
helm install aws-efa-k8s-device-plugin \
--namespace kube-system \
eks/aws-efa-k8s-device-plugin
```
**Or raw manifest (from [aws-samples/aws-efa-eks](https://github.com/aws-samples/aws-efa-eks)):**
```bash
kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-efa-eks/main/manifest/efa-k8s-device-plugin.yml
```
Wait for the device plugin pods to start on every EFA-capable node:
```bash
kubectl get pods -n kube-system -l name=aws-efa-k8s-device-plugin-daemonset -w
```
Verify EFA resources are advertised by each GPU node:
```bash
kubectl get nodes -o json | jq '.items[] | select(.status.allocatable["vpc.amazonaws.com/efa"] != null) | {name: .metadata.name, efa: .status.allocatable["vpc.amazonaws.com/efa"], gpu: .status.allocatable["nvidia.com/gpu"]}'
```
Each EFA-capable node should report a non-zero `vpc.amazonaws.com/efa` count (e.g., `32` on `p5.48xlarge`, reflecting that instance's EFA endpoint count). The exact count depends on instance type and how the node group's ENIs were configured at launch.
## Step 2: Verify Host Kernel Modules
Modern AWS GPU AMIs (Amazon Linux 2023, Ubuntu 22.04+, kernel ≥ 5.12) use **DMA-BUF** for GPU-Direct RDMA and **do not require** `nvidia-peermem` or `efa_nv_peermem`. The default AMIs for p5/p5e/p5en/p6-b200/GB200 ship with kernels in the 6.x line where DMA-BUF is the active path.
To confirm:
```bash
# On a GPU node (via kubectl debug or SSH):
uname -r
# Expected: 6.x kernel (e.g., 6.14.0-1018-aws)
lsmod | grep -E "^efa|nvidia"
# Expected: efa, nvidia, nvidia_modeset, nvidia_uvm, gdrdrv loaded
# Note: nvidia-peermem / efa_nv_peermem NOT loaded is normal on modern kernels
cat /sys/module/efa/version
# Expected: 3.0.0g or newer
```
If you are on an older kernel (< 5.12) and the host doesn't already have `efa_nv_peermem` loaded, the simplest path is to switch to an AMI that includes EFA host-level components — the EKS-optimized AL2023 NVIDIA AMI and all Bottlerocket AMIs include them. Otherwise, run [`aws-efa-installer`](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-enable) on the host (via a privileged DaemonSet or baked into a custom AMI). See [AWS — Manage EFA devices on Amazon EKS](https://docs.aws.amazon.com/eks/latest/userguide/device-management-efa.html) for the full picture.
## Step 3: Build a Dynamo EFA Image
Dynamo's image build is two steps: `container/render.py` writes a Dockerfile for the chosen framework + target, then `docker build` consumes it. Passing `--make-efa` to `render.py` appends the AWS EFA installer stage from [`container/templates/aws.Dockerfile`](../../../../container/templates/aws.Dockerfile), which defines a stage named `aws` on top of `runtime`. **You must pass `--target aws` to `docker build`** — without it, `docker build` stops at the `runtime` stage and you get an image without EFA. See [`container/README.md`](../../../../container/README.md) for the full build workflow.
```bash
# vLLM EFA image (amd64 or arm64 — vllm/vllm-openai is multi-arch)
container/render.py --framework=vllm --target=runtime --platform=linux/amd64 \
--make-efa --output-short-filename
docker build --target aws -t dynamo:latest-vllm-runtime-efa \
-f container/rendered.Dockerfile .
container/render.py --framework=vllm --target=runtime --platform=linux/arm64 \
--make-efa --output-short-filename
docker buildx build --platform=linux/arm64 --target aws \
-t dynamo:latest-vllm-runtime-efa-arm64 -f container/rendered.Dockerfile .
# SGLang EFA image (amd64 or arm64)
container/render.py --framework=sglang --target=runtime --platform=linux/amd64 \
--make-efa --output-short-filename
docker build --target aws -t dynamo:latest-sglang-runtime-efa \
-f container/rendered.Dockerfile .
container/render.py --framework=sglang --target=runtime --platform=linux/arm64 \
--make-efa --output-short-filename
docker buildx build --platform=linux/arm64 --target aws \
-t dynamo:latest-sglang-runtime-efa-arm64 -f container/rendered.Dockerfile .
# TRT-LLM EFA image (amd64 or arm64 — upstream nvcr.io/nvidia/tensorrt-llm/release
# publishes both variants; arm64 is what you want for GB200 / Grace EFA nodes)
container/render.py --framework=trtllm --target=runtime --platform=linux/amd64 \
--cuda-version=13.1 --make-efa --output-short-filename
docker build --target aws -t dynamo:latest-trtllm-runtime-efa \
-f container/rendered.Dockerfile .
container/render.py --framework=trtllm --target=runtime --platform=linux/arm64 \
--cuda-version=13.1 --make-efa --output-short-filename
docker buildx build --platform=linux/arm64 --target aws \
-t dynamo:latest-trtllm-runtime-efa-arm64 -f container/rendered.Dockerfile .
```
`--output-short-filename` writes to `container/rendered.Dockerfile`; omit it to get the long auto-generated filename (e.g., `vllm-runtime-cuda12.9-amd64-rendered.Dockerfile`) — useful when keeping several rendered Dockerfiles side by side.
See [Known Issues](#known-issues) below for one case where the default-built image does **not** produce a working EFA deployment out of the box (GB200 / arm64 64K-page kernels). The symptom looks like a working setup but fails at startup during NIXL memory registration.
## Step 4: Configure NIXL Backend
NIXL is the high-level KV transfer API and supports multiple backends. **For EFA, the LIBFABRIC backend must be selected.** UCX is NIXL's default backend, and while it has CUDA-IPC / RDMA transports available in the image, in standard pod-to-pod EFA configurations it lands on a slow transport (effectively TCP-speed at ~1–3 GB/s) instead of EFA's line rate. Empirically, LIBFABRIC is the only backend that reaches full EFA bandwidth on AWS.
Each framework selects the backend differently:
| Framework | How to select LIBFABRIC | Default if unset |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ |
| **SGLang** | `SGLANG_DISAGGREGATION_NIXL_BACKEND=LIBFABRIC` env var | UCX → TCP fallback |
| **vLLM** | `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}'` CLI flag | UCX → TCP fallback |
| **TRT-LLM** | `TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC` env var | UCX → TCP fallback |
| **KVBM (Rust)** | `DYN_KVBM_NIXL_BACKEND_LIBFABRIC=true` env var | UCX → TCP fallback |
This is a silent-failure path — getting it wrong manifests as ~100 s TTFT instead of a clear error. Always [verify at startup](#verification) that LIBFABRIC is active.
### Required EFA environment variables
In addition to backend selection, set these on every worker pod:
```yaml
env:
- { name: FI_PROVIDER, value: efa }
- { name: FI_EFA_USE_DEVICE_RDMA, value: "1" }
- { name: FI_EFA_ENABLE_SHM_TRANSFER, value: "0" }
- { name: FI_EFA_ENABLE_SHM, value: "0" }
# Place Amazon EFA libs first in LD_LIBRARY_PATH
- name: LD_LIBRARY_PATH
value: "/opt/amazon/efa/lib:/opt/amazon/efa/lib64:/opt/aws-ofi-nccl/lib:${LD_LIBRARY_PATH}"
```
### Recommended EFA performance tuning
```yaml
env:
- { name: FI_EFA_FORK_SAFE, value: "0" }
- { name: FI_EFA_USE_HUGE_PAGE, value: "1" }
- { name: FI_EFA_MR_MAX_CACHED_COUNT, value: "524288" }
- { name: FI_EFA_MR_MAX_CACHED_SIZE, value: "0" }
```
When using `FI_EFA_USE_HUGE_PAGE=1`, also add `hugepages-2Mi: 5120Mi` to the pod resource limits.
## Step 5: Pod Resource Requests
Dynamo pods that use EFA must request the resource and run privileged:
```yaml
resources:
limits:
nvidia.com/gpu: "4" # or your TP
vpc.amazonaws.com/efa: "4" # number of EFA NICs to allocate
hugepages-2Mi: 5120Mi # if FI_EFA_USE_HUGE_PAGE=1
securityContext:
privileged: true # REQUIRED — IPC_LOCK alone is insufficient
capabilities:
add: [IPC_LOCK]
hostIPC: true # required by some EFA setups
volumeMounts:
- { name: shm, mountPath: /dev/shm }
```
```yaml
volumes:
- name: shm
emptyDir: { medium: Memory, sizeLimit: 80Gi }
```
`privileged: true` is required for NIXL to register CUDA VRAM with the EFA NIC via `fi_mr_reg`. `IPC_LOCK` alone is insufficient.
## Known Issues
One issue currently affects default-built Dynamo EFA images.
### Issue 1: libfabric on GB200 fails `fi_mr_reg` on CUDA VRAM
**Known affected platforms:** GB200.
**Symptom:** Worker pod fails at startup with `fi_mr_reg` returning EFAULT during NIXL initialization. NIXL VRAM registration fails; depending on the framework, the worker either crashes or silently falls back to TCP.
**Root cause:** The libfabric version (versions lower than 2.5.x) bundled with the EFA installer (up to currently latest 1.48.0) lacks a CUDA branch in the dmabuf-eligibility check in `prov/efa/src/efa_mr.c`. On x86_64 hosts the legacy `ibv_reg_mr` path handles CUDA pointers natively, so the bug doesn't surface. On arm64 64K-page kernels (GB200), the legacy path returns EFAULT for CUDA VRAM. Tracked in [ofiwg/libfabric#12019](https://github.com/ofiwg/libfabric/issues/12019).
**Upstream status:** The bug is resolved in `ofiwg/libfabric` main and v2.5.x via a more comprehensive rewrite of `efa_mr_reg_ibv_mr()`. AWS's `aws/libfabric` fork has not picked up the upstream rewrite; the latest EFA installer (1.48.0) still ships `v2.4.0amzn3.0` with the older code path.
**Workarounds:**
1. **Apply the one-line patch to the bundled libfabric.** During image build, replace the `aws.Dockerfile` install step with a custom build:
```dockerfile
RUN git clone --depth 1 --branch v2.4.0amzn3.0 https://github.com/aws/libfabric.git /tmp/libfabric && \
cd /tmp/libfabric && \
sed -i 's/efa_mr_is_neuron(efa_mr) || efa_mr_is_rocr(efa_mr)/efa_mr_is_neuron(efa_mr) || efa_mr_is_rocr(efa_mr) || efa_mr_is_cuda(efa_mr)/' prov/efa/src/efa_mr.c && \
./autogen.sh && \
CPPFLAGS="-I/usr/local/cuda/include" \
LDFLAGS="-L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64/stubs -Wl,-rpath,/usr/local/cuda/lib64" \
./configure --prefix=/opt/amazon/efa --enable-efa --with-cuda=/usr/local/cuda --enable-cuda-dlopen && \
make -j$(nproc) && make install
# Then rebuild aws-ofi-nccl from source against the patched libfabric (do not mix versions)
```
2. **Replace bundled libfabric with `ofiwg/libfabric@v2.5.1`** (or newer). The upstream rewrite is already present; no patch needed. Rebuild `aws-ofi-nccl` against it.
## Verification
After deployment, confirm EFA is actually being used (not silent TCP fallback):
**1. NIXL chose the LIBFABRIC backend** (not UCX):
```bash
kubectl logs | grep -iE "NIXL.*backend|Backend.*instantiated"
# Expected: "Backend LIBFABRIC was instantiated"
# WRONG: "Backend UCX was instantiated"
```
**2. The LIBFABRIC plugin is loaded and executing** (not just opened):
```bash
kubectl exec -- bash -c '
grep "libplugin_LIBFABRIC" /proc/$(pgrep -f "dynamo|vllm|sglang" | head -1)/maps | grep "r-xp"
'
# Expected: at least one line ending in "r-xp" (executable code page mapped)
# If only "r--p" : library opened but never run — config didn't apply, NIXL chose a different backend
```
**3. Registered RDMA memory is GPU VRAM, not CPU pinned memory** (no CPU bounce):
```bash
kubectl logs | grep "efa_mr_reg_impl" | head -1
# Look for "Registered memory at 0x7d7749bc4000 of size 431767552"
kubectl exec -- bash -c 'grep "7d7749bc4" /proc/$(pgrep -f "dynamo|vllm|sglang" | head -1)/maps'
# Expected: NO OUTPUT — CUDA VRAM addresses are not in the Linux VMA table.
# If the address IS found: CPU pinned memory was registered — CPU bounce — GPU-Direct NOT working.
```
**4. NIXL transfers are happening, none failing** (via Prometheus metrics endpoint):
NIXL telemetry is off by default. To enable it, set on each worker:
```yaml
env:
- { name: NIXL_TELEMETRY_ENABLE, value: "y" }
- { name: NIXL_TELEMETRY_EXPORTER, value: prometheus }
- { name: NIXL_TELEMETRY_PROMETHEUS_PORT, value: "19090" } # NIXL's own port — distinct from framework metrics
```
Then query:
```bash
kubectl exec -- curl -s localhost:19090/metrics | grep -E "nixl_bytes_transferred|nixl_num_failed_transfers"
# Expected: nixl_bytes_transferred_count > 0 and increasing
# nixl_num_failed_transfers_total stays 0
```
The same metrics with the `vllm:` prefix are also published to vLLM's own metrics endpoint (typically `DYN_SYSTEM_PORT`, e.g. `8081`) when vLLM is the frontend.
**5. Decode side confirms KV receipt**:
```bash
kubectl logs | grep "External prefix cache hit rate"
# Expected: "External prefix cache hit rate: 100.0%"
```
Do not use `rdma_write_bytes` or other `/sys/class/infiniband/*/counters/*` checks for EFA verification. EFA SRD uses SEND operations at the hardware level, not RDMA READ/WRITE — `rdma_write_bytes` is always 0 on correctly configured EFA by design. Use the Prometheus + `/proc//maps` methodology above instead.
## Common Failure Modes
| Symptom | Likely cause | Fix |
| ------------------------------------------------------ | -------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| TTFT ~100 s, throughput ~MB/s | Silent TCP fallback — NIXL backend selection not applied | Verify Step 4 backend env var; check NIXL startup log |
| TTFT ~10 s, throughput 1–5 GB/s | UCX host-staged (no GPU-Direct on kernel ≥ 6.8) | Switch to LIBFABRIC backend |
| Pod fails at startup with `fi_mr_reg` EFAULT on GB200 | Issue 1 (libfabric CUDA dmabuf bug) | Apply patch or use ofiwg/libfabric v2.5.1 |
| Pod fails at startup with `fi_mr_reg` EFAULT on x86_64 | `privileged: true` missing OR `efa_nv_peermem` missing on old kernel | Verify Step 5 security context |
| Bandwidth halves after image rebuild | libfabric / aws-ofi-nccl ABI mismatch | Rebuild aws-ofi-nccl from source against the libfabric used in the same image |
| `rdma_write_bytes` shows 0 | **Not a failure** — EFA SRD uses SEND, not WRITE | Use Prometheus `nixl_bytes_transferred` instead |
## References
- [Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication) — transport-layer fundamentals
- [RDMA / InfiniBand on AKS](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/rdma-infini-band) — Azure equivalent
- [`container/templates/aws.Dockerfile`](../../../../container/templates/aws.Dockerfile) — EFA installer template
- [AWS — Manage EFA devices on Amazon EKS](https://docs.aws.amazon.com/eks/latest/userguide/device-management-efa.html) — official EKS-side guide (DRA driver + device plugin)
- [AWS EFA documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) — EC2-side EFA overview
- [`aws/eks-charts` — `aws-efa-k8s-device-plugin`](https://github.com/aws/eks-charts/tree/master/stable/aws-efa-k8s-device-plugin) — Helm chart source
- [ofiwg/libfabric#12019](https://github.com/ofiwg/libfabric/issues/12019) — CUDA dmabuf registration on EFA
# Amazon EFS Setup for EKS
# Create an Amazon EFS File System for Amazon EKS
This guide walks through creating an Amazon EFS file system and connecting it to your EKS cluster. The EFS CSI Driver was already installed as an addon via `eksctl.yaml` during cluster creation. Now we need to create the actual file system and make it available to Kubernetes workloads.
This filesystem will be used by Dynamo to store shared model weights and compilation cache across nodes.
## Prerequisites
- EKS cluster created following the [EKS guide](/dynamo/kubernetes-deployment/cloud-provider-guides/aws/eks-setup)
- Environment variables set:
```bash
export AWS_REGION="us-east-1"
export CLUSTER_NAME="ai-dynamo"
export DYNAMO_NAMESPACE="dynamo-system"
```
## Retrieve VPC and Subnet Information
Get the VPC ID associated with your EKS cluster:
```bash
export VPC_ID=$(aws eks describe-cluster \
--name $CLUSTER_NAME \
--region $AWS_REGION \
--query "cluster.resourcesVpcConfig.vpcId" \
--output text)
```
Get the CIDR range for the VPC (used for the security group rule):
```bash
export VPC_CIDR=$(aws ec2 describe-vpcs \
--vpc-ids $VPC_ID \
--query "Vpcs[0].CidrBlock" \
--output text)
```
## Create a Security Group for EFS
Create a security group that allows NFS traffic (port 2049) from within the VPC:
```bash
export EFS_SG_ID=$(aws ec2 create-security-group \
--group-name dynamo-efs-sg \
--description "Security group for EFS access from EKS" \
--vpc-id $VPC_ID \
--region $AWS_REGION \
--query "GroupId" \
--output text)
```
Add an inbound rule to allow NFS traffic from the VPC CIDR:
```bash
aws ec2 authorize-security-group-ingress \
--group-id $EFS_SG_ID \
--protocol tcp \
--port 2049 \
--cidr $VPC_CIDR \
--region $AWS_REGION
```
## Create the EFS File System
```bash
export EFS_FS_ID=$(aws efs create-file-system \
--performance-mode generalPurpose \
--throughput-mode elastic \
--encrypted \
--region $AWS_REGION \
--tags Key=Name,Value=dynamo-efs \
--query "FileSystemId" \
--output text)
```
Wait for the file system to become available:
```bash
aws efs describe-file-systems \
--file-system-id $EFS_FS_ID \
--region $AWS_REGION \
--query "FileSystems[0].LifeCycleState" \
--output text
```
You should see `available` before proceeding.
## Create Mount Targets
Mount targets allow your EKS nodes to access the EFS file system. You need one mount target per subnet where your nodes run.
Get the subnet IDs used by your EKS cluster:
```bash
export SUBNET_IDS=$(aws eks describe-cluster \
--name $CLUSTER_NAME \
--region $AWS_REGION \
--query "cluster.resourcesVpcConfig.subnetIds[]" \
--output text)
echo "Subnet IDs: $SUBNET_IDS"
```
Create a mount target in each subnet:
```bash
for SUBNET_ID in $(echo "$SUBNET_IDS" | tr '\t' '\n'); do
echo "Creating mount target in subnet: $SUBNET_ID"
aws efs create-mount-target \
--file-system-id $EFS_FS_ID \
--subnet-id $SUBNET_ID \
--security-groups $EFS_SG_ID \
--region $AWS_REGION 2>/dev/null || echo " Mount target already exists or subnet is in a duplicate AZ (this is OK)"
done
```
EFS allows only one mount target per Availability Zone. If multiple subnets are in the same AZ, the command will fail for the duplicates, which is expected and safe to ignore.
Verify mount targets are available:
```bash
aws efs describe-mount-targets \
--file-system-id $EFS_FS_ID \
--region $AWS_REGION \
--query "MountTargets[*].{SubnetId:SubnetId,AZ:AvailabilityZoneName,State:LifeCycleState}" \
--output table
```
Wait until all mount targets show `available` in the State column before proceeding.
## Create Kubernetes StorageClass
Create a StorageClass that uses the EFS CSI driver with dynamic provisioning:
```bash
kubectl apply -f - << EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-sc-dynamic
provisioner: efs.csi.aws.com
parameters:
provisioningMode: efs-ap
fileSystemId: "${EFS_FS_ID}"
directoryPerms: "777"
uid: "1000"
gid: "1000"
EOF
```
## Create a PersistentVolumeClaim
We create three separate PVCs because different Dynamo recipe examples reference each one individually:
* `model-cache` stores downloaded model weights (e.g. from HuggingFace).
* `compilation-cache` stores vLLM/TRT-LLM compilation artifacts.
* `perf-cache` stores benchmark traces and performance results.
```bash
# Create the namespace we will use for Dynamo if not already exists
kubectl create namespace ${DYNAMO_NAMESPACE}
# Create PVCs
kubectl apply -f - << EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
namespace: ${DYNAMO_NAMESPACE}
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
storageClassName: "efs-sc-dynamic"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: compilation-cache
namespace: ${DYNAMO_NAMESPACE}
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
storageClassName: "efs-sc-dynamic"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: perf-cache
namespace: ${DYNAMO_NAMESPACE}
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
storageClassName: "efs-sc-dynamic"
EOF
```
EFS is elastic, the `storage` value in the PVC is required by Kubernetes but does not limit the actual storage. EFS will grow and shrink automatically.
## Verify
Confirm the PVC is bound:
```bash
kubectl get pvc -n ${DYNAMO_NAMESPACE}
```
You should see output similar to:
```text
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
compilation-cache Bound pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 5Gi RWX efs-sc-dynamic 41s
model-cache Bound pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 5Gi RWX efs-sc-dynamic 42s
perf-cache Bound pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 5Gi RWX efs-sc-dynamic 41s
```
## Cleanup
To delete the EFS resources when no longer needed:
```bash
# Delete the Kubernetes resources
kubectl delete pvc model-cache compilation-cache perf-cache -n ${DYNAMO_NAMESPACE}
kubectl delete storageclass efs-sc-dynamic
# Delete mount targets
for MT_ID in $(aws efs describe-mount-targets --file-system-id $EFS_FS_ID --region $AWS_REGION --query "MountTargets[*].MountTargetId" --output text); do
aws efs delete-mount-target --mount-target-id $MT_ID --region $AWS_REGION
done
# Delete the EFS file system
aws efs delete-file-system --file-system-id $EFS_FS_ID --region $AWS_REGION
# Delete the security group
aws ec2 delete-security-group --group-id $EFS_SG_ID --region $AWS_REGION
```
# Amazon Elastic Container Service (ECS)
# Dynamo Deployment of vLLM Example on AWS ECS
## 1. EC2 Cluster Setup (for vLLM workloads)
1. Go to AWS ECS console, **Clusters** tab and click on **Create cluster** with name `dynamo-GPU`
2. Input the cluster name and choose **AWS EC2 instances** as the infrastructure. This option will create a cluster with EC2 instances to deploy containers.
3. Choose the ECS-optimized GPU AMI `Amazon Linux 2 (GPU)` (Amazon ECS–optimized), which includes NVIDIA drivers and the Docker GPU runtime out of the box.
4. Choose `g6e.2xlarge` as the **EC2 instance type** and add an `SSH Key pair` so you can log in the instance for debugging purpose. To test with disaggregated serving, we need at least 2 GPUs, so you can choose `g6e.12xlarge` with 4 GPUs
5. Set **Root EBS volume size** as `200`
6. For the networking, use the default settings. Make sure the **security group** has
- an inbound rule which allows "All traffic" from this security group.
- an inbound rule for port 22 and 8000, so that you can ssh into the instance for debugging purpose
7. Select `Turn on` for **Auto-assign public IP** option.
8. Click on **Create** and a cluster will be deployed through cloudformation.
## 2. Fargate Cluster Setup (for ETCD/NATS services)
1. Go to AWS ECS console, **Clusters** tab and click on **Create cluster**
2. Input the cluster name as `dynamo-fargate`
3. Choose **AWS Fargate (serverless)** as the infrastructure
4. For networking, use the same VPC and subnets as the EC2 cluster to ensure connectivity between services
5. For the security group, use the same security group as the EC2 cluster. This automatically allows communication between all services.
6. Ensure outbound rules allow all traffic (default setting) so the Fargate tasks can download container images and communicate externally
7. Click on **Create** to deploy the Fargate cluster
## 3. ETCD/NATS Task Definitions Setup
Add a task for ETCD and NATS services to run on Fargate. A sample task definition JSON is attached.
### 3.1 Create the ecsTaskExecutionRole (Required)
Before creating the task definitions, you need to create the `ecsTaskExecutionRole` IAM role. This role allows ECS to pull container images from registries and write logs to CloudWatch on your behalf.
If you create task definitions through the AWS Console's step-by-step wizard, this role is created automatically. However, when importing task definitions from JSON (as recommended in this guide), you must create this role manually.
Follow the [AWS documentation on creating the task execution IAM role](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_execution_IAM_role.html#create-task-execution-role) to create a role named `ecsTaskExecutionRole` with the `AmazonECSTaskExecutionRolePolicy` policy attached.
Based on the task definition, you may need to add Amazon CloudWatch permissions and AWS Secrets Manager permissions to the `ecsTaskExecutionRole`. See details in the [Amazon CloudWatch Logs permissions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/permissions-reference-cwl.html) the [AWS Secrets Manager authentication and access control guide](https://docs.aws.amazon.com/secretsmanager/latest/userguide/auth-and-access.html#auth-and-access_secrets)
The role ARN will be `arn:aws:iam:::role/ecsTaskExecutionRole`. Make sure to update `` in any task definition JSON files with your actual AWS account ID.
### 3.2 Task Definition Configuration
1. ETCD container
- Container name use `etcd`
- Image URL is `bitnamilegacy/etcd` and **Yes** for Essential container
- Container port
|Container port|Protocol|Port name| App protocol|
|-|-|-|-|
|2379|TCP|2379|HTTP|
|2380|TCP|2380|HTTP|
- Environment variable key is `ALLOW_NONE_AUTHENTICATION` and value is `YES`
2. NATS container
- Container name use `nats`
- Image URL is `nats` and **Yes** for Essential container
- Container port
|Container port|Protocol|Port name| App protocol|
|-|-|-|-|
|4222|TCP|4222|HTTP|
|6222|TCP|6222|HTTP|
|8222|TCP|8222|HTTP|
- Docker configuration, add `-js, --trace` in **Command**
## 4. vLLM Task Definitions Setup
Ensure you have created the `ecsTaskExecutionRole` as described in section 3.1 before creating these task definitions.
1. Dynamo vLLM Frontend and Decoding Worker Task
This task will create vLLM frontend, processors, routers and a decoding worker.
Please follow steps below to create this task
- Set container name as `dynamo-frontend` and use prebuild [Dynamo container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime).
- Choose `Amazon EC2 instances` as the **Launch type** with **Task size** `2 vCPU` and `40 GB`memory
- Choose `host` as the Network mode.
- Container name use `dynamo-vLLM-frontend`
- Add your Image URL (You can use the prebuild [Dynamo container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime)) and **Yes** for Essential container. It can be AWS ECR URL or Nvidia NGC URL. If using NGC URL, please also choose **Private registry authentication** and add your Secret Manager ARN or name.
- Container port
|Container port|Protocol|Port name| App protocol|
|-|-|-|-|
|8000|TCP|8000|HTTP|
- Use `1` GPU for **Resource allocation limits**
- Environment variables settings as below. Will override the `IP_ADDRESS` later.
|Key|Value type|Value|
|-|-|-|
|ETCD_ENDPOINTS|Value|http://IP_ADDRESS:2379|
|NATS_SERVER|Value|nats://IP_ADDRESS:4222|
- Docker configuration
Add `sh,-c` in **Entry point** and `cd examples/backends/vllm && python -m dynamo.frontend --router-mode kv & python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager` in **Command**
2. Dynamo vLLM PrefillWorker Task
Create the PrefillWorker task same as the frontend worker, except for following changes
- Set container name as `dynamo-prefill`
- No container port mapping
- Docker configuration with command `cd examples/backends/vllm && python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --disaggregation-mode prefill`
## 5. Task Deployment
You can create a service or directly run the task from the task definition
1. ETCD/NATS Task
- Choose the Fargate cluster (`dynamo-fargate`) for **Existing cluster** created in step 2.
- Select **Launch type** as `FARGATE`
- In the **Networking** section, select the same VPC and subnets used for the EC2 cluster
- For **Security group**, select the same security group used by the EC2 cluster
- Verify that outbound rules allow all traffic for downloading images and external communication
- Wait for this deployment to finish, and get the **Private IP** of this task.
2. Dynamo Frontend and Decoding Worker Task
- Choose the EC2 cluster (`dynamo-GPU`) for **Existing cluster** created in step 1.
- In the **Container Overrides**, use the IP for ETCD/NATS task for the `ETCD_ENDPOINTS` and `NATS_SERVER` values.
- After the deployment, an aggregated serving endpoint is created and you can test it with scripts in step 6.
3. Dynamo PrefillWorker Task
- For disaggregated serving, you can deploy a separate prefill worker on another GPU. Choose the EC2 cluster (`dynamo-GPU`) for **Existing cluster** created in step 1 with at least 2 GPUs ( `g6e.12xlarge` for example)
- In the **Container Overrides**, use the IP for ETCD/NATS task for the `ETCD_ENDPOINTS` and `NATS_SERVER` values.
## 6. Testing
Find the public IP of the dynamo frontend task from the task page. Run following commands to query the endpoint.
```sh
export DYNAMO_IP_ADDRESS=TASK_PUBLIC_IP_ADDRESS
curl http://$DYNAMO_IP_ADDRESS:8000/v1/models
curl http://$DYNAMO_IP_ADDRESS:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream":false,
"max_tokens": 30
}'
```
You should be able to see the responses from the hosted endpoint.
# Azure Kubernetes Service (AKS)
# Dynamo on AKS
This guide covers setting up an AKS cluster with GPU nodes and deploying Dynamo.
## Prerequisites
- An active Azure subscription with sufficient GPU VM quota
- [Azure CLI](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) (`az`) installed and logged in
- [kubectl](https://kubernetes.io/docs/tasks/tools/) installed
- [Helm](https://helm.sh/docs/intro/install/) v3.0+ installed
## Step 1: Create a Resource Group and Cluster
```bash
az group create \
--name \
--location
```
```bash
az aks create \
--resource-group \
--name \
--node-count 1 \
--generate-ssh-keys
```
Then get credentials:
```bash
az aks get-credentials \
--resource-group \
--name
```
## Step 2: Add a GPU Node Pool
Add a GPU-enabled node pool with driver installation skipped. The `--skip-gpu-driver-install` flag prevents AKS from managing GPU drivers — the NVIDIA GPU Operator (Step 3) will handle that instead.
```bash
az aks nodepool add \
--resource-group \
--cluster-name \
--name gpunp \
--node-count 2 \
--node-vm-size Standard_NC24ads_A100_v4 \
--skip-gpu-driver-install
```
For RDMA-capable workloads (disaggregated inference), use ND-series VMs such as `Standard_ND96asr_v4` or `Standard_ND96isr_H100_v5`. See the [RDMA / InfiniBand guide](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/rdma-infini-band) for the additional setup required on those nodes.
For a full list of GPU VM sizes, see [GPU-optimized VM sizes](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu).
## Step 3: Install the NVIDIA GPU Operator
The GPU Operator manages NVIDIA drivers, container toolkit, device plugin, and monitoring on GPU nodes.
```bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
```
```bash
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace
```
Verify the pods are running:
```bash
kubectl get pods -n gpu-operator
```
Expected output (abbreviated):
```text
NAMESPACE NAME READY STATUS RESTARTS AGE
gpu-operator gpu-feature-discovery-xxxxx 1/1 Running 0 2m
gpu-operator gpu-operator-xxxxx 1/1 Running 0 2m
gpu-operator nvidia-container-toolkit-daemonset-xxxxx 1/1 Running 0 2m
gpu-operator nvidia-cuda-validator-xxxxx 0/1 Completed 0 1m
gpu-operator nvidia-device-plugin-daemonset-xxxxx 1/1 Running 0 2m
gpu-operator nvidia-driver-daemonset-xxxxx 1/1 Running 0 2m
```
If you need RDMA / InfiniBand for disaggregated inference, **do not install the GPU Operator yet** — the RDMA setup requires different Helm values. See [RDMA / InfiniBand](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/rdma-infini-band) for the full setup, which includes the correct GPU Operator install command.
## Step 4: Install Dynamo
Follow the [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) to install the Dynamo Platform and deploy your first model.
## Additional Guides
### [RDMA / InfiniBand](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/rdma-infini-band)
Required for disaggregated inference in production. Without RDMA, KV cache transfers between prefill and decode workers fall back to TCP with severe latency degradation (~98s TTFT vs ~200–500ms with RDMA). ND-series VMs (e.g., `Standard_ND96asr_v4`, `Standard_ND96isr_H100_v5`) include Mellanox ConnectX InfiniBand NICs but require additional setup beyond the GPU Operator: the NVIDIA Network Operator, a NicClusterPolicy for MOFED drivers, an `ib-node-config` DaemonSet to configure kernel modules and memlock limits, and an RDMA Shared Device Plugin to expose the NICs to pods.
### [Storage for Model Caching](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/aks-storage)
Prevents each pod from independently downloading model weights on startup. Without shared storage, large models take hours to load per pod and will hit HuggingFace rate limits at scale. Covers Azure Managed Lustre, Azure Files, Azure Disk, and Local CSI options with per-cache-type recommendations (model cache, compilation cache, performance cache).
### [Azure Lustre CSI Driver](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/azure-lustre-csi-driver)
The recommended storage for large multi-node models requiring high-throughput shared access. Azure Managed Lustre is not installed by default — this guide covers installing and configuring the Lustre CSI driver before you can use it as a PVC storage class.
### [Spot VMs](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/spot-v-ms)
Significantly reduces GPU compute costs by running on preemptible Spot VM node pools. AKS automatically taints Spot nodes with `kubernetes.azure.com/scalesetpriority=spot:NoSchedule`, so Dynamo components need explicit tolerations. The Dynamo Helm chart includes a pre-built `values-aks-spot.yaml` that handles this.
## Clean Up Resources
```bash
# Delete all Dynamo Graph Deployments
kubectl delete dynamographdeployments.nvidia.com --all --all-namespaces
# Uninstall Dynamo Platform
export NAMESPACE="dynamo-system"
helm uninstall dynamo-platform -n $NAMESPACE
# If running Dynamo < 1.0 with a separate CRDs chart:
# helm uninstall dynamo-crds -n $NAMESPACE
```
If you want to delete the GPU Operator, follow the [Uninstalling the NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/uninstall.html) guide.
If you want to delete the entire AKS cluster, follow the [Delete an AKS cluster](https://learn.microsoft.com/en-us/azure/aks/delete-cluster) guide.
# RDMA / InfiniBand on AKS
# RDMA / InfiniBand on AKS
This guide covers setting up RDMA over InfiniBand on AKS for high-performance disaggregated inference with Dynamo. RDMA enables direct memory access between GPUs across nodes, bypassing CPU and kernel overhead — critical for low-latency KV cache transfer between prefill and decode workers.
Without RDMA, disaggregated inference falls back to TCP with severe performance degradation (~98s TTFT vs ~200-500ms with RDMA). See the [Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication) for details on transport options and performance expectations.
The Network Operator and NicClusterPolicy steps in this guide are based on the [Azure AKS RDMA InfiniBand](https://github.com/Azure/aks-rdma-infiniband) repository. That project is open-source and not covered by Microsoft Azure support — file issues on the GitHub repository.
## Prerequisites
**AKS cluster with RDMA-capable nodes:**
- At least **2 GPU nodes** to enable cross-node RDMA communication
- **ND-series VMs** with Mellanox ConnectX InfiniBand NICs (e.g., `Standard_ND96asr_v4`, `Standard_ND96isr_H100_v5`)
- **Ubuntu OS** on the node pool (required for NVIDIA driver compatibility)
- GPU driver installation **skipped** on the node pool (`--skip-gpu-driver-install`) — see [GPU Node Pool Setup](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/aks-setup#step-2-add-a-gpu-node-pool)
**Register the AKS InfiniBand feature** to ensure nodes land on the same physical InfiniBand network:
```bash
az feature register --namespace Microsoft.ContainerService --name AKSInfinibandSupport
az feature show --namespace Microsoft.ContainerService --name AKSInfinibandSupport --query "properties.state"
# Wait until "Registered"
az provider register --namespace Microsoft.ContainerService
```
## Overview
The RDMA setup involves five components installed in this order:
1. **Network Operator** — Deploys the Mellanox OFED driver and Node Feature Discovery
2. **NicClusterPolicy** — Configures the OFED driver on InfiniBand-capable nodes
3. **IB Node Configuration** — Loads InfiniBand kernel modules and sets memlock limits
4. **RDMA Shared Device Plugin** — Exposes InfiniBand NICs to pods as a Kubernetes resource
5. **GPU Operator** — Installed with RDMA-specific settings (NFD disabled, GPUDirect RDMA enabled, host MOFED)
## Step 1: Install the NVIDIA Network Operator
The [NVIDIA Network Operator](https://docs.nvidia.com/networking/display/kubernetes25100/index.html) automates deployment of networking components including Mellanox OFED drivers for InfiniBand support.
Create the namespace and label it for privileged workloads:
```bash
kubectl create ns network-operator
kubectl label --overwrite ns network-operator pod-security.kubernetes.io/enforce=privileged
```
Add the NVIDIA Helm repo (if not already added):
```bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
```
Create a `network-operator-values.yaml`:
```yaml
nfd:
deployNodeFeatureRules: false
```
Install the Network Operator:
```bash
helm install network-operator nvidia/network-operator \
--namespace network-operator \
-f network-operator-values.yaml \
--version v26.1.0
```
Verify the Network Operator pod is running:
```bash
kubectl get pods -n network-operator
```
## Step 2: Apply the NicClusterPolicy
The NicClusterPolicy configures the OFED driver (Mellanox OFED / DOCA driver) as a DaemonSet on all InfiniBand-capable nodes.
Apply the base NicClusterPolicy using kustomize:
```bash
kubectl apply -k https://github.com/Azure/aks-rdma-infiniband/configs/nicclusterpolicy/base
```
This targets nodes with Mellanox NICs (`feature.node.kubernetes.io/pci-15b3.present`) and installs the DOCA/OFED driver as a DaemonSet.
Wait for the MOFED driver DaemonSet to finish installing on all nodes (this may take several minutes):
```bash
kubectl get pods -n network-operator -l app=mofed-ubuntu22.04-ds -w
# Wait until all pods show Running
```
## Step 3: Deploy the IB Node Configuration DaemonSet
This DaemonSet loads InfiniBand kernel modules and sets unlimited memlock limits on GPU nodes. This is required for RDMA to function — without it, InfiniBand device files may not exist and memory pinning for RDMA transfers will fail.
This step is not covered in the Azure RDMA repo but is required for a working setup. The DaemonSet loads `ib_umad` and `rdma_ucm` kernel modules, sets unlimited memlock limits for containerd and kubelet, and restarts both services to apply the changes.
Create `ib-node-config.yaml`:
```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ib-node-config
namespace: kube-system
spec:
selector:
matchLabels:
app: ib-node-config
template:
metadata:
labels:
app: ib-node-config
spec:
hostPID: true
nodeSelector:
kubernetes.azure.com/agentpool:
tolerations:
- operator: Exists
initContainers:
- name: ib-setup
image: busybox:1.36
securityContext:
privileged: true
command:
- sh
- -c
- |
echo "=== IB Node Configuration ==="
nsenter -t 1 -m -u -i -n -- modprobe ib_umad
nsenter -t 1 -m -u -i -n -- modprobe rdma_ucm 2>/dev/null || true
nsenter -t 1 -m -u -i -n -- modprobe ib_ucm 2>/dev/null || true
nsenter -t 1 -m -u -i -n -- lsmod | grep ib_umad && echo "OK: ib_umad" || echo "FAIL: ib_umad"
nsenter -t 1 -m -u -i -n -- ls /dev/infiniband/rdma_cm && echo "OK: rdma_cm device" || echo "WARN: no rdma_cm device"
nsenter -t 1 -m -u -i -n -- sh -c 'printf "ib_umad\nrdma_ucm\n" > /etc/modules-load.d/ib-umad.conf'
nsenter -t 1 -m -u -i -n -- sh -c 'printf "* - memlock unlimited\nroot - memlock unlimited\n" > /etc/security/limits.d/99-ib-memlock.conf'
nsenter -t 1 -m -u -i -n -- sh -c 'mkdir -p /etc/systemd/system/containerd.service.d && printf "[Service]\nLimitMEMLOCK=infinity\n" > /etc/systemd/system/containerd.service.d/memlock.conf'
nsenter -t 1 -m -u -i -n -- sh -c 'mkdir -p /etc/systemd/system/kubelet.service.d && printf "[Service]\nLimitMEMLOCK=infinity\n" > /etc/systemd/system/kubelet.service.d/memlock.conf'
nsenter -t 1 -m -u -i -n -- systemctl daemon-reload
nsenter -t 1 -m -u -i -n -- systemctl restart containerd
nsenter -t 1 -m -u -i -n -- systemctl restart kubelet
sleep 10
nsenter -t 1 -m -u -i -n -- systemctl is-active containerd && echo "OK: containerd active" || echo "FAIL: containerd"
nsenter -t 1 -m -u -i -n -- systemctl is-active kubelet && echo "OK: kubelet active" || echo "FAIL: kubelet"
echo "=== Setup Complete ==="
containers:
- name: keepalive
image: busybox:1.36
command: ["sh", "-c", "echo IB node config active; sleep infinity"]
```
Replace `` with your GPU node pool name (e.g., `ndh100pool`).
```bash
kubectl apply -f ib-node-config.yaml
```
Wait for all pods to complete initialization:
```bash
kubectl get pods -n kube-system -l app=ib-node-config -w
```
**What this does:**
- **`ib_umad`** — InfiniBand user-space management datagram module, required for RDMA device access
- **`rdma_ucm`** — RDMA user-space connection manager
- **Memlock limits** — RDMA requires pinning memory pages; without unlimited memlock, large transfers fail
- **Service restarts** — containerd and kubelet must be restarted to pick up the new memlock limits
## Step 4: Deploy the RDMA Shared Device Plugin
The RDMA Shared Device Plugin exposes InfiniBand NICs as a Kubernetes extended resource so pods can request RDMA access.
Create the ConfigMap with the device plugin configuration:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: rdma-devices
namespace: kube-system
data:
config.json: |
{
"periodicUpdateInterval": 300,
"configList": [{
"resourceName": "hca_shared_devices_a",
"rdmaHcaMax": 1000,
"selectors": {
"vendors": ["15b3"],
"drivers": ["mlx5_core"]
}
}
]
}
```
Create the DaemonSet:
```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: rdma-shared-dp-ds
namespace: kube-system
spec:
selector:
matchLabels:
name: rdma-shared-dp-ds
template:
metadata:
labels:
name: rdma-shared-dp-ds
spec:
hostNetwork: true
nodeSelector:
kubernetes.azure.com/agentpool:
tolerations:
- operator: Exists
containers:
- name: k8s-rdma-shared-dp-ds
image: ghcr.io/mellanox/k8s-rdma-shared-dev-plugin:v1.5.3
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: plugins-registry
mountPath: /var/lib/kubelet/plugins_registry
- name: config
mountPath: /k8s-rdma-shared-dev-plugin
- name: devs
mountPath: /dev/
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: plugins-registry
hostPath:
path: /var/lib/kubelet/plugins_registry
- name: config
configMap:
name: rdma-devices
- name: devs
hostPath:
path: /dev/
```
Replace `` with your GPU node pool name (e.g., `ndh100pool`).
```bash
kubectl apply -f rdma-configmap.yaml
kubectl apply -f rdma-shared-dp-ds.yaml
```
Wait for the device plugin pods to start:
```bash
kubectl get pods -n kube-system -l name=rdma-shared-dp-ds -w
```
## Step 5: Install the GPU Operator (RDMA-Enabled)
Install the GPU Operator with RDMA-specific values:
```bash
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace \
--set nfd.enabled=false \
--set driver.rdma.enabled=true \
--set driver.rdma.useHostMofed=true
```
Key differences from a standard GPU Operator install:
- `nfd.enabled=false` — Network Operator already deploys Node Feature Discovery; running two NFD instances causes conflicts
- `driver.rdma.enabled=true` — enables GPUDirect RDMA support; causes the driver daemonset to build and load `nvidia_peermem`
- `driver.rdma.useHostMofed=true` — tells the GPU Operator to use the MOFED driver installed by the Network Operator (Step 1) rather than its own; required when the Network Operator manages OFED
Wait for the GPU Operator pods to reach `Running` state:
```bash
kubectl get pods -n gpu-operator -w
```
## Verification
**1. Check that MOFED driver pods are running on all InfiniBand nodes:**
```bash
kubectl get pods -n network-operator -l app=mofed-ubuntu22.04-ds
```
**2. Check that IB node config pods completed initialization:**
```bash
kubectl get pods -n kube-system -l app=ib-node-config
```
**3. Check that the RDMA Shared Device Plugin is running:**
```bash
kubectl get pods -n kube-system -l name=rdma-shared-dp-ds
```
**4. Verify RDMA resources are available on GPU nodes:**
```bash
kubectl get nodes -o json | jq '.items[] | select(.status.allocatable["rdma/hca_shared_devices_a"] != null) | {name: .metadata.name, rdma: .status.allocatable["rdma/hca_shared_devices_a"], gpu: .status.allocatable["nvidia.com/gpu"]}'
```
Each InfiniBand-capable node should report `rdma/hca_shared_devices_a` resources (typically `1k` based on `rdmaHcaMax: 1000`).
**5. Check GPU Operator pods are healthy:**
```bash
kubectl get pods -n gpu-operator
```
## Pod Resource Requests
Dynamo pods that need RDMA access should request the `rdma/hca_shared_devices_a` resource. When using the Dynamo operator with DGDR, this is handled automatically for disaggregated deployments on RDMA-capable clusters.
For manual DGD specs, add the resource request to your container:
```yaml
resources:
limits:
nvidia.com/gpu: 8
rdma/hca_shared_devices_a: 1
```
**`IPC_LOCK` capability is not required** when this setup is followed. `IPC_LOCK` is historically needed for RDMA because `ibv_reg_mr` calls `mlock()` to pin memory pages — but `mlock()` only needs the capability if the memlock rlimit would otherwise block it. The `ib-node-config` DaemonSet (Step 3) sets `LimitMEMLOCK=infinity` on the kubelet and containerd systemd units, so all pods on GPU nodes inherit an unlimited memlock limit and RDMA memory pinning works without any capability in the pod spec.
If you see `ENOMEM` errors from `ibv_reg_mr` and `ib-node-config` is running, verify that containerd and kubelet were restarted after the limits were applied (check the init container logs). If `ib-node-config` is not deployed, add `IPC_LOCK` to your pod's `securityContext.capabilities.add`.
## Troubleshooting
**MOFED pods stuck in `Init` or `CrashLoopBackOff`:**
- Verify nodes are Ubuntu OS: `kubectl get nodes -o custom-columns="NAME:.metadata.name,OS:.status.nodeInfo.osImage"`
- Check MOFED pod logs: `kubectl logs -n network-operator -c mofed-container`
**`rdma/hca_shared_devices_a` not showing on nodes:**
- Check the RDMA device plugin pods are running: `kubectl get pods -n kube-system -l name=rdma-shared-dp-ds`
- Check device plugin logs: `kubectl logs -n kube-system `
- Verify the `rdma-devices` ConfigMap exists: `kubectl get configmap rdma-devices -n kube-system`
**IB kernel modules not loading:**
- Check the ib-node-config init container logs: `kubectl logs -n kube-system -c ib-setup`
- Verify the MOFED driver is installed first (Step 2 must complete before Step 3)
**Memlock errors during RDMA transfers (`ENOMEM` from `ibv_reg_mr`):**
- Verify the ib-node-config DaemonSet has run on all GPU nodes and init containers completed
- Check that containerd and kubelet were restarted: `kubectl logs -n kube-system -c ib-setup`
- Confirm the limits took effect on the kubelet process:
```bash
# On a GPU node (via kubectl debug or ssh)
cat /proc/$(pgrep -x kubelet)/limits | grep -i memlock
# Should show: Max locked memory unlimited unlimited
```
- If limits are not unlimited, the ib-node-config DaemonSet needs to be re-applied and services restarted
**GPUDirect RDMA not working — `nvidia_peermem` module missing:**
ND-series nodes (including ND H100 v5) do **not** ship `nvidia_peermem` in the host OS. This module is required for InfiniBand adapters to directly read/write GPU memory — without it, RDMA transfers fall back to staging through host memory.
Verify whether the module is loaded:
```bash
# Check on a GPU node via a privileged pod or node shell
lsmod | grep nvidia_peermem
# If empty, the module is not loaded
modinfo nvidia_peermem
# If "Module not found", it is also not present in the host's /lib/modules
```
With the GPU Operator managing drivers (`driver.rdma.enabled=true`), `nvidia_peermem` is built and loaded by the `nvidia-driver-daemonset` — it lives in the driver pod's `/lib/modules`, not the host's native kernel modules. Verify the driver daemonset is loading it:
```bash
kubectl exec -n gpu-operator $(kubectl get pod -n gpu-operator -l app=nvidia-driver-daemonset -o jsonpath='{.items[0].metadata.name}') -- lsmod | grep nvidia_peermem
```
If this returns empty, ensure `driver.rdma.enabled=true` and `driver.rdma.useHostMofed=true` are set in your GPU Operator Helm values (see [Step 5](#step-5-install-the-gpu-operator-rdma-enabled) above), then restart the driver daemonset:
```bash
kubectl rollout restart daemonset/nvidia-driver-daemonset -n gpu-operator
```
The [nvidia-peermem-reloader](https://github.com/Azure/aks-rdma-infiniband/tree/main/configs/nvidia-peermem-reloader) DaemonSet from the Azure RDMA repo is designed for clusters using **AKS-managed GPU drivers** (without the GPU Operator). It simply runs `modprobe nvidia-peermem` — which will fail on ND H100 v5 nodes because the host OS doesn't include the module. When using the GPU Operator (recommended), the operator handles `nvidia_peermem` automatically via `driver.rdma.enabled=true`.
## See Also
- [Azure AKS RDMA InfiniBand — GitHub](https://github.com/Azure/aks-rdma-infiniband)
- [Set up InfiniBand on Azure HPC VMs — Microsoft Learn](https://learn.microsoft.com/en-us/azure/virtual-machines/setup-infiniband)
- [Enable InfiniBand VM extension — Microsoft Learn](https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/enable-infiniband)
- [NVIDIA Network Operator Documentation](https://docs.nvidia.com/networking/display/kubernetes25100/index.html)
- [Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication) — transport options, UCX configuration, performance expectations
# Storage for Model Caching on AKS
# Storage for Model Caching on AKS
For implementing tiered storage on AKS, you can take advantage of the different storage options available in Azure. This guide covers choosing the right storage for each Dynamo cache type and configuring PVCs.
## Available Storage Options
| Storage Option | Performance | Best For |
|----------------|-------------|----------|
| Local CSI (Ephemeral Disk) | Very high | Fast model caching, warm restarts |
| [Azure Managed Lustre](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/azure-lustre-csi-driver) | Extremely high | Large multi-node models, shared cache |
| [Azure Disk (Managed Disk)](https://learn.microsoft.com/en-us/azure/aks/azure-csi-driver-volume-provisioning?tabs=dynamic-volume-blob%2Cnfs%2Ckubernetes-secret%2Cnfs-3%2Cgeneral%2Cgeneral2%2Cdynamic-volume-disk%2Cgeneral-disk%2Cdynamic-volume-files%2Cgeneral-files%2Cgeneral-files2%2Cdynamic-volume-files-mid%2Coptimize%2Csmb-share&pivots=csi-disk#create-azure-disk-pvs-using-built-in-storage-classes) | High | Persistent single-writer model cache |
| [Azure Files](https://learn.microsoft.com/en-us/azure/aks/azure-csi-driver-volume-provisioning?tabs=dynamic-volume-blob%2Cnfs%2Ckubernetes-secret%2Cnfs-3%2Cgeneral%2Cgeneral2%2Cdynamic-volume-disk%2Cgeneral-disk%2Cdynamic-volume-files%2Cgeneral-files%2Cgeneral-files2%2Cdynamic-volume-files-mid%2Coptimize%2Csmb-share&pivots=csi-files#use-a-persistent-volume-for-storage) | Medium | Shared small/medium models |
| [Azure Blob (via Fuse or init)](https://learn.microsoft.com/en-us/azure/aks/azure-csi-driver-volume-provisioning?tabs=dynamic-volume-blob%2Cnfs%2Ckubernetes-secret%2Cnfs-3%2Cgeneral%2Cgeneral2%2Cdynamic-volume-disk%2Cgeneral-disk%2Cdynamic-volume-files%2Cgeneral-files%2Cgeneral-files2%2Cdynamic-volume-files-mid%2Coptimize%2Csmb-share&pivots=csi-blob#create-a-pvc-using-built-in-storage-class) | Low-Medium | Cold model storage, bootstrap downloads |
Azure Managed Lustre and Local CSI (ephemeral disk) are not installed by default in AKS and require additional setup before use. Azure Disk, Azure Files, and Azure Blob CSI drivers are available out of the box. See the [Azure Lustre CSI Driver](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/azure-lustre-csi-driver) guide for Lustre setup, or the [AKS CSI storage options documentation](https://learn.microsoft.com/azure/aks/csi-storage-drivers) for a full overview of built-in drivers.
For Azure Managed Lustre setup, see the [Azure Lustre CSI Driver](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/azure-lustre-csi-driver) guide.
## Recommendations by Cache Type
- **Model Cache** — raw model artifacts, configuration files, tokenizers, etc.
- Persistence: Required to avoid repeated downloads and reduce cold-start latency.
- Recommended storage: Azure Managed Lustre (shared, high throughput) or Azure Disk (single-replica, persistent).
- **Compilation Cache** — backend-specific compiled artifacts (e.g., TensorRT engines).
- Persistence: Optional.
- Recommended storage: Local CSI (fast, node-local) or Azure Disk (persistent when GPU configuration is fixed).
- **Performance Cache** — runtime tuning and profiling data.
- Persistence: Not required.
- Recommended storage: Local CSI (or other ephemeral storage).
## Check Available Storage Classes
List the storage classes available in your AKS cluster:
```bash
kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY
azureblob-csi blob.csi.azure.com Delete
azurefile file.csi.azure.com Delete
azurefile-csi file.csi.azure.com Delete
azurefile-csi-premium file.csi.azure.com Delete
azurefile-premium file.csi.azure.com Delete
default disk.csi.azure.com Delete
managed disk.csi.azure.com Delete
managed-csi disk.csi.azure.com Delete
managed-csi-premium disk.csi.azure.com Delete
managed-premium disk.csi.azure.com Delete
sc.azurelustre.csi.azure.com azurelustre.csi.azure.com Retain
```
## Example PVC Configuration
In the `cache.yaml` in the different [recipes](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/recipes), you can set the `storageClassName` to a storage option available in your AKS cluster:
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
storageClassName: "sc.azurelustre.csi.azure.com"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: compilation-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Gi
storageClassName: "azurefile-csi"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: perf-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Gi
storageClassName: "local-ephemeral"
```
## See Also
- [Azure Lustre CSI Driver](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/azure-lustre-csi-driver) — Full setup guide for Azure Managed Lustre
- [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) — Full walkthrough for setting up model caching with Dynamo, including download Jobs and mount configuration
- [AKS CSI Storage Drivers](https://learn.microsoft.com/azure/aks/csi-storage-drivers) — Microsoft documentation for all built-in CSI drivers
# Azure Lustre CSI Driver for AKS
# Azure Lustre CSI Driver for AKS
This guide covers installing and configuring the [Azure Lustre CSI driver](https://github.com/kubernetes-sigs/azurelustre-csi-driver) on an AKS cluster so that Dynamo workloads can use Azure Managed Lustre (AMLFS) filesystems for high-performance model storage.
## Prerequisites
**AKS cluster requirements**
- Kubernetes 1.21 or later
- Node pools must use the **Ubuntu** OS SKU — Windows and Azure Linux (CBL Mariner) nodes are not supported
- AKS is the only supported Kubernetes distribution (self-managed clusters are not supported)
**Tools**
- Azure CLI (`az`)
- `kubectl`
**Network connectivity**
AKS and your AMLFS filesystem must have network reachability. Two supported topologies:
- **VNet peering**: Deploy AKS in its own VNet and peer it with the AMLFS VNet. The AKS infrastructure VNet lives in the auto-created resource group `MC___`.
- **Shared VNet**: Use AKS's "Bring your own VNet" feature and deploy AKS in a dedicated subnet inside the AMLFS VNet. Do not use the same subnet as AMLFS.
Do not place AKS nodes and the AMLFS filesystem in the same subnet, even when sharing a VNet.
## Step 1: Connect to your AKS cluster
```bash
az login
az aks get-credentials \
--subscription \
--resource-group \
--name
kubectl config current-context
```
## Step 2: Install the CSI driver
There is no Helm chart. Install via the provided shell script:
```bash
# Install latest version
curl -skSL https://raw.githubusercontent.com/kubernetes-sigs/azurelustre-csi-driver/main/deploy/install-driver.sh | bash -s main
# Or install a specific version
curl -skSL https://raw.githubusercontent.com/kubernetes-sigs/azurelustre-csi-driver/main/deploy/install-driver.sh | bash -s v0.3.1
```
The script deploys the CSI controller (2-replica Deployment) and node plugin (DaemonSet) into `kube-system`, and waits for them to become ready.
**Verify the installation:**
```bash
# Controller pods — expect 2/2 or 3/3 Running
kubectl get -n kube-system pod -l app=csi-azurelustre-controller
# Node plugin pods — expect 3/3 Running on each node
kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
```
## Step 3: Configure storage
There are two provisioning modes depending on whether your AMLFS filesystem already exists.
### Option A: Static provisioning (existing AMLFS filesystem)
Use this when you want to bring your own Azure Managed Lustre filesystem. If you don't have one yet, create it first, then configure the CSI driver to use it.
#### Create an Azure Managed Lustre filesystem
**1. Register the resource provider (first time only):**
```bash
az provider register --namespace Microsoft.StorageCache
# Wait until state is "Registered"
az provider show --namespace Microsoft.StorageCache --query "registrationState"
```
**2. Validate your subnet before creating the filesystem:**
The subnet must be dedicated to AMLFS (do not share with AKS nodes or other resources) and sized to hold the filesystem. Check requirements first:
```bash
# Get the required subnet size for your planned SKU and capacity
az amlfs get-subnets-size \
--sku AMLFS-Durable-Premium-250 \
--storage-capacity 16
# Validate that your subnet meets the requirements
az amlfs check-amlfs-subnet \
--filesystem-subnet /subscriptions//resourceGroups//providers/Microsoft.Network/virtualNetworks//subnets/ \
--sku AMLFS-Durable-Premium-250 \
--location \
--storage-capacity 16
```
**3. Create a dedicated subnet for AMLFS:**
AMLFS requires its own subnet — it cannot share the subnet used by AKS nodes. Create a new subnet in the AKS VNet (or in a peered VNet):
```bash
# Get the node resource group and check for a custom VNet subnet
az aks show \
--name \
--resource-group \
--query "{vnet: agentPoolProfiles[0].vnetSubnetId, nodeRG: nodeResourceGroup}"
```
If `vnet` is non-null, your cluster uses Azure CNI with a custom VNet — use that VNet name and resource group below.
If `vnet` is `null`, AKS manages its own VNet in the node resource group. Find it:
```bash
az network vnet list \
--resource-group \
--query "[].{name:name, addressPrefixes:addressSpace.addressPrefixes}"
```
List existing subnets to find a free CIDR range:
```bash
az network vnet subnet list \
--resource-group \
--vnet-name \
--query "[].{name:name, prefix:addressPrefix}"
```
Pick a non-overlapping CIDR within the VNet's address space. The `filesystemSubnetSize` value from `get-subnets-size` is the number of IPs required. Azure also reserves 5 IPs per subnet, so add those when sizing the prefix (e.g., `filesystemSubnetSize: 8` → 13 IPs needed → use `/28` for 16 addresses or more).
Then create the dedicated AMLFS subnet:
```bash
az network vnet subnet create \
--name amlfs-subnet \
--resource-group \
--vnet-name \
--address-prefix
```
Use the full subnet resource ID in the next step:
`/subscriptions//resourceGroups//providers/Microsoft.Network/virtualNetworks//subnets/amlfs-subnet`
**4. Create the filesystem:**
```bash
az amlfs create \
--name \
--resource-group \
--location \
--sku AMLFS-Durable-Premium-250 \
--storage-capacity 16 \
--zones "[1]" \
--filesystem-subnet /subscriptions//resourceGroups//providers/Microsoft.Network/virtualNetworks//subnets/ \
--maintenance-window "{dayOfWeek:Sunday,timeOfDayUtc:'22:00'}"
```
This takes **10–20 minutes**. Use `--no-wait` to return immediately and poll with `az amlfs show`.
**Available SKUs:**
| SKU | Min size | Throughput |
|-----|----------|------------|
| `AMLFS-Durable-Premium-40` | 48 TiB | 40 MB/s per TiB |
| `AMLFS-Durable-Premium-125` | 16 TiB | 125 MB/s per TiB |
| `AMLFS-Durable-Premium-250` | 8 TiB | 250 MB/s per TiB |
| `AMLFS-Durable-Premium-500` | 4 TiB | 500 MB/s per TiB |
**5. Get the MGS IP address:**
```bash
az amlfs show \
--name \
--resource-group \
--query "{mgsAddress: clientInfo.mgsAddress, mountCommand: clientInfo.mountCommand}"
```
Use the `mgsAddress` value in the StorageClass below. Alternatively, find it in the Azure portal under your filesystem's **Client connection** pane.
**StorageClass:**
```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azurelustre-static
provisioner: azurelustre.csi.azure.com
parameters:
mgs-ip-address: # From portal > Client connection
reclaimPolicy: Retain
volumeBindingMode: Immediate
mountOptions:
- noatime
- flock
```
**PersistentVolumeClaim:**
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-lustre
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: # Match your filesystem size, e.g. 16Ti
storageClassName: azurelustre-static
```
```bash
kubectl apply -f storageclass.yaml
kubectl apply -f pvc.yaml
```
### Option B: Dynamic provisioning (auto-create AMLFS filesystem)
Requires driver v0.3.0 or later. The driver creates an AMLFS cluster automatically when the PVC is created — this takes **10+ minutes**.
**Additional IAM permissions required** on the kubelet managed identity (grant before creating the PVC):
```
Microsoft.StorageCache/amlFilesystems/read
Microsoft.StorageCache/amlFilesystems/write
Microsoft.StorageCache/amlFilesystems/delete
Microsoft.StorageCache/checkAmlFSSubnets/action
Microsoft.StorageCache/getRequiredAmlFSSubnetsSize/*
Microsoft.Network/virtualNetworks/subnets/read
Microsoft.Network/virtualNetworks/subnets/join/action
Microsoft.ManagedIdentity/userAssignedIdentities/assign/action
```
Alternatively assign the broader roles: **Reader** at subscription scope, **Contributor** at the target resource group, and **Network Contributor** at the VNet scope.
**Available SKUs:**
| SKU | Throughput |
|-----|------------|
| `AMLFS-Durable-Premium-40` | 40 MB/s per TiB |
| `AMLFS-Durable-Premium-125` | 125 MB/s per TiB (min 48 TiB) |
| `AMLFS-Durable-Premium-250` | 250 MB/s per TiB |
| `AMLFS-Durable-Premium-500` | 500 MB/s per TiB |
**StorageClass:**
```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azurelustre-dynamic
provisioner: azurelustre.csi.azure.com
parameters:
sku-name: "AMLFS-Durable-Premium-125"
zone: "1" # Availability zone: "1", "2", or "3"
maintenance-day-of-week: "Sunday"
maintenance-time-of-day-utc: "22:00"
# Optional overrides (defaults to AKS cluster values):
# location: "eastus"
# resource-group-name: "my-rg"
# vnet-name: "my-vnet"
# subnet-name: "my-subnet"
reclaimPolicy: Delete # WARNING: deletes the AMLFS cluster when PVC is deleted — use Retain in production
volumeBindingMode: Immediate
mountOptions:
- noatime
- flock
```
**PersistentVolumeClaim:**
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-lustre-dynamic
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 48Ti # Minimum for AMLFS-Durable-Premium-125
storageClassName: azurelustre-dynamic
```
```bash
kubectl apply -f storageclass-dynamic.yaml
kubectl apply -f pvc-dynamic.yaml
# Monitor provisioning (takes 10+ minutes)
kubectl describe pvc pvc-lustre-dynamic
```
## Troubleshooting
**Pod stuck in `ContainerCreating`**
```bash
kubectl describe pod
# Look for volume mount errors in Events
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=50
```
**PVC stuck in `Pending` (dynamic provisioning)**
```bash
kubectl describe pvc
# Check Events for authorization errors — kubelet identity may lack IAM permissions
```
**Node cannot mount** — verify Ubuntu OS SKU:
```bash
kubectl get nodes -o custom-columns="NAME:.metadata.name,OS:.status.nodeInfo.osImage"
```
## See also
- [Azure Managed Lustre CSI Driver — GitHub](https://github.com/kubernetes-sigs/azurelustre-csi-driver)
- [Use Azure Managed Lustre with AKS — Microsoft Learn](https://learn.microsoft.com/en-us/azure/azure-managed-lustre/use-csi-driver-kubernetes)
# AKS Spot VMs
# Running Dynamo on AKS Spot VMs
[Azure Spot VMs](https://azure.microsoft.com/en-us/products/virtual-machines/spot) offer significant cost savings for GPU workloads but can be evicted by Azure at any time. This guide covers the configuration required to schedule Dynamo on Spot VM node pools.
## How AKS Taints Spot Nodes
When a node pool uses Spot VMs, AKS automatically applies the following taint to all nodes in that pool:
```yaml
kubernetes.azure.com/scalesetpriority=spot:NoSchedule
```
This prevents standard workloads from landing on Spot nodes by default. Any pod that should run on a Spot node must explicitly tolerate this taint.
## Required Toleration
Add the following toleration to any workload that should run on Spot nodes:
```yaml
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedule
```
## Deploying Dynamo on Spot Nodes
The Dynamo platform Helm chart includes a pre-built values file for Spot VM deployments — [`examples/deployments/AKS/values-aks-spot.yaml`](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/deployments/AKS/values-aks-spot.yaml) — which adds the required toleration to all Dynamo components:
- Dynamo operator controller manager
- Webhook CA inject and cert generation jobs
- etcd
- NATS
- MPI SSH key generation job
- Other core Dynamo platform pods
Install Dynamo with the Spot values file:
```bash
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace dynamo-system \
--create-namespace \
-f ./values-aks-spot.yaml
```
To upgrade an existing installation:
```bash
helm upgrade dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace dynamo-system \
-f ./values-aks-spot.yaml
```
## Creating a Spot GPU Node Pool
Add a Spot GPU node pool to an existing AKS cluster:
```bash
az aks nodepool add \
--resource-group \
--cluster-name \
--name spotgpunp \
--node-count 2 \
--node-vm-size Standard_NC24ads_A100_v4 \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--skip-gpu-driver-install
```
`--spot-max-price -1` means pay up to the on-demand price (recommended). `--eviction-policy Delete` removes evicted nodes from the pool; use `Deallocate` if you want to preserve node state across evictions.
## See Also
- [Azure Spot VMs overview](https://learn.microsoft.com/en-us/azure/virtual-machines/spot-vms)
- [Use Spot VMs in AKS](https://learn.microsoft.com/en-us/azure/aks/spot-node-pool)
# Google Kubernetes Engine (GKE)
# Dynamo Deployment on GKE
## Pre-requisites
### Install gcloud CLI
https://cloud.google.com/sdk/docs/install
### Create GKE cluster
```bash
export PROJECT_ID=<>
export REGION=<>
export ZONE=<>
export CLUSTER_NAME=<>
export CLUSTER_MACHINE_TYPE=n2-standard-4
export NODE_POOL_MACHINE_TYPE=g2-standard-24
export GPU_TYPE=nvidia-l4
export GPU_COUNT=2
export CPU_NODE=2
export GPU_NODE=2
export DISK_SIZE=200
gcloud container clusters create ${CLUSTER_NAME} \
--project=${PROJECT_ID} \
--location=${ZONE} \
--subnetwork=default \
--disk-size=${DISK_SIZE} \
--machine-type=${CLUSTER_MACHINE_TYPE} \
--num-nodes=${CPU_NODE}
```
#### Create GPU pool
```bash
gcloud container node-pools create gpu-pool \
--accelerator type=${GPU_TYPE},count=${GPU_COUNT},gpu-driver-version=latest \
--project=${PROJECT_ID} \
--location=${ZONE} \
--cluster=${CLUSTER_NAME} \
--machine-type=${NODE_POOL_MACHINE_TYPE} \
--disk-size=${DISK_SIZE} \
--num-nodes=${GPU_NODE} \
--enable-autoscaling \
--min-nodes=1 \
--max-nodes=3
```
### Clone Dynamo GitHub repository
**Note:** Please make sure GitHub branch/commit version matches with Dynamo platform and VLLM container.
```bash
git clone https://github.com/ai-dynamo/dynamo.git
# Checkout to the desired branch
git checkout release/0.6.0
```
### Set environment variables for GKE
```bash
export NAMESPACE=dynamo-system
kubectl create namespace $NAMESPACE
kubectl config set-context --current --namespace=$NAMESPACE
export HF_TOKEN=
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}
```
## Install Dynamo Kubernetes Platform
[See installation steps](/dynamo/kubernetes-deployment/start-here/installation-guide#overview)
After installation, verify the installation:
**Expected output**
```bash
kubectl get pods
NAME READY STATUS RESTARTS AGE
dynamo-platform-dynamo-operator-controller-manager-69b9794fpgv9 2/2 Running 0 4m27s
dynamo-platform-etcd-0 1/1 Running 0 4m27s
dynamo-platform-nats-0 2/2 Running 0 4m27s
```
## Deploy Inference Graph
We will deploy a LLM model to the Dynamo platform. Here we use `Qwen/Qwen3-0.6B` model with VLLM and disaggregated deployment as an example.
In the deployment yaml file, some adjustments have to/ could be made:
- **(Required)** Add args to change `LD_LIBRARY_PATH` and `PATH` of decoder container, to enable GKE find the correct GPU driver
- Change VLLM image to the desired one on NGC
- Add namespace to metadata
- Adjust GPU/CPU request and limits
- Change model to deploy
More configurations please refer to https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/deployments/GKE/vllm
### Highlighted configurations in yaml file
Please note that `LD_LIBRARY_PATH` needs to be set properly in GKE as per [Run GPUs in GKE](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus)
The following snippet needs to be present in the `args` field of the deployment `yaml` file:
```bash
export LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH
export PATH=$PATH:/usr/local/nvidia/bin:/usr/local/nvidia/lib64
/sbin/ldconfig
```
For example, refer to the following from [`examples/deployments/GKE/vllm/disagg.yaml`](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/deployments/GKE/vllm/disagg.yaml)
```yaml
metadata:
name: vllm-disagg
namespace: dynamo-system
spec:
services:
Frontend:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
VllmDecodeWorker:
resources:
limits:
gpu: "3"
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
args:
- |
export LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH
export PATH=$PATH:/usr/local/nvidia/bin:/usr/local/nvidia/lib64
/sbin/ldconfig
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
```
## Deploy the model
```bash
cd dynamo/examples/deployments/GKE/vllm
kubectl apply -f disagg_gke.yaml -n ${NAMESPACE}
```
**Expected output after successful deployment**
```bash
kubectl get pods
NAME READY STATUS RESTARTS AGE
dynamo-platform-dynamo-operator-controller-manager-c665684ssqkx 2/2 Running 0 65m
dynamo-platform-etcd-0 1/1 Running 0 65m
dynamo-platform-nats-0 2/2 Running 0 65m
vllm-disagg-frontend-5954ddc4dd-4w2cb 1/1 Running 0 11m
vllm-disagg-vllmdecodeworker-77844cfcff-ddn4v 1/1 Running 0 11m
vllm-disagg-vllmprefillworker-55d5b74b4f-zrskh 1/1 Running 0 11m
```
## Test the Deployment
```bash
export DEPLOYMENT_NAME=vllm-disagg
# Find the frontend pod
export FRONTEND_POD=$(kubectl get pods -n ${NAMESPACE} | grep "${DEPLOYMENT_NAME}-frontend" | sort -k1 | tail -n1 | awk '{print $1}')
# Forward the pod's port to localhost
kubectl port-forward deployment/vllm-disagg-frontend 8000:8000 -n ${NAMESPACE}
# disagg
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream":false,
"max_tokens": 30
}'
```
### Response
```json
{"id":"chatcmpl-bd0670d9-0342-4eea-97c1-99b69f1f931f","choices":[{"index":0,"message":{"content":"Okay, here's a detailed character background for your intrepid explorer, tailored to fit the premise of Aeloria, with a focus on a","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"stop","logprobs":null}],"created":1756336263,"model":"Qwen/Qwen3-0.6B","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":{"prompt_tokens":190,"completion_tokens":29,"total_tokens":219,"prompt_tokens_details":null,"completion_tokens_details":null}}
```
# Feature Guides
Use these guides after you have Dynamo running and want to improve serving behavior, operate a deployment, or adapt Dynamo to a new workload.
## Recommended path
Most deployments start with the core performance loop:
| Step | Guide | Use when |
|---|---|---|
| 1 | [KV Cache Aware Routing](/dynamo/user-guides/kv-cache-aware-routing) | Route requests to workers that already hold useful KV cache. |
| 2 | [Disaggregated Serving](/dynamo/user-guides/disaggregated-serving) | Scale prefill and decode workers independently. |
| 3 | [KV Cache Offloading](/dynamo/user-guides/kv-cache-offloading) | Extend usable cache capacity beyond GPU memory. |
| 4 | [Benchmarking](/dynamo/user-guides/benchmarking) | Compare configurations before you move to production. |
## Where to go next
| Goal | Start with |
|---|---|
| Make serving more resilient | [Fault Tolerance](/dynamo/user-guides/fault-tolerance) |
| Monitor local deployments | [Observability (Local)](/dynamo/user-guides/observability-local) |
| Reproduce traffic without a full engine | [Mocker Engine Simulation](../mocker/mocker.md) |
| Add structured model outputs | [Tool Calling](/dynamo/user-guides/parsing/tool-call-parsing-dynamo) and [Reasoning](/dynamo/user-guides/parsing/reasoning-parsing-dynamo) |
| Build agent workloads | [Agents](/dynamo/user-guides/agents) |
| Serve specialized workloads | [LoRA Adapters](/dynamo/user-guides/lo-ra-adapters), [Multimodal](/dynamo/user-guides/multimodal), and [Diffusion](/dynamo/user-guides/diffusion) |
For cluster deployments, pair these guides with the [Kubernetes Deployment](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart) docs. The same features can be explored locally, then expressed through Dynamo's Kubernetes-native CRDs and operator when you move to a shared GPU cluster.
# Router Guide
## Overview
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
This guide helps you get started with using the Dynamo router and points to the pages that cover routing concepts, configuration, disaggregated serving, and operations in more detail.
## Quick Start
The router can be deployed using [Python / CLI](#python--cli-deployment), [Kubernetes](#kubernetes-deployment), or as a [standalone component](#standalone-router).
### Python / CLI Deployment
To launch the Dynamo frontend with the KV Router:
```bash
python -m dynamo.frontend --router-mode kv --http-port 8000
```
This command:
- Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8000 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers register themselves using the `register_model` API. For accurate prefix-cache state, workers must also publish KV cache events with the backend-specific event flags; otherwise the router can run in approximate mode with `--no-router-kv-events`.
#### CLI Arguments
| Argument | Default | Description |
|----------|---------|-------------|
| `--router-mode kv` | `round-robin` | Enable KV cache-aware routing |
| `--router-temperature ` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
| `--kv-cache-block-size ` | Backend-specific | KV cache block size (should match backend config) |
| `--router-kv-events` / `--no-router-kv-events` | `--router-kv-events` | Enable/disable real-time KV event tracking |
| `--load-aware` / `--no-load-aware` | `--no-load-aware` | Route by active load without cache-reuse signals; implies `--router-mode kv` on the frontend |
| `--router-kv-overlap-score-credit ` | `1.0` | Credit multiplier for device-local prefix overlap, from 0.0 to 1.0 |
| `--router-prefill-load-scale ` | `1.0` | Scale adjusted prompt-side prefill load before adding decode blocks |
| `--router-track-prefill-tokens` / `--no-router-track-prefill-tokens` | `--router-track-prefill-tokens` | Include prompt-side load in active worker load accounting |
| `--router-prefill-load-model ` | `none` | Prompt-side load model; see [Routing Concepts](/dynamo/components/router/routing-concepts#active-load-modeling) and [Configuration and Tuning](/dynamo/components/router/configuration-and-tuning#aic-prefill-load-model) |
| `--router-queue-threshold ` | `16.0` | Queue threshold fraction; enables priority scheduling via `priority` |
| `--router-queue-policy ` | `fcfs` | Scheduling policy for the queue: `fcfs` (tail TTFT), `wspt` (avg TTFT), or `lcfs` (comparison-only reverse ordering) |
| `--serve-indexer` | `false` | Serve the Dynamo-native remote indexer from this frontend/router on the worker component |
| `--use-remote-indexer` | `false` | Query the worker component's served remote indexer instead of maintaining a local overlap indexer |
For all available options: `python -m dynamo.frontend --help`
For detailed configuration options and tuning parameters, see [Configuration and Tuning](/dynamo/components/router/configuration-and-tuning). For candidate eligibility rules, see [Router Filtering](router-filtering.md). For how the router models prefill and decode load in the cost function, see [Routing Concepts](/dynamo/components/router/routing-concepts#active-load-modeling).
### Kubernetes Deployment
To enable the KV Router in Kubernetes, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
Frontend:
componentType: frontend
replicas: 1
envs:
- name: DYN_ROUTER_MODE
value: kv # Enable KV Smart Router
```
**Key Points:**
- Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only
- Configure worker-side KV event publishing when you want event-driven prefix-cache state
- Use `--no-router-kv-events` for approximate cache-state prediction when workers are not publishing events
#### Environment Variables
All CLI arguments can be configured via environment variables using the `DYN_` prefix:
| CLI Argument | Environment Variable | Default |
|--------------|---------------------|---------|
| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round-robin` |
| `--load-aware` | `DYN_ROUTER_LOAD_AWARE=true` | `false` |
| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` |
| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific |
| `--no-router-kv-events` | `DYN_ROUTER_USE_KV_EVENTS=false` | `true` |
| `--router-kv-overlap-score-credit` | `DYN_ROUTER_KV_OVERLAP_SCORE_CREDIT` | `1.0` |
| `--router-prefill-load-scale` | `DYN_ROUTER_PREFILL_LOAD_SCALE` | `1.0` |
| `--router-queue-policy` | `DYN_ROUTER_QUEUE_POLICY` | `fcfs` |
| `DYN_ENCODER_CUDA_TO_CPU_RATIO` | `8` | Throughput ratio of a non-CPU worker relative to one CPU worker for `device-aware-weighted` routing |
For complete K8s examples and advanced configuration, see [K8s Examples](/dynamo/components/router/router-examples#k8s-examples) and [Configuration and Tuning](/dynamo/components/router/configuration-and-tuning).
For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](/dynamo/additional-resources/kv-router-a-b-testing).
### Standalone Router
You can also run the KV router as a standalone service (without the Dynamo frontend) for disaggregated serving (e.g., routing to prefill workers), multi-tier architectures, or any scenario requiring intelligent KV cache-aware routing decisions. See the [Standalone Router component](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/components/src/dynamo/router/) for more details.
#### Frontend-Embedded vs. Standalone Router
| Deployment | Process | Metrics Port | Use Case |
|------------|---------|--------------|----------|
| **Frontend-embedded** | `python -m dynamo.frontend --router-mode kv` | Frontend HTTP port (default 8000) | Standard deployment; router runs inside the frontend process |
| **Standalone** | `python -m dynamo.router` | `DYN_SYSTEM_PORT` (if set) | Multi-tier architectures, advanced disaggregated prefill routing, custom pipelines |
The standalone router does not include the HTTP frontend (no `/v1/chat/completions` endpoint). It exposes only the `RouterRequestMetrics` via the system status server. See the [Standalone Router README](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/components/src/dynamo/router/README.md).
## Deployment Modes
The Dynamo router can be deployed in several configurations. The table below shows every combination and when to use it:
| Mode | Command | Routing Logic | KV Events | Topology | Use Case |
|------|---------|---------------|-----------|----------|----------|
| **Frontend + Round-Robin** | `python -m dynamo.frontend --router-mode round-robin` | Cycles through workers | None | Aggregated | Simplest baseline; no KV awareness |
| **Frontend + Random** | `python -m dynamo.frontend --router-mode random` | Random worker selection | None | Aggregated | Stateless load balancing |
| **Frontend + KV (Aggregated)** | `python -m dynamo.frontend --router-mode kv` | KV cache overlap + load | NATS Core / JetStream / ZMQ / Approx | Aggregated | Production single-pool serving with cache reuse |
| **Frontend + KV (Disaggregated)** | `python -m dynamo.frontend --router-mode kv` with prefill + decode workers | KV cache overlap + load | NATS Core / JetStream / ZMQ / Approx | Disaggregated (prefill + decode pools) | Separate prefill/decode for large-scale serving |
| **Frontend + Least-Loaded** | `python -m dynamo.frontend --router-mode least-loaded` | Fewest active connections | None | Aggregated or disaggregated fallback | Simple load-aware balancing without KV awareness |
| **Frontend + Device-Aware Weighted** | `python -m dynamo.frontend --router-mode device-aware-weighted` | Device-aware budget + least-loaded within selected device group | None | Aggregated or disaggregated fallback | Heterogeneous fleet balancing (CPU/non-CPU); degenerates to least-loaded when only one device class is present |
| **Frontend + Direct** | `python -m dynamo.frontend --router-mode direct` | Worker ID from request hints | None | Aggregated | External orchestrator (e.g., EPP/GAIE) selects workers |
| **Standalone Router** | `python -m dynamo.router` | KV cache overlap + load | NATS Core / JetStream / ZMQ | Any | Routing without the HTTP frontend (multi-tier, custom pipelines) |
### Routing Modes (`--router-mode`)
| Mode | Value | How Workers Are Selected |
|------|-------|-------------------------|
| **Round-Robin** | `round-robin` (default) | Cycles through available workers in order |
| **Random** | `random` | Selects a random worker for each request |
| **KV** | `kv` | Evaluates KV cache overlap and decode load per worker; picks lowest cost |
| **Least-Loaded** | `least-loaded` | Routes to the worker with fewest active connections; in disaggregated prefill paths it skips bootstrap optimization and falls back to synchronous prefill |
| **Device-Aware Weighted** | `device-aware-weighted` | Partitions workers into CPU and non-CPU groups, applies capability-normalized ratio budgeting using `DYN_ENCODER_CUDA_TO_CPU_RATIO` to decide which group receives the request, then selects the least-loaded worker within that group |
| **Direct** | `direct` | Reads the target `worker_id` from the request's routing hints; no selection logic |
### Device-Aware Weighted Routing
`device-aware-weighted` is designed for heterogeneous fleets where workers of different compute capability, for example CPU embedding encoders alongside GPU embedding encoders, share the same endpoint.
Workers are split into CPU and non-CPU groups. The router compares a capability-normalized load across the two groups:
```text
normalized_load = total_inflight(group) / (instance_count(group) x throughput_weight)
```
The throughput weight is `1` for CPU workers and `DYN_ENCODER_CUDA_TO_CPU_RATIO` for non-CPU workers. The next request is routed to the group with the lower normalized load, then to the least-loaded worker inside that group.
Use `DYN_ENCODER_CUDA_TO_CPU_RATIO` to approximate the throughput ratio of a non-CPU worker relative to one CPU worker. The default is `8`.
When only one device class is present, the policy degenerates to standard least-loaded routing.
### KV Event Transport Modes (within `--router-mode kv`)
When using KV routing, the router needs to know what each worker has cached. There are four ways to get this information:
| Event Mode | How to Enable | Description |
|------------|---------------|-------------|
| **NATS Core (local indexer)** | Router default (no router flag) | Workers maintain a local indexer; configure backend-side KV event publishing so the router can recover state and receive events via NATS Core |
| **JetStream (durable)** | `--router-durable-kv-events` | Events persisted in NATS JetStream; supports snapshots and durable consumers. *Deprecated.* |
| **ZMQ** | `--event-plane zmq` | Workers publish via ZMQ PUB sockets; the standalone `dynamo.indexer` service aggregates events |
| **Approximate (no events)** | `--no-router-kv-events` | No events consumed; router predicts cache state from its own routing decisions with TTL-based expiration |
### Aggregated vs. Disaggregated Topology
| Topology | Workers | How It Works |
|----------|---------|--------------|
| **Aggregated** | Single pool (prefill + decode in one process) | All workers handle the full request lifecycle |
| **Disaggregated** | Separate prefill and decode pools | Frontend routes to a prefill worker first, then to a decode worker; requires workers registered with `ModelType.Prefill` |
Disaggregated mode is activated automatically when prefill workers register alongside decode workers. See [Disaggregated Serving](/dynamo/components/router/disaggregated-serving) for details.
## More Router Docs
- **[Routing Concepts](/dynamo/components/router/routing-concepts)**: Cost model, worker selection, and routing primitives
- **[Configuration and Tuning](/dynamo/components/router/configuration-and-tuning)**: Router flags, transport modes, load tracking, and metrics
- **[Disaggregated Serving](/dynamo/components/router/disaggregated-serving)**: Prefill and decode routing setups
- **[Topology-Aware KV Transfer](/dynamo/components/router/topology-aware-kv-transfer)**: Runtime metadata and decode routing constraints for topology-aware prefill/decode handoff
- **[Router Operations](/dynamo/components/router/router-operations)**: Replicas, remote indexers, persistence, and recovery
- **[Router Examples](/dynamo/components/router/router-examples)**: Python API usage, K8s examples, and custom routing patterns
- **[Router Testing](router-testing.md)**: Recommended test layers for non-trivial router changes
- **[Standalone Indexer](/dynamo/components/router/standalone-indexer)**: Run the KV indexer as a separate service
- **[KV Event Replay — Dynamo vs vLLM](/dynamo/components/router/kv-event-replay-dynamo-vs-v-llm)**: Gap detection and replay behavior
# Disaggregated Serving
Disaggregated serving separates the two main phases of LLM inference:
| Phase | What it does | Scaling pressure |
|---|---|---|
| Prefill | Processes the prompt and produces the initial KV cache. | Input length, prompt reuse, context size |
| Decode | Generates output tokens using the KV cache. | Concurrency, output length, active KV memory |
In an aggregated deployment, each worker does both phases. In a disaggregated
deployment, prefill workers and decode workers are separate pools. Dynamo routes
each request through prefill first, transfers or exposes the KV cache state to
decode, and streams the response from the decode worker.
```mermaid
flowchart LR
Client["Client"] --> Frontend["Dynamo Frontend / Router"]
Frontend --> Prefill["Prefill workers prompt processing"]
Prefill -->|"KV transfer over fast fabric"| Decode["Decode workers token generation"]
Decode --> Frontend
Frontend --> Client
```
## When It Helps
Disaggregated serving is most useful when prefill and decode need different
resource shapes:
- long prompts or retrieval-heavy traffic make prefill expensive
- long generations or high concurrency make decode the bottleneck
- you want to scale prefill and decode replicas independently
- you want to pair prefill/decode separation with KV-aware routing
- large models need different parallelism for prompt processing and generation
It is not automatically better for every workload. For small models, short
prompts, low concurrency, or clusters without fast KV transfer, an aggregated
deployment may be simpler and faster.
## Mental Model
Disaggregated serving usually combines four pieces:
| Piece | Role |
|---|---|
| Frontend/router | Accepts OpenAI-compatible requests and coordinates routing. |
| Prefill workers | Run the prompt phase and prepare KV transfer state. |
| Decode workers | Continue generation after prefill completes. |
| KV transfer path | Moves or exposes KV cache state between prefill and decode workers. |
KV-aware routing is related but separate. Disaggregated serving splits prefill
and decode. KV-aware routing chooses workers based on cache locality. Many
production deployments use both, but you can reason about them independently.
For router-specific behavior, see [Router: Disaggregated Serving](/dynamo/components/router/disaggregated-serving)
and [KV Cache Aware Routing](/dynamo/user-guides/kv-cache-aware-routing).
## KV Transfer Is the Critical Path
Disaggregation only helps when decode workers can access the KV cache produced
by prefill quickly. In cross-node or high-throughput deployments, the KV
transfer path commonly depends on RDMA-capable networking through the backend's
transfer layer, such as NIXL/UCX. If RDMA is missing or silently falls back to
TCP, TTFT and throughput can be dominated by KV movement rather than model
compute.
Treat KV transfer as an early validation step, not a final tuning detail. Common
failure modes include missing RDMA device-plugin resources, pods without the
needed `rdma/ib` requests or `IPC_LOCK` capability, UCX/NIXL transport errors,
mismatched model or KV cache settings between prefill and decode workers, and
benchmarks that run through local port-forwarding instead of inside the cluster.
Symptoms usually look like high TTFT despite available prefill capacity, decode
workers sitting idle while prefill workers are busy, or disaggregated throughput
falling below an aggregated baseline after splitting workers across nodes.
Production cross-node disaggregated deployments usually require RDMA or an
equivalent fast fabric for KV cache transfer. Without it, the backend may fall
back to TCP and KV transfer can dominate TTFT and throughput. Validate the
transfer path before spending time tuning replica counts.
### Deploying Disaggregated with RDMA
Disaggregated deployments transfer KV cache between prefill and decode workers.
Without RDMA or another fast transfer path, this movement can become the main
performance bottleneck.
Prerequisites for a production cross-node deployment:
1. **RDMA-capable network** such as InfiniBand, RoCE, or an equivalent fast
fabric.
2. **RDMA device plugin** installed on the cluster so worker pods can request
`rdma/ib` resources.
3. **ETCD and NATS** deployed for Dynamo coordination.
The following example shows the RDMA-relevant fields in a disaggregated vLLM
`DynamoGraphDeployment`. Start from a validated recipe when one exists, then
adapt the resource requests, model, image, and parallelism for your cluster.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: dynamo-disagg
namespace: your-namespace
spec:
backendFramework: vllm
pvcs:
- name: model-cache
create: false
services:
Frontend:
componentType: frontend
replicas: 1
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
imagePullPolicy: IfNotPresent
VLLMPrefillWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: prefill
replicas: 2
resources:
limits:
gpu: "2"
sharedMemory:
size: 16Gi
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
- name: UCX_TLS
value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
- name: UCX_RNDV_SCHEME
value: "get_zcopy"
- name: UCX_RNDV_THRESH
value: "0"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
workingDir: /workspace
imagePullPolicy: IfNotPresent
securityContext:
capabilities:
add: ["IPC_LOCK"]
resources:
limits:
rdma/ib: "2"
requests:
rdma/ib: "2"
command: ["python3", "-m", "dynamo.vllm"]
args:
- --model
- "Qwen/Qwen3-32B-FP8"
- "--tensor-parallel-size"
- "2"
- "--kv-cache-dtype"
- "fp8"
- "--max-num-seqs"
- "1"
- --disaggregation-mode
- prefill
VLLMDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
subComponentType: decode
replicas: 1
resources:
limits:
gpu: "4"
sharedMemory:
size: 16Gi
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
- name: UCX_TLS
value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc"
- name: UCX_RNDV_SCHEME
value: "get_zcopy"
- name: UCX_RNDV_THRESH
value: "0"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
workingDir: /workspace
imagePullPolicy: IfNotPresent
securityContext:
capabilities:
add: ["IPC_LOCK"]
resources:
limits:
rdma/ib: "4"
requests:
rdma/ib: "4"
command: ["python3", "-m", "dynamo.vllm"]
args:
- --model
- "Qwen/Qwen3-32B-FP8"
- "--tensor-parallel-size"
- "4"
- "--kv-cache-dtype"
- "fp8"
- "--max-num-seqs"
- "1024"
- --disaggregation-mode
- decode
```
Critical RDMA settings:
| Setting | Purpose |
|---|---|
| `rdma/ib: "N"` | Requests RDMA resources for the worker pod. In most disaggregated vLLM deployments, match this to the worker TP size. |
| `IPC_LOCK` capability | Allows RDMA memory registration and pinned-memory use. |
| `UCX_TLS` | Enables RDMA-capable UCX transports such as `rc_x` and `dc_x`, plus CUDA transports for GPU buffers. |
| `UCX_RNDV_SCHEME=get_zcopy` | Enables zero-copy RDMA transfers for large KV-cache movement. |
After deployment, check the worker logs for UCX/NIXL initialization:
```bash
kubectl logs | grep -i "UCX\|NIXL"
```
Expected output includes:
```text
NIXL INFO Backend UCX was instantiated
```
If logs only show TCP transports, RDMA is not active. Check the RDMA device
plugin, worker `rdma/ib` resource requests, security context, and UCX settings.
For full transport setup and troubleshooting, see the
[Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication).
## Deployment Paths
Choose the path that matches how much control you need:
| Starting point | Use when |
|---|---|
| [Dynamo Recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) | A recipe matches your model, backend, hardware, and serving mode. Start here for validated baselines and `perf.yaml` benchmarks. |
| Direct `DynamoGraphDeployment` | You already know the prefill/decode layout, images, parallelism, and KV transfer settings. |
| [DGDR](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference) | You want Dynamo to generate a DGD from model, backend, hardware, workload, and SLA intent. |
| [Sizing with AIConfigurator](/dynamo/user-guides/disaggregated-serving/sizing-with-ai-configurator) | You want to compare aggregated vs. disaggregated layouts and estimate prefill/decode sizing before deployment. |
Good recipe starting points include:
- [Qwen3-32B vLLM disagg + KV router](https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3-32b)
- [DeepSeek V3.2 TensorRT-LLM disagg + KV router](https://github.com/ai-dynamo/dynamo/tree/main/recipes/deepseek-v32-fp4)
- [Llama 3 70B vLLM disaggregated recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes/llama-3-70b)
For the Kubernetes resource model, see the [Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide).
## Backend Examples
Each built-in backend has examples that show the concrete worker flags and
transfer settings:
| Backend | Examples |
|---|---|
| vLLM | [Deployment examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy), including `disagg.yaml`, `disagg_router.yaml`, and `disagg_planner.yaml` |
| TensorRT-LLM | [Deployment examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy), including disaggregated, router, and planner variants |
| SGLang | [Deployment examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy), including NIXL-based disaggregated serving |
## Operational Notes
Disaggregated deployments add a data-movement path between workers. Before
moving to production, verify:
- KV transfer backend and network fabric are configured for your backend
- RDMA resources, UCX/NIXL settings, and security context are active when your
deployment depends on RDMA
- prefill and decode workers agree on model, dtype, block size, and KV layout
- pods have the required GPU, shared memory, and network resources
- frontend/router flags match your routing strategy
- benchmarks run inside the cluster, not through local port-forwarding, when
validating high-load performance
Use [Dynamo Benchmarking](/dynamo/user-guides/benchmarking) to compare
aggregated and disaggregated configurations with the same workload.
## Next Steps
1. Start from a matching [Dynamo Recipe](https://github.com/ai-dynamo/dynamo/tree/main/recipes) when available.
2. Read the backend-specific deployment example for your engine.
3. Use [Sizing with AIConfigurator](/dynamo/user-guides/disaggregated-serving/sizing-with-ai-configurator) or DGDR when you need help choosing prefill/decode sizing.
4. Validate the result with [Dynamo Benchmarking](/dynamo/user-guides/benchmarking).
# Sizing with AIConfigurator
This page focuses on using
[AIConfigurator](https://github.com/ai-dynamo/aiconfigurator/tree/main) to size
aggregated and disaggregated Dynamo deployments. For the serving architecture
and deployment-path overview, start with [Disaggregated Serving](/dynamo/user-guides/disaggregated-serving).
AIConfigurator is a performance optimization tool that helps you find a strong
starting configuration for deploying LLMs with Dynamo. Given a supported model,
GPU system, backend, and SLA target, it searches aggregated and disaggregated
layouts and can generate deployment artifacts for the selected target.
## When to Use AIConfigurator
When deploying LLMs with Dynamo, you need to make several critical decisions:
- **Aggregated vs Disaggregated**: Which architecture gives better performance for your workload?
- **Worker Configuration**: How many prefill and decode workers to deploy?
- **Parallelism Settings**: What tensor/pipeline parallel configuration to use?
- **SLA Compliance**: How to meet your TTFT and TPOT targets?
AIConfigurator is useful when you want:
- candidate configurations that are filtered against your SLA requirements
- generated Dynamo configuration files and Kubernetes manifests
- performance comparisons between aggregated and disaggregated strategies
- a support check for a model/system/backend combination before you tune by hand
Exact runtime and throughput gains depend on the model, hardware, backend,
traffic shape, and available performance data. Treat AIConfigurator output as a
validated starting point, then benchmark the generated configuration in your
cluster.
### End-to-End Workflow

### Aggregated vs Disaggregated Architecture
AIConfigurator evaluates two deployment architectures and recommends the best one for your workload:

### When to Use Each Architecture

## Quick Start
```bash
# Install
pip3 install aiconfigurator
# Optional: check whether the model/system/backend is covered
aiconfigurator cli support \
--model-path Qwen/Qwen3-32B-FP8 \
--system h200_sxm \
--backend vllm
# Find optimal configuration for vLLM backend
aiconfigurator cli default \
--model-path Qwen/Qwen3-32B-FP8 \
--total-gpus 8 \
--system h200_sxm \
--backend vllm \
--backend-version 0.12.0 \
--isl 4000 \
--osl 500 \
--ttft 600 \
--tpot 16.67 \
--database-mode SILICON \
--deployment-target dynamo-j2 \
--save-dir ./results_vllm
# Deploy on Kubernetes
kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yaml
```
## Complete Walkthrough: vLLM on H200
This section walks through a validated example deploying Qwen3-32B-FP8 on 8× H200 GPUs using vLLM.
### Step 1: Run AIConfigurator
```bash
aiconfigurator cli default \
--model-path Qwen/Qwen3-32B-FP8 \
--system h200_sxm \
--total-gpus 8 \
--isl 4000 \
--osl 500 \
--ttft 600 \
--tpot 25 \
--backend vllm \
--backend-version 0.12.0 \
--deployment-target dynamo-j2 \
--generator-set K8sConfig.k8s_namespace=$YOUR_NAMESPACE \
--generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC \
--save-dir ./results_vllm
```
**Parameters explained:**
- `--model-path`: HuggingFace model ID or local path (e.g., `Qwen/Qwen3-32B-FP8`). `--model` is also accepted as an alias.
- `--system`: GPU system type (`h200_sxm`, `h100_sxm`, `a100_sxm`)
- `--total-gpus`: Number of GPUs available for deployment
- `--isl` / `--osl`: Input/Output sequence lengths in tokens
- `--ttft` / `--tpot`: SLA targets - Time To First Token (ms) and Time Per Output Token (ms)
- `--backend`: Inference backend (`vllm`, `trtllm`, or `sglang`)
- `--backend-version`: Backend version (e.g., `0.12.0` for vLLM)
- `--deployment-target`: Artifact target. `dynamo-j2` generates Dynamo Kubernetes manifests; other targets are available in the upstream CLI.
- `--save-dir`: Directory to save generated deployment configs
### Step 2: Review the Results
AIConfigurator outputs a comparison of aggregated vs disaggregated deployment strategies:
```text
********************************************************************************
* Dynamo aiconfigurator Final Results *
********************************************************************************
----------------------------------------------------------------------------
Input Configuration & SLA Target:
Model: Qwen/Qwen3-32B-FP8 (is_moe: False)
Total GPUs: 8
Best Experiment Chosen: disagg at 446.85 tokens/s/gpu (disagg 1.38x better)
----------------------------------------------------------------------------
Overall Best Configuration:
- Best Throughput: 3,574.80 tokens/s
- Per-GPU Throughput: 446.85 tokens/s/gpu
- Per-User Throughput: 53.58 tokens/s/user
- TTFT: 453.18ms
- TPOT: 18.66ms
- Request Latency: 9766.51ms
----------------------------------------------------------------------------
Pareto Frontier:
Qwen/Qwen3-32B-FP8 Pareto Frontier: tokens/s/gpu_cluster vs tokens/s/user
┌─────────────────────────────────────────────────────────────────────────┐
850.0┤ •• agg │
│ ff disagg │
│ xx disagg best │
│ │
708.3┤ │
│ f │
│ f │
│ fff │
566.7┤ f │
│ f │
│ f │
│ •• fffffffffffffffffx │
425.0┤ •••• ff │
│ ••• f │
│ ••••• f │
│ •••••••••• f │
283.3┤ ••• f │
│ •• f │
│ •• f │
│ ••••f │
141.7┤ •f• │
│ f••••• │
│ f ••••••• │
│ fffff •••• │
0.0┤ •••• │
└┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬┘
0 30 60 90 120
tokens/s/gpu_cluster tokens/s/user
----------------------------------------------------------------------------
Deployment Details:
(p) stands for prefill, (d) stands for decode, bs stands for batch size, a replica stands for the smallest scalable unit xPyD of the disagg system
Some math: total gpus used = replicas * gpus/replica
gpus/replica = (p)gpus/worker * (p)workers + (d)gpus/worker * (d)workers; for Agg, gpus/replica = gpus/worker
gpus/worker = tp * pp * dp = etp * ep * pp for MoE models; tp * pp for dense models (underlined numbers are the actual values in math)
agg Top Configurations: (Sorted by tokens/s/gpu)
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
| Rank | backend | tokens/s/gpu | tokens/s/user | TTFT | request_latency | concurrency | total_gpus (used) | replicas | gpus/replica | gpus/worker | parallel | bs |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
| 1 | vllm | 322.69 | 41.78 | 546.92 | 12490.03 | 64 (=32x2) | 8 (8=2x4) | 2 | 4 | 4 (=4x1x1) | tp4pp1 | 32 |
| 2 | vllm | 293.94 | 44.43 | 593.10 | 11823.67 | 56 (=14x4) | 8 (8=4x2) | 4 | 2 | 2 (=2x1x1) | tp2pp1 | 14 |
| 3 | vllm | 208.87 | 42.90 | 460.58 | 12093.52 | 40 (=40x1) | 8 (8=1x8) | 1 | 8 | 8 (=8x1x1) | tp8pp1 | 40 |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+
disagg Top Configurations: (Sorted by tokens/s/gpu)
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
| Rank | backend | tokens/s/gpu | tokens/s/user | TTFT | request_latency | concurrency | total_gpus (used) | replicas | gpus/replica | (p)workers | (p)gpus/worker | (p)parallel | (p)bs | (d)workers | (d)gpus/worker | (d)parallel | (d)bs |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
| 1 | vllm | 446.85 | 53.58 | 453.18 | 9766.51 | 76 (=76x1) | 8 (8=1x8) | 1 | 8 (=2x2+1x4) | 2 | 2 (=2x1) | tp2pp1 | 1 | 1 | 4 (=4x1) | tp4pp1 | 76 |
| 2 | vllm | 446.85 | 41.14 | 453.18 | 12581.87 | 144 (=72x2) | 8 (8=2x4) | 2 | 4 (=1x2+1x2) | 1 | 2 (=2x1) | tp2pp1 | 1 | 1 | 2 (=2x1) | tp2pp1 | 72 |
| 3 | vllm | 333.73 | 40.22 | 453.18 | 12860.32 | 72 (=36x2) | 8 (8=2x4) | 2 | 4 (=1x2+2x1) | 1 | 2 (=2x1) | tp2pp1 | 1 | 2 | 1 (=1x1) | tp1pp1 | 18 |
+------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
```
**Reading the output:**
- **tokens/s/gpu**: Overall throughput efficiency — higher is better
- **tokens/s/user**: Per-request generation speed (inverse of TPOT)
- **TTFT**: Predicted time to first token
- **concurrency**: Total concurrent requests across all replicas (e.g., `56 (=14x4)` means batch size 14 × 4 replicas)
- **agg Rank 1** recommends TP4 with 2 replicas — simpler to deploy
- **disagg Rank 1** recommends 2 prefill workers (TP2) + 1 decode worker (TP4) — higher throughput but requires RDMA
### Step 3: Deploy on Kubernetes
The `--save-dir` generates ready-to-use Kubernetes manifests:
```
├── agg
│ ├── best_config_topn.csv
│ ├── exp_config.yaml
│ ├── pareto.csv
│ ├── top1
│ │ ├── agg_config.yaml
│ │ ├── bench_run.sh # aiperf benchmark sweep script (bare-metal)
│ │ ├── generator_config.yaml
│ │ ├── k8s_bench.yaml # aiperf benchmark sweep Job (Kubernetes)
│ │ ├── k8s_deploy.yaml # Kubernetes DynamoGraphDeployment
│ │ └── run_0.sh
│ ...
├── disagg
│ ├── best_config_topn.csv
│ ├── exp_config.yaml
│ ├── pareto.csv
│ ├── top1
│ │ ├── bench_run.sh # aiperf benchmark sweep script (bare-metal)
│ │ ├── decode_config.yaml
│ │ ├── generator_config.yaml
│ │ ├── k8s_bench.yaml # aiperf benchmark sweep Job (Kubernetes)
│ │ ├── k8s_deploy.yaml # Kubernetes DynamoGraphDeployment
│ │ ├── prefill_config.yaml
│ │ ├── run_0.sh
│ │ └── run_1.sh (for multi-node setups)
│ ...
└── pareto_frontier.png
```
#### Prerequisites
Before deploying, ensure you have:
1. **HuggingFace Token Secret** (for gated models):
```bash
kubectl create secret generic hf-token-secret \
-n your-namespace \
--from-literal=HF_TOKEN="your-huggingface-token"
```
2. **Model Cache PVC** (recommended for faster restarts):
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
namespace: your-namespace
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
```
#### Deploy the Configuration
The generated `k8s_deploy.yaml` provides a starting point. You'll typically need to customize it for your environment:
```bash
kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yaml
```
**Complete deployment example** with model cache and production settings:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: dynamo-agg
namespace: your-namespace
spec:
backendFramework: vllm
pvcs:
- name: model-cache
create: false # Use existing PVC
services:
Frontend:
componentType: frontend
replicas: 1
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
imagePullPolicy: IfNotPresent
VLLMWorker:
envFromSecret: hf-token-secret
componentType: worker
replicas: 4
resources:
limits:
gpu: "2"
sharedMemory:
size: 16Gi # Required for vLLM
volumeMounts:
- name: model-cache
mountPoint: /opt/models
envs:
- name: HF_HOME
value: /opt/models
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1
workingDir: /workspace
imagePullPolicy: IfNotPresent
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- "Qwen/Qwen3-32B-FP8"
- "--no-enable-prefix-caching"
- "--tensor-parallel-size"
- "2"
- "--pipeline-parallel-size"
- "1"
- "--data-parallel-size"
- "1"
- "--kv-cache-dtype"
- "fp8"
- "--max-model-len"
- "6000"
- "--max-num-seqs"
- "1024"
```
**Key deployment settings:**
| Setting | Purpose | Notes |
|---------|---------|-------|
| `backendFramework: vllm` | Tells Dynamo which runtime to use | Required at spec level |
| `pvcs` + `volumeMounts` | Caches model weights across restarts | Mount at `/opt/models` (not `/root/`) |
| `HF_HOME` env var | Points HuggingFace to cache location | Must match `mountPoint` |
| `sharedMemory.size: 16Gi` | IPC memory for vLLM | 16Gi for vLLM, 80Gi for TRT-LLM |
| `envFromSecret` | Injects HF_TOKEN | Required for gated models |
### Step 4: Validate with AIPerf
After deployment, validate the predictions against actual performance using [AIPerf](https://github.com/ai-dynamo/aiperf).
> ℹ️ Run AIPerf **inside the cluster** to avoid network latency affecting measurements.
AIC automatically generates AIPerf scripts along with Dynamo configs and stores them in the results folder (when `--save-dir ...` is specified). For Kubernetes deployments, you can run benchmarks using `k8s_bench.yaml`; while for bare-metal systems, use the `bench_run.sh` script. These scripts execute AIPerf across a concurrency list: the default set (`1 2 8 16 32 64 128`) along with `BenchConfig.estimated_concurrency` and its values within ±5%. You can also customize this concurrency list as needed.
By default, AIPerf results will be saved in `/tmp/bench_artifacts` of the containers. If PVC name is specified in `--generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC`, result artifacts will be saved in the PVC volume mount instead.

| AIC Output | AIPerf Parameter | Notes |
|------------|-----------------|-------|
| `concurrency: 56 (=14x4)` | `--concurrency 56` | Use total concurrency when benchmarking via the frontend |
| ISL/OSL targets | `--isl 4000 --osl 500` | Match your AIC inputs |
| - | `--num-requests 800` | Use `concurrency × 40` minimum for statistical stability |
| - | `--extra-inputs "ignore_eos:true"` | Ensures exact OSL tokens generated |
> **Note on concurrency**: AIC reports concurrency as `total (=bs × replicas)`. When benchmarking through the frontend (which routes to all replicas), use the total value. If benchmarking a single replica directly, use the per-replica `bs` value instead.
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: aiperf-benchmark
namespace: your-namespace
spec:
template:
spec:
restartPolicy: Never
containers:
- name: aiperf
image: python:3.10
command:
- /bin/bash
- -c
- |
pip install aiperf
aiperf profile \
-m Qwen/Qwen3-32B-FP8 \
--endpoint-type chat \
-u http://dynamo-agg-frontend:8000 \
--isl 4000 --isl-stddev 0 \
--osl 500 --osl-stddev 0 \
--num-requests 800 \
--concurrency 56 \
--streaming \
--extra-inputs "ignore_eos:true" \
--num-warmup-requests 40 \
--ui-type simple
```
```bash
kubectl apply -f k8s_bench.yaml
kubectl logs -f -l job-name=aiperf-benchmark
```
**Validated results** (Qwen3-32B-FP8, 8× H200, TP2×4 replicas, aggregated):
| Metric | AIC Prediction | Actual (avg) | Status |
|--------|---------------|--------------|--------|
| TTFT (ms) | 509 | 209 | Better than target |
| ITL/TPOT (ms) | 16.49 | 15.06 | Within 10% |
| Throughput (req/s) | ~6.3 | 6.9 | Within 10% |
| Total Output TPS | ~3,178 | 3,462 | Within 10% |
The table above is a validation example, not a universal guarantee. Expect
variance across clusters, backend versions, model cache settings, and network
fabric. Run multiple benchmark passes and compare against the generated
concurrency and sequence-length assumptions.
## Fine-Tuning Your Deployment
AIConfigurator provides a strong starting point. Here's how to iterate for production:
### Adjusting for Actual Workload
If your real workload differs from the benchmark parameters:
```bash
# For longer outputs (chat/code generation):
# increase OSL, relax TTFT target
aiconfigurator cli default \
--model-path Qwen/Qwen3-32B-FP8 \
--total-gpus 8 \
--system h200_sxm \
--backend vllm \
--backend-version 0.12.0 \
--isl 2000 \
--osl 2000 \
--ttft 1000 \
--tpot 10 \
--save-dir ./results_long_output
```
### Exploring Alternative Configurations
Use `exp` mode to compare custom configurations:
```yaml
# custom_exp.yaml
exps:
- exp_tp2
- exp_tp4
exp_tp2:
mode: "patch"
serving_mode: "agg"
model_path: "Qwen/Qwen3-32B-FP8"
total_gpus: 8
system_name: "h200_sxm"
backend_name: "vllm"
backend_version: "0.12.0"
isl: 4000
osl: 500
ttft: 600
tpot: 16.67
config:
agg_worker_config:
tp_list: [2]
exp_tp4:
mode: "patch"
serving_mode: "agg"
model_path: "Qwen/Qwen3-32B-FP8"
total_gpus: 8
system_name: "h200_sxm"
backend_name: "vllm"
backend_version: "0.12.0"
isl: 4000
osl: 500
ttft: 600
tpot: 16.67
config:
agg_worker_config:
tp_list: [4]
```
```bash
aiconfigurator cli exp --yaml-path custom_exp.yaml --save-dir ./results_custom
```
For production disaggregated deployments, validate the KV transfer path before
tuning replica counts. See [Disaggregated Serving](/dynamo/user-guides/disaggregated-serving#deploying-disaggregated-with-rdma)
for RDMA prerequisites, the DGD resource pattern, and NIXL/UCX verification.
### Tuning vLLM-Specific Parameters
Override vLLM engine parameters with `--generator-set`:
```bash
aiconfigurator cli default \
--model-path Qwen/Qwen3-32B-FP8 \
--total-gpus 8 \
--system h200_sxm \
--backend vllm \
--backend-version 0.12.0 \
--isl 4000 --osl 500 \
--ttft 600 --tpot 16.67 \
--save-dir ./results_tuned \
--generator-set Workers.agg.kv_cache_free_gpu_memory_fraction=0.85 \
--generator-set Workers.agg.max_num_seqs=2048
```
Run `aiconfigurator cli default --generator-help` to see all available parameters.
### Prefix Caching Considerations
For workloads with repeated prefixes (e.g., system prompts):
- **Enable prefix caching** when you have high prefix hit rates
- **Disable prefix caching** (`--no-enable-prefix-caching`) for diverse prompts
AIConfigurator's default predictions assume no prefix caching. Enable it post-deployment if your workload benefits.
## Supported Configurations
### Backends and Versions
For a comprehensive breakdown of which model/system/backend/version combinations are supported in both aggregated and disaggregated modes, refer to the [**support matrix**](https://ai-dynamo.github.io/aiconfigurator/support-matrix/). The raw data is available as [per-system CSV files](https://github.com/ai-dynamo/aiconfigurator/tree/main/src/aiconfigurator/systems/support_matrix), which are automatically generated and tested to ensure accuracy across all supported configurations.
You can also check if a system / framework version is supported via the `aiconfigurator cli support` command. For example:
```bash
aiconfigurator cli support --model-path Qwen/Qwen3-32B-FP8 --system h100_sxm --backend-version 1.2.0rc5
```
## Common Use Cases
```bash
# Strict latency SLAs (real-time chat)
aiconfigurator cli default \
--model-path meta-llama/Llama-3.1-70B \
--total-gpus 16 \
--system h200_sxm \
--backend vllm \
--backend-version 0.12.0 \
--ttft 200 --tpot 8
# High throughput (batch processing)
aiconfigurator cli default \
--model-path Qwen/Qwen3-32B-FP8 \
--total-gpus 32 \
--system h200_sxm \
--backend trtllm \
--ttft 2000 --tpot 50
# Request latency constraint (end-to-end SLA)
aiconfigurator cli default \
--model-path Qwen/Qwen3-32B-FP8 \
--total-gpus 16 \
--system h200_sxm \
--backend vllm \
--backend-version 0.12.0 \
--request-latency 12000 \
--isl 4000 --osl 500
```
## Additional Options
```bash
# Web interface for interactive exploration
pip3 install aiconfigurator[webapp]
aiconfigurator webapp # Visit http://127.0.0.1:7860
# Quick config generation (no parameter sweep)
aiconfigurator cli generate \
--model-path Qwen/Qwen3-32B-FP8 \
--total-gpus 8 \
--system h200_sxm \
--backend vllm
# Check model/system support
aiconfigurator cli support \
--model-path Qwen/Qwen3-32B-FP8 \
--system h200_sxm \
--backend vllm
```
## Troubleshooting
### AIConfigurator Issues
**Model not found**: Use the full HuggingFace path (e.g., `Qwen/Qwen3-32B-FP8` not `QWEN3_32B`)
**Backend version mismatch**: Check supported versions with `aiconfigurator cli support --model-path --system --backend `
### Deployment Issues
**Pods crash with "Permission denied" on cache directory**:
- Mount the PVC at `/opt/models` instead of `/root/.cache/huggingface`
- Set `HF_HOME=/opt/models` environment variable
- Ensure the PVC has `ReadWriteMany` access mode
**Workers stuck in CrashLoopBackOff**:
- Check logs: `kubectl logs --previous`
- Verify `sharedMemory.size` is set (16Gi for vLLM, 80Gi for TRT-LLM)
- Ensure HuggingFace token secret exists and is named correctly
**Model download slow on every restart**:
- Add PVC for model caching (see deployment example above)
- Verify `volumeMounts` and `HF_HOME` are configured on workers
**"Context stopped or killed" errors (disaggregated only)**:
- Deploy ETCD and NATS infrastructure (required for KV cache transfer)
- See [Dynamo Kubernetes Guide](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart) for platform setup
### Performance Issues
**OOM errors**: Reduce `--max-num-seqs` or increase tensor parallelism
**Performance below predictions**:
- Verify warmup requests are sufficient (40+ recommended)
- Check for competing workloads on the cluster
- Ensure KV cache memory fraction is optimized
- Run benchmarks from inside the cluster to eliminate network latency
**Disaggregated TTFT extremely high (10+ seconds)**:
Start by checking the **RDMA and KV transfer path**. Without RDMA or another
fast transfer path, KV cache transfer may fall back to TCP and become a severe
bottleneck.
To diagnose:
```bash
# Check if RDMA resources are allocated
kubectl get pod -o yaml | grep -A5 "resources:"
# Check UCX transport in logs
kubectl logs | grep -i "UCX\|transport"
```
To fix:
1. Ensure your cluster has RDMA device plugin installed
2. Add `rdma/ib` resource requests to worker pods
3. Add `IPC_LOCK` capability to security context
4. Add UCX environment variables. See [Disaggregated Serving](/dynamo/user-guides/disaggregated-serving#deploying-disaggregated-with-rdma)
for the deployment pattern and verification steps.
**Disaggregated working but throughput lower than aggregated**:
For balanced workloads (ISL/OSL ratio between 2:1 and 10:1), aggregated is often better. Disaggregated shines for:
- Very long inputs (ISL > 8000) with short outputs
- Workloads needing independent prefill/decode scaling
## Learn More
- [AIConfigurator CLI Guide](https://github.com/ai-dynamo/aiconfigurator/blob/main/docs/cli_user_guide.md)
- [Dynamo Deployment Guide](https://github.com/ai-dynamo/aiconfigurator/blob/main/docs/dynamo_deployment_guide.md)
- [Dynamo Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide)
- [Benchmarking Guide](/dynamo/user-guides/benchmarking)
# KVBM Guide
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer and write-through cache for frameworks like vLLM and TensorRT-LLM.
KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems.
## Quick start with the pre-built NGC container
The fastest path is the published Dynamo container, which includes KVBM:
```bash
docker run --gpus all --rm -it \
nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 \
/bin/bash
```
For installation from source or custom builds, see [Local Installation](/dynamo/getting-started/local-installation) and [Release Artifacts](/dynamo/resources/release-artifacts).
## Run KVBM Standalone
KVBM can be used independently without using the rest of the Dynamo stack:
```bash
pip install kvbm
```
See the [support matrix](/dynamo/resources/support-matrix) for version compatibility.
### Build from Source
To build KVBM from source, see the detailed instructions in the [KVBM bindings README](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/lib/bindings/kvbm/README.md#build-from-source).
## Run KVBM in Dynamo with vLLM
### Docker Setup
```bash
# Start up etcd for KVBM leader/worker registration and discovery
docker compose -f dev/docker-compose.yml up -d
```
Pick one of the following to get a Dynamo vLLM container with KVBM built in. The subsequent serving commands are the same either way.
**Option A: Pre-built NGC container (recommended for quick start)**
```bash
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0
```
See the [Local Installation Guide](/dynamo/getting-started/local-installation) for full setup instructions and [Release Artifacts](/dynamo/resources/release-artifacts#container-images) for available versions.
**Option B: Build from source**
```bash
# Build a dynamo vLLM container (KVBM is built in by default)
# x86_64
python container/render.py --framework vllm --target runtime --output-short-filename --platform linux/amd64
docker buildx build --platform linux/amd64 -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile .
# arm64 (Grace, Jetson, arm64 EC2)
python container/render.py --framework vllm --target runtime --output-short-filename --platform linux/arm64
docker buildx build --platform linux/arm64 -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile .
# Launch the container
container/run.sh --image dynamo:latest-vllm-runtime -it --mount-workspace --use-nixl-gds
```
### Aggregated Serving
```bash
cd $DYNAMO_HOME/examples/backends/vllm
./launch/agg_kvbm.sh
```
#### Verify Deployment
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"stream": false,
"max_tokens": 10
}'
```
#### Alternative: Using Direct vllm serve
You can also use `vllm serve` directly with KVBM:
```bash
vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "kvbm.vllm_integration.connector"}' Qwen/Qwen3-0.6B
```
## Run KVBM in Dynamo with TensorRT-LLM