# NVIDIA Dynamo Documentation # Quickstart

简体中文

## Choose Your Path You're here. Container fast path. Full walkthrough — PyPI, configuration. Kubernetes-native production path. For contributors against `main`. Dynamo is backend-agnostic and Kubernetes-native without being Kubernetes-only. Use this container path to try the same frontend/router/worker stack locally; use the Kubernetes path when you want the operator, CRDs, Gateway API integration, autoscaling, scheduling, and cluster lifecycle management. ## Pull a Container Containers have all dependencies pre-installed. Pick your backend: ```bash docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.2 ``` ```bash docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.2 ``` ```bash docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2 ``` **Hugging Face token required for gated models.** Llama, Kimi, Qwen-VL, and other gated models require `HF_TOKEN` in your environment and accepting the model card's license on huggingface.co. Set `export HF_TOKEN=hf_…` before launching. For container versions and tags, see [Release Artifacts](/dynamo/resources/release-artifacts#container-images). ## Start the Frontend In your container, start the OpenAI-compatible frontend on port 8000: ```bash python3 -m dynamo.frontend --discovery-backend file ``` `--discovery-backend file` avoids needing etcd. To run frontend and worker in the same terminal, background each command with `> logfile.log 2>&1 &`. ## Start a Worker In another terminal, launch a worker for your backend: ```bash python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend file ``` ```bash python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --discovery-backend file ``` ```bash python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file \ --kv-events-config '{"enable_kv_cache_events": false}' ``` ## Verify and Test Check the endpoint is up: ```bash curl -sf http://localhost:8000/health && echo OK ``` If you see `OK`, send a chat completion: ```bash title="Request" curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50}' ``` ```json title="Response" { "id": "chatcmpl-...", "model": "Qwen/Qwen3-0.6B", "choices": [{ "index": 0, "message": {"role": "assistant", "content": "Hello! How can I help you today?"}, "finish_reason": "stop" }], "usage": {"prompt_tokens": 9, "completion_tokens": 10, "total_tokens": 19} } ``` Connection refused? The frontend takes a few seconds to start — retry. For production liveness and readiness probes, see [Health Checks](/dynamo/user-guides/observability-local/health-checks). ## From the Digest How Dynamo optimizes for agentic workloads at three layers: the frontend API, the router, and KV cache management. How Dynamo's concurrent global index evolved through six iterations to sustain over 100M ops/sec. ## Dive Deeper Pick a full install path from the [four options above](#choose-your-path), or explore how Dynamo works under the hood: How the frontend, router, and workers fit together. Worker discovery, multi-model routing, OpenAI compat. How the router places requests for prefix reuse. Liveness and readiness probes for production deployments.

简体中文

# Introduction to Dynamo Dynamo is an open-source, high-throughput, low-latency inference framework, designed to serve generative AI workloads in distributed environments. It is Kubernetes-native for production deployments, with an operator, CRDs, Helm charts, service discovery, Gateway API integration, and topology-aware scheduling, while still supporting local containers, Python workers, and standalone components for development or incremental adoption. This page gives an overview of Dynamo's design principles, performance benefits, and production-grade features. Looking to get started right away? See the [Quickstart](/dynamo/getting-started/quickstart) to install and run Dynamo in minutes. ## Why Dynamo? Inference engines optimize the GPU; Dynamo optimizes the system around them. - **System-level optimization on top of any engine** -- Inference engines optimize the single-GPU forward pass. Dynamo adds the distributed layer: disaggregated serving, smart routing, KV cache management across memory tiers, and auto-scaling. - **Composable performance improvement techniques** -- The techniques, disaggregated serving, KV cache-aware routing, and KV cache offloading, each improve performance on their own; using them together yields compounding gains. - **Engine-agnostic** -- Works with vLLM, SGLang, and TensorRT-LLM. Swap engines without changing your serving infrastructure. Extending support for Intel XPU and AMD hardware. - **Kubernetes-native production path** -- Dynamo exposes inference graphs as Kubernetes resources (`DynamoGraphDeployment`, `DynamoComponentDeployment`, `DynamoGraphDeploymentRequest`) and reconciles them with an operator, while integrating with Kubernetes service discovery, Gateway API Inference Extension, scheduling, observability, and model loading workflows. - **Production-ready at scale** -- Dynamo covers the full deployment lifecycle: automatic configuration (AIConfigurator), runtime auto-scaling (Planner), topology-aware gang scheduling (Grove), fault tolerance, and observability. - **Modular adoption** -- Start with one component (e.g., just the Router for KV-aware routing on top of your existing engine). Adopt more as needed. Each component is independently installable via pip. ## Design Principles ### Strong Foundations for AI Inference Dynamo adds system-level optimizations on top of inference engines. To provide such optimizations, Dynamo takes an operating systems approach by laying down the foundations for scheduling, memory management, and data transfer. These foundations allow Dynamo to evolve as new system-level performance techniques emerge. One of the motivations for Dynamo's system-level design was to support disaggregated serving: running prefill and decode on different devices so each can be scaled and parallelized independently. Disaggregated serving required three capabilities: (1) scheduling to assign prefill and decode phases without interference, (2) memory management for KV cache offloading and onboarding, and (3) low-latency data transfer to move KV cache between nodes and across the memory hierarchy. ![Dynamo foundations: scheduling, memory management, and data transfer](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/6b20fc2c68b94a23bd3bc7376f13498d018379cf45f881f44c470f0b560901dc/pages-v1.2.0/assets/img/intro-foundations.svg) Dynamo's foundations first addressed disaggregated serving, then extended to EPD disaggregation for multimodal, and now support workloads such as diffusion, RL, and agents. ### Modular but Well-Integrated Ecosystem Dynamo is designed to reduce the burden of replacing an existing stack in production. It offers modular, standalone components as Rust crates and pip wheels. For example, the three foundations of Dynamo for scheduling (Dynamo), memory management (KV Block Manager), and data transfer (NIXL) are each independently installable: ```bash pip install ai-dynamo pip install kvbm pip install nixl ``` Pre-built containers with all dependencies are also available. See [Release Artifacts](/dynamo/resources/release-artifacts) for container images. The Dynamo ecosystem includes these additional modular components, and will continue to grow over time: | Category | Products | Description | | :--- | :--- | :--- | | **Scheduling** | Dynamo | Inference serving for GenAI workloads | | **Routing** | Router | Smart routing leveraging KV cache hit rate and KV cache load. More algorithms will be added (e.g., agentic routing) | | **Data Transfer** | [NIXL](https://github.com/ai-dynamo/nixl) | Point-to-point data transfer between GPUs and tiered storage (G1: GPU, G2: CPU, G3: SSD, G4: remote) | | **Memory** | KVBM (KV Block Manager) | Manage KV cache across memory tiers (G1-G4) with customizable eviction policy | | **Scaling / Cloud** | Planner | Automatically tune performance in real time for prefill and decode given SLA constraints (TTFT and TPOT) | | | [Grove](https://github.com/ai-dynamo/grove) | Enables gang scheduling and topology awareness required for Kubernetes multi-node disaggregated serving | | | [Model Express](https://github.com/ai-dynamo/model-express) | Load model weights fast by caching and transferring them via NIXL to other GPUs. Will also be leveraged for fault tolerance | | **Perf** | [DynoSim](/dynamo/user-guides/dynosim) | Simulate Dynamo deployment choices with Mocker, workload-driven runs, sweeps, and AIC-backed timing models before validating on GPUs | | | [AIConfigurator](https://github.com/ai-dynamo/aiconfigurator) | Provides calibrated performance models and configuration search inputs for rapid DGDR profiling. Formerly known as LLMPet | | | [AIPerf](https://github.com/ai-dynamo/aiperf) | Re-architected GenAI-Perf written in Python for maximum extensibility; supports distributed benchmarking | | | AITune | Given a model or pipeline, searches for best backend to deploy with (e.g., TensorRT, Torch.compile, etc.) (coming soon) | | | Flex Tensor | Stream weights to GPUs from host memory to run very large language models in GPUs with limited memory capacity (coming soon) | These components are modular but are designed to work together as a unified family. New components will follow the same design principle. ### Vendor-Agnostic Ecosystem Enablement Dynamo is ***not designed for vendor lock-in***. Dynamo aims to enable the broader AI ecosystem and to provide the functionality developers need, such as integrations with third-party components. From the beginning, Dynamo is designed to support all LLM inference engines (vLLM, SGLang, and TensorRT-LLM). Support for additional engines is planned to enable more developer use cases. **Support for non-NVIDIA hardware** is also available: Dynamo is working with HW vendors such as Intel and AMD to extend hardware support. The full list of supported ecosystem components: | **Product Areas** | **Supported Ecosystem Components** | | :--- | :--- | | Inference engines | SGLang, TensorRT-LLM, vLLM | | Kubernetes | Inference gateway | | Memory management | Dynamo KV Block Manager, [LMCache](/dynamo/integrations/kv-cache-integrations/lm-cache), [SGLang HiCache](/dynamo/integrations/kv-cache-integrations/hi-cache), [FlexKV](/dynamo/integrations/kv-cache-integrations/flex-kv) | | Networking and storage | Mooncake, DOCA NetIO, GDS, POSIX, S3, 3FS ([supported via NIXL](/dynamo/design-docs/component-design/kvbm-design)) | | Multi-HW | Intel XPU, AMD | ## Deployment Posture Dynamo's production path is Kubernetes-native, not Kubernetes-only. The same core runtime concepts can be used from a local process, a container, or a Kubernetes cluster: | Path | Use when | What Dynamo provides | |---|---|---| | Local or container | You are evaluating, developing, or adopting one component at a time. | OpenAI-compatible frontend, router, workers, file or etcd discovery, Python/Rust APIs, and installable packages. | | Kubernetes | You are deploying shared GPU capacity, multi-node serving, autoscaling, or platform-integrated inference. | Helm install, Dynamo operator, DGD/DCD/DGDR CRDs, Kubernetes-native discovery, Gateway API Inference Extension, Grove/LWS scheduling, ModelExpress, observability, and lifecycle management. | ## Request Routing Modes Dynamo supports two request-routing modes. Both expose the same OpenAI-compatible API and the same backends; they differ in *where* request routing happens. - **Standalone mode** (default) -- The Dynamo Frontend serves HTTP requests directly, and the integrated Dynamo Router makes KV-aware routing decisions before dispatching to workers. No external gateway is required. This is the mode used by all local installs and the default Kubernetes deployment. Request flow: `client -> Frontend -> Router -> workers`. - **Gateway mode (GAIE)** -- Dynamo runs behind a Kubernetes [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/) gateway. KV-aware routing is performed at the gateway layer by the Dynamo Endpoint Picker Plugin (EPP); the Frontend runs as a sidecar in `--router-mode direct` and forwards requests to the worker the EPP selected. Use this mode when your platform standardizes on the Inference Gateway, or when you want gateway-level policy (auth, rate limiting, observability) co-located with KV-aware routing. Request flow: `client -> Inference Gateway -> EPP (KV-aware) -> Frontend sidecar (direct) -> workers`. Both modes support disaggregated serving, multimodal, and the same set of backends (vLLM, SGLang, TensorRT-LLM). For full setup, supported features, and configuration of gateway mode, see the [Inference Gateway (GAIE) guide](/dynamo/integrations/kubernetes-integrations/gateway-api-inference-extension-gaie). ## Performance Dynamo achieves state-of-the-art LLM performance by composing three core techniques: Disaggregated Serving, KV Cache-Aware Routing, and KV Cache Offloading. These techniques are underpinned by NIXL, a low-latency data transfer layer that enables seamless KV cache movement between nodes. - [KV cache-aware routing](/dynamo/design-docs/component-design/router-design) Smartly routes requests based on worker load and existing cache hits. By reusing precomputed KV pairs, it bypasses the prefill compute, starting the decode phase immediately. [Baseten](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/#how-baseten-uses-nvidia-dynamo) applied Dynamo KV cache-aware routing and saw 2x faster TTFT and 1.6x throughput on Qwen3 Coder 480B A35B. - [KV cache offloading](/dynamo/design-docs/component-design/kvbm-design) Expands the available context window by moving KV cache from HBM to cheaper storage tiers such as host memory, local disk, or remote storage. Reusing precomputed state improves TTFT, reduces Total Cost of Ownership (TCO), and allows for longer context processing. - [Disaggregated serving](/dynamo/design-docs/disaggregated-serving) In the Design Principles section, we introduced the concept of disaggregated serving. Its performance has been showcased by [InferenceX](https://newsletter.semianalysis.com/p/inferencex-v2-nvidia-blackwell-vs). DeepSeek V3 can be served with ~7x throughput/GPU, with disaggregated serving and large-scale expert parallelism. Furthermore, when these three techniques are composed together, they yield compounding benefits as shown in the following diagram. ![Performance composability of disaggregated serving, KV cache-aware routing, and KV cache offloading](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/3755f37885691212943d05ca79b70e5dcb734835ec99489e9f3609f9b96e4401/pages-v1.2.0/assets/img/intro-perf.svg) - **Disaggregated Serving + KV Cache-Aware Routing** -- KV cache-aware routing load balances for both compute (on prefill) and memory (on decode), optimizing latency and throughput simultaneously. - **Disaggregated Serving + KV Cache Offloading** -- KV cache offloading results in faster TTFT, and the number of prefill workers can be reduced to reduce TCO. - **KV Cache-Aware Routing + KV Cache Offloading** -- Offloading increases the total addressable cache size, increasing the KV cache hit rate, which in turn accelerates the TTFT. Ready to try these techniques? See [Dynamo recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) for step-by-step deployment examples that compose disaggregated serving, routing, and offloading. ## From Configuration to Production-Grade Deployment ### Finding Best Configurations Under 30 Seconds with AIConfigurator Manually finding the optimal parallelism for disaggregated serving can take days of exhaustive configuration sweeps—a challenge that only intensifies at scale. Dynamo uses AIC-backed DynoSim-style modeling to identify strong configurations in under 30 seconds, providing clear projections of the performance gains over standard aggregated serving. This logic is natively integrated into Kubernetes Custom Resource Definition (CRD), Dynamo Graph Deployment Request (DGDR), allowing users to deploy using automatically generated optimized configs. ### Auto-Adjusting Deployment Based on SLA with Planner Once the offline configuration is found with AIConfigurator or DGDR, developers can deploy their desired model into production. However, the production traffic can vary greatly online, and static configuration determined offline will not be able to adequately handle spikes in traffic. Dynamo offers [Planner](/dynamo/design-docs/component-design/planner-design) to circumvent this problem. Developers can simply set their SLA in terms of TTFT and Time Per Output Token (TPOT). Planner examines online traffic and automatically makes decisions to scale prefill and decode workers to effectively deal with traffic spikes while maintaining the specified SLA. Recently, Planner was expanded to deal with even more sophisticated scenarios such as drastically varying Input Sequence Length (ISL) given the same SLA. See the [Planner documentation](/dynamo/components/planner/planner-guide) for more details. ### Applying Topology-Aware Hierarchical Gang Scheduling with Grove When Planner decides to autoscale, developers need a way to effectively scale workers independently and hierarchically. Especially for prefill/decode disaggregation, prefill and decode workers need to be scaled independently to meet the specified SLA, and they need to be scheduled in physical proximity to each other for best performance. Dynamo offers [Grove](https://github.com/ai-dynamo/grove) which is a Kubernetes operator that provides a single declarative API for orchestrating any AI inference workload from simple single-pod deployments to complex multi-node, disaggregated systems. Grove enables: - Hierarchical gang scheduling - Topology-aware placement - Multi-level horizontal autoscaling - Explicit startup ordering - Rolling updates with configurable replacement strategies These features are crucial for deploying and scaling inference at data center scale for optimal performance. ### Ensuring Fault Tolerance for LLMs Kubernetes comes with some fault tolerance functionalities, but LLM deployment requires specialized fault tolerance and resiliency. Dynamo provides comprehensive fault tolerance mechanisms across multiple layers to ensure reliable LLM inference in production deployments: - **Router and Frontend** -- Dynamo supports launching multiple frontend + router replicas for improved fault tolerance by sharing router states. - **Request Migration** -- When a worker fails during request processing, Dynamo can migrate in-progress requests to healthy workers while preserving partial generation state and maintaining seamless token flow to clients. - **Request Cancellation** -- Dynamo supports canceling in-flight requests through the AsyncEngineContext trait, which provides graceful stop signals and hierarchical cancellation propagation through request chains. - **Request Rejection (Load Shedding)** -- When workers are overloaded, Dynamo rejects new requests with HTTP 503 responses based on configurable thresholds for KV cache utilization and prefill tokens. ### Observability Dynamo provides built-in metrics, distributed tracing, and logging for monitoring inference deployments. See the [Observability Guide](/dynamo/user-guides/observability-local) for setup details. ## What's Next? Explore the following resources to go deeper: - [Recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) -- Compose disaggregated serving, routing, and offloading - [KV Cache-Aware Routing](/dynamo/user-guides/kv-cache-aware-routing) -- Configure smart request routing - [KV Cache Offloading](/dynamo/user-guides/kv-cache-offloading) -- Set up multi-tier memory management - [Planner](/dynamo/components/planner/planner-guide) -- Configure SLA-based autoscaling - [Kubernetes Deployment](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart) -- Deploy at scale with Grove - [Inference Gateway (GAIE)](/dynamo/integrations/kubernetes-integrations/gateway-api-inference-extension-gaie) -- Run Dynamo in gateway mode behind the K8s Inference Gateway - [Overall Architecture](/dynamo/design-docs/overall-architecture) -- Full technical design - [Support Matrix](/dynamo/resources/support-matrix) -- Check hardware and engine compatibility **Further reading:** [Dynamo Digest](../digest/index.mdx). > Install and run Dynamo on a local machine or VM with containers or PyPI

简体中文

# Local Installation This guide walks through installing and running Dynamo on a local machine or VM with one or more GPUs. By the end, you'll have a working OpenAI-compatible endpoint serving a model. For production multi-node clusters, see the [Kubernetes Deployment Guide](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart). To build from source for development, see [Building from Source](/dynamo/getting-started/building-from-source). ## System Requirements | Requirement | Supported | |---|---| | **GPU** | NVIDIA Ampere, Ada Lovelace, Hopper, Blackwell | | **OS** | Ubuntu 22.04, Ubuntu 24.04 | | **Architecture** | x86_64, ARM64 (ARM64 requires Ubuntu 24.04) | | **CUDA** | 12.9+ or 13.0+ (B300/GB300 require CUDA 13) | | **Python** | 3.10, 3.12 | | **Driver** | 575.51.03+ (CUDA 12) or 580.00.03+ (CUDA 13) | TensorRT-LLM does not support Python 3.11. For the full compatibility matrix including backend framework versions, see the [Support Matrix](/dynamo/resources/support-matrix). ## Install Dynamo ### Option A: Containers (Recommended) Containers have all dependencies pre-installed. No setup required. ```bash # SGLang docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0 # TensorRT-LLM docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.2.0 # vLLM docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 ``` To run frontend and worker in the same container, either: - Run processes in background with `&` (see Run Dynamo section below), or - Open a second terminal and use `docker exec -it bash` See [Release Artifacts](/dynamo/resources/release-artifacts#container-images) for available versions and backend guides for run instructions: [SGLang](/dynamo/backends/sg-lang) | [TensorRT-LLM](/dynamo/backends/tensor-rt-llm) | [vLLM](/dynamo/backends/v-llm) ### Option B: Install from PyPI ```bash # Install uv (recommended Python package manager) curl -LsSf https://astral.sh/uv/install.sh | sh # Create virtual environment uv venv venv source venv/bin/activate uv pip install pip ``` Install system dependencies and the Dynamo wheel for your chosen backend: **SGLang** ```bash sudo apt install python3-dev uv pip install --prerelease=allow "ai-dynamo[sglang]" ``` For CUDA 13 (B300/GB300), the container is recommended. See [SGLang install docs](https://docs.sglang.io/get_started/install.html) for details. **TensorRT-LLM** ```bash sudo apt install python3-dev pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130 pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]" ``` TensorRT-LLM requires `pip` due to a transitive Git URL dependency that `uv` doesn't resolve. We recommend using the TensorRT-LLM container for broader compatibility. See the [TRT-LLM backend guide](/dynamo/backends/tensor-rt-llm) for details. **vLLM** ```bash sudo apt install python3-dev libxcb1 uv pip install --prerelease=allow "ai-dynamo[vllm]" ``` ## Run Dynamo ### Discovery Backend Dynamo components discover each other through a shared backend. Two options are available: | Backend | When to Use | Setup | |---|---|---| | **File** | Single machine, local development | No setup -- pass `--discovery-backend file` to all components. The event plane automatically defaults to ZMQ (no NATS required). | | **etcd** | Multi-node, production | Requires a running etcd instance (default if no flag is specified). The event plane defaults to NATS. | This guide uses `--discovery-backend file`. For etcd setup, see [Service Discovery](/dynamo/kubernetes-deployment/advanced-platform/service-discovery). ### Verify Installation (Optional) Verify the CLI is installed and callable: ```bash python3 -m dynamo.frontend --help ``` If you cloned the repository, you can run additional system checks: ```bash python3 dev/sanity_check.py ``` ### Start the Frontend ```bash # Start the OpenAI compatible frontend (default port is 8000) python3 -m dynamo.frontend --discovery-backend file ``` To run in a single terminal (useful in containers), append `> logfile.log 2>&1 &` to run processes in background: ```bash python3 -m dynamo.frontend --discovery-backend file > dynamo.frontend.log 2>&1 & ``` ### Start a Worker In another terminal (or same terminal if using background mode), start a worker for your chosen backend: **SGLang** ```bash python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend file ``` **TensorRT-LLM** ```bash python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --discovery-backend file ``` The warning `Cannot connect to ModelExpress server/transport error. Using direct download.` is expected in this local single-machine setup (no ModelExpress server running) and can be safely ignored. In a Kubernetes deployment where `MODEL_EXPRESS_URL` is configured, this warning -- or the related `Failed to resolve local model path after server download` -- indicates that ModelExpress is configured but is not actually serving cached models; see [Model Caching in Kubernetes](/dynamo/kubernetes-deployment/model-loading/model-caching#option-2-modelexpress-p2p-distribution) for the correct configuration. **vLLM** ```bash python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file \ --kv-events-config '{"enable_kv_cache_events": false}' ``` ### KV Events Configuration For dependency-free local development, disable KV event publishing (avoids NATS): - **vLLM:** Add `--kv-events-config '{"enable_kv_cache_events": false}'` - **SGLang:** No flag needed (KV events disabled by default) - **TensorRT-LLM:** No flag needed (KV events disabled by default) KV events are disabled by default for all backends. For vLLM and SGLang, add backend-specific `--kv-events-config` only when you want KV event publishing enabled. For TensorRT-LLM, enable event publishing with `--publish-events-and-metrics`. ## Test Your Deployment ```bash curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50}' ``` ## Troubleshooting **CUDA/driver version mismatch** Run `nvidia-smi` to check your driver version. Dynamo requires driver 575.51.03+ for CUDA 12 or 580.00.03+ for CUDA 13. B300/GB300 GPUs require CUDA 13. See the [Support Matrix](/dynamo/resources/support-matrix) for full requirements. **Model doesn't fit on GPU (OOM)** The default model `Qwen/Qwen3-0.6B` requires ~2GB of GPU memory. Larger models need more VRAM: | Model Size | Approximate VRAM | |---|---| | 7B | 14-16 GB | | 13B | 26-28 GB | | 70B | 140+ GB (multi-GPU) | Start with a small model and scale up based on your hardware. **Python 3.11 with TensorRT-LLM** TensorRT-LLM does not support Python 3.11. If you see installation failures with TensorRT-LLM, check your Python version with `python3 --version`. Use Python 3.10 or 3.12 instead. **Container runs but GPU not detected** Ensure you passed `--gpus all` to `docker run`. Without this flag, the container won't have access to GPUs: ```bash # Correct docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0 # Wrong -- no GPU access docker run --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0 ``` ## Next Steps - [Backend Guides](/dynamo/backends/sg-lang) -- Backend-specific configuration and features - [Disaggregated Serving](/dynamo/user-guides/disaggregated-serving) -- Scale prefill and decode independently - [KV Cache Aware Routing](/dynamo/user-guides/kv-cache-aware-routing) -- Smart request routing - [Kubernetes Deployment](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart) -- Production multi-node deployments > Build Dynamo from source for development and contributions

简体中文

# Building from Source Build Dynamo from source when you want to contribute code, test features on the development branch, or customize the build. If you just want to run Dynamo, the [Local Installation](/dynamo/getting-started/local-installation) guide is faster. This guide covers Ubuntu and macOS. For a containerized dev environment that handles all of this automatically, see [DevContainer](#devcontainer). ## 1. Install System Libraries **Ubuntu:** ```bash sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake ``` **macOS:** ```bash # Install Homebrew if needed /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" brew install cmake protobuf # Verify Metal is accessible xcrun -sdk macosx metal ``` ## 2. Install Rust ```bash curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source $HOME/.cargo/env ``` ## 3. Create a Python Virtual Environment Install [uv](https://docs.astral.sh/uv/#installation) if you don't have it: ```bash curl -LsSf https://astral.sh/uv/install.sh | sh ``` Create and activate a virtual environment: ```bash uv venv .venv source .venv/bin/activate ``` ## 4. Install Build Tools ```bash uv pip install pip maturin ``` [Maturin](https://github.com/PyO3/maturin) is the Rust-Python bindings build tool. ## 5. Build the Rust Bindings ```bash cd lib/bindings/python maturin develop --uv ``` ## 6. Install GPU Memory Service ```bash # Return to project root cd "$(git rev-parse --show-toplevel)" uv pip install -e lib/gpu_memory_service ``` ## 7. Install the Wheel ```bash uv pip install -e . ``` ## 8. Verify the Build ```bash python3 -m dynamo.frontend --help ``` You should see the frontend command help output. ## DevContainer VSCode and Cursor users can skip manual setup using pre-configured development containers. The DevContainer installs all toolchains, builds the project, and sets up the Python environment automatically. Framework-specific containers are available for vLLM, SGLang, and TensorRT-LLM. See the [DevContainer README](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/.devcontainer) for setup instructions. ## Set Up Pre-commit Hooks Before submitting PRs, install the pre-commit hooks to ensure your code passes CI checks: ```bash uv pip install pre-commit pre-commit install ``` Run checks manually on all files: ```bash pre-commit run --all-files ``` ## Troubleshooting **Missing system packages** If `maturin develop` fails with linker errors, verify all system dependencies are installed. On Ubuntu: ```bash sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake ``` **Virtual environment not activated** Maturin builds against the active Python interpreter. If you see errors about Python or site-packages, ensure your virtual environment is activated: ```bash source .venv/bin/activate ``` **Disk space** The Rust `target/` directory can grow to 10+ GB during development. If builds fail with disk space errors, clean the build cache: ```bash cargo clean ``` ## Next Steps - [Contribution Guide](/dynamo/getting-started/contribution-guide) -- Workflow for contributing code - [Examples](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples) -- Explore the codebase - [Good First Issues](https://github.com/ai-dynamo/dynamo/labels/good-first-issue) -- Find a task to work on # Kubernetes Deployment Use the Kubernetes guides when you are ready to move beyond a local Dynamo process and deploy on a GPU cluster. Dynamo's Kubernetes path is native to the platform: inference graphs are expressed as Dynamo CRDs, reconciled by the Dynamo operator, installed with Helm, and integrated with Kubernetes service discovery, Gateway API Inference Extension, scheduling, observability, and model-loading workflows. This does not make Kubernetes the only way to use Dynamo. Local containers, PyPI installs, and standalone components remain the right path for evaluation, development, and incremental adoption. Start with the [Kubernetes Quickstart](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart) to run one model end to end. Then use the rest of the Kubernetes Deployment section based on what you need next: | Goal | Guide | |---|---| | Install the operator and prerequisites | [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) | | Deploy and manage models | [Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide) | | Load models faster across pods | [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) and [ModelExpress](/dynamo/kubernetes-deployment/model-loading/model-express) | | Operate a cluster deployment | [Autoscaling](/dynamo/kubernetes-deployment/operate/autoscaling), [Rolling Update](/dynamo/kubernetes-deployment/operate/rolling-update), [Disagg Communication](/dynamo/kubernetes-deployment/operate/disagg-communication), and [Observability Metrics](/dynamo/kubernetes-deployment/operate/observability/metrics) | | Scale disaggregated serving | [Multinode Deployments](/dynamo/kubernetes-deployment/scale/multinode-deployments), [Grove](/dynamo/kubernetes-deployment/scale/grove), and [Topology Aware Scheduling](/dynamo/kubernetes-deployment/scale/topology-aware-scheduling) | | Integrate with Kubernetes serving APIs | [Gateway API Inference Extension (GAIE)](/dynamo/integrations/kubernetes-integrations/gateway-api-inference-extension-gaie) and [LWS](/dynamo/integrations/kubernetes-integrations/lws) | If you are still evaluating Dynamo locally, start with the [Quickstart](/dynamo/getting-started/quickstart) and [Local Installation](/dynamo/getting-started/local-installation) first.

简体中文

# Contribution Guide Dynamo is an open-source distributed inference platform, built by a growing community of contributors. The project is licensed under [Apache 2.0](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/LICENSE) and welcomes contributions of all sizes -- from typo fixes to major features. Community contributions have shaped core areas of Dynamo including backend integrations, documentation, deployment tooling, and performance improvements. With 200+ external contributors, 220+ merged community PRs, and new contributors joining every month, Dynamo is one of the fastest-growing open-source inference projects. Check out our [commit activity](https://github.com/ai-dynamo/dynamo/graphs/commit-activity) and [GitHub stars](https://github.com/ai-dynamo/dynamo/stargazers). This guide will help you get started. Join the community: - [CNCF Slack (`#ai-dynamo`)](https://communityinviter.com/apps/cloud-native/cncf) -- join CNCF Slack and find us in `#ai-dynamo` - [Discord](https://discord.gg/nvidia-dynamo) - [GitHub Discussions](https://github.com/ai-dynamo/dynamo/discussions) ## TL;DR For experienced contributors: 1. Fork and clone the repo 2. For changes ≥100 lines or new features, [open an issue](https://github.com/ai-dynamo/dynamo/issues/new?template=contribution_request.yml) first 3. Create a branch: `git checkout -b yourname/fix-router-timeout` 4. Make changes, run `pre-commit run` 5. Commit with DCO sign-off: `git commit -s -m "fix: description"` 6. Open a PR targeting `main` --- ## Ways to Contribute ### Report a Bug Found something broken? [Open a bug report](https://github.com/ai-dynamo/dynamo/issues/new?template=bug_report.yml) with: - Steps to reproduce - Expected vs. actual behavior - Environment details (OS, GPU, Python version, Dynamo version) ### Improve Documentation Documentation improvements are always welcome: - Fixing typos or unclear explanations - Adding examples or tutorials - Improving API documentation Small doc fixes can be submitted directly as PRs without an issue. ### Propose a Feature Have an idea? [Open a feature request](https://github.com/ai-dynamo/dynamo/issues/new?template=feature_request.yml) to discuss it with maintainers before implementation. ### Contribute Code Ready to write code? See the [Contribution Workflow](#contribution-workflow) section below. ### Help the Community Not all contributions are code. You can also: - Answer questions on [Discord](https://discord.gg/nvidia-dynamo) or in the `#ai-dynamo` channel on [CNCF Slack](https://communityinviter.com/apps/cloud-native/cncf) - Review pull requests - Share how you're using Dynamo -- blog posts, talks, or social media - Star the [repository](https://github.com/ai-dynamo/dynamo) --- ## Getting Started ### Find an Issue Browse [open issues](https://github.com/ai-dynamo/dynamo/issues) or look for: | Issue Type | Description | |------------|-------------| | [Good First Issues](https://github.com/ai-dynamo/dynamo/labels/good-first-issue) | Beginner-friendly, with guidance | | [Help Wanted](https://github.com/ai-dynamo/dynamo/labels/help-wanted) | Community contributions welcome | ### Fork and Clone 1. [Fork the repository](https://github.com/ai-dynamo/dynamo/fork) on GitHub 2. Clone your fork: ```bash git clone https://github.com/YOUR-USERNAME/dynamo.git cd dynamo git remote add upstream https://github.com/ai-dynamo/dynamo.git ``` ### Building from Source Full build instructions are included below. Expand the accordion to set up your local development environment.
Expand build instructions #### 1. Install System Libraries **Ubuntu:** ```bash sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake ``` **macOS:** ```bash # Install Homebrew if needed /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" brew install cmake protobuf # Verify Metal is accessible xcrun -sdk macosx metal ``` #### 2. Install Rust ```bash curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source $HOME/.cargo/env ``` #### 3. Create a Python Virtual Environment Install [uv](https://docs.astral.sh/uv/#installation) if you don't have it: ```bash curl -LsSf https://astral.sh/uv/install.sh | sh ``` Create and activate a virtual environment: ```bash uv venv .venv source .venv/bin/activate ``` #### 4. Install Build Tools ```bash uv pip install pip maturin ``` [Maturin](https://github.com/PyO3/maturin) is the Rust-Python bindings build tool. #### 5. Build the Rust Bindings ```bash cd lib/bindings/python maturin develop --uv ``` #### 6. Install GPU Memory Service ```bash # Return to project root cd "$(git rev-parse --show-toplevel)" uv pip install -e lib/gpu_memory_service ``` #### 7. Install the Wheel ```bash uv pip install -e . ``` #### 8. Verify the Build ```bash python3 -m dynamo.frontend --help ``` VSCode and Cursor users can use the [`.devcontainer`](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/.devcontainer) folder for a pre-configured development environment. See the [devcontainer README](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/.devcontainer/README.md) for details.
### Set Up Pre-commit Hooks ```bash uv pip install pre-commit pre-commit install ``` You're all set up! Get curious -- explore the codebase, experiment with the [examples](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples), and see how the pieces fit together. When you're ready, pick an issue from the [Good First Issues](https://github.com/ai-dynamo/dynamo/labels/good-first-issue) board or read on for the full contribution workflow. --- ## Contribution Workflow The contribution process depends on the size and scope of your change. Even when not required, opening an issue is a great way to start a conversation with Dynamo maintainers before investing time in a PR. | Size | Lines Changed | Example | What You Need | |------|---------------|---------|---------------| | **XS** | 1–10 | Typo fix, config tweak | Submit a PR directly | | **S** | 10–100 | Small bug fix, doc improvement, focused refactor | Submit a PR directly | | **M** | 100–200 | Feature addition, moderate refactor | [Open an issue](https://github.com/ai-dynamo/dynamo/issues/new?template=contribution_request.yml) first | | **L** | 200–500 | Multi-file feature, new component | [Open an issue](https://github.com/ai-dynamo/dynamo/issues/new?template=contribution_request.yml) first | | **XL** | 500–1000 | Major feature, cross-component change | [Open an issue](https://github.com/ai-dynamo/dynamo/issues/new?template=contribution_request.yml) first | | **XXL** | 1000+ | Architecture change | Requires a [DEP](https://github.com/ai-dynamo/enhancements) | **Small changes (under 100 lines):** Submit a PR directly -- no issue needed. This includes typos, simple bug fixes, and formatting. If your PR addresses an existing approved issue, link it with "Fixes #123". **Larger changes (≥100 lines):** [Open a Contribution Request](https://github.com/ai-dynamo/dynamo/issues/new?template=contribution_request.yml) issue first and wait for the `approved-for-pr` label before submitting a PR. **Architecture changes:** Changes that affect multiple components, introduce or modify public APIs, alter communication plane architecture, or affect backend integration contracts require a [Dynamo Enhancement Proposal (DEP)](https://github.com/ai-dynamo/enhancements). Open a DEP in the [`ai-dynamo/enhancements`](https://github.com/ai-dynamo/enhancements) repo before starting implementation. ### Submitting a Pull Request 1. **Create a GitHub Issue** (if required) — [Open a Contribution Request](https://github.com/ai-dynamo/dynamo/issues/new?template=contribution_request.yml) and describe what you're solving, your proposed approach, estimated PR size, and files affected. 2. **Get Approval** — Wait for maintainers to review and apply the `approved-for-pr` label. 3. **Submit a Pull Request** — [Open a PR](https://github.com/ai-dynamo/dynamo/compare) that references the issue using GitHub keywords (e.g., "Fixes #123"). 4. **Address Code Rabbit Review** — Respond to automated Code Rabbit suggestions, including nitpicks. 5. **Trigger CI Tests** — For external contributors, a maintainer must comment `/ok to test COMMIT-ID` to run the full CI suite, where `COMMIT-ID` is the short SHA of your latest commit. Fix any failing tests before requesting human review. 6. **Request Review** — Add the person who approved your issue as a reviewer. Check [CODEOWNERS](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/CODEOWNERS) for required approvers based on files modified. **AI-Generated Code:** While we encourage using AI tools, you must fully understand every change in your PR. Inability to explain submitted code will result in rejection. ### Branch Naming Use a descriptive branch name that identifies you and the change: ```text yourname/fix-description ``` Examples: ```text jsmith/fix-router-timeout jsmith/add-lora-support ``` --- ## Code Style & Quality Maintainers assess contribution quality based on code style, test coverage, architecture alignment, and review responsiveness. Consistent, high-quality contributions are the foundation for building trust in the project. ### Pre-commit Hooks All PRs are checked against [pre-commit hooks](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/.pre-commit-config.yaml). After [installing pre-commit](#set-up-pre-commit-hooks), run checks locally: ```bash pre-commit run --all-files ``` ### Commit Message Conventions Use [conventional commit](https://www.conventionalcommits.org/) prefixes: | Prefix | Use For | |--------|---------| | `feat:` | New features | | `fix:` | Bug fixes | | `docs:` | Documentation changes | | `refactor:` | Code refactoring (no behavior change) | | `test:` | Adding or updating tests | | `chore:` | Maintenance, dependency updates | | `ci:` | CI/CD changes | | `perf:` | Performance improvements | Examples: ```text feat(router): add weighted load balancing fix(frontend): resolve streaming timeout on large responses docs: update quickstart for macOS users test(planner): add unit tests for scaling policy ``` ### Language Conventions | Language | Style Guide | Formatter | |----------|-------------|-----------| | **Python** | [PEP 8](https://peps.python.org/pep-0008/) | `black`, `ruff` | | **Rust** | [Rust API Guidelines](https://rust-lang.github.io/api-guidelines/) | `cargo fmt`, `cargo clippy` | | **Go** | [Effective Go](https://go.dev/doc/effective_go) | `gofmt` | ### Testing Run the test suite before submitting a PR: ```bash # Run all tests pytest tests/ # Run unit tests only pytest -m unit tests/ # Run a specific test file pytest -s -v tests/test_example.py ``` For Rust components: ```bash cargo test ``` For the Kubernetes operator (Go): ```bash cd deploy/operator go test ./... -v ``` ### General Guidelines - Keep PRs focused -- one concern per PR - Write clean, well-documented code that future contributors can understand - Include tests for new functionality and bug fixes - Ensure clean builds (no warnings or errors) - All tests must pass - No commented-out code - Respond to review feedback promptly and constructively ### Running GitHub Actions Locally Use [act](https://nektosact.com/) to run workflows locally: ```bash act -j pre-merge-rust ``` Or use the [GitHub Local Actions](https://marketplace.visualstudio.com/items?itemName=SanjulaGanepola.github-local-actions) VS Code extension. --- ## What to Expect ### Status Labels | Status | What It Means | |--------|---------------| | `needs-triage` | We're reviewing your issue | | `needs-info` | We need more details from you | | `approved-for-pr` | Ready for implementation — submit a PR | | `in-progress` | Someone is working on this | | `blocked` | Waiting on external dependency | ### Response Times We aim to: - **Respond** to new issues within a few business days - **Triage** high-priority issues within a week Issues with no activity for 30 days may be auto-closed (can be reopened). ### Review Process After you submit a PR and complete the steps in [Submitting a Pull Request](#submitting-a-pull-request): 1. The reviewer will provide feedback -- please respond to all comments within a reasonable timeframe 2. If changes are requested, address them and ping the reviewer for re-review 3. If your PR hasn't been reviewed within 7 days, feel free to ping the reviewer or leave a comment ### Good First Issues Issues labeled `good-first-issue` are sized for new contributors. We provide extra guidance on these -- look for clear acceptance criteria and a suggested approach in the issue description. --- ## DCO & Licensing ### Developer Certificate of Origin Dynamo requires all contributions to be signed off with the [Developer Certificate of Origin (DCO)](https://developercertificate.org/). This certifies that you have the right to submit your contribution under the project's [Apache 2.0 license](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/LICENSE). Each commit must include a sign-off line: ```text Signed-off-by: Jane Smith ``` Add this automatically with the `-s` flag: ```bash git commit -s -m "fix: your descriptive message" ``` **Requirements:** - Use your real name (no pseudonyms or anonymous contributions) - Your `user.name` and `user.email` must be configured in git **DCO Check Failed?** See our [DCO Troubleshooting Guide](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/DCO.md) for step-by-step instructions to fix it. ### License By contributing, you agree that your contributions will be licensed under the [Apache 2.0 License](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/LICENSE). --- ## Code of Conduct We are committed to providing a welcoming and inclusive environment. All participants are expected to abide by our [Code of Conduct](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/CODE_OF_CONDUCT.md). --- ## Security If you discover a security vulnerability, please follow the instructions in our [Security Policy](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/SECURITY.md). Do not open a public issue for security vulnerabilities. --- ## Getting Help - **CNCF Slack**: [Join CNCF Slack](https://communityinviter.com/apps/cloud-native/cncf) and find us in `#ai-dynamo` - **Discord**: [Join our community](https://discord.gg/nvidia-dynamo) - **Discussions**: [GitHub Discussions](https://github.com/ai-dynamo/dynamo/discussions) - **Documentation**: [docs.nvidia.com/dynamo](https://docs.nvidia.com/dynamo/) Thank you for contributing to Dynamo! # Support Matrix **See also:** [Release Artifacts](/dynamo/resources/release-artifacts) for container images, wheels, Helm charts, and crates | [Feature Matrix](/dynamo/resources/feature-matrix) for backend feature support ## At a Glance **Latest stable release:** [v1.2.0](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0) -- SGLang `0.5.11` (NIXL `1.0.1`) | TensorRT-LLM `1.3.0rc14` (NIXL `0.10.1`) | vLLM `0.20.1` (NIXL `0.10.1`) **Experimental release:** [v1.2.0-deepseek-v4-dev.3](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0-deepseek-v4-dev.3) *(DeepSeek-V4-Flash / V4-Pro on Blackwell, vLLM + SGLang containers only)* -- vLLM `0.20.1` | SGLang upstream `deepseek-v4-blackwell` preview | NIXL `0.10.1` | Requirement | Supported | | :--- | :--- | | **GPU** | NVIDIA Ampere, Ada Lovelace, Hopper, Blackwell | | **OS** | Ubuntu 22.04, Ubuntu 24.04, CentOS Stream 9 (experimental) | | **Arch** | x86_64, ARM64 (ARM64 requires Ubuntu 24.04) | | **CUDA 12** | Container images for SGLang and vLLM (CUDA 12.9) | | **CUDA 13** | Container images for TensorRT-LLM (CUDA 13.1), SGLang and vLLM (CUDA 13.0) | **On this page:** [Backend Dependencies](#backend-dependencies) | [CUDA and Drivers](#cuda-and-driver-requirements) | [Hardware](#hardware-compatibility) | [Platform](#platform-architecture-compatibility) | [Cloud](#cloud-service-provider-compatibility) | [Build Support](#build-support) ## Backend Dependencies > Driver requirements differ by backend — see [CUDA and Driver Requirements](#cuda-and-driver-requirements) below. The following table shows the backend framework versions included with each Dynamo release: | **Dynamo** | **SGLang** | **TensorRT-LLM** | **vLLM** | **NIXL** | | :--- | :--- | :--- | :--- | :--- | | **main (ToT)** | `0.5.11` | `1.3.0rc17` | `0.21.0` | `0.10.1` (TRT-LLM); `1.1.0` (vLLM); `1.0.1` (SGLang) | | **v1.2.0** | `0.5.11` | `1.3.0rc14` | `0.20.1` | `0.10.1` (TRT-LLM, vLLM); `1.0.1` (SGLang) | | **v1.2.0-deepseek-v4-dev.3** *(experimental, partial)* | upstream DSv4 preview | — | `0.20.1` | `0.10.1` | | **v1.2.0-deepseek-v4-dev.2** *(experimental, partial)* | upstream DSv4 preview | — | `0.20.0` | `0.10.1` | | **v1.1.1** | `0.5.10.post1` | `1.3.0rc11` | `0.19.0` | `0.10.1` (TRT-LLM, vLLM); `1.0.1` (SGLang) | | **v1.1.0** | `0.5.10.post1` | `1.3.0rc11` | `0.19.0` | `0.10.1` (TRT-LLM, vLLM); `1.0.1` (SGLang) | | **v1.1.0-dev.3** *(experimental, partial)* | `0.5.10.post1` | `1.3.0rc11` | `0.19.0` | `0.10.1` | | **v1.1.0-dev.2** *(experimental, partial)* | `0.5.9` | `1.3.0rc9` | `0.19.0` | `0.10.1` | | **v1.1.0-dev.1** *(experimental)* | `0.5.9` | `1.3.0rc5.post1` | `0.17.1` | `0.10.1` | | **v1.0.2** | `0.5.9` | `1.3.0rc5.post1` | `0.16.0` | `0.10.1` | | **v1.0.1** | `0.5.9` | `1.3.0rc5.post1` | `0.16.0` | `0.10.1` | | **v1.0.0** | `0.5.9` | `1.3.0rc5.post1` | `0.16.0` | `0.10.1` | | **v0.9.1** | `0.5.8` | `1.3.0rc3` | `0.14.1` | `0.9.0` | | **v0.9.0** | `0.5.8` | `1.3.0rc1` | `0.14.1` | `0.9.0` | | **v0.8.1.post3** | `0.5.6.post2` | `1.2.0rc6.post3` | `0.12.0` | `0.8.0` | | **v0.8.1.post2** | `0.5.6.post2` | `1.2.0rc6.post2` | `0.12.0` | `0.8.0` | | **v0.8.1.post1** | `0.5.6.post2` | `1.2.0rc6.post1` | `0.12.0` | `0.8.0` | | **v0.8.1** | `0.5.6.post2` | `1.2.0rc6.post1` | `0.12.0` | `0.8.0` | | **v0.8.0** | `0.5.6.post2` | `1.2.0rc6.post1` | `0.12.0` | `0.8.0` | | **v0.7.1** | `0.5.4.post3` | `1.2.0rc3` | `0.11.0` | `0.8.0` | | **v0.7.0.post1** | `0.5.4.post3` | `1.2.0rc3` | `0.11.0` | `0.8.0` | | **v0.7.0** | `0.5.4.post3` | `1.2.0rc2` | `0.11.0` | `0.8.0` | | **v0.6.1.post1** | `0.5.3.post2` | `1.1.0rc5` | `0.11.0` | `0.6.0` | | **v0.6.1** | `0.5.3.post2` | `1.1.0rc5` | `0.11.0` | `0.6.0` | | **v0.6.0** | `0.5.3.post2` | `1.1.0rc5` | `0.11.0` | `0.6.0` | For **v1.1.0-dev.2**, **v1.1.0-dev.3**, **v1.2.0-deepseek-v4-dev.2**, and **v1.2.0-deepseek-v4-dev.3**, the cells above match `container/context.yaml` on the corresponding release branch (pins used to build images). Those lines are **partial releases**: not every backend has a published Dynamo runtime container for that tag. See [Pre-Release Artifacts](/dynamo/resources/release-artifacts#pre-release-artifacts) for what actually shipped. The `v1.2.0-deepseek-v4-dev.2` and `v1.2.0-deepseek-v4-dev.3` SGLang containers are built on the upstream `lmsysorg/sglang:deepseek-v4-blackwell` preview image rather than a tagged SGLang release; TensorRT-LLM is not part of those dev releases. ### Version Labels - **1.3.0 (main / ToT)** reflects the current development branch. - Releases marked *(experimental, partial)* are pre-releases: the table shows branch build pins, which may include backends with no NGC image for that dev tag yet. ### Version Compatibility - Backend versions listed are the only versions tested and supported for each release. - TensorRT-LLM does not support Python 3.11; installation of the `ai-dynamo[trtllm]` wheel will fail on Python 3.11. ### CUDA and Driver Requirements Dynamo container images include CUDA toolkit libraries. The host machine must have a compatible NVIDIA GPU driver installed. | Dynamo Version | Backend | CUDA Toolkit | Min Driver | Notes | | :--- | :--- | :--- | :--- | :--- | | **1.2.0** | **SGLang** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | | | | **TensorRT-LLM** | 13.1 | 580.xx+ | CUDA 13 only | | | **vLLM** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | | | **1.1.1** | **SGLang** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | | | | **TensorRT-LLM** | 13.1 | 580.xx+ | | | | **vLLM** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | | | **1.1.0** | **SGLang** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | | | | **TensorRT-LLM** | 13.1 | 580.xx+ | | | | **vLLM** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | | | **1.0.2** | **SGLang** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | | | | **TensorRT-LLM** | 13.1 | 580.xx+ | | | | **vLLM** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | | | **1.0.1** | **SGLang** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | | | | **TensorRT-LLM** | 13.1 | 580.xx+ | | | | **vLLM** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | | | **1.0.0** | **SGLang** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | | | | **TensorRT-LLM** | 13.1 | 580.xx+ | | | | **vLLM** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | | | **0.9.1** | **SGLang** | 12.9 | 575.xx+ | | | | **TensorRT-LLM** | 13.0 | 580.xx+ | | | | **vLLM** | 12.9 | 575.xx+ | | | **0.9.0** | **SGLang** | 12.9 | 575.xx+ | | | | **TensorRT-LLM** | 13.0 | 580.xx+ | | | | **vLLM** | 12.9 | 575.xx+ | | | **0.8.1** | **SGLang** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | Experimental | | | **TensorRT-LLM** | 13.0 | 580.xx+ | | | | **vLLM** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | Experimental | | **0.8.0** | **SGLang** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | Experimental | | | **TensorRT-LLM** | 13.0 | 580.xx+ | | | | **vLLM** | 12.9 | 575.xx+ | | | | | 13.0 | 580.xx+ | Experimental | | **0.7.1** | **SGLang** | 12.8 | 570.xx+ | | | | **TensorRT-LLM** | 13.0 | 580.xx+ | | | | **vLLM** | 12.9 | 575.xx+ | | | **0.7.0** | **SGLang** | 12.9 | 575.xx+ | | | | **TensorRT-LLM** | 13.0 | 580.xx+ | | | | **vLLM** | 12.8 | 570.xx+ | | Patch versions (e.g., v0.8.1.post1, v0.7.0.post1) have the same CUDA support as their base version. Experimental `v1.1.0-dev.*` images follow the same CUDA matrix as `v1.0.2`. The `v1.2.0-deepseek-v4-dev.3` vLLM container is CUDA 13.0 multi-arch; the SGLang containers split by arch (CUDA 12.9 on `amd64`, CUDA 13.0 on `arm64`). Experimental CUDA 13 images are not published for all versions. Check [Release Artifacts](/dynamo/resources/release-artifacts) for availability. For detailed artifact versions and NGC links (including container images, Python wheels, Helm charts, and Rust crates), see the [Release Artifacts](/dynamo/resources/release-artifacts) page. #### CUDA Compatibility Resources For detailed information on CUDA driver compatibility, forward compatibility, and troubleshooting: - [CUDA Compatibility Overview](https://docs.nvidia.com/deploy/cuda-compatibility/) - [Why CUDA Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/why-cuda-compatibility.html) - [Minor Version Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html) - [Forward Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html) - [FAQ](https://docs.nvidia.com/deploy/cuda-compatibility/frequently-asked-questions.html) For extended driver compatibility beyond the minimum versions listed above, consider using `cuda-compat` packages on the host. See [Forward Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/forward-compatibility.html) for details. ## Hardware Compatibility | **CPU Architecture** | **Status** | | :------------------- | :----------- | | **x86_64** | Supported | | **ARM64** | Supported | Dynamo provides multi-arch container images supporting both AMD64 (x86_64) and ARM64 architectures. See [Release Artifacts](/dynamo/resources/release-artifacts) for available images. ### GPU Compatibility If you are using a **GPU**, the following GPU models and architectures are supported: | **GPU Architecture** | **Status** | | :----------------------------------- | :--------- | | **NVIDIA Blackwell Architecture** | Supported | | **NVIDIA Hopper Architecture** | Supported | | **NVIDIA Ada Lovelace Architecture** | Supported | | **NVIDIA Ampere Architecture** | Supported | ## Platform Architecture Compatibility **Dynamo** is compatible with the following platforms: | **Operating System** | **Version** | **Architecture** | **Status** | | :------------------- | :---------- | :--------------- | :----------- | | **Ubuntu** | 22.04 | x86_64 | Supported | | **Ubuntu** | 24.04 | x86_64 | Supported | | **Ubuntu** | 24.04 | ARM64 | Supported | | **CentOS Stream** | 9 | x86_64 | Experimental | Wheels are built using a manylinux_2_28-compatible environment and validated on CentOS Stream 9 and Ubuntu (22.04, 24.04). Compatibility with other Linux distributions is expected but not officially verified. ## Cloud Service Provider Compatibility ### AWS | **Host Operating System** | **Version** | **Architecture** | **Status** | | :------------------------ | :---------- | :--------------- | :--------- | | **Amazon Linux** | 2023 | x86_64 | Supported | **AL2023 TensorRT-LLM Limitation:** There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend). ## Build Support For version-specific artifact details, installation commands, and release history, see [Release Artifacts](/dynamo/resources/release-artifacts). **Dynamo** currently provides build support in the following ways: - **Wheels**: We distribute Python wheels of Dynamo and KV Block Manager: - [ai-dynamo](https://pypi.org/project/ai-dynamo/) - [ai-dynamo-runtime](https://pypi.org/project/ai-dynamo-runtime/) - [kvbm](https://pypi.org/project/kvbm/) as a standalone implementation. - **Dynamo Container Images**: We distribute multi-arch images (x86 & ARM64 compatible) on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo): - [Dynamo Frontend](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/dynamo-frontend) *(New in v0.8.0)* - [SGLang Runtime](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime) - [SGLang Runtime (CUDA 13)](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime-cu13) - [TensorRT-LLM Runtime](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime) - [TensorRT-LLM Runtime (EFA)](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime) *(New in v1.0.0, Experimental, AMD64 only)* - [vLLM Runtime](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime) - [vLLM Runtime (CUDA 13)](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime-cu13) - [vLLM Runtime (EFA)](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime) *(New in v1.0.0, Experimental, AMD64 only)* - [Kubernetes Operator](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator) - [Snapshot Agent](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/snapshot-agent) *(New in v1.0.0, Preview)* - **Helm Charts**: [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) hosts the helm charts supporting Kubernetes deployments of Dynamo: - [Dynamo Platform](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-platform) (now includes CRDs) - [Snapshot](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/snapshot) *(New in v1.0.0, Preview)* - [Dynamo CRDs](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-crds) *(Deprecated in v1.0.0, CRDs managed by Operator)* - [Dynamo Graph](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-graph) *(Deprecated in v0.9.0)* - **Rust Crates**: - [dynamo-runtime](https://crates.io/crates/dynamo-runtime/) - [dynamo-llm](https://crates.io/crates/dynamo-llm/) - [dynamo-protocols](https://crates.io/crates/dynamo-protocols/) - [dynamo-parsers](https://crates.io/crates/dynamo-parsers/) - [dynamo-config](https://crates.io/crates/dynamo-config/) *(New in v0.8.0)* - [dynamo-memory](https://crates.io/crates/dynamo-memory/) *(New in v0.8.0)* - [dynamo-tokens](https://crates.io/crates/dynamo-tokens/) *(New in v0.9.0)* - [dynamo-mocker](https://crates.io/crates/dynamo-mocker/) *(New in v1.0.0)* - [dynamo-kv-router](https://crates.io/crates/dynamo-kv-router/) *(New in v1.0.0)* Once you've confirmed that your platform and architecture are compatible, you can install **Dynamo** by following the [Local Quick Start](https://github.com/ai-dynamo/dynamo/blob/main/README.md#local-quick-start) in the README. # Feature Matrix This document provides a comprehensive compatibility matrix for key Dynamo features across the supported backends. *Updated for Dynamo v1.2.0* **Legend:** * ✅ : Supported * 🚧 : Work in Progress / Experimental / Limited ## Quick Comparison | Feature | SGLang | TensorRT-LLM | vLLM | Source | | :--- | :---: | :---: | :---: | :--- | | **Disaggregated Serving** | ✅ | ✅ | ✅ | [Design Doc][disagg] | | **KV-Aware Routing** | ✅ | ✅ | ✅ | [Router Doc][kv-routing] | | **SLA-Based Planner** | ✅ | ✅ | ✅ | [Planner Doc][planner] | | **KV Block Manager** | 🚧 | ✅ | ✅ | [KVBM Doc][kvbm] | | **Multimodal (Image)** | ✅ | ✅ | ✅ | [Multimodal Doc][mm] | | **Multimodal (Video)** | ✅ | | ✅ | [Multimodal Doc][mm] | | **Multimodal (Audio)** | | | 🚧 | [Multimodal Doc][mm] | | **Request Migration** | ✅ | 🚧 | ✅ | [Migration Doc][migration] | | **Request Cancellation** | 🚧 | ✅ | ✅ | Backend READMEs | | **LoRA** | | | ✅ | [K8s Guide][lora] | | **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc][tools] | | **Speculative Decoding** | 🚧 | ✅ | ✅ | Backend READMEs | | **Dynamo Snapshot** | ✅ | | ✅ | [Snapshot Docs][snapshot] | ## 1. vLLM Backend vLLM offers the broadest feature coverage in Dynamo, with full support for disaggregated serving, KV-aware routing, KV block management, LoRA adapters, and multimodal inference including video and audio. *Source: [docs/backends/vllm/README.md][vllm-readme]* | Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | **Disaggregated Serving** | — | | | | | | | | | | | **KV-Aware Routing** | ✅ | — | | | | | | | | | | **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | | | **KV Block Manager** | ✅ | ✅ | ✅ | — | | | | | | | | **Multimodal** | ✅ | ✅1 | — | ✅ | — | | | | | | | **Request Migration** | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | | | | **Request Cancellation** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | | | **LoRA** | ✅ | ✅2 | — | ✅ | — | ✅ | ✅ | — | | | | **Tool Calling** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | **Speculative Decoding** | ✅ | ✅ | — | ✅ | — | ✅ | ✅ | — | ✅ | — | > **Notes:** > 1. **Multimodal + KV-Aware Routing**: Image-aware KV routing is supported in the documented vLLM paths. The default Rust frontend path supports model families handled by `llm-multimodal`; the Python chat-processor path delegates to vLLM's multimodal processor. ([Source][mm-kv-routing]) > 2. **KV-Aware LoRA Routing**: vLLM supports routing requests based on LoRA adapter affinity. > 3. **Audio Support**: vLLM supports audio models like Qwen2-Audio (experimental). ([Source][mm-vllm]) > 4. **Video Support**: vLLM supports video input with frame sampling. ([Source][mm-vllm]) > 5. **Speculative Decoding**: Eagle3 support documented. ([Source][vllm-spec]) ## 2. SGLang Backend SGLang is optimized for high-throughput serving with fast primitives, providing robust support for disaggregated serving, KV-aware routing, and request migration. *Source: [docs/backends/sglang/README.md][sglang-readme]* | Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | **Disaggregated Serving** | — | | | | | | | | | | | **KV-Aware Routing** | ✅ | — | | | | | | | | | | **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | | | **KV Block Manager** | 🚧 | 🚧 | 🚧 | — | | | | | | | | **Multimodal** | ✅2 | 1 | — | 🚧 | — | | | | | | | **Request Migration** | ✅ | ✅ | ✅ | 🚧 | ✅ | — | | | | | | **Request Cancellation** | 🚧3 | ✅ | ✅ | 🚧 | 🚧 | ✅ | — | | | | | **LoRA** | | | | 🚧 | | | | — | | | | **Tool Calling** | ✅ | ✅ | ✅ | 🚧 | ✅ | ✅ | ✅ | | — | | | **Speculative Decoding** | 🚧 | 🚧 | — | 🚧 | — | 🚧 | — | | 🚧 | — | > **Notes:** > 1. **Multimodal + KV-Aware Routing**: Not supported. ([Source][kv-routing]) > 2. **Multimodal Patterns**: Supports simple Aggregated **EPD**, **E/PD**, and **E/P/D** patterns. Traditional Disagg **EP/D** is not supported. ([Source][mm-sglang]) > 3. **Request Cancellation**: Cancellation during the remote prefill phase is not supported in disaggregated mode. ([Source][sglang-readme]) > 4. **Speculative Decoding**: Code hooks exist (`spec_decode_stats` in publisher), but no examples or documentation yet. ## 3. TensorRT-LLM Backend TensorRT-LLM delivers maximum inference performance and optimization, with full KVBM integration and robust disaggregated serving support. *Source: [docs/backends/trtllm/README.md][trtllm-readme]* | Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | **Disaggregated Serving** | — | | | | | | | | | | | **KV-Aware Routing** | ✅ | — | | | | | | | | | | **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | | | **KV Block Manager** | ✅ | ✅ | ✅ | — | | | | | | | | **Multimodal** | ✅1 | ✅2 | — | ✅ | — | | | | | | | **Request Migration** | ✅ | ✅ | ✅ | ✅ | 🚧 | — | | | | | | **Request Cancellation** | ✅3 | ✅3 | ✅3 | ✅3 | ✅3 | ✅3 | — | | | | | **LoRA** | | | | | | | | — | | | | **Tool Calling** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | — | | | **Speculative Decoding** | ✅ | ✅ | — | ✅ | — | ✅ | ✅ | | ✅ | — | > **Notes:** > 1. **Multimodal Disaggregation**: Supports **EP/D** (Traditional) and **E/P/D** (Full Disaggregation) image flows, including image URLs and pre-computed embeddings. ([Source][mm-trtllm]) > 2. **Multimodal + KV-Aware Routing**: Image-aware KV routing is supported through the dedicated TRT-LLM MM Router Worker. It requires KV event publishing on the TRT-LLM workers. ([Source][mm-kv-routing]) > 3. **Request Cancellation**: Due to known issues, the TensorRT-LLM engine is temporarily not notified of request cancellations, meaning allocated resources for cancelled requests are not freed. --- [vllm-readme]: ../backends/v-llm [sglang-readme]: ../backends/sg-lang [trtllm-readme]: ../backends/tensor-rt-llm [disagg]: ../design-docs/disaggregated-serving [kv-routing]: ../user-guides/kv-cache-aware-routing [planner]: ../components/planner [kvbm]: ../components/kvbm [migration]: ../user-guides/fault-tolerance/request-migration [tools]: ../user-guides/tool-calling [mm]: ../user-guides/multimodal [mm-vllm]: ../features/multimodal/multimodal-vllm.md [mm-trtllm]: ../features/multimodal/multimodal-trtllm.md [mm-sglang]: ../features/multimodal/multimodal-sglang.md [mm-kv-routing]: ../features/multimodal/multimodal-kv-routing.md [lora]: ../kubernetes-deployment/deploy-models/managing-models-with-dynamo-model [vllm-spec]: ../additional-resources/speculative-decoding/speculative-decoding-with-v-llm [trtllm-eagle]: ../additional-resources/tensor-rt-llm-details/llama-4-eagle [snapshot]: ../kubernetes-deployment/advanced-platform/snapshot # Release Artifacts This document provides a comprehensive inventory of all Dynamo release artifacts including container images, Python wheels, Helm charts, and Rust crates. > **See also:** [Support Matrix](/dynamo/resources/support-matrix) for hardware and platform compatibility | [Feature Matrix](/dynamo/resources/feature-matrix) for backend feature support Release history in this document begins at v0.6.0. ## Current Release: Dynamo v1.2.0 - **GitHub Release:** [v1.2.0](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0) - **Docs:** [v1.2.0](https://docs.dynamo.nvidia.com/dynamo) - **NGC Collection:** [ai-dynamo](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) > **Experimental:** The DeepSeek-V4 preview tags remain available under [Pre-Release Artifacts](#pre-release-artifacts). Use the stable v1.2.0 artifacts below unless you specifically need the preview SGLang DeepSeek-V4 images. ### Container Images | Image:Tag | Description | Backend | CUDA | Arch | NGC | Notes | |-----------|-------------|---------|------|------|-----|-------| | `vllm-runtime:1.2.0` | Runtime container for vLLM backend | vLLM `v0.20.1` | `v12.9` | AMD64/ARM64 | [NGC: vLLM runtime 1.2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime?version=1.2.0) | | | `vllm-runtime:1.2.0-cuda13` | Runtime container for vLLM backend (CUDA 13) | vLLM `v0.20.1` | `v13.0` | AMD64/ARM64 | [NGC: vLLM runtime 1.2.0-cuda13](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime?version=1.2.0-cuda13) | | | `vllm-runtime:1.2.0-efa` | Runtime container for vLLM with AWS EFA | vLLM `v0.20.1` | `v12.9` | AMD64/ARM64 | [NGC: vLLM runtime 1.2.0-efa](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime?version=1.2.0-efa) | Experimental | | `sglang-runtime:1.2.0` | Runtime container for SGLang backend | SGLang `v0.5.11` | `v12.9` | AMD64/ARM64 | [NGC: SGLang runtime 1.2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime?version=1.2.0) | | | `sglang-runtime:1.2.0-cuda13` | Runtime container for SGLang backend (CUDA 13) | SGLang `v0.5.11` | `v13.0` | AMD64/ARM64 | [NGC: SGLang runtime 1.2.0-cuda13](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime?version=1.2.0-cuda13) | | | `sglang-runtime:1.2.0-efa` | Runtime container for SGLang with AWS EFA | SGLang `v0.5.11` | `v12.9` | AMD64/ARM64 | [NGC: SGLang runtime 1.2.0-efa](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/sglang-runtime?version=1.2.0-efa) | Experimental | | `tensorrtllm-runtime:1.2.0-cuda13` | Runtime container for TensorRT-LLM backend | TRT-LLM `v1.3.0rc14` | `v13.1` | AMD64/ARM64 | [NGC: TensorRT-LLM runtime 1.2.0-cuda13](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime?version=1.2.0-cuda13) | CUDA 13 only | | `tensorrtllm-runtime:1.2.0-efa` | Runtime container for TensorRT-LLM with AWS EFA | TRT-LLM `v1.3.0rc14` | `v13.1` | AMD64/ARM64 | [NGC: TensorRT-LLM runtime 1.2.0-efa](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime?version=1.2.0-efa) | Experimental | | `dynamo-frontend:1.2.0` | API gateway with Endpoint Prediction Protocol (EPP) | — | — | AMD64/ARM64 | [NGC: Dynamo frontend 1.2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/dynamo-frontend?version=1.2.0) | | | `dynamo-planner:1.2.0` | Standalone Planner image used by Profiler jobs and Planner pods | — | — | AMD64/ARM64 | [NGC: Dynamo planner 1.2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/dynamo-planner?version=1.2.0) | | | `kubernetes-operator:1.2.0` | Kubernetes operator for Dynamo deployments | — | — | AMD64/ARM64 | [NGC: Kubernetes operator 1.2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/kubernetes-operator?version=1.2.0) | | | `snapshot-agent:1.2.0` | Snapshot agent for fast GPU worker recovery via CRIU | — | — | AMD64/ARM64 | [NGC: Snapshot agent 1.2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/snapshot-agent?version=1.2.0) | Preview | ### Python Wheels We recommend using the TensorRT-LLM NGC container instead of the `ai-dynamo[trtllm]` wheel. See the [NGC container collection](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) for supported images. | Package | Description | Python | Platform | PyPI | |---------|-------------|--------|----------|------| | `ai-dynamo==1.2.0.post1` | Main package with backend integrations (vLLM, SGLang, TRT-LLM) | `3.10`–`3.12` | Linux (glibc `v2.28+`) | [PyPI: ai-dynamo 1.2.0.post1](https://pypi.org/project/ai-dynamo/1.2.0.post1/) | | `ai-dynamo-runtime==1.2.0.post1` | Core Python bindings for Dynamo runtime | `3.10`–`3.12` | Linux (glibc `v2.28+`) | [PyPI: ai-dynamo-runtime 1.2.0.post1](https://pypi.org/project/ai-dynamo-runtime/1.2.0.post1/) | | `kvbm==1.2.0.post1` | KV Block Manager for disaggregated KV cache | `3.10`–`3.12` | Linux (glibc `v2.28+`) | [PyPI: kvbm 1.2.0.post1](https://pypi.org/project/kvbm/1.2.0.post1/) | ### Helm Charts | Chart | Description | NGC | |-------|-------------|-----| | `dynamo-platform-1.2.0` | Platform services (etcd, NATS) and Dynamo Operator for Dynamo cluster | [NGC Helm: dynamo-platform-1.2.0](https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-1.2.0.tgz) | | `snapshot-1.2.0` | Snapshot DaemonSet for fast GPU worker recovery | [NGC Helm: snapshot-1.2.0](https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/snapshot-1.2.0.tgz) | The `dynamo-crds` Helm chart is deprecated as of v1.0.0; CRDs are now managed by the Dynamo Operator. The `dynamo-graph` Helm chart is deprecated as of v0.9.0. ### Rust Crates | Crate | Description | MSRV (Rust) | crates.io | |-------|-------------|-------------|-----------| | `dynamo-runtime@1.2.0` | Core distributed runtime library | `v1.82` | [crates.io: dynamo-runtime 1.2.0](https://crates.io/crates/dynamo-runtime/1.2.0) | | `dynamo-llm@1.2.0` | LLM inference engine | `v1.82` | [crates.io: dynamo-llm 1.2.0](https://crates.io/crates/dynamo-llm/1.2.0) | | `dynamo-protocols@1.2.0` | Async OpenAI-compatible API client | `v1.82` | [crates.io: dynamo-protocols 1.2.0](https://crates.io/crates/dynamo-protocols/1.2.0) | | `dynamo-async-openai@1.0.2` | Deprecated legacy OpenAI client; use **`dynamo-protocols`** | `v1.82` | [crates.io: dynamo-async-openai 1.0.2](https://crates.io/crates/dynamo-async-openai/1.0.2) | | `dynamo-parsers@1.2.0` | Protocol parsers (SSE, JSON streaming) | `v1.82` | [crates.io: dynamo-parsers 1.2.0](https://crates.io/crates/dynamo-parsers/1.2.0) | | `dynamo-memory@1.2.0` | Memory management utilities | `v1.82` | [crates.io: dynamo-memory 1.2.0](https://crates.io/crates/dynamo-memory/1.2.0) | | `dynamo-config@1.2.0` | Configuration management | `v1.82` | [crates.io: dynamo-config 1.2.0](https://crates.io/crates/dynamo-config/1.2.0) | | `dynamo-tokenizers@1.2.0` | Standalone tokenizer library | `v1.82` | [crates.io: dynamo-tokenizers 1.2.0](https://crates.io/crates/dynamo-tokenizers/1.2.0) | | `dynamo-tokens@1.2.0` | Tokenizer bindings for LLM inference | `v1.82` | [crates.io: dynamo-tokens 1.2.0](https://crates.io/crates/dynamo-tokens/1.2.0) | | `dynamo-mocker@1.2.0` | Inference engine simulator for benchmarking | `v1.82` | [crates.io: dynamo-mocker 1.2.0](https://crates.io/crates/dynamo-mocker/1.2.0) | | `dynamo-kv-router@1.2.0` | KV-aware request routing library | `v1.82` | [crates.io: dynamo-kv-router 1.2.0](https://crates.io/crates/dynamo-kv-router/1.2.0) | ## Quick Install Commands ### Container Images (NGC) For detailed run instructions, see the backend-specific guides: [vLLM](/dynamo/backends/v-llm) | [SGLang](/dynamo/backends/sg-lang) | [TensorRT-LLM](/dynamo/backends/tensor-rt-llm) ```bash # Runtime containers docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 docker pull nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0 docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.2.0-cuda13 # CUDA 13 variants docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0-cuda13 docker pull nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0-cuda13 # EFA variants (AWS, experimental) docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0-efa docker pull nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0-efa docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.2.0-efa # Infrastructure containers docker pull nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.2.0 docker pull nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0 docker pull nvcr.io/nvidia/ai-dynamo/kubernetes-operator:1.2.0 docker pull nvcr.io/nvidia/ai-dynamo/snapshot-agent:1.2.0 ``` ### Python Wheels (PyPI) For detailed installation instructions, see the [Local Quick Start](https://github.com/ai-dynamo/dynamo#local-quick-start) in the README. ```bash # Install Dynamo with a specific backend (Recommended) uv pip install "ai-dynamo[vllm]==1.2.0.post1" uv pip install --prerelease=allow "ai-dynamo[sglang]==1.2.0.post1" # TensorRT-LLM requires the NVIDIA PyPI index and pip pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]==1.2.0.post1" # Install Dynamo core only uv pip install ai-dynamo==1.2.0.post1 # Install standalone KVBM uv pip install kvbm==1.2.0.post1 ``` ### Helm Charts (NGC) For Kubernetes deployment instructions, see the [Kubernetes Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide). ```bash helm install dynamo-platform oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform --version 1.2.0 helm install snapshot oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/snapshot --version 1.2.0 ``` ### Rust Crates (crates.io) For API documentation, see each crate on [docs.rs](https://docs.rs/). To build Dynamo from source, see [Building from Source](https://github.com/ai-dynamo/dynamo#building-from-source). ```bash cargo add dynamo-runtime@1.2.0 cargo add dynamo-llm@1.2.0 cargo add dynamo-protocols@1.2.0 # Deprecated legacy crate name — pin only if a dependency requires it; new code should use dynamo-protocols: # cargo add dynamo-async-openai@1.0.2 cargo add dynamo-parsers@1.2.0 cargo add dynamo-memory@1.2.0 cargo add dynamo-config@1.2.0 cargo add dynamo-tokenizers@1.2.0 cargo add dynamo-tokens@1.2.0 cargo add dynamo-mocker@1.2.0 cargo add dynamo-kv-router@1.2.0 ``` **CUDA and Driver Requirements:** For detailed CUDA toolkit versions and minimum driver requirements for each container image, see the [Support Matrix](/dynamo/resources/support-matrix#cuda-and-driver-requirements). ## Known Issues For a complete list of known issues, refer to the release notes for each version: - [v1.2.0 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0) - [v1.1.1 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.1) - [v1.1.0 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0) - [v1.0.2 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v1.0.2) - [v1.0.1 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v1.0.1) - [v1.0.0 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v1.0.0) - [v0.9.0 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v0.9.0) - [v0.8.1 Release Notes](https://github.com/ai-dynamo/dynamo/releases/tag/v0.8.1) ### Known Artifact Issues | Version | Artifact | Issue | Status | |---------|----------|-------|--------| | v0.9.0 | `dynamo-platform-0.9.0` | Helm chart sets operator image to `0.7.1` instead of `0.9.0`. | Fixed in v0.9.0.post1 | | v0.8.1 | `vllm-runtime:0.8.1-cuda13` | Container fails to launch. | Known issue | | v0.8.1 | `sglang-runtime:0.8.1-cuda13`, `vllm-runtime:0.8.1-cuda13` | Multimodality not expected to work on ARM64. Works on AMD64. | Known limitation | | v0.8.0 | `sglang-runtime:0.8.0-cuda13` | CuDNN installation issue caused PyTorch `v2.9.1` compatibility problems with `nn.Conv3d`, resulting in performance degradation and excessive memory usage in multimodal workloads. | Fixed in v0.8.1 ([#5461](https://github.com/ai-dynamo/dynamo/pull/5461)) | --- ## Release Artifact History Each bullet is a **delta** to what ships on NGC / Helm / PyPI / crates.io: net-new crates, removed Helm charts, or image lines that **split** or **appear** on the registry. See the inventory tables above for full matrices. Stable releases first (newest first). **Pre-Release Git Tags** (`v*-dev.*`, experimental tracks) are summarized below; per-tag images and wheels are spelled out in [Pre-Release Artifacts](#pre-release-artifacts). For backend version pins, see the version-pins table above and the [GitHub Releases](#github-releases) table below. **Stable Releases** - **v1.2.0**: **Images:** vLLM and SGLang runtime images for CUDA 12.9 and CUDA 13.0, TensorRT-LLM runtime image for CUDA 13.1, multi-arch EFA runtime images, and refreshed `dynamo-frontend`, `kubernetes-operator`, `dynamo-planner`, and `snapshot-agent` images. **Wheels:** `ai-dynamo`, `ai-dynamo-runtime`, and `kvbm` published as `1.2.0.post1`. **Crates:** `1.2.0` crates published, including the new `dynamo-tokenizers` crate. **Helm:** `dynamo-platform` and `snapshot` charts published at `1.2.0`. - **v1.1.1**: Patch release. Same backend versions as v1.1.0: SGLang `v0.5.10.post1` (NIXL `v1.0.1`), TRT-LLM `v1.3.0rc11` (NIXL `v0.10.1`), vLLM `v0.19.0` (NIXL `v0.10.1`). - **v1.1.0**: **Images:** Split Planner into its own `dynamo-planner` image on NGC for Profiler jobs and Planner pods; worker and runtime images no longer bundle Planner (**artifact boundary change**, not a new engine capability). **Crates:** First **`1.y.z`** publication on crates.io for **`dynamo-protocols`** (multi-protocol types; **`dynamo-async-openai`** remains deprecated with final release **`1.0.2`**). - **v1.0.2 / v1.0.1**: No artifact additions or removals versus v1.0.0. - **v1.0.0**: **Images:** `snapshot-agent`, EFA variants for vLLM and TRT-LLM (AMD64 only). **Crates:** First publish of `dynamo-mocker`, `dynamo-kv-router`. **Helm:** Added `snapshot` (preview); dropped deprecated `dynamo-crds` from the publish stream (CRDs owned by the Operator). - **v0.9.1**: No artifact additions or removals versus v0.9.0. - **v0.9.0**: **Crates:** First publish of `dynamo-tokens`. **Helm:** Dropped deprecated `dynamo-graph` from the publish stream. - **v0.8.0**: **Images:** `dynamo-frontend`, CUDA 13 variants for vLLM and SGLang. **Crates:** First publish of `dynamo-memory`, `dynamo-config`. **Dynamo Nightlies** - **New as of v1.1.0\*:** **`ai-dynamo`** and **`ai-dynamo-runtime`** — nightly builds from **`main`** publish wheels tagged **`*.devYYYYMMDD`**. Install with **`pip`** or **`uv`** using **`--pre`** and the same NVIDIA extra-index pattern as [Pre-Release Artifacts](#pre-release-artifacts). \* **`*.devYYYYMMDD`** versioning for nightly **`main`** wheels began **Apr 24, 2026**. **Pre-Release and Experimental Git Tags** - **v1.2.0-deepseek-v4-dev.3**: **Images:** `vllm-runtime:*-deepseek-v4-cuda13-dev.3`, `sglang-runtime:*-deepseek-v4-cuda12-dev.3`, `sglang-runtime:*-deepseek-v4-cuda13-dev.3`. **Helm / PyPI:** Not published for this tag (see [Pre-Release Artifacts](#v120-deepseek-v4-dev3)). - **v1.1.0-dev.3**: **Images:** `tensorrtllm-runtime:1.1.0-dev.3`. **Wheels:** `ai-dynamo`, `ai-dynamo-runtime` on [pypi.nvidia.com](https://pypi.nvidia.com/) (see [below](#v110-dev3)). - **v1.1.0-dev.2**: **Images:** `sglang-runtime:1.1.0-dev.2`, `tensorrtllm-runtime:1.1.0-dev.2`. **Wheels:** `ai-dynamo`, `ai-dynamo-runtime` on [pypi.nvidia.com](https://pypi.nvidia.com/) (see [below](#v110-dev2)). - **v1.1.0-dev.1**: **Images:** vLLM, SGLang, TRT-LLM runtime matrix (CUDA 12 / 13 and EFA variants as listed), `dynamo-frontend`, `kubernetes-operator`, `snapshot-agent`. **Wheels:** `ai-dynamo`, `ai-dynamo-runtime` on [pypi.nvidia.com](https://pypi.nvidia.com/). **Helm:** `dynamo-platform`, `snapshot` at `1.1.0-dev.1` (see [below](#v110-dev1)). **Helm-Only Patches** - **v0.9.0.post1**: Republished `dynamo-platform` Helm chart only (operator image tag correction). **Backend-Only Patch Trains** - **v0.8.1.post1 / .post2 / .post3**: Republished TRT-LLM runtime image and PyPI wheels only. ### crates.io Rust Packages These crates use repository `https://github.com/ai-dynamo/dynamo.git`. The table lists each crate’s **first non-placeholder** publication on crates.io (excluding reservation uploads named `0.0.0-prerelease.0`). Dates are from the crates.io registry index. | Crate | First Published Version | Date (crates.io) | |-------|-------------------------|------------------| | `dynamo-runtime` | `0.1.0` | 2025-03-18 | | `dynamo-llm` | `0.2.0` | 2025-05-01 | | `dynamo-async-openai` | `0.4.1` | 2025-08-27 | | `dynamo-parsers` | `0.5.0` | 2025-09-18 | | `dynamo-memory` | `0.8.0` | 2026-01-15 | | `dynamo-config` | `0.8.0` | 2026-01-15 | | `dynamo-tokens` | `0.9.0` | 2026-02-12 | | `dynamo-mocker` | `1.0.0` | 2026-03-13 | | `dynamo-kv-router` | `1.0.0` | 2026-03-13 | | `dynamo-protocols` | `1.1.0` | 2026-05-04 | | `dynamo-tokenizers` | `1.2.0` | 2026-06-02 | **`dynamo-async-openai`** is **deprecated**; **`1.0.2`** is its final crates.io release. Use **`dynamo-protocols`** for new dependencies ([crate](https://crates.io/crates/dynamo-protocols)). ### GitHub Releases | Version | Release Date | GitHub | Docs | Notes | |---------|--------------|--------|------|-------| | `v1.2.0` | Jun 2, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0) | [Docs](https://docs.dynamo.nvidia.com/dynamo) | | | `v1.2.0-deepseek-v4-dev.3` | May 9, 2026 | [Tag](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0-deepseek-v4-dev.3) | — | Experimental (DeepSeek-V4-Flash / V4-Pro Blackwell preview; vLLM + SGLang containers only) | | `v1.2.0-deepseek-v4-dev.2` | May 1, 2026 | [Tag](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0-deepseek-v4-dev.2) | — | Experimental (DeepSeek-V4-Flash / V4-Pro Blackwell preview; vLLM + SGLang containers only) | | `v1.1.1` | May 5, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.1) | [Docs](https://docs.dynamo.nvidia.com/dynamo) | | | `v1.1.0` | May 1, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0) | [Docs](https://docs.dynamo.nvidia.com/dynamo) | | | `v1.1.0-dev.3` | Apr 18, 2026 | [Tag](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0-dev.3) | — | Pre-Release (TRT-LLM Runtime Image + Wheels; see Pre-Release Artifacts) | | `v1.1.0-dev.2` | Apr 9, 2026 | [Tag](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0-dev.2) | — | Pre-Release (SGLang + TRT-LLM Runtime Images + Wheels; see Pre-Release Artifacts) | | `v1.1.0-dev.1` | Mar 17, 2026 | [Tag](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0-dev.1) | — | Experimental | | `v1.0.2` | Apr 22, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v1.0.2) | [Docs](https://docs.dynamo.nvidia.com/dynamo) | | | `v1.0.1` | Mar 16, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v1.0.1) | [Docs](https://docs.dynamo.nvidia.com/dynamo) | | | `v1.0.0` | Mar 12, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v1.0.0) | [Docs](https://docs.dynamo.nvidia.com/dynamo) | | | `v0.9.1` | Mar 4, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.9.1) | [Docs](https://docs.dynamo.nvidia.com/dynamo) | | `v0.9.0` | Feb 11, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.9.0) | Archived docs unavailable | | `v0.8.1` | Jan 23, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.8.1) | Archived docs unavailable | | `v0.8.0` | Jan 15, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.8.0) | Archived docs unavailable | | `v0.7.1` | Dec 15, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.7.1) | Archived docs unavailable | | `v0.7.0` | Nov 26, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.7.0) | Archived docs unavailable | | `v0.6.1` | Nov 6, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.6.1) | — | | `v0.6.0` | Oct 28, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.6.0) | — | ### Container Images > **NGC Collection:** [ai-dynamo](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) > > To access a specific version, append `?version=TAG` to the container URL: > `https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/{container}?version={tag}` #### vllm-runtime | Image:Tag | vLLM | Arch | CUDA | Notes | |-----------|------|------|------|-------| | `vllm-runtime:1.2.0` | `v0.20.1` | AMD64/ARM64 | `v12.9` | | | `vllm-runtime:1.2.0-cuda13` | `v0.20.1` | AMD64/ARM64 | `v13.0` | | | `vllm-runtime:1.2.0-efa` | `v0.20.1` | AMD64/ARM64 | `v12.9` | Experimental | | `vllm-runtime:1.1.1` | `v0.19.0` | AMD64/ARM64 | `v12.9` | | | `vllm-runtime:1.1.1-cuda13` | `v0.19.0` | AMD64/ARM64 | `v13.0` | | | `vllm-runtime:1.1.1-efa-amd64` | `v0.19.0` | AMD64 | `v12.9` | Experimental | | `vllm-runtime:1.1.0` | `v0.19.0` | AMD64/ARM64 | `v12.9` | | | `vllm-runtime:1.1.0-cuda13` | `v0.19.0` | AMD64/ARM64 | `v13.0` | | | `vllm-runtime:1.1.0-efa-amd64` | `v0.19.0` | AMD64 | `v12.9` | Experimental | | `vllm-runtime:1.0.2` | `v0.16.0` | AMD64/ARM64 | `v12.9` | | | `vllm-runtime:1.0.2-cuda13` | `v0.16.0` | AMD64/ARM64 | `v13.0` | | | `vllm-runtime:1.0.2-efa-amd64` | `v0.16.0` | AMD64 | `v12.9` | Experimental | | `vllm-runtime:1.0.1` | `v0.16.0` | AMD64/ARM64 | `v12.9` | | | `vllm-runtime:1.0.1-cuda13` | `v0.16.0` | AMD64/ARM64 | `v13.0` | | | `vllm-runtime:1.0.1-efa-amd64` | `v0.16.0` | AMD64 | `v12.9` | Experimental | | `vllm-runtime:1.0.0` | `v0.16.0` | AMD64/ARM64 | `v12.9` | | | `vllm-runtime:1.0.0-cuda13` | `v0.16.0` | AMD64/ARM64 | `v13.0` | | | `vllm-runtime:1.0.0-efa-amd64` | `v0.16.0` | AMD64 | `v12.9` | Experimental | | `vllm-runtime:0.9.1` | `v0.14.1` | AMD64/ARM64 | `v12.9` | | | `vllm-runtime:0.9.1-cuda13` | `v0.14.1` | AMD64/ARM64 | `v13.0` | Experimental | | `vllm-runtime:0.9.0` | `v0.14.1` | AMD64/ARM64 | `v12.9` | | | `vllm-runtime:0.9.0-cuda13` | `v0.14.1` | AMD64/ARM64 | `v13.0` | Experimental | | `vllm-runtime:0.8.1` | `v0.12.0` | AMD64/ARM64 | `v12.9` | | | `vllm-runtime:0.8.0` | `v0.12.0` | AMD64/ARM64 | `v12.9` | | | `vllm-runtime:0.8.0-cuda13` | `v0.12.0` | AMD64/ARM64 | `v13.0` | Experimental | | `vllm-runtime:0.7.0.post2` | `v0.11.2` | AMD64/ARM64 | `v12.8` | Patch | | `vllm-runtime:0.7.1` | `v0.11.0` | AMD64/ARM64 | `v12.8` | | | `vllm-runtime:0.7.0.post1` | `v0.11.0` | AMD64/ARM64 | `v12.8` | Patch | | `vllm-runtime:0.7.0` | `v0.11.0` | AMD64/ARM64 | `v12.8` | | | `vllm-runtime:0.6.1.post1` | `v0.11.0` | AMD64/ARM64 | `v12.8` | Patch | | `vllm-runtime:0.6.1` | `v0.11.0` | AMD64/ARM64 | `v12.8` | | | `vllm-runtime:0.6.0` | `v0.11.0` | AMD64 | `v12.8` | | #### sglang-runtime | Image:Tag | SGLang | Arch | CUDA | Notes | |-----------|--------|------|------|-------| | `sglang-runtime:1.2.0` | `v0.5.11` | AMD64/ARM64 | `v12.9` | | | `sglang-runtime:1.2.0-cuda13` | `v0.5.11` | AMD64/ARM64 | `v13.0` | | | `sglang-runtime:1.2.0-efa` | `v0.5.11` | AMD64/ARM64 | `v12.9` | Experimental | | `sglang-runtime:1.1.1` | `v0.5.10.post1` | AMD64/ARM64 | `v12.9` | | | `sglang-runtime:1.1.1-cuda13` | `v0.5.10.post1` | AMD64/ARM64 | `v13.0` | | | `sglang-runtime:1.1.0` | `v0.5.10.post1` | AMD64/ARM64 | `v12.9` | | | `sglang-runtime:1.1.0-cuda13` | `v0.5.10.post1` | AMD64/ARM64 | `v13.0` | | | `sglang-runtime:1.0.2` | `v0.5.9` | AMD64/ARM64 | `v12.9` | | | `sglang-runtime:1.0.2-cuda13` | `v0.5.9` | AMD64/ARM64 | `v13.0` | | | `sglang-runtime:1.0.1` | `v0.5.9` | AMD64/ARM64 | `v12.9` | | | `sglang-runtime:1.0.1-cuda13` | `v0.5.9` | AMD64/ARM64 | `v13.0` | | | `sglang-runtime:1.0.0` | `v0.5.9` | AMD64/ARM64 | `v12.9` | | | `sglang-runtime:1.0.0-cuda13` | `v0.5.9` | AMD64/ARM64 | `v13.0` | | | `sglang-runtime:0.9.1` | `v0.5.8` | AMD64/ARM64 | `v12.9` | | | `sglang-runtime:0.9.1-cuda13` | `v0.5.8` | AMD64/ARM64 | `v13.0` | Experimental | | `sglang-runtime:0.9.0` | `v0.5.8` | AMD64/ARM64 | `v12.9` | | | `sglang-runtime:0.9.0-cuda13` | `v0.5.8` | AMD64/ARM64 | `v13.0` | Experimental | | `sglang-runtime:0.8.1` | `v0.5.6.post2` | AMD64/ARM64 | `v12.9` | | | `sglang-runtime:0.8.1-cuda13` | `v0.5.6.post2` | AMD64/ARM64 | `v13.0` | Experimental | | `sglang-runtime:0.8.0` | `v0.5.6.post2` | AMD64/ARM64 | `v12.9` | | | `sglang-runtime:0.8.0-cuda13` | `v0.5.6.post2` | AMD64/ARM64 | `v13.0` | Experimental | | `sglang-runtime:0.7.1` | `v0.5.4.post3` | AMD64/ARM64 | `v12.9` | | | `sglang-runtime:0.7.0.post1` | `v0.5.4.post3` | AMD64/ARM64 | `v12.9` | Patch | | `sglang-runtime:0.7.0` | `v0.5.4.post3` | AMD64/ARM64 | `v12.9` | | | `sglang-runtime:0.6.1.post1` | `v0.5.3.post2` | AMD64/ARM64 | `v12.9` | Patch | | `sglang-runtime:0.6.1` | `v0.5.3.post2` | AMD64/ARM64 | `v12.9` | | | `sglang-runtime:0.6.0` | `v0.5.3.post2` | AMD64 | `v12.8` | | #### tensorrtllm-runtime | Image:Tag | TRT-LLM | Arch | CUDA | Notes | |-----------|---------|------|------|-------| | `tensorrtllm-runtime:1.2.0-cuda13` | `v1.3.0rc14` | AMD64/ARM64 | `v13.1` | | | `tensorrtllm-runtime:1.2.0-efa` | `v1.3.0rc14` | AMD64/ARM64 | `v13.1` | Experimental | | `tensorrtllm-runtime:1.1.1` | `v1.3.0rc11` | AMD64/ARM64 | `v13.1` | | | `tensorrtllm-runtime:1.1.1-efa-amd64` | `v1.3.0rc11` | AMD64 | `v13.1` | Experimental | | `tensorrtllm-runtime:1.1.0` | `v1.3.0rc11` | AMD64/ARM64 | `v13.1` | | | `tensorrtllm-runtime:1.1.0-efa-amd64` | `v1.3.0rc11` | AMD64 | `v13.1` | Experimental | | `tensorrtllm-runtime:1.0.2` | `v1.3.0rc5.post1` | AMD64/ARM64 | `v13.1` | | | `tensorrtllm-runtime:1.0.2-efa-amd64` | `v1.3.0rc5.post1` | AMD64 | `v13.1` | Experimental | | `tensorrtllm-runtime:1.0.1` | `v1.3.0rc5.post1` | AMD64/ARM64 | `v13.1` | | | `tensorrtllm-runtime:1.0.1-efa-amd64` | `v1.3.0rc5.post1` | AMD64 | `v13.1` | Experimental | | `tensorrtllm-runtime:1.0.0` | `v1.3.0rc5.post1` | AMD64/ARM64 | `v13.1` | | | `tensorrtllm-runtime:1.0.0-efa-amd64` | `v1.3.0rc5.post1` | AMD64 | `v13.1` | Experimental | | `tensorrtllm-runtime:0.9.1` | `v1.3.0rc3` | AMD64/ARM64 | `v13.0` | | | `tensorrtllm-runtime:0.9.0` | `v1.3.0rc1` | AMD64/ARM64 | `v13.0` | | | `tensorrtllm-runtime:0.8.1.post3` | `v1.2.0rc6.post3` | AMD64/ARM64 | `v13.0` | Patch | | `tensorrtllm-runtime:0.8.1.post1` | `v1.2.0rc6.post2` | AMD64/ARM64 | `v13.0` | Patch | | `tensorrtllm-runtime:0.8.1` | `v1.2.0rc6.post1` | AMD64/ARM64 | `v13.0` | | | `tensorrtllm-runtime:0.8.0` | `v1.2.0rc6.post1` | AMD64/ARM64 | `v13.0` | | | `tensorrtllm-runtime:0.7.0.post2` | `v1.2.0rc2` | AMD64/ARM64 | `v13.0` | Patch | | `tensorrtllm-runtime:0.7.1` | `v1.2.0rc3` | AMD64/ARM64 | `v13.0` | | | `tensorrtllm-runtime:0.7.0.post1` | `v1.2.0rc3` | AMD64/ARM64 | `v13.0` | Patch | | `tensorrtllm-runtime:0.7.0` | `v1.2.0rc2` | AMD64/ARM64 | `v13.0` | | | `tensorrtllm-runtime:0.6.1-cuda13` | `v1.2.0rc1` | AMD64/ARM64 | `v13.0` | Experimental | | `tensorrtllm-runtime:0.6.1.post1` | `v1.1.0rc5` | AMD64/ARM64 | `v12.9` | Patch | | `tensorrtllm-runtime:0.6.1` | `v1.1.0rc5` | AMD64/ARM64 | `v12.9` | | | `tensorrtllm-runtime:0.6.0` | `v1.1.0rc5` | AMD64/ARM64 | `v12.9` | | #### dynamo-frontend | Image:Tag | Arch | Notes | |-----------|------|-------| | `dynamo-frontend:1.2.0` | AMD64/ARM64 | | | `dynamo-frontend:1.1.1` | AMD64/ARM64 | | | `dynamo-frontend:1.1.0` | AMD64/ARM64 | | | `dynamo-frontend:1.0.2` | AMD64/ARM64 | | | `dynamo-frontend:1.0.1` | AMD64/ARM64 | | | `dynamo-frontend:1.0.0` | AMD64/ARM64 | | | `dynamo-frontend:0.9.1` | AMD64/ARM64 | | | `dynamo-frontend:0.9.0` | AMD64/ARM64 | | | `dynamo-frontend:0.8.1` | AMD64/ARM64 | | | `dynamo-frontend:0.8.0` | AMD64/ARM64 | Initial | #### kubernetes-operator | Image:Tag | Arch | Notes | |-----------|------|-------| | `kubernetes-operator:1.2.0` | AMD64/ARM64 | | | `kubernetes-operator:1.1.1` | AMD64/ARM64 | | | `kubernetes-operator:1.1.0` | AMD64/ARM64 | | | `kubernetes-operator:1.0.2` | AMD64/ARM64 | | | `kubernetes-operator:1.0.1` | AMD64/ARM64 | | | `kubernetes-operator:1.0.0` | AMD64/ARM64 | | | `kubernetes-operator:0.9.1` | AMD64/ARM64 | | | `kubernetes-operator:0.9.0` | AMD64/ARM64 | | | `kubernetes-operator:0.8.1` | AMD64/ARM64 | | | `kubernetes-operator:0.8.0` | AMD64/ARM64 | | | `kubernetes-operator:0.7.1` | AMD64/ARM64 | | | `kubernetes-operator:0.7.0.post1` | AMD64/ARM64 | Patch | | `kubernetes-operator:0.7.0` | AMD64/ARM64 | | | `kubernetes-operator:0.6.1` | AMD64/ARM64 | | | `kubernetes-operator:0.6.0` | AMD64/ARM64 | | #### dynamo-planner | Image:Tag | Arch | Notes | |-----------|------|-------| | `dynamo-planner:1.2.0` | AMD64/ARM64 | | | `dynamo-planner:1.1.1` | AMD64/ARM64 | | | `dynamo-planner:1.1.0` | AMD64/ARM64 | New | #### snapshot-agent | Image:Tag | Arch | Notes | |-----------|------|-------| | `snapshot-agent:1.2.0` | AMD64/ARM64 | Preview | | `snapshot-agent:1.1.1` | AMD64/ARM64 | Preview | | `snapshot-agent:1.1.0` | AMD64/ARM64 | Preview | | `snapshot-agent:1.0.2` | AMD64/ARM64 | Preview | | `snapshot-agent:1.0.1` | AMD64/ARM64 | Preview | | `snapshot-agent:1.0.0` | AMD64/ARM64 | Preview | ### Python Wheels > **PyPI:** [ai-dynamo](https://pypi.org/project/ai-dynamo/) | [ai-dynamo-runtime](https://pypi.org/project/ai-dynamo-runtime/) | [kvbm](https://pypi.org/project/kvbm/) > > To access a specific version: `https://pypi.org/project/{package}/{version}/` #### ai-dynamo (wheel) | Package | Python | Platform | Notes | |---------|--------|----------|-------| | `ai-dynamo==1.2.0.post1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo==1.1.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo==1.1.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo==1.0.2` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo==1.0.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo==1.0.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo==0.9.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo==0.9.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo==0.8.1.post3` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | TRT-LLM `v1.2.0rc6.post3` | | `ai-dynamo==0.8.1.post1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | TRT-LLM `v1.2.0rc6.post2` | | `ai-dynamo==0.8.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo==0.8.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo==0.7.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo==0.7.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo==0.6.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo==0.6.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | #### ai-dynamo-runtime (wheel) | Package | Python | Platform | Notes | |---------|--------|----------|-------| | `ai-dynamo-runtime==1.2.0.post1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo-runtime==1.1.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo-runtime==1.1.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo-runtime==1.0.2` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo-runtime==1.0.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo-runtime==1.0.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo-runtime==0.9.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo-runtime==0.9.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo-runtime==0.8.1.post3` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | TRT-LLM `v1.2.0rc6.post3` | | `ai-dynamo-runtime==0.8.1.post1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | TRT-LLM `v1.2.0rc6.post2` | | `ai-dynamo-runtime==0.8.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo-runtime==0.8.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo-runtime==0.7.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo-runtime==0.7.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo-runtime==0.6.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `ai-dynamo-runtime==0.6.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | #### kvbm (wheel) | Package | Python | Platform | Notes | |---------|--------|----------|-------| | `kvbm==1.2.0.post1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `kvbm==1.1.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `kvbm==1.1.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `kvbm==1.0.2` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `kvbm==1.0.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `kvbm==1.0.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `kvbm==0.9.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `kvbm==0.9.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `kvbm==0.8.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `kvbm==0.8.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `kvbm==0.7.1` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | | | `kvbm==0.7.0` | `3.10`–`3.12` | Linux (glibc `v2.28+`) | Initial | ### Helm Charts > **NGC Helm Registry:** [ai-dynamo](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) > > Direct download: `https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/{chart}-{version}.tgz` #### dynamo-crds (Helm chart) -- Deprecated The `dynamo-crds` Helm chart is deprecated as of v1.0.0. CRDs are now managed by the Dynamo Operator. | Chart | Notes | |-------|-------| | `dynamo-crds-0.9.1` | Last release | | `dynamo-crds-0.9.0` | | | `dynamo-crds-0.8.1` | | | `dynamo-crds-0.8.0` | | | `dynamo-crds-0.7.1` | | | `dynamo-crds-0.7.0` | | | `dynamo-crds-0.6.1` | | | `dynamo-crds-0.6.0` | | #### dynamo-platform (Helm chart) | Chart | Notes | |-------|-------| | `dynamo-platform-1.2.0` | | | `dynamo-platform-1.1.1` | | | `dynamo-platform-1.1.0` | | | `dynamo-platform-1.0.2` | | | `dynamo-platform-1.0.1` | | | `dynamo-platform-1.0.0` | | | `dynamo-platform-0.9.1` | | | `dynamo-platform-0.9.0-post1` | Helm fix: operator image tag | | `dynamo-platform-0.9.0` | | | `dynamo-platform-0.8.1` | | | `dynamo-platform-0.8.0` | | | `dynamo-platform-0.7.1` | | | `dynamo-platform-0.7.0` | | | `dynamo-platform-0.6.1` | | | `dynamo-platform-0.6.0` | | #### snapshot (Helm chart) | Chart | Notes | |-------|-------| | `snapshot-1.2.0` | Preview | | `snapshot-1.1.1` | Preview | | `snapshot-1.1.0` | Preview | | `snapshot-1.0.2` | Preview | | `snapshot-1.0.1` | Preview | | `snapshot-1.0.0` | Preview | #### dynamo-graph (Helm chart) -- Deprecated The `dynamo-graph` Helm chart is deprecated as of v0.9.0. | Chart | Notes | |-------|-------| | `dynamo-graph-0.8.1` | Last release | | `dynamo-graph-0.8.0` | | | `dynamo-graph-0.7.1` | | | `dynamo-graph-0.7.0` | | | `dynamo-graph-0.6.1` | | | `dynamo-graph-0.6.0` | | ### Rust Crates > **crates.io:** [dynamo-runtime](https://crates.io/crates/dynamo-runtime) | [dynamo-llm](https://crates.io/crates/dynamo-llm) | [dynamo-protocols](https://crates.io/crates/dynamo-protocols) | [dynamo-async-openai](https://crates.io/crates/dynamo-async-openai) *(deprecated)* | [dynamo-parsers](https://crates.io/crates/dynamo-parsers) | [dynamo-memory](https://crates.io/crates/dynamo-memory) | [dynamo-config](https://crates.io/crates/dynamo-config) | [dynamo-tokenizers](https://crates.io/crates/dynamo-tokenizers) | [dynamo-tokens](https://crates.io/crates/dynamo-tokens) > > To access a specific version: `https://crates.io/crates/{crate}/{version}` #### dynamo-runtime (crate) | Crate | MSRV (Rust) | Notes | |-------|-------------|-------| | `dynamo-runtime@1.2.0` | `v1.82` | | | `dynamo-runtime@1.1.1` | `v1.82` | | | `dynamo-runtime@1.1.0` | `v1.82` | | | `dynamo-runtime@1.0.2` | `v1.82` | | | `dynamo-runtime@1.0.1` | `v1.82` | | | `dynamo-runtime@1.0.0` | `v1.82` | | | `dynamo-runtime@0.9.1` | `v1.82` | | | `dynamo-runtime@0.9.0` | `v1.82` | | | `dynamo-runtime@0.8.1` | `v1.82` | | | `dynamo-runtime@0.8.0` | `v1.82` | | | `dynamo-runtime@0.7.1` | `v1.82` | | | `dynamo-runtime@0.7.0` | `v1.82` | | | `dynamo-runtime@0.6.1` | `v1.82` | | | `dynamo-runtime@0.6.0` | `v1.82` | | #### dynamo-llm (crate) | Crate | MSRV (Rust) | Notes | |-------|-------------|-------| | `dynamo-llm@1.2.0` | `v1.82` | | | `dynamo-llm@1.1.1` | `v1.82` | | | `dynamo-llm@1.1.0` | `v1.82` | | | `dynamo-llm@1.0.2` | `v1.82` | | | `dynamo-llm@1.0.1` | `v1.82` | | | `dynamo-llm@1.0.0` | `v1.82` | | | `dynamo-llm@0.9.1` | `v1.82` | | | `dynamo-llm@0.9.0` | `v1.82` | | | `dynamo-llm@0.8.1` | `v1.82` | | | `dynamo-llm@0.8.0` | `v1.82` | | | `dynamo-llm@0.7.1` | `v1.82` | | | `dynamo-llm@0.7.0` | `v1.82` | | | `dynamo-llm@0.6.1` | `v1.82` | | | `dynamo-llm@0.6.0` | `v1.82` | | #### dynamo-protocols (crate) On crates.io, **`dynamo-protocols`** lists **`1.1.0`** as its first installable release (placeholder reservation **`0.0.0-prerelease.0`** omitted here like other **`0.0.0-prerelease.*`** uploads). Earlier semver lines for the OpenAI-compatible client shipped under **`dynamo-async-openai`** — see **`#### dynamo-async-openai (crate)`** below. | Crate | MSRV (Rust) | Notes | |-------|-------------|-------| | `dynamo-protocols@1.2.0` | `v1.82` | | | `dynamo-protocols@1.1.1` | `v1.82` | | | `dynamo-protocols@1.1.0` | `v1.82` | | #### dynamo-async-openai (crate) **Deprecated.** Prefer **`dynamo-protocols`**. This crate remains published on crates.io for manifests pinned to the old package name. | Crate | MSRV (Rust) | Notes | |-------|-------------|-------| | `dynamo-async-openai@1.0.2` | `v1.82` | Final crates.io release | | `dynamo-async-openai@1.0.1` | `v1.82` | | | `dynamo-async-openai@1.0.0` | `v1.82` | | | `dynamo-async-openai@0.9.1` | `v1.82` | | | `dynamo-async-openai@0.9.0` | `v1.82` | | | `dynamo-async-openai@0.8.1` | `v1.82` | | | `dynamo-async-openai@0.8.0` | `v1.82` | | | `dynamo-async-openai@0.7.1` | `v1.82` | | | `dynamo-async-openai@0.7.0` | `v1.82` | | | `dynamo-async-openai@0.7.0-post1` | `v1.82` | | | `dynamo-async-openai@0.6.1` | `v1.82` | | | `dynamo-async-openai@0.6.0` | `v1.82` | | | `dynamo-async-openai@0.5.1` | `v1.82` | | | `dynamo-async-openai@0.5.0` | `v1.82` | | | `dynamo-async-openai@0.4.1` | `v1.82` | | #### dynamo-parsers (crate) | Crate | MSRV (Rust) | Notes | |-------|-------------|-------| | `dynamo-parsers@1.2.0` | `v1.82` | | | `dynamo-parsers@1.1.1` | `v1.82` | | | `dynamo-parsers@1.1.0` | `v1.82` | | | `dynamo-parsers@1.0.2` | `v1.82` | | | `dynamo-parsers@1.0.1` | `v1.82` | | | `dynamo-parsers@1.0.0` | `v1.82` | | | `dynamo-parsers@0.9.1` | `v1.82` | | | `dynamo-parsers@0.9.0` | `v1.82` | | | `dynamo-parsers@0.8.1` | `v1.82` | | | `dynamo-parsers@0.8.0` | `v1.82` | | | `dynamo-parsers@0.7.1` | `v1.82` | | | `dynamo-parsers@0.7.0` | `v1.82` | | | `dynamo-parsers@0.6.1` | `v1.82` | | | `dynamo-parsers@0.6.0` | `v1.82` | | #### dynamo-memory (crate) | Crate | MSRV (Rust) | Notes | |-------|-------------|-------| | `dynamo-memory@1.2.0` | `v1.82` | | | `dynamo-memory@1.1.1` | `v1.82` | | | `dynamo-memory@1.1.0` | `v1.82` | | | `dynamo-memory@1.0.2` | `v1.82` | | | `dynamo-memory@1.0.1` | `v1.82` | | | `dynamo-memory@1.0.0` | `v1.82` | | | `dynamo-memory@0.9.1` | `v1.82` | | | `dynamo-memory@0.9.0` | `v1.82` | | | `dynamo-memory@0.8.1` | `v1.82` | | | `dynamo-memory@0.8.0` | `v1.82` | Initial | #### dynamo-config (crate) | Crate | MSRV (Rust) | Notes | |-------|-------------|-------| | `dynamo-config@1.2.0` | `v1.82` | | | `dynamo-config@1.1.1` | `v1.82` | | | `dynamo-config@1.1.0` | `v1.82` | | | `dynamo-config@1.0.2` | `v1.82` | | | `dynamo-config@1.0.1` | `v1.82` | | | `dynamo-config@1.0.0` | `v1.82` | | | `dynamo-config@0.9.1` | `v1.82` | | | `dynamo-config@0.9.0` | `v1.82` | | | `dynamo-config@0.8.1` | `v1.82` | | | `dynamo-config@0.8.0` | `v1.82` | Initial | #### dynamo-tokenizers (crate) | Crate | MSRV (Rust) | Notes | |-------|-------------|-------| | `dynamo-tokenizers@1.2.0` | `v1.82` | Initial | #### dynamo-tokens (crate) | Crate | MSRV (Rust) | Notes | |-------|-------------|-------| | `dynamo-tokens@1.2.0` | `v1.82` | | | `dynamo-tokens@1.1.1` | `v1.82` | | | `dynamo-tokens@1.1.0` | `v1.82` | | | `dynamo-tokens@1.0.2` | `v1.82` | | | `dynamo-tokens@1.0.1` | `v1.82` | | | `dynamo-tokens@1.0.0` | `v1.82` | | | `dynamo-tokens@0.9.1` | `v1.82` | | | `dynamo-tokens@0.9.0` | `v1.82` | Initial | #### dynamo-mocker (crate) | Crate | MSRV (Rust) | Notes | |-------|-------------|-------| | `dynamo-mocker@1.2.0` | `v1.82` | | | `dynamo-mocker@1.1.1` | `v1.82` | | | `dynamo-mocker@1.1.0` | `v1.82` | | | `dynamo-mocker@1.0.2` | `v1.82` | | | `dynamo-mocker@1.0.1` | `v1.82` | | | `dynamo-mocker@1.0.0` | `v1.82` | Initial | #### dynamo-kv-router (crate) | Crate | MSRV (Rust) | Notes | |-------|-------------|-------| | `dynamo-kv-router@1.2.0` | `v1.82` | | | `dynamo-kv-router@1.1.1` | `v1.82` | | | `dynamo-kv-router@1.1.0` | `v1.82` | | | `dynamo-kv-router@1.0.2` | `v1.82` | | | `dynamo-kv-router@1.0.1` | `v1.82` | | | `dynamo-kv-router@1.0.0` | `v1.82` | Initial | --- ## Pre-Release Artifacts **Pre-Release artifacts do not go through QA validation.** Pre-release versions are experimental previews intended for early testing and feedback. They may contain bugs, breaking changes, or incomplete features. Use stable releases for production workloads. **Pre-Release Python Wheels** are published on the NVIDIA package index at [pypi.nvidia.com](https://pypi.nvidia.com/), not on the public [PyPI](https://pypi.org/) index. Like stable wheels, they are **Linux (manylinux) builds** for the Python versions in the [Support Matrix](/dynamo/resources/support-matrix); `pip`/`uv` on macOS or Windows will not find matching wheels. Install on a supported Linux host or inside a Linux container. Install by adding that URL as an extra index and allowing pre-releases (PEP 440 dev versions): ```bash # uv (recommended in other Dynamo docs) uv pip install --pre --extra-index-url https://pypi.nvidia.com/ ai-dynamo==1.1.0.dev2 # pip pip install --pre --extra-index-url https://pypi.nvidia.com ai-dynamo==1.1.0.dev2 ``` A GitHub or container tag `v1.1.0-dev.N` maps to a wheel version `1.1.0.devN` (for example `v1.1.0-dev.2` → `==1.1.0.dev2`). Optional extras such as `ai-dynamo[vllm]` use the same flags; pin the version you want from the sections below. ### v1.2.0-deepseek-v4-dev.3 - **Branch:** [release/1.2.0-deepseek-v4-dev.3](https://github.com/ai-dynamo/dynamo/tree/release/1.2.0-deepseek-v4-dev.3) - **GitHub Tag:** [v1.2.0-deepseek-v4-dev.3](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0-deepseek-v4-dev.3) - **Backends:** vLLM `v0.20.1` (DSv4 stabilization patch over `v0.20.0` native DSv4 support) | SGLang upstream `lmsysorg/sglang:deepseek-v4-blackwell` preview (refreshed for dev.3) | NIXL `v0.10.1` - **Coverage:** Partial -- DeepSeek-V4-Flash and V4-Pro only. vLLM and SGLang containers are published for Blackwell (B200 plus GB200); no TensorRT-LLM container, no other component containers, no Helm charts, no wheels. Snapshot dev build for early-access V4 model support; not QA-gated. #### Container Images | Image:Tag | Backend | CUDA | Arch | |-----------|---------|------|------| | `vllm-runtime:1.2.0-deepseek-v4-cuda13-dev.3` | vLLM `v0.20.1` | `v13.0` | AMD64/ARM64 | | `sglang-runtime:1.2.0-deepseek-v4-cuda12-dev.3` | SGLang upstream DSv4 preview | `v12.9` | AMD64 | | `sglang-runtime:1.2.0-deepseek-v4-cuda13-dev.3` | SGLang upstream DSv4 preview | `v13.0` | ARM64 | #### Python Wheels Not published for this dev release. Use the `v1.1.1` wheels or `v1.1.0-dev.3` from [pypi.nvidia.com](https://pypi.nvidia.com/). #### Helm Charts Not published for this dev release. Use `v1.1.1` charts for platform install. #### Rust Crates Not shipped for pre-release versions. ### v1.2.0-deepseek-v4-dev.2 - **Branch:** [release/1.2.0-deepseek-v4-dev.2](https://github.com/ai-dynamo/dynamo/tree/release/1.2.0-deepseek-v4-dev.2) - **GitHub Tag:** [v1.2.0-deepseek-v4-dev.2](https://github.com/ai-dynamo/dynamo/releases/tag/v1.2.0-deepseek-v4-dev.2) - **Backends:** vLLM `v0.20.0` (native DeepSeek-V4 support) | SGLang upstream `lmsysorg/sglang:deepseek-v4-blackwell` preview | NIXL `v0.10.1` - **Coverage:** DeepSeek-V4-Flash and V4-Pro only. vLLM and SGLang containers are published for Blackwell. TensorRT-LLM container, other component containers, Helm charts, and wheels are not published for this tag. Snapshot dev build for early-access V4 model support; not QA-gated. #### Container Images | Image:Tag | Backend | CUDA | Arch | |-----------|---------|------|------| | `vllm-runtime:1.2.0-deepseek-v4-cuda13-dev.2` | vLLM `v0.20.0` | `v13.0` | AMD64/ARM64 | | `sglang-runtime:1.2.0-deepseek-v4-cuda12-dev.2` | SGLang upstream DSv4 preview | `v12.9` | AMD64 | | `sglang-runtime:1.2.0-deepseek-v4-cuda13-dev.2` | SGLang upstream DSv4 preview | `v13.0` | ARM64 | #### Python Wheels Not published for this dev release. Use the `v1.1.0` wheels or `v1.1.0-dev.3` from [pypi.nvidia.com](https://pypi.nvidia.com/). #### Helm Charts Not published for this dev release. Use `v1.1.0` charts for platform install. #### Rust Crates Not shipped for pre-release versions. ### v1.1.0-dev.3 - **Branch:** [release/1.1.0-dev.3](https://github.com/ai-dynamo/dynamo/tree/release/1.1.0-dev.3) - **GitHub Tag:** [v1.1.0-dev.3](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0-dev.3) - **Backends (branch ToT):** SGLang `v0.5.10.post1` | TensorRT-LLM `v1.3.0rc11` | vLLM `v0.19.0` | NIXL `v0.10.1` - **Coverage:** TensorRT-LLM runtime container plus **`ai-dynamo`** and **`ai-dynamo-runtime`** wheels on [pypi.nvidia.com](https://pypi.nvidia.com/). SGLang and vLLM containers, component containers (`dynamo-frontend`, `dynamo-planner`, `kubernetes-operator`, `snapshot-agent`), **`kvbm`** wheel, and Helm charts are not published for this tag. #### Container Images | Image:Tag | Backend | CUDA | Arch | |-----------|---------|------|------| | `tensorrtllm-runtime:1.1.0-dev.3` | TRT-LLM `v1.3.0rc11` | `v13.1` | AMD64/ARM64 | #### Python Wheels Available from [pypi.nvidia.com](https://pypi.nvidia.com/) (pre-release index): ```bash uv pip install --pre --extra-index-url https://pypi.nvidia.com/ ai-dynamo==1.1.0.dev3 uv pip install --pre --extra-index-url https://pypi.nvidia.com/ ai-dynamo-runtime==1.1.0.dev3 ``` `kvbm==1.1.0.dev3` is not yet published. #### Helm Charts Not published for this dev release. Use the latest stable (`v1.1.0`) for platform install. #### Rust Crates Not shipped for pre-release versions. ### v1.1.0-dev.2 - **Branch:** [release/1.1.0-dev.2](https://github.com/ai-dynamo/dynamo/tree/release/1.1.0-dev.2) - **GitHub Tag:** [v1.1.0-dev.2](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0-dev.2) - **Backends (branch ToT):** SGLang `v0.5.9` | TensorRT-LLM `v1.3.0rc9` | vLLM `v0.19.0` | NIXL `v0.10.1` - **Coverage:** SGLang and TensorRT-LLM runtime containers plus **`ai-dynamo`** and **`ai-dynamo-runtime`** wheels on [pypi.nvidia.com](https://pypi.nvidia.com/). vLLM runtime container, component containers (`dynamo-frontend`, `dynamo-planner`, `kubernetes-operator`, `snapshot-agent`), **`kvbm`** wheel, and Helm charts are not published for this tag. #### Container Images | Image:Tag | Backend | CUDA | Arch | |-----------|---------|------|------| | `sglang-runtime:1.1.0-dev.2` | SGLang `v0.5.9` | `v12.9` | AMD64/ARM64 | | `tensorrtllm-runtime:1.1.0-dev.2` | TRT-LLM `v1.3.0rc9` | `v13.1` | AMD64/ARM64 | #### Python Wheels Available from [pypi.nvidia.com](https://pypi.nvidia.com/) (pre-release index): ```bash uv pip install --pre --extra-index-url https://pypi.nvidia.com/ ai-dynamo==1.1.0.dev2 uv pip install --pre --extra-index-url https://pypi.nvidia.com/ ai-dynamo-runtime==1.1.0.dev2 ``` #### Helm Charts Not published for this dev release. Use the latest stable (`v1.1.0`) for platform install. #### Rust Crates Not shipped for pre-release versions. ### v1.1.0-dev.1 - **Branch:** [release/1.1.0-dev.1](https://github.com/ai-dynamo/dynamo/tree/release/1.1.0-dev.1) - **GitHub Tag:** [v1.1.0-dev.1](https://github.com/ai-dynamo/dynamo/releases/tag/v1.1.0-dev.1) - **Backends:** SGLang `v0.5.9` | TensorRT-LLM `v1.3.0rc5.post1` | vLLM `v0.17.1` | NIXL `v0.10.1` #### Container Images | Image:Tag | Backend | CUDA | Arch | |-----------|---------|------|------| | `vllm-runtime:1.1.0-dev.1` | vLLM `v0.17.1` | `v12.9` | AMD64/ARM64 | | `vllm-runtime:1.1.0-dev.1-cuda13` | vLLM `v0.17.1` | `v13.0` | AMD64/ARM64 | | `vllm-runtime:1.1.0-dev.1-efa-amd64` | vLLM `v0.17.1` | `v12.9` | AMD64 | | `sglang-runtime:1.1.0-dev.1` | SGLang `v0.5.9` | `v12.9` | AMD64/ARM64 | | `sglang-runtime:1.1.0-dev.1-cuda13` | SGLang `v0.5.9` | `v13.0` | AMD64/ARM64 | | `tensorrtllm-runtime:1.1.0-dev.1` | TRT-LLM `v1.3.0rc5.post1` | `v13.1` | AMD64/ARM64 | | `tensorrtllm-runtime:1.1.0-dev.1-efa-amd64` | TRT-LLM `v1.3.0rc5.post1` | `v13.1` | AMD64 | | `dynamo-frontend:1.1.0-dev.1` | — | — | AMD64/ARM64 | | `kubernetes-operator:1.1.0-dev.1` | — | — | AMD64/ARM64 | | `snapshot-agent:1.1.0-dev.1` | — | — | AMD64/ARM64 | #### Python Wheels Available from [pypi.nvidia.com](https://pypi.nvidia.com/) (pre-release index): ```bash uv pip install --pre --extra-index-url https://pypi.nvidia.com/ ai-dynamo==1.1.0.dev1 uv pip install --pre --extra-index-url https://pypi.nvidia.com/ ai-dynamo-runtime==1.1.0.dev1 ``` #### Helm Charts | Chart | NGC | |-------|-----| | `dynamo-platform-1.1.0-dev.1` | [NGC Helm: dynamo-platform 1.1.0-dev.1](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/dynamo-platform?version=1.1.0-dev.1) | | `snapshot-1.1.0-dev.1` | [NGC Helm: snapshot 1.1.0-dev.1](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/helm-charts/snapshot?version=1.1.0-dev.1) | #### Rust Crates Not shipped for pre-release versions. # Examples The examples below assume you build the latest image yourself from source. If using a prebuilt image, follow the examples from the corresponding branch. ## Hello World Demonstrates the basic concepts of Dynamo by creating a simple GPU-unaware graph. [View Hello World Example](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/custom_backend/hello_world) ## vLLM Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with vLLM. [View vLLM Backend Guide](/dynamo/backends/v-llm) ## SGLang Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with SGLang. [View SGLang Backend Guide](/dynamo/backends/sg-lang) ## TensorRT-LLM Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with TensorRT-LLM. [View TensorRT-LLM Backend Guide](/dynamo/backends/tensor-rt-llm) # Glossary ## B **Block** - A fixed-size chunk of tokens (typically 16 or 64 tokens) used for efficient KV cache management and memory allocation, serving as the fundamental unit for techniques like PagedAttention. ## C **Component** - The fundamental deployable unit in Dynamo. A discoverable service entity that can host multiple endpoints and typically maps to a Docker container (such as VllmWorker, Router, Processor). **Conditional Disaggregation** - Dynamo's intelligent decision-making process within disaggregated serving that determines whether a request is processed locally or sent to a remote prefill engine based on prefill length and queue status. ## D **Decode Phase** - The second phase of LLM inference that generates output tokens one at a time. **Disaggregated Serving** - Dynamo's core architecture that separates prefill and decode phases into specialized engines to maximize GPU throughput and improve performance. **Discovery Plane** - The service discovery layer where components (frontend, router, and workers) register services, discover services, and watch for new service life-cycle events at runtime using Kubernetes or etcd backends. **Distributed Runtime** - Dynamo's Rust-based core system that manages service discovery, communication, and component lifecycle across distributed clusters. **Dynamo** - NVIDIA's high-performance distributed inference framework for Large Language Models (LLMs) and generative AI models, designed for multinode environments with disaggregated serving and cache-aware routing. **Dynamo Kubernetes Platform** - A Kubernetes platform providing managed deployment experience for Dynamo inference graphs. ## E **Endpoint** - A specific network-accessible API within a Dynamo component, such as `generate` or `load_metrics`. **Event Plane** - The pub/sub layer for KV cache updates, worker metrics, and sequence tracking; it supports KV-aware routing and disaggregated serving architectures. ## F **Frontend** - Dynamo's API server component that receives user requests and provides OpenAI-compatible HTTP endpoints. ## G **Graph** - A collection of interconnected Dynamo components that form a complete inference pipeline with request paths (single-in) and response paths (many-out for streaming). A graph can be packaged into a Dynamo Artifact for deployment. ## I **Instance** - A running process with a unique `instance_id`. Multiple instances can serve the same namespace, component, and endpoint for load balancing. **Inter-Token Latency (ITL)** - The latency between consecutive output tokens during the decode phase; typically paired with TTFT to define performance SLAs. ## K **KV Block Manager (KVBM)** - Dynamo's scalable runtime component that handles memory allocation, management, and remote sharing of Key-Value blocks across heterogeneous and distributed environments. **KV Cache** - Key-Value cache that stores computed attention states from previous tokens to avoid recomputation during inference. **KV Router** - Dynamo's intelligent routing system that directs requests to workers with the highest cache overlap to maximize KV cache reuse. Determines routing based on KV cache hit rates and worker metrics. **KVIndexer** - Dynamo component that maintains a global view of cached blocks across all workers using a prefix tree structure to calculate cache hit rates. **KVPublisher** - Dynamo component that emits KV cache events (stored/removed) from individual workers to the global KVIndexer. **KV Transfer Policy** - Kubernetes DGD policy under `spec.experimental.kvTransferPolicy` that tells Dynamo which worker topology domain to use for disaggregated prefill-to-decode KV-cache transfer routing. ## L **LoRA (Low-Rank Adaptation)** - A fine-tuning technique for serving specialized model variants without duplicating full model weights. Dynamo supports dynamic loading and serving of LoRA adapters at runtime using worker APIs (for example, to load/unload,or for discovery in /v1/models). ## M **Model Deployment Card (MDC)** - A configuration structure containing all information required for distributed model serving. When a worker loads a model, it creates an MDC containing references to components such as the tokenizer, templates, runtime config. Workers publish their MDC to make the model discoverable to frontends. Frontends use the MDC to configure request preprocessing (tokenization, prompt formatting). ## N **Namespace** - Dynamo's logical grouping mechanism for related components. Similar to directories in a file system, they prevent collisions between different deployments. **NIXL (NVIDIA Inference tranXfer Library)** - High-performance data transfer library optimized for inference workloads, supporting direct GPU-to-GPU transfers and multiple memory hierarchies. ## P **PagedAttention** - Memory management technique from vLLM that efficiently manages KV cache by chunking requests into blocks. **Planner** - Dynamo component that performs dynamic resource scaling based on real-time demand signals and system metrics. **Prefill Phase** - The first phase of LLM inference that processes the input prompt and generates KV cache. **Prefix Caching** - Optimization technique that reuses previously computed KV cache for common prompt prefixes. **Processor** - Dynamo component that handles request preprocessing, tokenization, and routing decisions. **Profiler** - Dynamo component that analyzes model performance to determine optimal engine configurations, including disagg/agg, parallelization mapping (TP, TEP, DEP), and other engine knobs (batch size, max num tokens), feeding the Planner for SLA-driven autoscaling. ## R **RadixAttention** - Technique from SGLang that uses a prefix tree structure for efficient KV cache matching, insertion, and eviction. **RDMA (Remote Direct Memory Access)** - Technology that allows direct memory access between distributed systems, used for efficient KV cache transfers. **Request Plane** - The transport layer that transmits RPCs between components (frontend-to-worker or router-to-router) utilizing one of these protocols: TCP, HTTP, or NATS. ## S **SGLang** - Fast LLM inference framework with native embedding support and RadixAttention. **Speculative Decoding** - An optimization where a draft model proposes tokens for parallel verification by the main model; reduces latency (for example, vLLM with Eagle). ## T **Tensor Parallelism (TP)** - Model parallelism technique where model weights are distributed across multiple GPUs. **TensorRT-LLM** - NVIDIA's optimized LLM inference engine with multinode MPI distributed support. **Time-To-First-Token (TTFT)** - The latency from receiving a request to generating the first output token. **Topology-Aware KV Transfer** - Dynamo routing behavior that constrains or biases decode worker selection toward workers sharing the selected prefill worker's topology domain. **Topology Domain** - A logical level in the cluster topology, such as `zone` or `rack`. For topology-aware KV transfer, workers publish domain values in `ModelRuntimeConfig.topology_domains`. **Topology Taint** - A canonical worker taint generated from topology metadata in the form `dynamo.topology/=`. The router uses these taints through normal `RoutingConstraints`. ## V **vLLM** - High-throughput LLM serving engine with distributed tensor/pipeline parallelism and PagedAttention. ## W **Wide Expert Parallelism (WideEP)** - Mixture-of-Experts deployment strategy that spreads experts across many GPUs (e.g., 64-way EP) so each GPU hosts only a few experts. ## X **xPyD (x Prefill y Decode)** - Dynamo notation describing disaggregated serving configurations where x prefill workers serve y decode workers. Dynamo supports runtime-reconfigurable xPyD. # Dynamo Digest > Technical deep dives, announcements, and updates from the Dynamo team. Technical deep dives, announcements, and updates from the Dynamo team. How Dynamo checkpoints warm inference workers and restores them quickly on Kubernetes, with a path toward sub-five-second startup for large models. A short pointer to the DynoSim deep dive on fast, workload-driven simulation for finding Dynamo deployment Pareto frontiers. A short note on TokenSpeed's launch, its kernel and scheduler work, and Dynamo's day-0 integration. Lessons from running Claude Code, Codex, and OpenClaw against Dynamo: prompt stability, reasoning fidelity, and streaming tool dispatch. How Dynamo optimizes for agentic workloads at three layers: the frontend API, the router, and KV cache management. How Dynamo's concurrent global index evolved through six iterations to sustain over 100 million operations per second. # NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes > NVIDIA Dynamo Snapshot combines CUDA and host checkpointing to restore warm inference workers quickly on Kubernetes. ![Kubernetes checkpoint and restore lifecycle with NVIDIA Dynamo Snapshot.](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/d4167d87ad413eab02b80d20dec1ac3a3b2382ba6e37695f4430b8e65c735b69/digest/snapshot/dynamo-snapshot-lifecycle.webp) Cold-starting inference replicas on Kubernetes can take minutes while engines load weights, warm kernels, and compile graphs. In our blog post, [NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes](https://developer.nvidia.com/blog/nvidia-dynamo-snapshot-fast-startup-for-inference-workloads-on-kubernetes/), we introduce Dynamo Snapshot, a checkpoint/restore approach that combines `cuda-checkpoint`, CRIU, and a privileged `snapshot-agent` DaemonSet to restore warm workers from shared storage. We also walk through KV cache unmapping, CRIU restore optimizations, and GPU Memory Service (GMS), which bring the `gpt-oss-120b` prototype below five seconds and reduce startup time by 21x. # DynoSim: Simulating the Pareto Frontier > DynoSim is a workload-driven discrete-event simulation of the NVIDIA Dynamo serving stack for mapping Pareto frontiers before real-cluster validation. ![DynoSim Pareto frontier plot showing explored configurations and GPU-verified configurations.](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/bc39695216ca404d8a770950d9d0c60a7608f4e4dba46dfe62935bb321dca658/digest/dynosim/dynosim-hero.png) DynoSim is a workload-driven discrete-event simulation of NVIDIA Dynamo: a Dynamo twin for exploring LLM serving behavior before running full deployments. It brings measured engine forward-pass timing, Mocker scheduler cores, Router and Planner behavior, KV cache effects, and workload traces onto one virtual timeline. In our blog post, [DynoSim: Simulating the Pareto Frontier](https://developer.nvidia.com/blog/dynosim-simulating-the-pareto-frontier/), we show how simulation becomes the inner loop for design exploration: sweep broadly, map the throughput-latency Pareto frontier, shortlist the most promising candidates, and verify them on real clusters. # Dynamo Day 0 support for TokenSpeed > Dynamo adds day-0 TokenSpeed support with the Dynamo frontend for Kimi K2.5. [TokenSpeed](https://lightseek.org/blog/lightseek-tokenspeed.html) ([GitHub](https://github.com/lightseekorg/tokenspeed)) launched today as LightSeek's new inference engine for agentic workloads. The initial repo is a preview, with more model coverage and runtime features landing over the next few weeks. Two pieces are worth calling out. First, TokenSpeed includes new MLA kernel work for long-context Kimi-style workloads on Blackwell. Second, TokenSpeed has a native C++ scheduler in `tokenspeed-scheduler/` that models request flow and cache operations as explicit state machines, while Python remains the runtime and integration layer. Dynamo now has day-0 support for running TokenSpeed as a Dynamo backend through `python -m dynamo.tokenspeed`. The Dynamo frontend remains the user-facing OpenAI-compatible API entrypoint and handles request routing, streaming responses, and cancellation. See the [Kimi K2.5 TokenSpeed recipe](https://github.com/ai-dynamo/dynamo/tree/main/recipes/kimi-k2.5/tokenspeed/agg/nvidia) for the current Dynamo launch recipe. Things are moving quickly. Upstream TokenSpeed calls out ongoing work on model coverage, P/D, EPLB, KV store, Mamba cache, VLM, metrics, Hopper optimization, and related runtime features. > Streaming Tokens and Tools: Multi-Turn Agentic Harness Support in Dynamo # Streaming Tokens and Tools: Multi-Turn Agentic Harness Support in Dynamo An agentic exchange must preserve a structured interaction: assistant turns interleave reasoning with one or more tool calls, and subsequent user turns return the corresponding tool results to the model context. Reasoning replay is model- and turn-dependent: some reasoning should be retained, while some should be dropped. The inference engine is responsible for this more expressive interaction and for producing correctly segmented API results. Tool-call parsing and reasoning parsing need to happen before the attached harness consumes the response. High-value agentic workflows such as coding also depend on a responsive harness experience: reasoning segments, tool-call events, and request metadata need to stream back as the turn unfolds instead of arriving after a final text response. This post covers lessons from running real agentic clients against Dynamo: how we hardened parser and API coverage and how those parser layers became standalone reusable crates. These changes build on the performance considerations outlined in our [first post](/dynamo/digest/agentic-inference), which focused on the serving architecture underneath agentic inference: the frontend, router, and KV cache management. This follow-up focuses on correctness, user-experience equivalence, and performance. Agentic harnesses are still evolving quickly. Claude Code, Codex, and OpenClaw expose the same pressure points from different API surfaces, so the examples below focus on the core behaviors that custom serving stacks need to reproduce. ![Standard server vs Dynamo across two turns of an agent loop. Each turn crosses the network at two edges: harness-to-server (where prompt stability decides cache reuse) and server-to-harness (where streaming tool dispatch decides when the harness can act). Dynamo changes both edges.](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/aec4a8f1575fe673a0bf31ed9984aeaf9001a9a3e31614b2411bb54893c0aa40/digest/agentic-inference/images/fig-1-agent-loop.svg) ## Harness-Facing Dynamo Settings Our experiments used the newly released `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4` model, though the same issues apply across models, reasoning parsers, and tool-call parsers. To reproduce our results, configure the frontend with the Anthropic-compatible API and the flags that preserve prompt, reasoning, and tool state: - `--enable-anthropic-api` exposes the Anthropic Messages API to harnesses. Many harnesses can fall back to the default Messages API, but the experience is degraded. - `--strip-anthropic-preamble` removes the Anthropic billing header that can destabilize KV reuse. - `--enable-streaming-tool-dispatch` lets complete tool calls start executing as soon as they are decoded, rather than waiting for the end of the turn. Putting all of this together: ```bash python -m dynamo.frontend \ --http-port 8000 \ --enable-anthropic-api \ --strip-anthropic-preamble \ --enable-streaming-tool-dispatch ``` On the worker side, the important settings in this deployment are: - `--dyn-tool-call-parser ` and `--dyn-reasoning-parser ` reconstruct tool calls and reasoning blocks in the model-specific format the harness expects. Those parsers also control whether reasoning from previous turns should be retained, transformed, or dropped. ## Prompt Stability Is Key for Cache Reuse Claude Code sends thousands of tokens of reusable prompt scaffolding, much of which is designed to be the same across different users and sessions. However, at the very front of each prompt is a session-specific billing header which causes cache misses when pointed at custom endpoints that do not strip it out: ```text x-anthropic-billing-header: cc_version=0.2.93; cch=abc123def456==; You are Claude Code, an interactive CLI tool... ``` These headers poison the KV cache and prevent it from being reused, even across sessions by the same user. A varying line at position zero means every new session starts from a different token prefix, so the stable instructions and tool definitions behind it never line up cleanly for reuse. To restore KV-cache reuse, Dynamo added `--strip-anthropic-preamble`. The fix is mechanically small and operationally important: remove the unstable billing header before tokenization so that the stable prompt starts at token zero. The measured impact was large. On a Dynamo B200 deployment with a 52K-token prompt, a stable prefix landed at `168ms` TTFT. Keeping a varying per-session header in the prefix pushed that to `912ms`. Removing the billing header before tokenization brought it back to `169ms`. On this workload, the unstable header costs `744ms` per request and turns a reusable system prompt into a cold prefill. That is about a `5x` reduction in TTFT for new users hitting the same deployment or for the same user opening a new session. ![TTFT across three prompt-prefix conditions on a 52K-token prompt: a stable prefix lands at 168 ms, the stripped Anthropic preamble at 169 ms, while a varying per-session billing header pushes TTFT to 912 ms — a single token at position zero is the difference between hot KV reuse and a full cold prefill.](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/0dd4753c86133d5897c377be044d639ab3c954dc19be66bd8c2ff3b6a213786d/digest/agentic-inference/images/fig-2-ttft-prefix-stability.svg) ## The Nuances of Reasoning and Tool Parsing Reasoning replay into the next turn does not have one universal correct form. Some models intentionally drop prior thinking on ordinary assistant turns. Agentic turns with interleaved tool calls are different: there, the reasoning spans often need to remain attached to the tool calls they explain. The real contract is model-specific and turn-specific. Anthropic's [April 23 Claude Code postmortem](https://www.anthropic.com/engineering/april-23-postmortem) gives a concrete production example of this policy: thinking from previous turns can be cleared on session resume to reduce the prefill burden after the cached prompt has expired. Contemporary reasoning models tend to produce two different kinds of assistant turns: - reasoning followed by a direct response to the user - reasoning followed by one or more tool calls Agentic models are especially good at producing turns where many reasoning and tool-call segments appear within a single response in the pattern of: ```text reasoning_0 tool_call_0 reasoning_1 tool_call_1 ``` On the next turn each reasoning span needs to stay attached to the tool call it explains. Dynamo now supports this interleaved format fully. Previously, the same turn could be reconstructed as: ```text reasoning_0 reasoning_1 tool_call_0 tool_call_1 ``` If the assistant turn is reconstructed as one generic reasoning block followed by one blob of tool calls, the model still has all the same tokens but loses the sequence and delimiters that made them meaningful. This grouped ordering came from legacy models that emitted only a single reasoning span and a single tool-call pass per turn. In addition to the reordering bug, we also found that reasoning was often being dropped too aggressively before the next turn. For some models, dropping prior thinking on turns without tool calls is an established behavior and part of the model's fine-tuning (DeepSeek-R1 is the clearest example). But that same behavior is wrong for interleaved agentic turns where the prior reasoning explains the tool sequence. This issue was difficult to spot because users could see reasoning being decoded correctly in the outgoing response while it was still being *silently malformed* or dropped before the next turn. We validated this against a Dynamo + TRT-LLM deployment: Nemotron-3-Super-120B-A12B-NVFP4 on 4x B200 with TP=4, with `--enable-anthropic-api`, `--strip-anthropic-preamble`, `--enable-streaming-tool-dispatch`, the `nemotron_deci` reasoning parser, and the `qwen3_coder` tool call parser. ### Combined Reasoning and Tool Calls A model that reasons before calling a tool generates a response where `` content flows first, followed by `` XML. In the case of Nemotron, two different parsers, `nemotron_deci` for reasoning and `qwen3_coder` for tool calls, have to split that stream into the correct Anthropic Messages API content blocks without interfering with each other. We sent the same prompt five times through the Anthropic Messages API: a system prompt instructing the model to think step by step, two tool definitions (calculator and weather), and the user message "Think carefully about what 15 * 23 equals, then use the calculator to verify." The response structure from a representative round: ```json { "content": [ { "type": "thinking", "thinking": "I need to calculate 15 * 23. Let me think: 15 * 20 = 300, and 15 * 3 = 45, so 300 + 45 = 345. I'll use the calculator to verify.\n" }, { "type": "tool_use", "id": "call-a3364797-3160-4e84-b567-5c495694d502", "name": "calculator", "input": { "expression": "15 * 23" } } ], "stop_reason": "tool_use", "usage": { "input_tokens": 403, "output_tokens": 95 } } ``` ### Streaming Two Parsers at Once The streaming path makes the parser interaction more visible. A streaming request produces a sequence of SSE events, and the event type sequence shows exactly how the two parsers carve up the token stream: ```text 1ms message_start 82ms content_block_start type=thinking 82ms content_block_delta (thinking tokens stream here, ~7ms apart) ... (~70 thinking deltas over ~520ms) 602ms content_block_stop 602ms content_block_start type=text 602ms content_block_delta 800ms content_block_stop 800ms content_block_start type=tool_use 800ms content_block_delta 800ms content_block_stop 814ms message_delta stop_reason=tool_use 814ms message_stop ``` The thinking block streams token by token from `82ms` to `602ms`. Then a brief text block appears (the whitespace between the thinking and tool call regions of the raw token stream). Then the tool_use block arrives at `800ms` as a single structured unit. The `message_stop` follows at `814ms`. This round-trip did not produce the correct Anthropic event sequence until [PR #7358](https://github.com/ai-dynamo/dynamo/pull/7358). The fix had three parts: 1. **One owner for reasoning parsing**: reasoning parsing used to happen at multiple competing layers. The backend parser could split model output into `reasoning_content` and normal `content`, while the Anthropic streaming converter still tried to infer `` boundaries when mapping the same stream into Anthropic content blocks. PR #7358 made ownership explicit. If a backend path has already produced structured reasoning deltas, the Anthropic converter trusts them and only maps them into the response format. 2. **Template-native reasoning when available**: Dynamo now checks whether the active chat template knows how to read `reasoning_content`. Templates like Nemotron and Qwen3 read that field directly, so Dynamo leaves it alone and lets the template decide how much prior thinking to keep. If the template only understands `content`, Dynamo falls back to the legacy representation: preserve reasoning by inserting `` blocks into `content`, or leave it out when the model/parser policy says prior thinking should not carry into the next turn. Both the Rust preprocessor path (`ModelInput::Tokens`) and the Python worker path (`ModelInput::Text`) use this same conditional rule. 3. **Respect per-request thinking controls**: Many templates default `truncate_history_thinking=true` to save context. That is reasonable for ordinary chat, but it removes the reasoning behind prior tool calls in agent workflows. Dynamo now changes that behavior only for requests where reasoning is actually in play: when a reasoning parser is configured and the client has not disabled thinking, the Anthropic path sets `enable_thinking=true` and `truncate_history_thinking=false`. That keeps the next-turn context agents need without changing the default for requests or models that should run without thinking. In our B200 experiment with a 52K-token system prompt and an assistant turn containing about 500 tokens of thinking, the unchanged next-turn prefix landed at `167ms` TTFT while mutated thinking landed at `322ms`. That is a `1.9x` increase, or about `155ms` per request, from changing the reasoning content inside the next-turn prefix. The key takeaway is that the harness, parser, and template path must agree on each model's expected reasoning behavior. Dropping thinking on ordinary turns may be correct for one model and wrong for another. Preserving interleaved reasoning on tool-calling turns may be essential even when ordinary turns are allowed to strip it. **In practice, you should not assume that the tokens produced on turn `N` will automatically arrive unchanged as the prefix of turn `N+1`.** Whether that is true depends on the reasoning parser, tool parser, and chat template for the model you are serving. ## Streaming Tool Calls Streaming tokens make the user experience feel more responsive and dynamic. The hard part is preserving that streaming behavior while still emitting tool calls as coherent blocks. In the older Dynamo path, reasoning tokens streamed back normally, but tool calls stayed buffered until the very end of the turn before being released all at once to the harness. That reduces responsiveness and delays tool execution even when the model has already decided what to call. | State | What the harness sees | When tool readiness becomes visible | |-------|------------------------|-------------------------------------| | Buffered | tool-call chunks withheld | only at `finish_reason: "tool_calls"` | | Inline streaming | regular tool-call deltas | as soon as the model emits them | | Dispatch | typed `event: tool_call_dispatch` side channel | at the same structural completion point, but already parsed | The important change is from the first row to the latter two. That is where the harness stops waiting for stream end to learn that it needs to act. Without dispatch, the harness sees a regular token stream and has to infer when a tool call is complete by accumulating deltas and waiting for enough structure to be present. With dispatch enabled, Dynamo can emit a typed SSE side channel: ```text event: tool_call_dispatch data: {"choice_index":0,"tool_call":{"index":0,"id":"call-...","type":"function","function":{"name":"calculator","arguments":"{\"expression\":\"42 * 17\"}"}}} ``` That event tells the harness, in one shot, that the tool call is ready to execute. No harness-side delta assembly, no guessing whether the arguments are complete, and no custom parser living inside the harness. This makes Dynamo more easily compatible with custom harnesses. ![Tool dispatch timing on a single turn. Standard servers surface the tool call only after the entire stream finishes; Dynamo emits a typed tool_call_dispatch event the moment the call is parsed, so the tool can run in parallel with the rest of the stream. Δt is the time saved per tool call.](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/33afb2f43dba4829eabfa722bf1207a68a3a193515a8d1b6179f2104579c13c4/digest/agentic-inference/images/fig-3-streaming-dispatch-timeline.svg) ## Anthropic API Fidelity for Claude Code and OpenClaw Claude Code and OpenClaw both exercise the Anthropic Messages API rather than only text generation behind an endpoint. Matching the harness experience depends on a collection of smaller behaviors that are easy to miss in ad hoc testing: - model metadata at both `GET /v1/models` and `GET /v1/models/{model_id}` - correct handling of slashed model IDs - useful `input_tokens` in `message_start` - acceptance of `cache_control` Once the frontend is reachable and compliant, both harnesses can point at Dynamo's Anthropic-compatible endpoint: ```bash ANTHROPIC_API_KEY=local-dev-token \ ANTHROPIC_BASE_URL=http://localhost:8000 \ ANTHROPIC_CUSTOM_MODEL_OPTION=nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ ANTHROPIC_CUSTOM_MODEL_OPTION_NAME="Dynamo NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \ claude --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ANTHROPIC_API_KEY=local-dev-token \ ANTHROPIC_BASE_URL=http://localhost:8000 \ npx openclaw agent --local -m "Say ok" --json ``` The fixes in this area brought the custom deployment closer to the native backend behavior. One concrete example shows the flavor of these bugs better than a long checklist. During startup, the harness asks for details about the selected model directly, but Dynamo did not yet serve that endpoint: ```text GET /v1/models/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 HTTP/1.1 404 Not Found ``` Another example is `message_start` reporting `input_tokens: 0` even when the final response later contains the real count. This can make the token count in the harness temporarily drop to `0` every time a new turn starts. [PR #7234](https://github.com/ai-dynamo/dynamo/pull/7234) fixed that Anthropic path by populating `input_tokens` before the stream begins. Those counts are also control-plane data for long sessions: harnesses use context length to decide when to compact the conversation before the next request would exceed the model window. The broader tokenizer-service work landed separately in [PR #7699](https://github.com/ai-dynamo/dynamo/pull/7699), which added `/v1/tokenize` and `/v1/detokenize` endpoints for accurate token counts before a request is processed by the engine. ## Responses and Codex Fidelity The Codex-facing version of the same problem lives on the `v1/responses` side. Passing compliance tests is not enough to provide parity in user experience. We found that a Responses API request could not survive an internal round-trip without losing the fields that made it a Responses request rather than a chat completions request. Preserving those fields required architectural changes in Dynamo's `ResponseParams` path, together with the upstream type-alignment work in [PR #6089](https://github.com/ai-dynamo/dynamo/pull/6089). Codex should point at Dynamo through the OpenAI-compatible Responses API with request compression enabled: ```bash OPENAI_API_KEY=local-dev-token \ codex exec \ -c 'openai_base_url="http://localhost:8000/v1"' \ -c 'features.enable_request_compression=true' \ -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ "Say ok" ``` ### Codex Model Metadata Shapes the Request Codex parity begins before the first `POST /v1/responses`. The CLI resolves the configured model string into a local model-catalog record, and the resulting `ModelInfo` controls the harness state built around the model: base instructions, history formatting, tool registry, reasoning parameters, verbosity controls, image support, context accounting, tool-output truncation, `parallel_tool_calls`, and the final Responses payload. Two endpoints can serve the same underlying model and still drive different agent behavior if Codex attaches different catalog metadata. The request may validate against the schema while the harness around it has changed. Tool-output truncation is a useful example. Codex does not replay unlimited command output into the next model turn. Shell and tool observations are truncated according to the selected model's catalog policy before they re-enter context. In the catalog snapshot we tested, `gpt-5.5` used: ```json { "mode": "tokens", "limit": 10000 } ``` By contrast, `openai/openai/gpt-5.5` on a custom endpoint used fallback metadata: ```json { "mode": "bytes", "limit": 10000 } ``` Those budgets are not equivalent. A `10,000`-byte limit cuts off structured logs, tracebacks, JSON, or test output much earlier than a `10,000`-token limit for ASCII-heavy coding output. For a coding agent, that changes what the model can inspect after a failed test, a search command, or a compiler error. The model may need additional tool calls to recover context that the intended catalog profile would have preserved. Reasoning settings are also catalog-derived. Codex sends a Responses `reasoning` object when the selected model metadata says reasoning summaries are supported. In that path, Codex also requests `reasoning.encrypted_content` so reasoning state can be replayed across turns. Fallback metadata removes that path. Prompting changes too. In Codex, switching from the fallback/default profile to the `gpt-5.5` catalog profile changes the system prompt. The fallback prompt is organized around generic Codex operation (`# How you work`, `# AGENTS.md spec`, `# Tool Guidelines`) and emphasizes `AGENTS.md` precedence, planning, validation, and shell-search habits. The `gpt-5.5` prompt is a different instruction document (`# Personality`, `# General`, `# Working with the user`) that frames the agent as a pragmatic software engineer and adds stronger guidance on codebase reading, local-pattern reuse, scoped edits, dirty worktrees, `apply_patch`, collaboration updates, and final-answer formatting. Catalog aliasing therefore affects base behavioral policy as well as request fields such as truncation and reasoning. We saw this directly in a 50-task subset of SWE-Bench Verified. In this setup, both routes reached OpenAI-served GPT-5.5; the difference was the endpoint and the model-catalog record Codex attached to it. When the custom endpoint used model ID `openai/openai/gpt-5.5` without being associated with the `gpt-5.5` catalog profile, Codex used generic fallback behavior. In one run, the fallback profile issued roughly half as many tool calls: | Catalog profile | Total tool calls | Per task | |-----------------|------------------|----------| | `gpt-5.5` profile | 2,087 | 41.7 | | Fallback profile | 1,048 | 21.0 | | Delta | -1,039 | -20.8 | The paired comparison pointed in the same direction on every task: the `gpt-5.5` profile used more tools in `50 / 50` tasks, while the fallback profile used more tools in `0 / 50`. A permutation test put the difference below `p < 0.001`. After adding a model-catalog alias so `openai/openai/gpt-5.5` inherited the intended `gpt-5.5` profile, the same 50-task setup became much closer: | Catalog profile | Total tool calls | Per task | |-----------------|------------------|----------| | `gpt-5.5` profile | 2,081 | 41.6 | | Alias-backed custom profile | 2,205 | 44.1 | | Delta | +124 | +2.5 | The remaining difference was not statistically significant in this run: the permutation test was about `p = 0.22`, and the paired directions were mixed (`20 / 50` tasks favored the native profile, `28 / 50` favored the alias-backed profile, and `2 / 50` tied). For Dynamo, the implication is that Codex compatibility needs to be evaluated at the catalog and request-shaping layer as well as the HTTP schema layer. If Codex cannot resolve a model ID into the intended profile, fallback defaults may change truncation, search-tool availability, verbosity controls, reasoning-summary support, and parallel tool-call support before Dynamo receives the request. ## What's Next Dynamo now has `nvext.agent_hints`: `latency_sensitivity`, `priority`, `osl`, and `speculative_prefill`. Those fields give the harness a way to say more about the turn than the prompt alone. A session waiting on a user reply is not the same as one working through a long background tool sequence, and the API can now carry some of that difference. In the v1.1.0 line, Dynamo is also making more of the agent stack available as reusable pieces. The protocol, parser, and tokenizer layers are versioned as standalone crates, including `dynamo-protocols`, `dynamo-parsers`, and `dynamo-tokenizers`. That gives teams a way to build or customize a harness-facing serving path without copying Dynamo internals into a separate project. This is also the bridge to longer-running systems such as AutoResearch. The first post explained why agentic workloads stress the serving stack. This post shows the harness-facing contract needed to run those workloads correctly and sets the stage for efficient long-running agents backed by Dynamo endpoints. > How Dynamo optimizes for agentic workloads at three layers: frontend API, router, and KV cache management. # Full-Stack Optimizations for Agentic Inference with Dynamo Coding agents are starting to write production code at scale. [Stripe’s agents generate 1,300+ PRs per week](https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents). [Ramp attributes 30% of merged PRs to agents](https://www.infoq.com/news/2026/01/ramp-coding-agent-platform/). [Spotify reports 650+ agent-generated PRs per month](https://engineering.atspotify.com/2025/11/spotifys-background-coding-agent-part-1). Tools like Claude Code and Codex make hundreds of API calls per coding session, each carrying the full conversation history. Behind every one of these workflows is an inference stack under significant KV cache pressure. ![Cumulative cache reads vs writes across a 42-call Claude Code session. Cache reads (891K tokens) grow steeply while writes (76K) and uncached input stay flat -- an 11.7x read/write ratio.](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/84f818be92c8353e77ed1e76484878a0b12b0403ce62ba6fb58d4b127b2d25f0/digest/agentic-inference/cumulative-reads-writes.png) Lets take Claude Code as an example. After the first API call that writes the conversation prefix to KV cache, every subsequent call to the same worker hits 85-97% cache. Agent teams (or swarms) push this further with 97.2% aggregate cache hit rate across 4 Opus teammates. An 11.7x read/write ratio means the system reads from cache nearly 12 times for every token it writes. This is a write-once-read-many (WORM) access pattern: the system prompt and growing conversation prefix are computed once, then served from cache on every subsequent call. Maximizing cache reuse rate across all workers and keeping KV blocks warm and routable is the central optimization target for agentic inference. These numbers come from managed API infrastructure where the provider controls prefix matching, cache placement, and eviction. For teams running open-source models on their own GPUs, none of this exists out of the box. We have been building Dynamo to close that gap. This post walks through how we are making Dynamo agent-native at three layers: the frontend API, the router, and KV cache management. Throughout this post, we use three terms consistently: - **Harness**: the agent framework that drives the workflow (Claude Code, Codex, OpenClaw, OpenCode, etc.) - **Orchestrator**: Dynamo's routing, scheduling, and cache management layer - **Runtime**: the inference engine that executes the model and owns the kv cache manager (SGLang, vLLM, TRT-LLM) ## Layer 1: The Frontend ### Multi-Protocol Support Agent harnesses are increasingly adopting `v1/responses` and `v1/messages` over `v1/chat/completions` to cleanly handle new patterns including interleaved thinking and tool calls. The key difference in these APIs is structural. In `v1/chat/completions`, message content is a flat string and tool calls are bolted on as a separate field. As an example, notice how [GLM](https://docs.z.ai/guides/capabilities/thinking-mode#example-usage) and [MiniMax](https://platform.minimax.io/docs/guides/text-m2-function-call#important-note) API handle interleaved thinking differently when hosting their model behind the `v1/chat/completions` endpoint. The `v1/responses` and `v1/messages` APIs use typed content blocks, so a single assistant turn can contain thinking, tool calls, and text as distinct objects. This matters for inference because the orchestrator can see block boundaries, perform prompt optimizations, and apply different cache and scheduling policies per block type. Dynamo serves all three endpoints through a common internal representation, so a single deployment can act as the inference backend for any harness. Our team has been running a Dynamo deployment of GLM-5 and MiniMax2.5 internally to power our Codex and Claude Code harnesses. This lets us benchmark our backend implementations against closed-source inference, targeting parity on cache reuse performance. We will be sharing a full write-up and some optimized recipes for deploying both models in the upcoming weeks.
**Serving Claude Code with Dynamo** **Serving Codex with Dynamo**
We have also invested in day-0 tool call and reasoning parsing support for various open-source models. If you find that a model is not supported, please [open an issue](https://github.com/ai-dynamo/dynamo/issues) or use the [tool-call-parser-generator](https://github.com/ai-dynamo/dynamo/blob/main/.agents/contributor-skills/tool-parser-generator/SKILL.md) skill to generate it with your harness of choice. ### Agent Hints: The Harness-Orchestrator Interface Today, inference servers see anonymous tokenized requests. But agent harnesses have global context that the infrastructure never sees: which agents are blocked on tool calls, which just spawned, how many turns remain in a session, and whether the current call is a quick lookup or a long synthesis. When using coding agents, the user waits for a final result, not individual token streams, so the orchestrator can reorder and prioritize requests across agents without affecting the end-user experience. Sessions run for minutes to [even days](https://factory.ai/news/missions) with long tool-call pauses. This is enough to optimize inference scheduling in ways that traditional serving cannot. ![Where nvext fits in the agentic protocol stack alongside MCP and A2A](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/7140ae782b4da98d32855f75782fe2f93f4ea4d9a9977ee221a365283b955ba1/digest/agentic-inference/protocol-stack.svg) Dynamo’s new agent hints extension was designed to bridge this gap. It allows any harness to attach structured hints to a request across all three API endpoints, giving the router and runtime the context they need to make agent aware scheduling and caching decisions. This is a v1 API that we are actively co-designing with the community and would love feedback from teams building agent harnesses on what signals are most useful. Please reach out to us if you have any ideas or feedback! ```json { "model": "MiniMaxAI/MiniMax-M2.5", "messages": [...], "tools": [...], "nvext": { "agent_hints": { "osl": 256, "speculative_prefill": true, "priority": 10 }, "cache_control": { "type": "ephemeral", "ttl": "1h" } } } ``` The `agent_hints` fields: - **`priority`** controls scheduling across both the router and engine. Higher values mean "more important" at the Dynamo API level; Dynamo translates that into router queue ordering and backend-specific engine priority. - **`osl`** (output sequence length) is the harness's estimate of how many tokens this request will generate. The router uses this to gauge how long a worker will be occupied, which improves load balancing. A harness can learn this over time by tracking average output lengths per tool call type. - **`speculative_prefill`** signals the orchestrator to begin caching this request's prefix on a likely worker before the full request is ready. This is useful when the harness knows a tool call is about to return and wants to warm the cache ahead of time. The `cache_control` field will look familiar to anyone who has used Anthropic's prompt caching API. It tells the orchestrator to pin the computed prefix on the worker for the specified TTL, protecting it from eviction during tool call gaps. Currently `ephemeral` is the only supported type (to match Anthropic's API). We discuss how this works in the cache retention section below. You can find complete documentation on agent hints [here](../../components/frontend/nvext.md#cache-control). ## Layer 2: The Router A coding agent follows a sequential pattern: long prefill, tool call, extend prefix, repeat. A multi-agent harness fans out work across parallel subagents with short, independent contexts. Default round-robin routing is blind to both patterns -- it cannot account for cache locality, request priority, or session structure. Dynamo's router closes this gap with three mechanisms: KV-aware placement, priority scheduling, and extensible routing strategies. ### KV-Aware Placement Without cache-aware routing, turn 2 of a conversation has a ~1/N chance of landing on the same worker as turn 1. Every miss is a full prefix recomputation which is a significant performance bottleneck and extremely costly for an end user. Dynamo's router maintains a global index of which KV cache blocks exist on which workers. The [Flash Indexer post](/dynamo/digest/flash-indexer) covers the six iterations that got this indexer to 170M ops/s (**planetary** scale KV routing). On every request, the router queries the index for per-worker overlap scores and selects the worker that minimizes the combined cost of cache miss and current decode load. This cost function is tunable, and we show below how teams can build custom agent aware routing strategies on top of it. ### Priority Scheduling `priority` is the single user-facing scheduling knob. Higher values mean "more important" at the Dynamo API level. Dynamo uses that one hint at both layers: - At the **router**, higher-priority requests are shifted earlier in the queue when `--router-queue-threshold` is enabled. - At the **engine**, Dynamo normalizes backend-specific polarity and forwards the request for queue ordering, preemption, and KV cache eviction. At the router, incoming requests enter a `BinaryHeap` ordered by effective arrival time. A higher `priority` makes the request appear as if it arrived earlier, placing it ahead of lower-priority work. Requests only enter the queue when all workers exceed a configurable load threshold. Below that threshold, they bypass the queue entirely and go straight to worker selection. When capacity frees up (prefill completes or a request finishes), the queue drains highest-priority entries first. Once dispatched, SGLang, vLLM, and TRT-LLM may interpret engine priority differently, so Dynamo normalizes the engine-facing value per backend. Engines like SGLang can also use priority-based radix cache eviction where lower-priority blocks are evicted first under memory pressure. ![How priority flows from harness through router dispatch to engine treatment](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/ce124d2d668c26783acf93061654456e7d8b8d140fa506be8b2d6b8309229b5e/digest/agentic-inference/two-gates.svg) ### Agentic Workload Routing Strategies A research agent with a 200K context window needs workers with enough free KV capacity to hold its full state. The router's default cost function (overlap score + decode load) handles the common case, but teams with domain-specific workloads can use the router's Python bindings to implement custom routing strategies. The core `KvRouter` class provides `best_worker()` for querying routing decisions, `get_potential_loads()` for per-worker load inspection, and `generate()` for routing + dispatch in one call. Custom routers register on the same service mesh as the default components and can override routing config per-request: ```python # Query per-worker load and overlap for custom routing logic loads = await router.get_potential_loads(token_ids) # Override routing config based on request properties # Long contexts benefit from stronger overlap credit config = {"overlap_score_credit": 1.0} if len(token_ids) > 8192 else {} worker_id, dp_rank, overlap = await router.best_worker( token_ids, request_id="req-123", update_indexer=True, router_config_override=config ) # Or bypass the default selector entirely when the harness # has its own worker selection logic (e.g., session affinity) stream = await router.generate( token_ids, model=model, worker_id=chosen_worker ) ``` The [NeMo Agent Toolkit (NAT)](https://github.com/NVIDIA/NeMo-Agent-Toolkit/tree/develop/examples/dynamo_integration) team used these APIs to build a custom online-learning agentic router. Their router extracts session metadata from `nvext` annotations and feeds it to a [Thompson Sampling](https://en.wikipedia.org/wiki/Thompson_sampling) bandit style cost function that learns which workers perform best for which prefix patterns under load. Compared to Dynamo's default routing, they measured 4x reduction in p50 TTFT and 1.5x increase in p50 tokens-per-second. Priority tagging of latency-sensitive requests achieved up to 63% p50 TTFT reduction under moderate memory pressure. See the [NAT Dynamo integration example](https://github.com/NVIDIA/NeMo-Agent-Toolkit/tree/develop/examples/dynamo_integration) for implementation details. We will be making this available as a routing strategy in Dynamo soon. ## Layer 3: KV Cache Management Agentic workloads produce blocks with vastly different reuse value -- system prompts reused every turn, reasoning tokens never reused again -- but default LRU eviction treats all blocks identically. A 2-30 second tool call pause can age out an agent's entire prefix, forcing full recomputation when it resumes. The cache needs to understand block value, support cross-worker sharing, and respect agent lifecycle boundaries. ### The Problem with Uniform Eviction | Block Type | Reuse Pattern | Value | |------------|---------------|-------| | System prompt + tool definitions | Every turn | Highest | | Conversation history | Subsequent turns, growing monotonically | High | | Thinking/reasoning tokens | Typically zero reuse after reasoning loop closes (a significant portion of output) | Near-zero | | Subagent KV | Multiple turns then agent dies. No need to retain | Near-zero | LRU sees only recency. In a high traffic environment, a wait for the completion of a called tool (2-30 seconds while the agent waits for an external API) might cause the agent's blocks to age out and when the agent resumes, the entire prefix must be recomputed. To solve this, we need to provide the orchestrator APIs to control which blocks should be retained, where they should live, and for how long. ### KV Cache as a Shared Resource Today, KV cache is treated as a local, ephemeral resource on each worker. An agent's ~32K-token system prompt and tool definitions are computed independently on every worker that serves its requests. When a lead agent spawns 4 subagents, each with overlapping tool definitions, that shared prefix is recomputed 4 times if the subagents land on different workers. In our analysis of Claude Code team sessions, we measured this directly: teammates averaged 79.4% cache hit rate vs. 91.3% for the lead agent's explore subagents (5.0x vs. 11.7x read/write ratio), with the gap driven almost entirely by cold-start writes on each teammate's first call. The goal is to make high value KV cache blocks available to all workers in the cluster. Essentially, they are written once during cold start and then read by any worker at all times. Solutions like SGLang's HiCache and Dynamo's KV Block Manager (KVBM) are building toward a 4-tier memory hierarchy: ![KV cache memory hierarchy: GPU (HBM) at ~ns, CPU (pinned DRAM) at ~us, Local NVMe at ~ms, Remote Storage (NIXL) at ~ms via RDMA](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/39d782302bce25c14ef629e1d593c67606c0fa5508628c4aac0546dfc66a5354/digest/agentic-inference/kv-memory-hierarchy.svg) Blocks follow a write-through path: when a worker computes KV for a prefix, the blocks flow from GPU to CPU to disk automatically. Each block is deduplicated by sequence hash in a global registry. Once a block is registered, it is immutable and addressable by any worker that can reach the storage tier. This directly solves the subagent cold-start problem. When the lead agent computes tool definitions and system prompt, those blocks write through to shared storage. When subagent 1 spawns on a different worker, the router queries the Flash Indexer, finds the blocks in shared storage, and the worker loads them via NIXL (RDMA read) instead of recomputing from scratch. Subagent 2 does the same. Four redundant prefill computations become one compute and three loads. The same mechanism addresses cache coherence in disaggregated prefill-decode serving. In disagg mode, the prefill worker computes KV and transfers it to the decode worker via NIXL. The decode worker generates tokens, producing new KV state. On the next turn, a prefill worker needs both the original prefix and the generated tokens from turn 1, but those live only on the decode worker. With shared storage, the decode worker writes its new blocks to the common tier and any prefill worker can fetch them on the next turn. Multi-tier storage solves sharing and persistence, but blocks still arrive on GPU only after the request hits the worker. The missing piece for agentic systems is prefetch: the harness can use historical timing data to predict when an agent's tool call might return, which means it knows which blocks will be needed and when. We are building prefetch hooks so the harness can signal "bring these blocks from storage to GPU ahead of the next request." Combined with the retention APIs (below), this gives the harness full lifecycle control: pin blocks to prevent eviction, set priority to control eviction ordering, and prefetch blocks proactively before they are needed. ![During tool calls, KV blocks offload to host memory and storage, then prefetch back to GPU before the next LLM call.](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/9b35b71748bc859615eecd6e09c72fc663c1563bb455a2098148a3ee2a8409ba/digest/agentic-inference/tool-call-offload-prefetch.svg) ### Selective Cache Retention Making blocks globally available solves the sharing problem, but does not solve eviction. SGLang and vLLM both support priority-based eviction via a priority heap where the harness assigns a numeric priority per request and lower-priority blocks are evicted first. TensorRT-LLM takes this further with `TokenRangeRetentionConfig` (designed and implemented by a Dynamo team member[@jthomson04](https://github.com/jthomson04)) which allows per-region control within a single request. A request carries zero or more directives. Blocks without directives follow the default LRU path with zero overhead. The evictor becomes a two-structure system: an LRU free list for unprioritized blocks (O(1), unchanged) and a priority queue for annotated blocks. The harness can express "system prompt blocks are evicted last (priority: 100); conversation context survives a 30-second tool call (duration: 45s); decode tokens are first to go (priority: 1)" without the engine needing to understand why. Anthropic's prompt caching lets you mark prefixes as cacheable on their infrastructure. Dynamo's `cache_control` API brings the same semantics to self-hosted inference. When a request includes `cache_control: { type: "ephemeral", ttl: "1h" }`, the router pins the matching prefix nodes in the worker's radix tree for that TTL, protecting them from eviction in the worker's L2 storage. The next step is connecting retention with the distributed cache. Today, retention directives apply to a single worker's local cache. When a block is pinned on worker A but the next request routes to worker B, the pin does not follow. Extending retention semantics across HiCache/KVBM's shared storage tier means the harness can pin a block once and have it survive across workers: the priority and TTL metadata travel with the block through the write-through path, and any worker that loads the block from shared storage inherits the retention policy. Combined with the prefetch hooks described above, this gives the harness end-to-end lifecycle control across the full memory hierarchy. ### Agent Lifecycle Awareness Consider a typical Claude Code session. The lead agent runs for 20+ turns, accumulating a growing conversation prefix. Along the way it spawns explore subagents that each run 1-3 turns and terminate. It might spawn a team of 4 specialists that work in parallel on different subtasks and then terminate. Midway through, the agent hits a context limit and summarizes its history, compressing ~175K tokens down to ~40K. Each of these events produces ephemeral KV: blocks that will never be referenced again. Subagent termination, context summarization, and closed reasoning loops all generate ephemeral KV that occupies the same memory as high-value blocks like the system prompt. Reasoning models amplify this: `...` blocks account for ~40% of generated tokens but become ephemeral the moment the reasoning loop closes. Without lifecycle awareness, the cache treats all of these blocks identically. ![Lead agent conversation flow branches to a sub-agent. The sub-agent's ephemeral KV is evicted on session end.](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/f18e58d72ce99b171d7fc3a135a136b65fe35d49cbd378c829ae6bd1825c823e/digest/agentic-inference/subagent-lifecycle-vertical.svg) The retention primitives from above (priority, TTL, token ranges) give us the building blocks. What is missing is the ability to associate them with sessions. If the harness can tag a subagent's requests as belonging to a session and mark that session's KV as ephemeral, the evictor can target those blocks first and skip writing them to shared storage entirely. When the subagent terminates, its session's blocks are the first to reclaim. The same mechanism applies to thinking tokens: the engine can detect `` boundaries during generation and tag those blocks as ephemeral at insertion time, so they skip L2 write-back and evict before normal blocks without any external signal. The design space here is wide: harness-driven session tagging, engine-native semantic detection, hybrid approaches that combine both. We are actively exploring multiple directions and expect the right answer will vary by workload and framework. ## Closing the Gap The biggest optimization surface in agentic inference is the gap between what the harness knows and what the infrastructure can see. Which agents are blocked, which are about to resume, which KV is worth keeping, which can be thrown away -- all of this context exists at the harness layer but never crosses the API boundary. `nvext.agent_hints` is our first cut at closing that gap: a small set of structured signals that let the orchestrator make informed routing, scheduling, and cache management decisions instead of treating every request as anonymous tokens. This is a v1 API and we are actively evolving it. If you are building agent harnesses, running open-source models for agentic workloads, or thinking about cache-aware inference, we want to hear what signals matter most for your use case. Reach out on [GitHub](https://github.com/ai-dynamo/dynamo) or tag us on X: [@0xishand](https://x.com/0xishand), [@KranenKyle](https://x.com/KranenKyle), [@flowpow123](https://x.com/flowpow123). # Flash Indexer: A Story of Inter-Galactic KV Routing > Dynamo's Flash Indexer tracks every cached KV block across all inference workers at 170M ops/s. Six iterations of data structure design got it there. The **Flash Indexer** is a concurrent global index of every cached KV block across every inference worker, sustaining over **100 million operations per second**. It evolved through six iterations—from a Python dictionary to a jump-optimized spatial index—to the point where network latency, tokenization, and hashing are the bottlenecks. We're shipping it as the default indexer in Dynamo v1.0.0. For scale intuition: at 100M+ index ops/sec, the system can support approximately $$N \approx 10^8 / r$$ concurrent workloads, where $$r$$ is the workload's sustained index ops/sec (inserts + lookups) under real traffic, including bursty prefill, well beyond current planetary-scale inference demand. This post walks through those iterations—how each redesign drove a new order-of-magnitude improvement, and the specific data structure or concurrency breakthrough behind it. --- ## 1. Background ### 1.1 KV Block Identity Every cached block carries three identifiers: - **Local block hash** (`u64`): Content hash of the tokens within a single block. Position-independent—two blocks with the same tokens produce the same hash. Both the frontend and publisher use the same algorithm. - **Sequence block hash** (`u64`): Rolling hash of the entire prefix up to this block. Position-dependent—identical tokens at different positions produce different hashes. ```text seq_hash[0] = local_hash[0] seq_hash[i] = hash(seq_hash[i-1] || local_hash[i]) ``` - **Worker ID**: Which worker holds the block. Local hashes are deliberately *chunk hashes* (no prefix context) so frontends can hash query blocks cheaply in parallel. The tradeoff: chunk hashes can't distinguish position. *"Predict the next token | Learn from the error | Predict the next token."* produces identical hashes at blocks 0 and 2. This collision problem drives every data structure decision below. ### 1.2 Events and Requests The indexer handles two kinds of traffic: **KV Events** (writes): A publisher sitting alongside each engine emits `Store(worker_id, local_hash, seq_hash)` when a block is cached and `Remove(worker_id, seq_hash)` when evicted. We need explicit events because engines cache blocks beyond request lifetime and their eviction policies (LRU sweeps, memory pressure, preemption) are opaque—there's no way to infer cache state from request-response cycles alone. The stream is bursty: prefills produce dozens of stores at once; eviction sweeps produce bursts of removes. KV Event Density Per-worker and aggregate KV cache event density heatmap derived from 5% of the Mooncake FAST'25 trace, replayed across 16 Mocker workers with 2,048 GPU blocks each. Green cells indicate Store-dominant time bins (prefill bursts); amber cells indicate Remove-dominant bins (eviction sweeps). The diverging colorscale is clamped at ±10 events per worker and ±100 events aggregate, highlighting the bursty, temporally correlated nature of KV cache traffic that the Flash Indexer must sustain at line rate. **Requests** (reads): On every inference request, the frontend sends a sequence of chunk hashes `[local_hash_0, ..., local_hash_D]`. The indexer returns `(worker_id, match_depth)` scores so the router can pick the worker with the deepest cached prefix. KV Event Flow Each engine is paired with a publisher that enriches raw KV events with worker identity and block hashes, then broadcasts them via pub/sub. The router requests store lookups from the indexer, which computes prefix overlap scores used for routing decisions. Both paths are hot. Slow events mean stale routing decisions. Slow queries to the indexer means user-facing latency. The design goal: keep both fast without mutual contention. --- ## 2. Nested Dictionary → Rust Actor ### 2.1 Python Dictionary The simplest possible index is a nested dictionary. For each worker, store a mapping from local block hash to the set of external sequence hashes that share that chunk hash. Since local hashes are chunk hashes, the same tokens can appear at different positions in different sequences, and a single local hash can map to multiple sequence hashes on the same worker. To find matches, iterate every worker and walk through the query sequence, checking for hits. ```py class KvIndex: # worker_id -> { local_hash -> set of seq_hashes } index: dict[int, dict[int, set[int]]] = {} def store(self, worker_id: int, blocks: list[tuple[int, int]]): if worker_id not in self.index: self.index[worker_id] = {} for local_hash, seq_hash in blocks: if local_hash not in self.index[worker_id]: self.index[worker_id][local_hash] = set() self.index[worker_id][local_hash].add(seq_hash) def remove(self, worker_id: int, seq_hashes: list[int]): if worker_id not in self.index: return for seq_hash in seq_hashes: for local_hash, hashes in self.index[worker_id].items(): hashes.discard(seq_hash) def find_matches(self, query: list[int]) -> dict[int, int]: scores = {} for worker_id, blocks in self.index.items(): depth = 0 for local_hash in query: if local_hash in blocks and blocks[local_hash]: depth += 1 else: break if depth > 0: scores[worker_id] = depth return scores ``` There is a correctness issue with this approach. `local_hash in blocks` tells us the worker has *some* block with those tokens, but not *which* one—different sequences sharing the same chunk hash are conflated. This collision problem shapes every data structure decision that follows. This is `O(W × D)` per query (W workers, D query depth). With hundreds of workers and sequences thousands of blocks long, it's a non-starter. ### 2.2 Rust Actor Porting to Rust (`HashMap>>`) eliminates interpreter overhead. A **single-threaded actor** owns the index exclusively and communicates through channels—correct and lock-free, but serializes all reads behind all writes. The single thread is the throughput ceiling. --- ## 3. Inverted Index `worker -> { hash -> ... }` forces `find_matches` to iterate every worker. But the question is *"which workers have this block?"*—keyed by block, not worker. Instead of iterating workers and checking blocks, build a forward index keyed by LocalHash that maps to the sequence hashes and their worker sets. ```rust // local_hash -> (seq_hash -> set of workers) index: HashMap>> ``` Now `find_matches` traverses the query once. At each position, take the union of worker sets. Workers only *drop out* as you go deeper—each is drained at most once—giving **O(D + W)** instead of O(W × D). The inverted index is a major win for reads, but every data structure choice is a two-sided tradeoff between query performance and update cost. On the read side, the collision issue from Section 2.1 resurfaces in a different shape. When we union worker sets across sequence hashes at a given local hash, we conflate workers that cached different sequences sharing the same chunk. The seq hash data is in the index, but `find_matches` cannot use it without computing the query's own seq hashes—which reintroduces rolling hash computation on the read path, exactly what chunk hashes were designed to avoid. On the write side, removes are equally expensive: without a per-worker reverse lookup, removing a block by seq hash requires scanning the entire index. We could add a reverse lookup table, but that's more bookkeeping on every store. The radix tree resolves both. --- ## 4. Radix Tree Each node has a small children map keyed by `LocalHash`, plus a worker set. Parent-child relationships scope collision risk: two blocks with the same chunk hash collide only if they share the same parent, which means the same prefix. Different prefixes lead to different parents. This requires one new field in KV events: the **parent hash**, so the tree can link child to parent as events arrive. Prefix Tree Structure Prefix-aware radix tree indexes cached blocks by local hash. Shared prefixes branch where sequences diverge; each node records which workers hold that block. ```rust type SharedRadixBlock = Rc>; struct RadixBlock { children: HashMap, workers: HashSet, block_hash: Option, } struct RadixTree { root: SharedRadixBlock, lookup: HashMap>, } ``` Each node also carries a sequence hash. A per-worker **lookup table** (`worker -> { seq_hash -> node }`) provides O(1) access for event processing: stores attach children via the parent's seq hash; removes find the node directly. Two keys for two access patterns—local hash for traversal, sequence hash for events. Both the tree and the lookup table point to the same nodes via `Rc>` (shared ownership with interior mutability, single-threaded). The children maps at each node are small—bounded by branching factor, not total block count. This approach remains single-threaded behind the actor, with serialized reads and writes. --- ## 5. Concurrent Radix Tree Reads don't conflict with each other. We replace `Rc>` with `Arc>` (atomic reference counting + reader-writer lock). Now `find_matches` acquires only read locks and executes *inline on the caller's thread*—no channel, no actor, no queue. Writes use **sticky routing**: a `ThreadPoolIndexer` deterministically assigns each `WorkerId` to one thread. Events for the same worker always land on the same thread, so there's no write-write contention on any worker's subtree. Concurrency Model Write events are sticky-routed by worker ID to a thread pool, ensuring sequential ordering. A concurrent radix tree with `Arc` allows `find_matches()` reads in parallel, enabling concurrent traversals. ```rust type SharedBlock = Arc>; struct ConcurrentRadixTree { root: SharedBlock, lookup: DashMap>>, } ``` `DashMap` shards the outer map so reads and writes to different workers don't touch the same lock. `parking_lot::RwLock` avoids the OS syscall on uncontended paths (2–3x faster than `std::sync::RwLock`). `FxHashMap` replaces SipHash with a single multiply-xor step—safe here because keys are `u64` hashes, not user input. `parking_lot::RwLock` is task-fair by default: it processes waiters in arrival order rather than unconditionally favoring readers or writers. Combined with sticky routing's guarantee that each worker's writes are serialized on a single thread, write contention is minimal and neither reads nor writes are starved. The actor is gone for reads. Multiple `find_matches` calls proceed concurrently with writes to different workers. --- ## 6. Positional Indexer with Jump Search The radix tree traverses node-by-node, following pointers from parent to child—cache-hostile and fundamentally sequential. You can't check position 128 without visiting 0 through 127. Replace the tree with a `Vec>` indexed by position. `index[position]` is a concurrent map from local hash to sequence entry. Any position is O(1)—no traversal required. ```rust enum SeqEntry { Single(ExternalHash, HashSet), Multi(HashMap>), } struct PositionalIndexer { // index[position] -> { local_hash -> SeqEntry } index: Vec>, worker_blocks: DashMap, jump_size: usize, } ``` The `SeqEntry` enum handles collisions: in the common case a `(position, local_hash)` slot has exactly one sequence hash, stored inline without a `HashMap` allocation. Only when multiple prefixes produce the same chunk hash at the same position does it upgrade to `Multi`. The `Single`/`Multi` split also enables lazy hash computation: when a lookup finds a `Single` entry, the match is unambiguous without computing the query's sequence hash. The expensive rolling hash is only needed on the rare `Multi` entries where chunk hash collisions require disambiguation. But the positional indexer's biggest advantage isn't the data layout – it's what **random access makes possible.** Random access enables **jump search**: 1. Initialize the active worker set from position 0. 2. Jump ahead by `jump_size` positions (e.g., 64) to the next checkpoint. 3. At the checkpoint, count how many active workers still match (cardinality check—no set cloning needed). 4. If all match: the entire skipped range is confirmed. Continue jumping. 5. If fewer match: some workers drained in the skipped range. Scan forward through positions `[previous_checkpoint + 1 .. current_checkpoint]` to find each lost worker's exact drain point. 6. Resume jumping from the current checkpoint. Positional Jump Search With position as a first-class key, the indexer jumps ahead by a fixed stride. On a partial match, a lookback from the previous checkpoint identifies exact drain points, then resumes jumping from the current checkpoint. Best case: `D / J` lookups instead of `D`. Worst case (workers drop at every jump): degrades to a linear scan with jump overhead. The positional indexer wins on long sequences with high prefix sharing; the radix tree wins on short or highly-divergent sequences. The `Vec` layout also improves cache locality: early positions (shared system prompts, common preambles) are the hot path, cluster at the front of the array, and stay warm in cache. With jump size *J* (= `jump_size`, defaulting to 64), amortized cost drops to **O(D/J + W)**. Since *J* is a tunable constant, the complexity remains linear in *D*; the practical benefit is skipping the vast majority of positions when prefix sharing is high. --- ## 7. Benchmarks All benchmarks run on a 24-core Arrow Lake (285K) desktop, replaying publicly-available [Mooncake production traces](https://github.com/kvcache-ai/Mooncake/tree/main/FAST25-release/arxiv-trace) through a mock engine with 16,384 GPU blocks and prefix caching enabled. The harness tests all five backends with 24 concurrent event-processing threads. **Ops throughput** is the combined rate of KV events and `find_matches` requests per second. We sweep offered load by compressing the same trace into shorter durations and compare achieved vs. offered throughput. The **threshold throughput** is where achieved throughput stops tracking offered—the indexer's saturation point. Indexer Performance Achieved vs. offered block throughput across five indexer backends, measured with `mooncake_bench` on real trace data. The Flash Indexer sustains 170M ops/s — 42x faster than the Radix Tree shipped in Dynamo v0.1.0 (4M ops/s) and 440x faster than the naive implementations (385K ops/s). --- ## 8. Future Directions With the Flash Indexer shipping in Dynamo v1.0.0, the next round of optimizations targets the remaining constant factors: - **Binary search within jumps.** Replace the linear scan-back after a failed jump with binary search: `O(log J)` instead of `O(J)` per failed jump. - **Hierarchical routing.** A sparse top-level indexer for coarse-grained prefix coverage across deployment groups, with full indexers at the leaves. - **Inline bitsets for worker sets.** Replace `HashSet` with fixed-width bitsets stored inline in each node, turning membership tests into single bit operations and eliminating pointer chases. --- ## 9. Conclusion The journey from a Python dictionary to the Flash Indexer spans six iterations, each motivated by a concrete bottleneck in the previous design: 1. **Naive Nested Dict** — simple but O(W × D) per query. 2. **Rust + Actor Pattern** — fast language, correct concurrency, but single-threaded bottleneck. 3. **Inverted Index** — O(D + W) per query by flipping the key structure; secondary `seq_hash` layer for chunk-hash collision safety. 4. **Radix Tree** — tree structure replaces giant flat map; per-node children maps stay small; dual-key design (local hash for traversal, seq hash for event processing); `Rc>` for single-threaded shared ownership. 5. **Concurrent Radix Tree** — `Arc>` replaces `Rc>`; `DashMap` with per-worker inner `RwLock` for the lookup table (shard-level locking for rare mutations, cheap shared reads on the hot path); reads bypass the actor entirely; sticky routing serializes writes per worker with zero contention. 6. **Concurrent Positional Indexer via Jump Search (Flash Indexer)** — an alternative to the radix tree for long-sequence workloads; `Vec>` indexed by position replaces pointer chasing with O(1) random access, enabling jump search that skips most of the depth; `DashMap` with per-worker inner `RwLock` for the reverse lookup; hot prefix positions cluster at the front of the `Vec` and stay warm in cache. The result: a sustained ops throughput of **170 million operations per second**—events and requests combined—with achieved throughput tracking offered throughput all the way to the limit. # Kubernetes Quickstart Get a model running on Kubernetes in minutes. Dynamo's production path is Kubernetes-native: you install the platform with Helm, submit Dynamo CRDs, and let the operator reconcile inference graphs into pods, services, routing, model-loading, and scaling resources. The local and container guides remain useful for development, but Kubernetes is the canonical path for shared GPU clusters and multi-node serving. **Deployment modes.** Dynamo supports two deployment modes on Kubernetes. This quickstart uses **standalone mode**, where the Dynamo Frontend serves requests and the integrated Dynamo Router does KV-aware routing. Dynamo can also run in **gateway mode** behind a [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/) gateway, where KV-aware routing happens in the Dynamo Endpoint Picker Plugin (EPP) at the gateway layer and the Frontend runs as a sidecar in `--router-mode direct`. See the [Inference Gateway (GAIE) guide](/dynamo/integrations/kubernetes-integrations/gateway-api-inference-extension-gaie) to set up gateway mode. ## Prerequisites - Kubernetes cluster (v1.24+) with GPU nodes - [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) (v1.24+) - [Helm](https://helm.sh/docs/intro/install/) (v3.0+) installed - [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html) installed on the cluster - HuggingFace token secret on cluster ### HuggingFace token secret Create a HuggingFace token secret for model downloads. If you don't have a token, see the HuggingFace [token guide](https://huggingface.co/docs/hub/en/security-tokens). ```bash export HF_TOKEN= kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN="$HF_TOKEN" ``` ### GPU Operator quick install If you don't have the GPU Operator yet: ```bash helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --force-update helm repo update nvidia helm install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator --create-namespace \ --wait --timeout=600s ``` If your cluster already provides GPU drivers (e.g., GKE with `gpu-driver-version=latest`, or AKS), add: ```bash --set driver.enabled=false --set toolkit.enabled=false ``` ### Detailed installation The GPU Operator is the only prerequisite for a basic deployment. For additional features like RDMA, Prometheus, or multinode scheduling with Grove/KAI Scheduler, see the [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide). If your GPU SKU and cloud provider are supported, you can use [AICR](https://github.com/NVIDIA/aicr) for rapid installation of prerequisites and the Dynamo Helm chart. ### Verify cluster is ready Optionally, verify your cluster is ready: ```bash ./deploy/pre-deployment/pre-deployment-check.sh ``` ## Install Dynamo ```bash export NAMESPACE=dynamo-system helm install dynamo-platform \ oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform \ --version "1.0.2" \ --namespace "$NAMESPACE" \ --create-namespace ``` Wait for the platform pods: ```bash kubectl get pods -n $NAMESPACE # Expected: dynamo-operator-*, etcd-*, nats-* pods all Running ``` ## Understand Dynamo Deployment Resources Before applying the first YAML, it helps to know the Kubernetes resources Dynamo uses. These are Dynamo's native control-plane objects; you describe the inference graph, and the operator owns the Kubernetes deployments, services, and component rollout around it: | Resource or path | What it does | In this quickstart | |---|---|---| | `DynamoGraphDeployment` (DGD) | The canonical live deployment. It describes the Dynamo inference graph that serves traffic. | Generated by DGDR in Option A, or applied directly in Option B. | | `DynamoComponentDeployment` (DCD) | Per-component deployments created by the operator from the DGD, such as frontend and worker components. | Created for you by the operator. | | `DynamoGraphDeploymentRequest` (DGDR) | A generator/profiler that can produce a DGD from a model, backend, workload, hardware, and optional SLA targets. | Option A uses DGDR so Dynamo can generate the first DGD. | | Recipes | Tuned `deploy.yaml` manifests that are already DGD specs. | Use these later when a recipe matches your model, backend, and hardware. | ```mermaid flowchart LR DGDR["DGDR
generator/profiler"] --> DGD["DGD
live deployment"] DGD --> DCD["DCDs
component deployments"] DCD --> Pods["Pods and Services"] ``` This quickstart uses DGDR because it avoids hand-writing the first DGD. After DGDR generates and applies the DGD, the DGDR reaches a terminal state, similar to a Kubernetes Job. The DGD persists and serves your model. DGDR can also carry supported generated-deployment features such as `features.planner` for Planner configuration and `features.mocker` for mocker mode. KV-aware routing is not currently exposed as a DGDR feature field; use a direct DGD, a tuned recipe, or `overrides.dgd` when you need to set router mode or other graph-level details explicitly. For tuned production-style manifests, start from [Dynamo recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes). For the full deployment model, see the [Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide). ## Deploy Your First Model Save this DGDR to generate and deploy a DGD for `Qwen/Qwen3-0.6B`: ```yaml # qwen3-quickstart.yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: qwen3-quickstart spec: model: Qwen/Qwen3-0.6B backend: auto image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.1.1" # dynamo-frontend for Dynamo < 1.1.0 ``` The DGDR generates a DGD similar in shape to the following. If you already know the backend and runtime image you want, you can apply this canonical DGD object directly instead of using DGDR: ```yaml # qwen3-dgd.yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeployment metadata: name: qwen3-direct spec: components: - name: Frontend type: frontend replicas: 1 podTemplate: spec: containers: - name: main image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1 envFrom: - secretRef: name: hf-token-secret - name: VllmDecodeWorker type: worker replicas: 1 podTemplate: spec: containers: - name: main image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1 command: - python3 - -m - dynamo.vllm args: - --model - Qwen/Qwen3-0.6B envFrom: - secretRef: name: hf-token-secret resources: limits: nvidia.com/gpu: "1" requests: ephemeral-storage: 2Gi workingDir: /workspace/examples/backends/vllm ``` Apply exactly one of the manifests. Option A: generate and apply a DGD with DGDR. ```bash kubectl apply -f qwen3-quickstart.yaml -n $NAMESPACE ``` Option B: apply the DGD directly. ```bash kubectl apply -f qwen3-dgd.yaml -n $NAMESPACE ``` If you use DGDR, watch it progress from `Pending` to `Profiling` to `Deploying` to `Deployed`: ```bash kubectl get dgdr qwen3-quickstart -n $NAMESPACE -w ``` In both paths, the DGD is the live serving resource: ```bash kubectl get dynamographdeployment -n $NAMESPACE kubectl get dynamocomponentdeployment -n $NAMESPACE ``` Dynamo supports vLLM, TensorRT-LLM, and SGLang backends. Setting `backend: auto` lets the profiler choose the best one for your model and hardware. See the [vLLM backend guide](/dynamo/backends/v-llm) for a backend guide example. ## Send a Request Once the DGD is ready, it is serving the model: ```bash # Find and port-forward the frontend FRONTEND_SVC=$(kubectl get svc -n $NAMESPACE -o name | grep frontend | head -1) kubectl port-forward "$FRONTEND_SVC" 8000:8000 -n $NAMESPACE & # Send a request curl -s http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "What is NVIDIA Dynamo?"}], "max_tokens": 200 }' | python3 -m json.tool ``` ## Cleanup ```bash kubectl delete dgdr qwen3-quickstart -n $NAMESPACE --ignore-not-found kubectl delete dynamographdeployment qwen3-quickstart qwen3-direct \ -n $NAMESPACE --ignore-not-found ``` ## Next Steps - **[Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide)** — Cloud provider setup, GPU Operator details, optional components (Grove, RDMA, model caching, Prometheus) - **[Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide)** — DGD, DCD, DGDR, recipes, strategy selection, and common pitfalls - **[DGDR Reference](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference)** — Spec reference, lifecycle phases, monitoring commands, and generated DGD behavior - **[Creating Deployments](/dynamo/additional-resources/creating-deployments)** — Hand-craft a DGD spec for full control # Installation Guide This guide walks you through installing everything needed to deploy models with Dynamo on Kubernetes. Follow the steps in order — each builds on the previous one. ## Prerequisites Before you begin, make sure you have: - A **Kubernetes cluster (v1.24+)** with GPU-capable nodes. See the cloud provider guides if you need to create one: - [Amazon EKS](/dynamo/kubernetes-deployment/cloud-provider-guides/aws/eks-setup) | [Azure AKS](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/aks-setup) | [Google GKE](/dynamo/kubernetes-deployment/cloud-provider-guides/gcp/gke-setup) - For local development: [Minikube Setup](/dynamo/kubernetes-deployment/start-here/minikube-setup) - **kubectl** v1.24+ — [Install kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) - **Helm** v3.0+ — [Install Helm](https://helm.sh/docs/intro/install/) **Cloud provider GPU drivers**: The GPU Operator (Step 1) installs GPU drivers for you. When creating your cluster's GPU node pools, **do not enable provider-managed GPU driver installation** (e.g., skip AKS GPU driver install, don't use GKE `--accelerator gpu-driver-version=latest`). If your nodes already have provider-managed drivers, see the GPU Operator step for how to handle this. Verify your tools: ```bash kubectl version --client # Should show v1.24+ helm version # Should show v3.0+ ``` ## Overview Every Dynamo deployment requires two Helm charts: the **GPU Operator** (Step 1) and the **Dynamo Platform** (Step 2). Everything else is optional. Decide what optional components you need before starting so you can install them in Step 3. | Optional Component | When you need it | Required for | |-----------|-----------------|--------------| | Grove + KAI Scheduler | Multinode or disaggregated inference | Multinode deployments (operator errors without Grove or LWS) | | Network Operator / RDMA | Disaggregated inference in production | Acceptable KV cache transfer performance (TCP fallback has ~200-500x degradation) | | kube-prometheus-stack | Autoscaling, metrics dashboards, or the Planner | Planner `sla` mode, KEDA/HPA autoscaling | | Shared storage (model cache) | Large models (>70B) or many replicas | Avoiding per-pod downloads and HuggingFace rate limits | **Grove + KAI Scheduler** — Grove is the default multinode orchestrator. The operator returns a hard error on multinode deployments if neither Grove nor [LeaderWorkerSet (LWS)](https://github.com/kubernetes-sigs/lws#installation) is available. KAI Scheduler is optional but recommended alongside Grove for GPU-aware scheduling. See [Grove](/dynamo/kubernetes-deployment/scale/grove) for details. **Network Operator / RDMA** — Without RDMA, disaggregated inference falls back to TCP automatically, but with severe performance degradation (~98s TTFT vs ~200-500ms with RDMA). Required for any production disaggregated deployment. Setup is cloud-provider-specific — see the [Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication) and your cloud provider guide. **kube-prometheus-stack** — Required for the Planner's `sla` optimization mode (it reads live TTFT/ITL metrics from Prometheus). Also required for KEDA/HPA-based autoscaling. The Planner's `throughput` mode can function without it using internal queue depth signals, but metrics-driven features will not work. See [Metrics](/dynamo/kubernetes-deployment/operate/observability/metrics) for details. **Shared storage** — Prevents each pod from downloading model weights independently. Without it, large models (>70B) take hours to download per pod, and many replicas will hit HuggingFace rate limits. Not enforced by the operator — this is an operational concern. See [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) for the full walkthrough. ## Step 1: Install the GPU Operator The [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html) automates deployment of all NVIDIA software components needed to provision GPUs — drivers, container toolkit, device plugin, and monitoring. ```bash helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update ``` ```bash helm install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator --create-namespace # Uncomment if your nodes already have provider-managed GPU drivers installed: # --set driver.enabled=false ``` If your GPU nodes already have provider-managed drivers installed (e.g., you used GKE's `--accelerator gpu-driver-version=latest`), uncomment the `driver.enabled=false` line above so the operator doesn't conflict with the existing drivers. Some cloud providers require additional GPU Operator configuration. See your provider guide for details: - [AKS GPU Operator setup](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/aks-setup) — skip AKS-managed GPU driver install on node pools - [EKS GPU Operator setup](/dynamo/kubernetes-deployment/cloud-provider-guides/aws/eks-setup) - [GKE GPU Operator setup](/dynamo/kubernetes-deployment/cloud-provider-guides/gcp/gke-setup) — `LD_LIBRARY_PATH` and `ldconfig` init requirements Verify the GPU Operator is running: ```bash kubectl get pods -n gpu-operator # Expected: gpu-operator, nvidia-driver-daemonset, nvidia-device-plugin-daemonset, etc. all Running ``` ## Step 2: Install the Dynamo Platform Set your environment variables: ```bash export NAMESPACE=dynamo-system export RELEASE_VERSION=1.0.2 # match a version from https://github.com/ai-dynamo/dynamo/releases ``` ```bash helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$RELEASE_VERSION.tgz helm install dynamo-platform dynamo-platform-$RELEASE_VERSION.tgz \ --namespace $NAMESPACE \ --create-namespace # Note: add \ to --create-namespace above when uncommenting any optional flags below # # Grove + KAI Scheduler — uncomment if using multinode or disaggregated inference. # Option A (install=true): Dynamo installs and manages Grove/KAI as bundled subcharts (dev/testing): # --set "global.grove.install=true" \ # --set "global.kai-scheduler.install=true" \ # Option B (enabled=true): Grove/KAI are already installed externally (production): # --set "global.grove.enabled=true" \ # --set "global.kai-scheduler.enabled=true" \ # # kube-prometheus-stack — uncomment if Prometheus is installed (required for Planner sla mode and autoscaling): # --set "dynamo-operator.dynamo.metrics.prometheusEndpoint=http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090" ``` All `helm install` commands can be customized with your own values file: `helm install ... -f your-values.yaml` **Shared/Multi-Tenant Clusters**: If a cluster-wide Dynamo operator is already running, do **not** install another one. Check with: ```bash kubectl get clusterrolebinding -o json | \ jq -r '.items[] | select(.metadata.name | contains("dynamo-operator-manager")) | "Cluster-wide operator found in namespace: \(.subjects[0].namespace)"' ``` **Namespace-restricted mode** (`namespaceRestriction.enabled=true`) is deprecated and will be removed in a future release. Use the default cluster-wide mode for all new deployments. Verify the Dynamo platform is running: ```bash # Check CRDs kubectl get crd | grep dynamo # Expected: dynamographdeployments, dynamocomponentdeployments, dynamographdeploymentrequests, etc. # Check operator and platform pods kubectl get pods -n $NAMESPACE # Expected: dynamo-operator-*, etcd-*, nats-* pods all Running ``` ## Step 3: Install Optional Components The Dynamo install command above includes commented flags for each optional component. Install the component first, then uncomment the corresponding flag before running `helm install` in Step 2 (or run `helm upgrade --reuse-values` with the flag if you've already installed Dynamo). ### Multinode: Multinode deployments require either Grove + KAI Scheduler or an alternative orchestrator setup (LeaderWorkerSet + Volcano) to enable gang scheduling for workloads that span multiple nodes. See the [Multinode Deployment Guide](/dynamo/kubernetes-deployment/scale/multinode-deployments) for details on orchestrator selection and configuration. #### Grove + KAI Scheduler There are two ways to enable Grove and KAI Scheduler, controlled by which flags you uncomment in the Dynamo install command: - **`install=true`** — Dynamo installs and manages Grove/KAI as bundled subcharts. Simplest path; recommended for dev/testing. - **`enabled=true`** — Tells Dynamo that Grove/KAI are already installed and externally managed. Use this when you install Grove/KAI separately (e.g., to manage their lifecycle independently or share them across namespaces). Recommended for production. For the `enabled=true` path, install Grove and KAI Scheduler separately first. See the [Grove installation guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md) and [KAI Scheduler deployment guide](https://github.com/NVIDIA/KAI-Scheduler) for instructions. **Compatibility matrix:** | dynamo-platform | kai-scheduler | Grove | |-----------------|---------------|-------| | 1.0.x | >= v0.13.0 | >= v0.1.0-alpha.6 | | 1.1.x | >= v0.13.4 | >= v0.1.0-alpha.8 | #### LWS + Volcano If you are not using Grove for multinode, you can use [LeaderWorkerSet (LWS)](https://lws.sigs.k8s.io/docs/installation/) (>= v0.7.0) with [Volcano](https://github.com/volcano-sh/volcano#quick-start-guide) for gang scheduling. Both must be installed before deploying multinode workloads. 1. Install Volcano: ```bash helm repo add volcano-sh https://volcano-sh.github.io/helm-charts helm repo update helm install volcano volcano-sh/volcano -n volcano-system --create-namespace ``` 2. Install LWS (>= v0.7.0) with Volcano gang scheduling enabled: ```bash export LWS_VERSION=0.8.0 helm install lws oci://registry.k8s.io/lws/charts/lws \ --version=$LWS_VERSION \ --namespace lws-system \ --create-namespace \ --set gangSchedulingManagement.schedulerProvider=volcano \ --wait --timeout 300s ``` See the [LWS docs](https://lws.sigs.k8s.io/docs/) and [Volcano docs](https://github.com/volcano-sh/volcano#quick-start-guide) for configuration options, and the [Multinode Deployment Guide](/dynamo/kubernetes-deployment/scale/multinode-deployments) for orchestrator selection. ### Network Operator / RDMA RDMA setup is cloud-provider-specific. See the [Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication) for transport options, UCX configuration, and performance expectations, and your cloud provider guide for setup instructions: - [AKS — InfiniBand + Network Operator](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/rdma-infini-band) - [EKS — EFA device plugin](/dynamo/kubernetes-deployment/cloud-provider-guides/aws/eks-setup) (also see the [EFA configuration guide](/dynamo/kubernetes-deployment/operate/disagg-communication#aws-efa-configuration)) - [GKE — GPUDirect-TCPXO](/dynamo/kubernetes-deployment/cloud-provider-guides/gcp/gke-setup) ### kube-prometheus-stack Install Prometheus before running the Dynamo install command so you can set the endpoint in one pass: ```bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring --create-namespace \ --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \ --set-json 'prometheus.prometheusSpec.podMonitorNamespaceSelector={}' \ --set-json 'prometheus.prometheusSpec.probeNamespaceSelector={}' ``` Then uncomment the `prometheusEndpoint` line in the Dynamo install command. The Dynamo operator automatically creates PodMonitors for its components. See [Metrics](/dynamo/kubernetes-deployment/operate/observability/metrics) for dashboard setup and available metrics, and [Logging](/dynamo/kubernetes-deployment/operate/observability/logging) for the Grafana Loki + Alloy logging stack. ### Shared Storage for Model Caching Set up a `ReadWriteMany` PVC so all pods share downloaded model weights instead of each downloading independently. No Dynamo chart flags are needed — storage is configured in your deployment spec. Setup is cloud-provider-specific: - [AKS — Azure Files / Managed Lustre](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/aks-storage) - [EKS — EFS](/dynamo/kubernetes-deployment/cloud-provider-guides/aws/efs) - GKE — Cloud Filestore (see [GKE guide](/dynamo/kubernetes-deployment/cloud-provider-guides/gcp/gke-setup)) For large clusters with frequent model updates, consider [ModelExpress](/dynamo/kubernetes-deployment/model-loading/model-caching#option-2-modelexpress-p2p-distribution) for P2P model distribution and ModelStreamer for direct streaming from object storage. See [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) for the full walkthrough including the download Job, mount configuration, and ModelExpress setup. ## Step 4: Pre-Deployment Check Run the pre-deployment check script to validate your cluster is ready for deployments: ```bash ./deploy/pre-deployment/pre-deployment-check.sh ``` This checks kubectl connectivity, default StorageClass configuration, GPU node availability, and GPU Operator status. See [Pre-Deployment Checks](https://github.com/ai-dynamo/dynamo/tree/main/deploy/pre-deployment/README.md) for details. ## Next Steps Your cluster is ready. Follow the **[Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide)** to choose between applying a tuned DGD recipe, creating a DGD directly, or using DGDR to generate one. ## Troubleshooting **"VALIDATION ERROR: Cannot install cluster-wide Dynamo operator"** ``` VALIDATION ERROR: Cannot install cluster-wide Dynamo operator. Found existing namespace-restricted Dynamo operators in namespaces: ... ``` Cause: Attempting cluster-wide install on a shared cluster with existing namespace-restricted operators. Solution: Migrate the existing namespace-restricted operators to cluster-wide mode. Namespace-restricted mode is deprecated. **CRDs already exist** Cause: Installing CRDs on a cluster where they're already present (common on shared clusters). Solution: CRDs are installed automatically by the Helm chart. If you encounter conflicts, check existing CRDs with `kubectl get crd | grep dynamo`. **Pods not starting?** ```bash kubectl describe pod -n $NAMESPACE kubectl logs -n $NAMESPACE ``` **Bitnami etcd "unrecognized" image?** ```bash ERROR: Original containers have been substituted for unrecognized ones. ``` Add to the helm install command: ```bash --set "etcd.image.repository=bitnamilegacy/etcd" --set "etcd.global.security.allowInsecureImages=true" ``` **Clean uninstall?** ```bash # Uninstall the platform helm uninstall dynamo-platform --namespace $NAMESPACE # List Dynamo CRDs kubectl get crd | grep "dynamo.*nvidia.com" # Delete each CRD kubectl delete crd ``` ## Advanced: Build from Source If you need to contribute to Dynamo or use the latest unreleased features from the main branch: ```bash # 1. Set registry environment export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/ # or your registry export DOCKER_USERNAME='$oauthtoken' export DOCKER_PASSWORD= export IMAGE_TAG=$RELEASE_VERSION # 2. Build and push operator image cd deploy/operator docker build -t $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG . && docker push $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG cd - # 3. Create namespace and image pull secret (only if using a private registry) kubectl create namespace $NAMESPACE kubectl create secret docker-registry docker-imagepullsecret \ --docker-server=$DOCKER_SERVER \ --docker-username=$DOCKER_USERNAME \ --docker-password=$DOCKER_PASSWORD \ --namespace=$NAMESPACE # 4. Install from local chart cd deploy/helm/charts helm dep build ./platform/ helm install dynamo-platform ./platform/ \ --namespace "$NAMESPACE" \ --set "dynamo-operator.controllerManager.manager.image.repository=$DOCKER_SERVER/kubernetes-operator" \ --set "dynamo-operator.controllerManager.manager.image.tag=$IMAGE_TAG" \ --set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret" ``` ## Reference - [Helm Chart Configuration](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/platform/README.md) - [Create Custom Deployments](/dynamo/additional-resources/creating-deployments) - [Dynamo Operator Details](/dynamo/kubernetes-deployment/start-here/dynamo-operator) - [ModelExpress Server](https://github.com/ai-dynamo/modelexpress) # Dynamo Operator ## Overview Dynamo operator is a Kubernetes operator that simplifies the deployment, configuration, and lifecycle management of DynamoGraphs. It automates the reconciliation of custom resources to ensure your desired state is always achieved. This operator is ideal for users who want to manage complex deployments using declarative YAML definitions and Kubernetes-native tooling. ## Architecture - **Operator Deployment:** Deployed as a Kubernetes `Deployment` in a specific namespace. - **Controllers:** - `DynamoGraphDeploymentController`: Watches `DynamoGraphDeployment` CRs and orchestrates graph deployments. - `DynamoComponentDeploymentController`: Watches `DynamoComponentDeployment` CRs and handles individual component deployments. - `DynamoGraphDeploymentRequestController`: Watches `DynamoGraphDeploymentRequest` CRs and runs the profiling/generation flow that produces a `DynamoGraphDeployment`. - `DynamoGraphDeploymentScalingAdapterController`: Watches scaling adapter CRs used by external autoscalers and Planner-driven scaling flows. - `DynamoModelController`: Watches `DynamoModel` CRs and manages model lifecycle (e.g., loading LoRA adapters). - `DynamoCheckpointController`: Watches `DynamoCheckpoint` CRs for GPU worker checkpoint/restore workflows. - **Workflow:** 1. A custom resource is created by the user or API server. 2. The corresponding controller detects the change and runs reconciliation. 3. Kubernetes resources (Deployments, Services, etc.) are created or updated to match the CR spec. 4. Status fields are updated to reflect the current state. ## Deployment Modes The Dynamo operator supports three deployment modes to accommodate different cluster environments and use cases: ### 1. Cluster-Wide Mode (Default, Recommended) The operator monitors and manages DynamoGraph resources across **all namespaces** in the cluster. **When to Use:** - You have full cluster admin access - You want centralized management of all Dynamo workloads - Standard production deployment on a dedicated cluster --- ### 2. Namespace-Scoped Mode (DEPRECATED) > **DEPRECATED:** Namespace-scoped mode (`namespaceRestriction.enabled=true`) is deprecated and will be removed in a future release. Use cluster-wide mode instead. Do not use this for new deployments. The operator monitors and manages DynamoGraph resources **only in a specific namespace**. A lease marker is created to signal the operator's presence to any cluster-wide operators. **When to Use:** - You're on a shared/multi-tenant cluster - You only have namespace-level permissions - You want to test a new operator version in isolation - You need to avoid conflicts with other operators **Installation:** ```bash helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \ --namespace my-namespace \ --create-namespace \ --set dynamo-operator.namespaceRestriction.enabled=true ``` --- ### 3. Hybrid Mode (DEPRECATED) > **DEPRECATED:** Hybrid mode relies on namespace-scoped operators, which are deprecated and will be removed in a future release. Use a single cluster-wide operator instead. A **cluster-wide operator** manages most namespaces, while **one or more namespace-scoped operators** run in specific namespaces (e.g., for testing new versions). The cluster-wide operator automatically detects and excludes namespaces with namespace-scoped operators using lease markers. **When to Use:** - Running production workloads with a stable operator version - Testing new operator versions in isolated namespaces without affecting production - Gradual rollout of operator updates - Development/staging environments on production clusters **How It Works:** 1. Namespace-scoped operator creates a lease named `dynamo-operator-namespace-scope` in its namespace 2. Cluster-wide operator watches for these lease markers across all namespaces 3. Cluster-wide operator automatically excludes any namespace with a lease marker 4. If namespace-scoped operator stops, its lease expires (TTL: 30s by default) 5. Cluster-wide operator automatically resumes managing that namespace **Setup Example:** ```bash # 1. Install cluster-wide operator (production, v1.0.0) helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \ --namespace dynamo-system \ --create-namespace # 2. Install namespace-scoped operator (testing, v2.0.0-beta) helm install dynamo-test dynamo-platform-${RELEASE_VERSION}.tgz \ --namespace test-namespace \ --create-namespace \ --set dynamo-operator.namespaceRestriction.enabled=true \ --set dynamo-operator.controllerManager.manager.image.tag=v2.0.0-beta ``` **Observability:** ```bash # List all namespaces with local operators kubectl get lease -A --field-selector metadata.name=dynamo-operator-namespace-scope # Check which operator version is running in a namespace kubectl get lease -n my-namespace dynamo-operator-namespace-scope \ -o jsonpath='{.spec.holderIdentity}' ``` ## Custom Resource Definitions (CRDs) Dynamo installs the following Custom Resources. The main deployment path is: create or generate a `DynamoGraphDeployment`, then let the operator create the lower-level resources that run it. | Custom Resource | What it represents | Typical use | |---|---|---| | `DynamoGraphDeployment` (DGD) | The canonical live deployment for a Dynamo inference graph. | Author directly, apply a tuned recipe, or let DGDR generate it. | | `DynamoGraphDeploymentRequest` (DGDR) | A deploy-by-intent request that profiles a model/hardware target and generates a DGD. | Start here when you want Dynamo to choose sizing, parallelism, or Planner-enabled generated config. | | `DynamoComponentDeployment` (DCD) | Per-component deployments created from a DGD, such as frontend, router, prefill, decode, and planner components. | Usually inspected for debugging rather than authored directly. | | `DynamoModel` | Model and adapter lifecycle management layered onto a running deployment. | Load, unload, or manage model artifacts such as LoRA adapters. | | `DynamoCheckpoint` | Checkpoint metadata and job configuration for snapshotting GPU workers. | Use with Snapshotting GPU Workers to restore warm workers faster than cold start. | Advanced and operator-owned resources: - `DynamoGraphDeploymentScalingAdapter`: scaling interface used by Planner or external autoscalers to adjust component replicas. - `DynamoWorkerMetadata`: discovery metadata written for worker pods. For the complete technical API reference for Dynamo Custom Resource Definitions, see: **📖 [Dynamo CRD API Reference](/dynamo/additional-resources/api-reference-k-8-s)** For user-focused workflows, see: - **[Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide)** for DGD, DCD, DGDR, and recipes - **[DGDR Reference](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference)** for deploy-by-intent generated deployments - **[Managing Models with DynamoModel Guide](/dynamo/kubernetes-deployment/deploy-models/managing-models-with-dynamo-model)** - **[Snapshotting GPU Workers](/dynamo/kubernetes-deployment/advanced-platform/snapshot)** for `DynamoCheckpoint` ## Webhooks The Dynamo Operator uses **Kubernetes admission webhooks** for real-time validation and mutation of custom resources before they are persisted to the cluster. Webhooks are a required component of the operator and ensure that invalid configurations are rejected immediately at the API server level. **Key Features:** - ✅ Shared certificate infrastructure across all webhook types - ✅ Automatic certificate generation and rotation (default, all environments) - ✅ cert-manager integration (optional, for custom PKI) - ✅ Immutability enforcement for critical fields For complete documentation on webhooks, certificate management, and troubleshooting, see: **📖 [Webhooks Guide](/dynamo/kubernetes-deployment/advanced-platform/webhooks)** ## Observability The Dynamo Operator provides comprehensive observability through Prometheus metrics and Grafana dashboards. This allows you to monitor: - **Controller Performance**: Reconciliation loop duration, success rates, and error rates by resource type - **Webhook Activity**: Validation performance, admission rates, and denial patterns - **Resource Inventory**: Current count of managed resources by state and namespace - **Operational Health**: Success rates and health indicators for controllers and webhooks ### Metrics Collection Metrics are automatically exposed on the operator's `/metrics` endpoint (port 8443 by default) and collected by Prometheus via a ServiceMonitor. The ServiceMonitor is automatically created when you install the operator via Helm (controlled by `metricsService.enabled`, which defaults to `true`). ### Grafana Dashboard A pre-built Grafana dashboard is available for visualizing operator metrics. The dashboard includes: - **Reconciliation Metrics**: Rate, duration (P95), and errors by resource type - **Webhook Metrics**: Request rate, duration (P95), and denials by resource type and operation - **Resource Inventory**: Count of DynamoGraphDeployments by state and namespace - **Operational Health**: Success rate gauges for controllers and webhooks For complete setup instructions and metrics reference, see: **📖 [Operator Metrics Guide](/dynamo/kubernetes-deployment/operate/observability/operator-metrics)** ## Installation ### Quick Install with Helm ```bash # Set environment export NAMESPACE=dynamo-system export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases # Install Platform (includes operator) helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace ``` > **Note:** Namespace-scoped and hybrid deployment modes are deprecated. Use cluster-wide mode for all new deployments. See [Deployment Modes](#deployment-modes) above if you need backward-compatible configurations. ### Building from Source ```bash # Set environment export NAMESPACE=dynamo-system export DOCKER_SERVER=your-registry.com/ # your container registry export IMAGE_TAG=latest # Build operator image cd deploy/operator docker build -t $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG \ --build-context snapshot=../snapshot \ --build-arg DOCKER_PROXY="" \ . docker push $DOCKER_SERVER/kubernetes-operator:$IMAGE_TAG cd - # Install platform with custom operator image (CRDs are automatically installed by the chart) cd deploy/helm/charts helm install dynamo-platform ./platform/ \ --namespace ${NAMESPACE} \ --create-namespace \ --set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/kubernetes-operator" \ --set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \ --set dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret ``` For detailed installation options, see the [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) ## Development - **Code Structure:** The operator is built using Kubebuilder and the operator-sdk, with the following structure: - `controllers/`: Reconciliation logic - `api/v1alpha1/`: CRD types - `config/`: Manifests and Helm charts ## References - [Kubernetes Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) - [Custom Resource Definitions](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) - [Operator SDK](https://sdk.operatorframework.io/) - [Helm Best Practices for CRDs](https://helm.sh/docs/chart_best_practices/custom_resource_definitions/) # Minikube Setup Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Kubernetes Platform locally. ## 1. Install Minikube First things first! Start by installing Minikube. Follow the official [Minikube installation guide](https://minikube.sigs.k8s.io/docs/start/) for your operating system. ## 2. Configure GPU Support (Optional) Planning to use GPU-accelerated workloads? You'll need to configure GPU support in Minikube. Follow the [Minikube GPU guide](https://minikube.sigs.k8s.io/docs/tutorials/nvidia/) to set up NVIDIA GPU support before proceeding. Make sure to configure GPU support before starting Minikube if you plan to use GPU workloads! ## 3. Start Minikube Time to launch your local cluster! ```bash # Start Minikube with GPU support (if configured) minikube start --driver docker --container-runtime docker --gpus all --memory=16000mb --cpus=8 # Enable required addons minikube addons enable istio-provisioner minikube addons enable istio minikube addons enable storage-provisioner-rancher ``` ## 4. Verify Installation Let's make sure everything is working correctly! ```bash # Check Minikube status minikube status # Verify Istio installation kubectl get pods -n istio-system # Verify storage class kubectl get storageclass ``` ## Next Steps Once your local environment is set up, you can proceed with the [Dynamo Kubernetes Platform installation guide](/dynamo/kubernetes-deployment/start-here/installation-guide) to deploy the platform to your local cluster. # Deployment Overview Dynamo's canonical Kubernetes deployment is a [`DynamoGraphDeployment`](/dynamo/additional-resources/api-reference-k-8-s#dynamographdeployment) (DGD). A DGD describes the inference graph you want to run. The Dynamo operator reconciles that graph into one or more [`DynamoComponentDeployment`](/dynamo/additional-resources/api-reference-k-8-s#dynamocomponentdeployment) (DCD) resources, which run the frontend, router, prefill workers, decode workers, and other graph components. This is the Kubernetes-native control path for Dynamo: you author or generate Dynamo resources, and the operator translates them into Kubernetes workloads, services, routing metadata, model-loading resources, and status conditions. For local development or incremental adoption, you can still run the same frontend, router, and worker components outside Kubernetes. You can create a DGD directly from a known-good manifest, or you can use a [`DynamoGraphDeploymentRequest`](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference) (DGDR) to profile your model and generate a DGD for you. Most users only need three ideas before they deploy: - **Recipes are the fastest path** when one matches your model, backend, hardware, and serving pattern. They are already DGD manifests. - **DGDR is the guided path** when you want Dynamo to profile and generate a DGD from model/SLA intent. - **DGD is the object that serves traffic**. DGDR can create it, but the DGD is what persists after profiling completes. You do not need to author DCDs directly for normal deployments. ## Start Here: Resource Model ```mermaid flowchart LR DGDR["DynamoGraphDeploymentRequest (DGDR)
optional generator and profiler"] Recipes["recipes/model/.../deploy.yaml
pre-tuned DGD manifests"] DGD["DynamoGraphDeployment (DGD)
canonical live deployment"] DCD["DynamoComponentDeployments (DCDs)
per-component deployments"] Pods["Pods and Services
frontend, router, workers"] DGDR -->|"profiles + generates"| DGD Recipes -->|"kubectl apply"| DGD DGD -->|"operator reconciles"| DCD DCD --> Pods ``` | Resource or path | What it is | Use it when | Learn more | |---|---|---|---| | `DynamoGraphDeployment` (DGD) | The canonical live deployment for a Dynamo inference graph. | You have a known-good configuration or tuned YAML. | [Creating Deployments](/dynamo/additional-resources/creating-deployments), [DGD API](/dynamo/additional-resources/api-reference-k-8-s#dynamographdeployment) | | `DynamoComponentDeployment` (DCD) | The per-component deployment objects created from a DGD. | Usually not authored directly; inspect them to debug frontend/router/worker rollout. | [DCD API](/dynamo/additional-resources/api-reference-k-8-s#dynamocomponentdeployment) | | `DynamoGraphDeploymentRequest` (DGDR) | A deploy-by-intent request that profiles your model/hardware and generates a DGD. | You want Dynamo to size the deployment, choose parallelism, configure supported generated-deployment features such as Planner, or produce DGD YAML. | [DGDR Reference](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference) | | Recipes | Curated `deploy.yaml` manifests that are already DGD specs. | A recipe matches your model, backend, hardware, and serving mode. | [Dynamo recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) | | `DynamoModel` | Model and adapter lifecycle management layered onto an existing DGD or DCD. | You need declarative model operations such as LoRA adapter loading. | [Managing Models with DynamoModel](/dynamo/kubernetes-deployment/deploy-models/managing-models-with-dynamo-model) | ## Choose Your Path Start with the row that matches your situation. The sections later in this page are reference material; you can read them as needed instead of going linearly. | Situation | Do this first | Then read | |---|---|---| | A recipe matches your model/backend/hardware | Apply the recipe's model cache resources, then apply its `deploy.yaml`. | [Deploy a Tuned DGD from Recipes](#deploy-a-tuned-dgd-from-recipes) | | You want Dynamo to generate the deployment | Create a DGDR. Use `autoApply: true` to let the operator create the DGD, or `autoApply: false` to inspect the generated DGD YAML first. | [Use DGDR to Generate a DGD](#use-dgdr-to-generate-a-dgd) | | You already know the exact topology | Author or edit a DGD directly, then apply it with `kubectl`. | [Creating Deployments](/dynamo/additional-resources/creating-deployments) | | You are preparing for production | Add model caching, choose backend/search strategy, and validate networking/planner needs. | [Production Details](#production-details) | ## Deploy a Tuned DGD from Recipes If a [recipe](https://github.com/ai-dynamo/dynamo/tree/main/recipes) matches your target model, backend, GPU type, and serving mode, start there. Recipes are curated `DynamoGraphDeployment` manifests with model-cache setup and, for many recipes, benchmark jobs. The common recipe flow is: ```bash cd recipes # Update the recipe storageClassName first, then create model cache resources. kubectl apply -f /model-cache/ -n ${NAMESPACE} kubectl wait --for=condition=Complete job/model-download \ -n ${NAMESPACE} --timeout=6000s # Deploy a tuned DGD. kubectl apply -f ///deploy.yaml -n ${NAMESPACE} ``` Follow the README in the specific recipe directory for model-specific images, GPU requirements, cache setup, and request examples. ## Use DGDR to Generate a DGD A DGDR is Dynamo's deploy-by-intent path. Instead of hand-crafting a deployment spec with parallelism settings, replica counts, and resource limits, you describe what you want to run (model, backend, workload, SLA targets) and DGDR generates a DGD: 1. **Spec** — You submit a DGDR with your model, workload expectations, and optional SLA targets. 2. **Hardware Discovery** — The operator discovers your cluster's GPU hardware (SKU, VRAM, count per node) via DCGM or node labels. 3. **Profiling** — The profiler analyzes your model against the discovered hardware, using either rapid simulation or thorough real-GPU benchmarking. 4. **DGD Generation** — The profiler produces an optimized `DynamoGraphDeployment` (DGD) spec with the best parallelization strategy, replica counts, and resource configuration. 5. **Review** (when `autoApply: false`) — The generated DGD is stored in `.status.profilingResults.selectedConfig` for you to inspect and optionally modify before deploying. 6. **Deploy** — With `autoApply: true`, the operator creates the DGD. With `autoApply: false`, you apply the generated DGD yourself. 7. **Planner** (optional) — If enabled, the Planner monitors live traffic and adjusts replica counts at runtime to meet your SLA targets. DGDR currently supports generated-deployment feature configuration for Planner (`features.planner`) and mocker mode (`features.mocker`). The DGDR API does not currently expose `features.kvRouter`; configure explicit router mode in a DGD, a tuned recipe, or a generated DGD override when you need KV-aware routing details. ```text ┌──────┐ ┌───────────┐ ┌──────────┐ ┌─────────────┐ ┌────────┐ ┌─────────┐ │ Spec │───▶│ Hardware │───▶│ Profiler │───▶│ Generated │───▶│ Deploy │───▶│ Planner │ │ │ │ Discovery │ │ │ │ DGD │ │ │ │ (opt.) │ └──────┘ └───────────┘ └──────────┘ └─────────────┘ └────────┘ └─────────┘ │ autoApply: false? ▼ Review ``` For the DGDR spec reference, field descriptions, and lifecycle phases, see the [DGDR Reference](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference). ## DGDR Detail: Choose a Search Strategy The `searchStrategy` field controls how the profiler explores configurations. Your choice depends on how much time you can invest and how close to optimal you need. ### Rapid (Default) ```yaml searchStrategy: rapid ``` Uses AIC-backed DynoSim-style performance modeling to search deployment configurations without running real inference. Completes in ~30 seconds with no GPU resources consumed during profiling. **Use rapid when:** - Getting started or iterating quickly - Running in CI/CD pipelines - Your GPU SKU is in the [AIC support matrix](#aic-support-matrix) **Limitations:** - If AIC does not support your model/hardware/backend combination, the profiler falls back to a naive memory-fit config (basic TP calculation) which may not be optimal. - Simulated results may differ from real-hardware performance for unusual configurations. ### Thorough ```yaml searchStrategy: thorough backend: vllm # must specify a concrete backend ``` Enumerates candidate parallelization configs, deploys each on real GPUs, and benchmarks with AIPerf. Takes 2–4 hours. **Use thorough when:** - Tuning for production and you need the most optimal configuration - Your hardware is not supported by AIC (e.g., PCIe GPUs) - You want measured rather than simulated performance data **Constraints:** - **Disaggregated mode only** — thorough does not run aggregated configurations. - **`backend: auto` is not supported** — you must specify `vllm`, `sglang`, or `trtllm`. The DGDR will be rejected if you use `auto` with `thorough`. - **Requires GPU resources** — the profiler deploys real inference engines on your cluster during profiling. ## DGDR Detail: AIC Support Matrix The rapid strategy relies on AIC performance models. AIC currently supports: ### GPU SKUs | Supported (rapid) | Not Yet Supported (use thorough) | |---|---| | H100 SXM | V100 (SXM/PCIe) | | H100 PCIe | T4 | | H200 SXM | MI200, MI300 | | A100 SXM | | | A100 PCIe | | | A30 | | | B200 SXM | | | GB200 SXM | | | L40S | | | L4 | | Some rapid-mode SKUs use AIC estimate-only data until measured profiles are available. Use `searchStrategy: thorough` when you need hardware-measured profiling for an estimate-only or unsupported SKU. When specifying GPU SKUs manually, use lowercase underscore format (e.g., `h100_sxm`, not `H100-SXM5-80GB`). See the [DGDR Reference — SKU Format](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference#sku-format) for the full list. ### Backends All three backends are supported for both rapid and thorough: | Backend | Dense Models | MoE Models | |---------|-------------|------------| | vLLM | ✅ | 🚧 Work in progress | | SGLang | ✅ | ✅ | | TensorRT-LLM | ✅ | 🚧 Work in progress | **If you are deploying a Mixture-of-Experts (MoE) model** (e.g., DeepSeek-R1, Qwen3-MoE), use **SGLang** as the backend for full support. vLLM and TRT-LLM have partial MoE support that is still under development. ### Parallelization Strategies The profiler selects different parallelization strategies depending on the model architecture: | Model Architecture | Prefill | Decode | |---|---|---| | MLA+MoE (DeepSeek-V3, DeepSeek-R1) | TEP, DEP | TEP, DEP | | GQA+MoE (Qwen3-MoE) | TP, TEP, DEP | TP, TEP, DEP | | Dense models (Llama, Qwen, etc.) | TP | TP | ## Production Details After the basic deployment path is clear, use this checklist to decide which production topics apply: | Concern | Why it matters | Section | |---|---|---| | Model startup is slow or the model is gated | Avoid repeated downloads and pass `HF_TOKEN` cleanly. | [Model Caching](#production-detail-model-caching) | | Traffic changes over time | Planner can scale prefill/decode replicas at runtime. | [Planner](#production-detail-planner) | | The model spans nodes or uses disaggregated serving | Grove/LWS and RDMA affect scheduling and KV transfer. | [Multinode and RDMA](#production-detail-multinode-and-rdma) | | You need a specific inference engine | Backend choice affects MoE support, thorough profiling, and distributed behavior. | [Backend Selection](#production-detail-backend-selection) | ## Production Detail: Model Caching **Set up model caching before deploying if any of these apply:** - Your model is large (>70B parameters) — downloading hundreds of GB per pod takes hours - You are scaling to many replicas — each pod downloads the full model independently, and HuggingFace will rate-limit concurrent downloads - You want fast pod startup on scaling events ### How It Works with DGDR Add a `modelCache` section to your DGDR spec that points to a pre-populated PVC: ```yaml spec: model: meta-llama/Llama-3.1-70B-Instruct modelCache: pvcName: model-cache pvcMountPath: /home/dynamo/.cache/huggingface pvcModelPath: hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/ ``` The operator mounts this PVC at `pvcMountPath` read-only into the profiling job and passes it through to the generated DGD, so both profiling and serving use the cached weights. `pvcModelPath` must be the HuggingFace snapshot path inside the PVC — `hub/models----/snapshots/`. This follows the layout that `huggingface-cli download` creates when `HF_HOME` is set to the mount point. Replace `--` by substituting `/` with `--` in the model ID, and replace `` with the actual snapshot revision. See [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching#find-the-snapshot-path) for how to look up the hash after downloading. ### Setup 1. Create a `ReadWriteMany` PVC — see the [Installation Guide — Shared Storage](/dynamo/kubernetes-deployment/start-here/installation-guide#shared-storage-for-model-caching) for provider-specific options (EFS, Azure Lustre, GKE Filestore). 2. Run a one-time download Job to populate the PVC. 3. Reference the PVC in your DGDR's `modelCache` field. See [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) for the full walkthrough with YAML examples. ### Private and Gated Models For models that require authentication (e.g., gated HuggingFace models), create a Kubernetes Secret named `hf-token-secret` with a `HF_TOKEN` key: ```bash kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN= \ -n $NAMESPACE ``` The profiler and deployed pods will automatically use this token. ## Production Detail: Planner The Planner provides **runtime autoscaling** for disaggregated deployments. It adjusts prefill and decode replica counts to meet your SLA targets as traffic fluctuates. ```yaml spec: features: planner: enabled: true sla: ttft: 500 # Target time to first token (ms) itl: 50 # Target inter-token latency (ms) ``` ### Planner Scaling Modes | Mode | Description | Prometheus Required? | |---|---|---| | `throughput` (default) | Static queue-depth and KV-cache thresholds; scales based on saturation | No | | `latency` | Same as throughput with more aggressive thresholds | No | | `sla` | Rust engine perf shim targeting specific TTFT/ITL values; uses native AIC when available, optional bootstrap data, and live FPM tuning | Yes | ### Prometheus Requirement The `sla` optimization target reads live TTFT/ITL metrics from Prometheus. If you want SLA-driven autoscaling, install Prometheus before creating the DGDR. See the [Installation Guide — Prometheus](/dynamo/kubernetes-deployment/start-here/installation-guide#kube-prometheus-stack) for setup instructions. The `throughput` and `latency` modes use internal queue-depth signals and work **without Prometheus**. See the [Planner Guide](/dynamo/components/planner/planner-guide) for advanced configuration and scaling behavior details. ## Production Detail: Multinode and RDMA Models that require more GPUs than a single node provides (e.g., DeepSeek-R1 on 8-GPU nodes) need multinode orchestration. ### Grove and KAI Scheduler **Grove** is required for multinode DGDR deployments. It provides gang scheduling (all pods in a group start together or not at all), coordinated scaling, and network topology-aware placement. The operator will return an error if you attempt a multinode deployment without Grove or LeaderWorkerSet (LWS) installed. **KAI Scheduler** is optional but recommended alongside Grove for GPU-aware scheduling and topology optimization. See the [Installation Guide — Grove + KAI Scheduler](/dynamo/kubernetes-deployment/start-here/installation-guide#grove--kai-scheduler) for setup instructions and the compatibility matrix. ### High-Speed Networking (RDMA) Disaggregated serving transfers KV cache data between prefill and decode workers. Understanding the networking stack helps you diagnose performance issues: | Layer | What it is | |---|---| | **NIXL** | Dynamo's KV cache transfer library. Moves data between prefill and decode pods. | | **UCX / libfabric** | Low-level communication frameworks that NIXL uses underneath. | | **RDMA** | Remote Direct Memory Access — the general technique for moving data between machines without involving the CPU. | | **InfiniBand** | High-speed RDMA networking standard. Common on-prem and on Azure (AKS). | | **RoCE** | RDMA over Converged Ethernet — RDMA on standard Ethernet hardware. | | **EFA** | AWS Elastic Fabric Adapter — AWS's RDMA-capable networking for EKS. | | **GPUDirect RDMA** | Allows data to go directly between a GPU and a network adapter, bypassing CPU memory entirely. | | **NCCL** | NVIDIA Collective Communications Library — handles intra-model parallelism (TP/PP) communication _within_ a pod. Separate from NIXL. | When RDMA is missing or not active, NIXL can fall back to TCP. That makes KV cache movement the likely bottleneck and can produce very high TTFT or low throughput even when the model workers appear healthy. **Enable RDMA if:** - You are running multinode disaggregated deployments - You need low-latency KV cache transfer between workers See the [Installation Guide — Network Operator / RDMA](/dynamo/kubernetes-deployment/start-here/installation-guide#network-operator--rdma) for provider-specific setup instructions, and the [Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication) for transport details and performance expectations. ### MoE Models and Multinode Sweep Limits The profiler sweeps MoE models across up to **4 nodes** (dense models: 1 node max per engine during sweep). If your MoE model requires more than 4 nodes of GPUs, the profiler will select the best config within that range and you may need to adjust replica counts manually. ## Production Detail: Backend Selection The `backend` field controls which inference engine is used. The default (`auto`) lets the profiler pick the best backend, but you should specify a backend explicitly in these cases: | Scenario | Recommended Backend | |---|---| | MoE models (DeepSeek-R1, Qwen3-MoE) | `sglang` (full MoE support) | | Using `searchStrategy: thorough` | Any except `auto` (required) | | TensorRT-LLM compilation caching | `trtllm` (add a compilation cache PVC) | | Need load-based planner scaling (FPM) | `vllm` (any config) or `trtllm` (non-attention-DP only). SGLang FPM is wired in Dynamo but the upstream module is not in the 1.2.0 runtime image. | TensorRT-LLM does not support Python 3.11. If your environment uses Python 3.11, use `vllm` or `sglang` instead. ### Multinode Backend Behavior Each backend handles multinode inference differently: - **vLLM**: Uses Ray for multi-node TP/PP. Ray head runs on the leader, agents on workers. - **SGLang**: Uses `--dist-init-addr`, `--nnodes`, `--node-rank` flags for distributed setup. - **TRT-LLM**: MPI-based. The operator auto-generates SSH keypairs; the leader runs `mpirun`. ## Troubleshooting ### OOM During Profiling or Serving - **Cause**: The model doesn't fit in GPU memory with the selected TP size. - **Fix**: Ensure `hardware.totalGpus` is large enough for your model. The profiler calculates minimum TP from model size and VRAM, but edge cases (large context lengths, KV cache overhead) may require more GPUs than the minimum. ### GPU Auto-Detection Cap The operator caps auto-detected GPU count at **32**. If your cluster has more GPUs and you want the profiler to use them, set `hardware.totalGpus` explicitly: ```yaml spec: hardware: totalGpus: 64 ``` ### Profiling Job Fails to Schedule GPU nodes often have taints. Add tolerations via the `overrides` field: ```yaml spec: overrides: profilingJob: template: spec: containers: [] # required placeholder tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule ``` ### DGDR Spec Is Immutable Once the DGDR enters the `Profiling` phase, the spec cannot be changed. If you need to adjust settings, delete the DGDR and recreate it: ```bash kubectl delete dgdr my-model -n $NAMESPACE kubectl apply -f updated-dgdr.yaml -n $NAMESPACE ``` ### DGD Persists After DGDR Deletion Deleting a DGDR does **not** delete the DGD it created. This is intentional — the DGD continues serving traffic independently. To clean up fully: ```bash kubectl delete dgdr my-model -n $NAMESPACE kubectl delete dgd my-model-dgd -n $NAMESPACE ``` ## Example Workflows ### Small Dense Model (Quick Start) A small model on a single node with rapid profiling — the simplest case: ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: qwen-small spec: model: Qwen/Qwen3-0.6B ``` ### Large Dense Model with SLA Targets A 70B model with model caching, SLA targets, and the planner enabled: ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: llama-70b spec: model: meta-llama/Llama-3.1-70B-Instruct backend: vllm searchStrategy: rapid autoApply: false modelCache: pvcName: model-cache pvcMountPath: /home/dynamo/.cache/huggingface pvcModelPath: hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/ sla: ttft: 500 itl: 50 workload: isl: 4000 osl: 1000 requestRate: 10 features: planner: enabled: true ``` ### MoE Model (DeepSeek-R1) A large MoE model requiring multinode, SGLang backend, and thorough profiling: ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: deepseek-r1 spec: model: deepseek-ai/DeepSeek-R1 backend: sglang searchStrategy: thorough autoApply: false modelCache: pvcName: model-cache pvcMountPath: /home/dynamo/.cache/huggingface pvcModelPath: hub/models--deepseek-ai--DeepSeek-R1/snapshots/ sla: ttft: 2000 itl: 100 hardware: totalGpus: 32 features: planner: enabled: true overrides: profilingJob: template: spec: containers: [] tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule ``` **Prerequisites for this deployment:** - [Grove and KAI Scheduler](/dynamo/kubernetes-deployment/start-here/installation-guide#grove--kai-scheduler) installed - [RDMA](/dynamo/kubernetes-deployment/start-here/installation-guide#network-operator--rdma) configured for efficient KV cache transfer - Model [cached on a shared PVC](/dynamo/kubernetes-deployment/start-here/installation-guide#shared-storage-for-model-caching) - [Prometheus](/dynamo/kubernetes-deployment/start-here/installation-guide#kube-prometheus-stack) installed (for SLA-driven planner scaling) ## Further Reading - [DGDR Reference](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference) — Spec reference, lifecycle phases, monitoring commands - [DGDR Examples](/dynamo/components/profiler/profiler-examples) — Ready-to-use YAML for various scenarios - [Profiler Guide](/dynamo/components/profiler/profiler-guide) — Profiling algorithms, picking modes, gate checks - [Planner Guide](/dynamo/components/planner/planner-guide) — Scaling modes, PlannerConfig reference - [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) — PVC setup, ModelExpress, and ModelStreamer - [Creating Deployments](/dynamo/additional-resources/creating-deployments) — Manual DGD spec for hand-crafted configs - [Multinode Deployments](/dynamo/kubernetes-deployment/scale/multinode-deployments) — Grove, LWS, and multinode details - [Disaggregated Communication](/dynamo/kubernetes-deployment/operate/disagg-communication) — NIXL, RDMA, and networking # Managing Models with DynamoModel ## Overview `DynamoModel` is a Kubernetes Custom Resource that represents a machine learning model deployed on Dynamo. It enables you to: - **Deploy LoRA adapters** on top of running base models - **Track model endpoints** and their readiness across your cluster - **Manage model lifecycle** declaratively with Kubernetes DynamoModel works alongside `DynamoGraphDeployment` (DGD) or `DynamoComponentDeployment` (DCD) resources. While DGD/DCD deploy the inference infrastructure (pods, services), DynamoModel handles model-specific operations like loading LoRA adapters. ## Quick Start ### Prerequisites Before creating a DynamoModel, you need: 1. A running `DynamoGraphDeployment` or `DynamoComponentDeployment` 2. Components configured with `modelRef` pointing to your base model 3. Pods are ready and serving your base model For complete setup including DGD configuration, see [Integration with DynamoGraphDeployment](#integration-with-dynamographdeployment). ### Deploy a LoRA Adapter **1. Create your DynamoModel:** ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoModel metadata: name: my-lora namespace: dynamo-system spec: modelName: my-custom-lora baseModelName: Qwen/Qwen3-0.6B # Must match modelRef.name in your DGD modelType: lora source: uri: s3://my-bucket/loras/my-lora ``` **2. Apply and verify:** ```bash # Apply the DynamoModel kubectl apply -f my-lora.yaml # Check status kubectl get dynamomodel my-lora ``` **Expected output:** ``` NAME TOTAL READY AGE my-lora 2 2 30s ``` That's it! The operator automatically discovers endpoints and loads the LoRA. For detailed status monitoring, see [Monitoring & Operations](#monitoring--operations). ## Understanding DynamoModel ### Model Types DynamoModel supports three model types: | Type | Description | Use Case | |------|-------------|----------| | **`base`** | Reference to an existing base model | Tracking endpoints for a base model (default) | | **`lora`** | LoRA adapter that extends a base model | Deploy fine-tuned adapters on existing models | | **`adapter`** | Generic model adapter | Future extensibility for other adapter types | Most users will use **`lora`** to deploy fine-tuned models on top of their base model deployments. ### How It Works When you create a DynamoModel, the operator: 1. **Discovers endpoints**: Finds all pods running your `baseModelName` (by matching `modelRef.name` in DGD/DCD) 2. **Creates service**: Automatically creates a Kubernetes Service to track these pods 3. **Loads LoRA**: Calls the LoRA load API on each endpoint (for `lora` type) 4. **Updates status**: Reports which endpoints are ready **Key linkage:** ```yaml # DGD modelRef.name ↔ DynamoModel baseModelName must match Worker: modelRef: name: Qwen/Qwen3-0.6B --- spec: baseModelName: Qwen/Qwen3-0.6B ``` ## Configuration Overview DynamoModel requires just a few key fields to deploy a model or adapter: | Field | Required | Purpose | Example | |-------|----------|---------|---------| | `modelName` | Yes | Model identifier | `my-custom-lora` | | `baseModelName` | Yes | Links to DGD modelRef | `Qwen/Qwen3-0.6B` | | `modelType` | No | Type: base/lora/adapter | `lora` (default: `base`) | | `source.uri` | For LoRA | Model location | `s3://bucket/path` or `hf://org/model` | **Example minimal LoRA configuration:** ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoModel metadata: name: my-lora spec: modelName: my-custom-lora baseModelName: Qwen/Qwen3-0.6B modelType: lora source: uri: s3://my-bucket/my-lora ``` **For complete field specifications, validation rules, and all options, see:** 📖 [DynamoModel API Reference](/dynamo/additional-resources/api-reference-k-8-s#dynamomodel) ### Status Summary The status shows discovered endpoints and their readiness: ```bash kubectl get dynamomodel my-lora ``` **Key status fields:** - `totalEndpoints` / `readyEndpoints`: Counts of discovered vs ready endpoints - `endpoints[]`: List with addresses, pod names, and ready status - `conditions`: Standard Kubernetes conditions (EndpointsReady, ServicesFound) For detailed status usage, see the [Monitoring & Operations](#monitoring--operations) section below ## Common Use Cases ### Use Case 1: S3-Hosted LoRA Adapter Deploy a LoRA adapter stored in an S3 bucket. ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoModel metadata: name: customer-support-lora namespace: production spec: modelName: customer-support-adapter-v1 baseModelName: meta-llama/Llama-3.3-70B-Instruct modelType: lora source: uri: s3://my-models-bucket/loras/customer-support/v1 ``` **Prerequisites:** - S3 bucket accessible from your pods (IAM role or credentials) - Base model `meta-llama/Llama-3.3-70B-Instruct` running via DGD/DCD **Verification:** ```bash # Check LoRA is loaded kubectl get dynamomodel customer-support-lora -o jsonpath='{.status.readyEndpoints}' # Should output: 2 (or your number of replicas) # View which pods are serving kubectl get dynamomodel customer-support-lora -o jsonpath='{.status.endpoints[*].podName}' ``` ### Use Case 2: HuggingFace-Hosted LoRA Deploy a LoRA adapter from HuggingFace Hub. ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoModel metadata: name: multilingual-lora namespace: dynamo-system spec: modelName: multilingual-adapter baseModelName: Qwen/Qwen3-0.6B modelType: lora source: uri: hf://myorg/qwen-multilingual-lora@v1.0.0 # Optional: @revision ``` **Prerequisites:** - HuggingFace Hub accessible from your pods - If private repo: HF token configured as secret and mounted in pods - Base model `Qwen/Qwen3-0.6B` running via DGD/DCD **With HuggingFace token:** ```yaml # In your DGD/DCD spec: services: worker: envFromSecret: hf-token-secret # Provides HF_TOKEN env var modelRef: name: Qwen/Qwen3-0.6B # ... rest of config ``` ### Use Case 3: Multiple LoRAs on Same Base Model Deploy multiple LoRA adapters on the same base model deployment. ```yaml --- # LoRA for customer support apiVersion: nvidia.com/v1alpha1 kind: DynamoModel metadata: name: support-lora spec: modelName: support-adapter baseModelName: Qwen/Qwen3-0.6B modelType: lora source: uri: s3://models/support-lora --- # LoRA for code generation apiVersion: nvidia.com/v1alpha1 kind: DynamoModel metadata: name: code-lora spec: modelName: code-adapter baseModelName: Qwen/Qwen3-0.6B # Same base model modelType: lora source: uri: s3://models/code-lora ``` Both LoRAs will be loaded on all pods serving `Qwen/Qwen3-0.6B`. Your application can then route requests to the appropriate adapter. ## Monitoring & Operations ### Checking Status **Quick status check:** ```bash kubectl get dynamomodel ``` **Example output:** ``` NAME TOTAL READY AGE my-lora 2 2 5m customer-lora 4 3 2h ``` **Detailed status:** ```bash kubectl describe dynamomodel my-lora ``` **Example output:** ``` Name: my-lora Namespace: dynamo-system Spec: Model Name: my-custom-lora Base Model Name: Qwen/Qwen3-0.6B Model Type: lora Source: Uri: s3://my-bucket/my-lora Status: Ready Endpoints: 2 Total Endpoints: 2 Endpoints: Address: http://10.0.1.5:9090 Pod Name: worker-0 Ready: true Address: http://10.0.1.6:9090 Pod Name: worker-1 Ready: true Conditions: Type: EndpointsReady Status: True Reason: EndpointsDiscovered Events: Type Reason Message ---- ------ ------- Normal EndpointsReady Discovered 2 ready endpoints for base model Qwen/Qwen3-0.6B ``` ### Understanding Readiness An endpoint is **ready** when: 1. The pod is running and healthy 2. The LoRA load API call succeeded **Condition states:** - `EndpointsReady=True`: All endpoints are ready (full availability) - `EndpointsReady=False, Reason=NotReady`: Not all endpoints ready (check message for counts) - `EndpointsReady=False, Reason=NoEndpoints`: No endpoints found When `readyEndpoints < totalEndpoints`, the operator automatically retries loading every 30 seconds. ### Viewing Endpoints **Get endpoint addresses:** ```bash kubectl get dynamomodel my-lora -o jsonpath='{.status.endpoints[*].address}' | tr ' ' '\n' ``` **Output:** ``` http://10.0.1.5:9090 http://10.0.1.6:9090 ``` **Get endpoint pod names:** ```bash kubectl get dynamomodel my-lora -o jsonpath='{.status.endpoints[*].podName}' | tr ' ' '\n' ``` **Check readiness of each endpoint:** ```bash kubectl get dynamomodel my-lora -o json | jq '.status.endpoints[] | {podName, ready}' ``` **Output:** ```json { "podName": "worker-0", "ready": true } { "podName": "worker-1", "ready": true } ``` ### Updating a Model To update a LoRA (e.g., deploy a new version): ```bash # Edit the source URI kubectl edit dynamomodel my-lora # Or apply an updated YAML kubectl apply -f my-lora-v2.yaml ``` The operator will detect the change and reload the LoRA on all endpoints. ### Deleting a Model ```bash kubectl delete dynamomodel my-lora ``` For LoRA models, the operator will: 1. Unload the LoRA from all endpoints 2. Clean up associated resources 3. Remove the DynamoModel CR The base model deployment (DGD/DCD) continues running normally. ## Troubleshooting ### No Endpoints Found **Symptom:** ```yaml status: totalEndpoints: 0 readyEndpoints: 0 conditions: - type: EndpointsReady status: "False" reason: NoEndpoints message: "No endpoint slices found for base model Qwen/Qwen3-0.6B" ``` **Common Causes:** 1. **Base model deployment not running** ```bash # Check if pods exist kubectl get pods -l nvidia.com/dynamo-component-type=worker ``` **Solution:** Deploy your DGD/DCD first, wait for pods to be ready. 2. **`baseModelName` mismatch** ```bash # Check modelRef in your DGD kubectl get dynamographdeployment my-deployment -o yaml | grep -A2 modelRef ``` **Solution:** Ensure `baseModelName` in DynamoModel exactly matches `modelRef.name` in DGD. 3. **Pods not ready** ```bash # Check pod status kubectl get pods -l nvidia.com/dynamo-component-type=worker ``` **Solution:** Wait for pods to reach `Running` and `Ready` state. 4. **Wrong namespace** **Solution:** Ensure DynamoModel is in the same namespace as your DGD/DCD. ### LoRA Load Failures **Symptom:** ```yaml status: totalEndpoints: 2 readyEndpoints: 0 # ← No endpoints ready despite pods existing conditions: - type: EndpointsReady status: "False" reason: NoReadyEndpoints ``` **Common Causes:** 1. **Source URI not accessible** ```bash # Check operator logs kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager -f | grep "Failed to load" ``` **Solution:** - For S3: Verify bucket permissions, IAM role, credentials - For HuggingFace: Verify token is valid, repo exists and is accessible 2. **Invalid LoRA format** **Solution:** Ensure your LoRA weights are in the format expected by your backend framework (SGLang, vLLM, etc.) 3. **Endpoint API errors** ```bash # Check operator logs for HTTP errors kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "error" ``` **Solution:** Check the backend framework's logs in the worker pods: ```bash kubectl logs worker-0 ``` 4. **Out of memory** **Solution:** LoRA adapters require additional memory. Increase memory limits in your DGD: ```yaml resources: limits: memory: "32Gi" # Increase if needed ``` ### Status Shows Not Ready **Symptom:** Some endpoints remain not ready for extended periods. **Diagnosis:** ```bash # Check which endpoints are not ready kubectl get dynamomodel my-lora -o json | jq '.status.endpoints[] | select(.ready == false)' # View operator logs for that specific pod kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "worker-0" # Check the worker pod logs kubectl logs worker-0 | tail -50 ``` **Common Causes:** 1. **Network issues**: Pod can't reach S3/HuggingFace 2. **Resource constraints**: Pod is OOMing or being throttled 3. **API endpoint not responding**: Backend framework isn't serving the LoRA API **When to wait vs investigate:** - **Wait**: If readyEndpoints is increasing over time (LoRAs loading progressively) - **Investigate**: If stuck at same readyEndpoints for >5 minutes ### Viewing Events and Logs **Check events:** ```bash kubectl describe dynamomodel my-lora | tail -20 ``` **View operator logs:** ```bash # Follow logs kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager -f # Filter for specific model kubectl logs -n dynamo-system deployment/dynamo-operator-controller-manager | grep "my-lora" ``` **Common events and messages:** | Event/Message | Meaning | Action | |---------------|---------|--------| | `EndpointsReady` | All endpoints are ready | ✅ Good - full service availability | | `NotReady` | Not all endpoints ready | ⚠️ Check readyEndpoints count - operator will retry | | `PartialEndpointFailure` | Some endpoints failed to load | Check logs for errors | | `NoEndpointsFound` | No pods discovered | Verify DGD running and modelRef matches | | `EndpointDiscoveryFailed` | Can't query endpoints | Check operator RBAC permissions | | `Successfully reconciled` | Reconciliation complete | ✅ Good | ## Integration with DynamoGraphDeployment This section shows the complete end-to-end workflow for deploying base models and LoRA adapters together. DynamoModel and DynamoGraphDeployment work together to provide complete model deployment: - **DGD**: Deploys the infrastructure (pods, services, resources) - **DynamoModel**: Manages model-specific operations (LoRA loading) ### Linking Models to Components The connection is established through the `modelRef` field in your DGD: **Complete example:** ```yaml --- # 1. Deploy the base model infrastructure apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-deployment spec: backendFramework: vllm services: Frontend: componentType: frontend replicas: 1 extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:latest Worker: # This modelRef creates the link to DynamoModel modelRef: name: Qwen/Qwen3-0.6B # ← Key linking field componentType: worker replicas: 2 resources: limits: gpu: "1" extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:latest args: - --model - Qwen/Qwen3-0.6B - --tensor-parallel-size - "1" --- # 2. Deploy LoRA adapters on top apiVersion: nvidia.com/v1alpha1 kind: DynamoModel metadata: name: my-lora spec: modelName: my-custom-lora baseModelName: Qwen/Qwen3-0.6B # ← Must match modelRef.name above modelType: lora source: uri: s3://my-bucket/loras/my-lora ``` ### Deployment Workflow **Recommended order:** ```bash # 1. Deploy base model infrastructure kubectl apply -f my-deployment.yaml # 2. Wait for pods to be ready kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-component-type=worker --timeout=5m # 3. Deploy LoRA adapters kubectl apply -f my-lora.yaml # 4. Verify LoRA is loaded kubectl get dynamomodel my-lora ``` **What happens behind the scenes:** | Step | DGD | DynamoModel | |------|-----|-------------| | 1 | Creates pods with modelRef | - | | 2 | Pods become running and ready | - | | 3 | - | CR created, discovers endpoints via auto-created Service | | 4 | - | Calls LoRA load API on each endpoint | | 5 | - | All endpoints ready ✓ | The operator automatically handles all service discovery - you don't configure services, labels, or selectors manually. ## API Reference For complete field specifications, validation rules, and detailed type definitions, see: **📖 [Dynamo CRD API Reference](/dynamo/additional-resources/api-reference-k-8-s#dynamomodel)** ## Summary DynamoModel provides declarative model management for Dynamo deployments: ✅ **Simple**: 2-step deployment of LoRA adapters ✅ **Automatic**: Endpoint discovery and loading handled by operator ✅ **Observable**: Rich status reporting and conditions ✅ **Integrated**: Works seamlessly with DynamoGraphDeployment **Next Steps:** - Try the [Quick Start](#quick-start) example - Explore [Common Use Cases](#common-use-cases) - Check the [API Reference](/dynamo/additional-resources/api-reference-k-8-s#dynamomodel) for advanced configuration # DGDR Reference A `DynamoGraphDeploymentRequest` (DGDR) is Dynamo's deploy-by-intent generator for [`DynamoGraphDeployment`](/dynamo/additional-resources/api-reference-k-8-s#dynamographdeployment) (DGD) resources. You describe what you want to run and your performance targets; the profiler determines a configuration and produces the DGD that serves traffic. For the full deployment mental model — including DGD, DCD, DGDR, recipes, strategy selection, model caching, planner setup, and common pitfalls — see the [Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide). ## DGDR, DGD, and Recipes Dynamo provides two Custom Resources for deploying inference graphs: | | DGD (canonical live deployment) | DGDR (generator/profiler) | |---|---|---| | **You provide** | Full deployment spec (services, parallelism, replicas, resource limits, etc.) | Model, backend, workload, hardware, and optional SLA targets | | **What happens** | The operator reconciles the DGD into `DynamoComponentDeployment` resources and pods | The profiler generates a DGD; with `autoApply: true`, the operator creates it | | **Best for** | Known-good configs, tuned recipes, or full manual control | New model/hardware combinations, SLA-driven sizing, or generated DGD YAML | | **Persistence** | Persists and serves traffic | Reaches a terminal state after generation/deploy | Use DGD directly when you have a hand-crafted configuration for a specific model/hardware combination. Most [recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) are tuned DGD manifests. Use DGDR when you want Dynamo to generate the DGD for you. For DGD deployment details, see [Creating Deployments](/dynamo/additional-resources/creating-deployments). ## Spec Reference ### Minimal Example ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: my-model spec: model: Qwen/Qwen3-0.6B image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.1.1" # dynamo-frontend for Dynamo < 1.1.0 ``` ### Field Reference | Field | Required | Default | Purpose | |---|---|---|---| | `model` | Yes | — | HuggingFace model ID (e.g. `Qwen/Qwen3-0.6B`) | | `image` | No | — | Container image for the profiling job. Dynamo >= 1.1.0: use `dynamo-planner`; earlier versions: use `dynamo-frontend`. | | `backend` | No | `auto` | Inference engine: `auto`, `vllm`, `sglang`, `trtllm` | | `searchStrategy` | No | `rapid` | Profiling depth: `rapid` (AIC-backed DynoSim-style modeling, ~30s) or `thorough` (real GPU, 2–4h) | | `autoApply` | No | `true` | Automatically deploy the profiler's recommended config | | `sla.ttft` | No | — | Target time to first token (ms) | | `sla.itl` | No | — | Target inter-token latency (ms) | | `sla.e2eLatency` | No | — | Target end-to-end latency (ms). Cannot be combined with explicit `ttft`/`itl`. | | `workload.isl` | No | `4000` | Expected average input sequence length | | `workload.osl` | No | `1000` | Expected average output sequence length | | `workload.requestRate` | No | — | Target requests per second | | `workload.concurrency` | No | — | Target concurrent requests | | `hardware.gpuSku` | No | auto-detected | GPU SKU (see [SKU Format](#sku-format)) | | `hardware.vramMb` | No | auto-detected | GPU VRAM in MB | | `hardware.totalGpus` | No | auto-detected (capped at 32) | Total GPUs available to the deployment | | `hardware.numGpusPerNode` | No | auto-detected | GPUs per node | | `hardware.interconnect` | No | auto-detected | Interconnect type | | `hardware.rdma` | No | auto-detected | Whether RDMA is available | | `modelCache.pvcName` | No | — | Name of a `ReadWriteMany` PVC containing cached model weights | | `modelCache.pvcModelPath` | No | — | Path to the model directory inside the PVC | | `modelCache.pvcMountPath` | No | `/opt/model-cache` | Mount path inside containers | | `features.planner` | No | disabled | Enable the SLA-aware Planner; the generated DGD includes Planner service/configuration | | `features.mocker` | No | disabled | Enable mocker mode for testing | | `overrides.profilingJob` | No | — | `batchv1.JobSpec` overrides for the profiling job (e.g., tolerations) | | `overrides.dgd` | No | — | Raw DGD override base applied to the generated deployment | For the complete CRD spec, see the [API Reference](/dynamo/additional-resources/api-reference-k-8-s). DGDR does not currently expose a `features.kvRouter` field. To configure router mode or KV-aware routing details, use a direct DGD, a tuned recipe, or `overrides.dgd` when you still want DGDR to generate the base deployment. ### Generated DGD Overrides Use `spec.overrides.dgd` when the generated `DynamoGraphDeployment` needs a field that DGDR does not expose directly. The value is a partial `nvidia.com/v1alpha1` DGD object that is merged into the profiler-generated deployment after Dynamo selects a configuration. For example, to inject an environment variable into every generated service: ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: qwen3-sglang spec: model: Qwen/Qwen3-30B-A3B backend: sglang image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.1.1" # dynamo-frontend for Dynamo < 1.1.0 overrides: dgd: apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment spec: envs: - name: TRITON_PTXAS_PATH value: /usr/local/cuda/bin/ptxas ``` Use `spec.envs` for variables that should apply to all generated services. To target a single service, override that service's `envs` entry instead: ```yaml spec: overrides: dgd: apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment spec: services: decode: # replace with the generated service name envs: - name: CUSTOM_WORKER_ENV value: "enabled" ``` `overrides.profilingJob` only customizes the profiling Job. Use `overrides.dgd` for settings that must appear on the deployed worker pods. ### SKU Format When providing hardware configuration manually, use lowercase underscore format: | Correct | Incorrect | |---|---| | `h100_sxm` | `H100-SXM5-80GB` | | `h200_sxm` | `H200-SXM-141GB` | | `a100_sxm` | `A100-SXM4-80GB` | | `a30` | `A30` | | `l40s` | `L40S` | All supported values: `gb200_sxm`, `b200_sxm`, `h200_sxm`, `h100_sxm`, `h100_pcie`, `a100_sxm`, `a100_pcie`, `a30`, `l40s`, `l40`, `l4`, `v100_sxm`, `v100_pcie`, `t4`, `mi200`, `mi300`. Not all SKUs are supported by the AIC profiler for `rapid` mode. See [AIC Support Matrix](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide#aic-support-matrix) for details. **PCIe variants not yet supported by profiler.** The CRD admits PCIe SKUs (`h100_pcie`, `a100_pcie`, `v100_pcie`), but the profiler does not currently ship training data for them. You can submit a DGDR with a PCIe value; the operator will accept it but profiler-assisted sizing will fall back to defaults. Profiler support for PCIe SKUs is tracked as an engineering follow-up. ## Lifecycle When you create a DGDR, it progresses through these phases: | Phase | What is happening | |---|---| | `Pending` | Spec validated; operator is discovering GPU hardware and preparing the profiling job | | `Profiling` | Profiling job running — sub-phases: `Initializing`, `SweepingPrefill`, `SweepingDecode`, `SelectingConfig`, `BuildingCurves`, `GeneratingDGD`, `Done` | | `Ready` | Profiling complete; optimal config stored in `.status.profilingResults.selectedConfig`. Terminal state when `autoApply: false`. | | `Deploying` | Creating the `DynamoGraphDeployment` (only when `autoApply: true`) | | `Deployed` | DGD is running and healthy | | `Failed` | Unrecoverable error — profiling failures are not retried (`backoffLimit: 0`); check events and conditions for details | ### Conditions The operator maintains these conditions on the DGDR status: | Condition | Meaning | |---|---| | `Validation` | Spec validation passed or failed | | `Profiling` | Profiling job is running, succeeded, or failed | | `SpecGenerated` | Generated DGD spec is available | | `DeploymentReady` | DGD is deployed and healthy | | `Succeeded` | Aggregate condition — true when the DGDR has reached its target state | ### Monitoring ```bash # Watch phase transitions kubectl get dgdr my-model -n $NAMESPACE -w # Detailed status, conditions, and events kubectl describe dgdr my-model -n $NAMESPACE # Profiling sub-phase kubectl get dgdr my-model -n $NAMESPACE -o jsonpath='{.status.profilingPhase}' # Profiling job logs kubectl get pods -n $NAMESPACE -l nvidia.com/dgdr-name=my-model kubectl logs -f -n $NAMESPACE # View generated DGD spec (when autoApply: false) kubectl get dgdr my-model -n $NAMESPACE \ -o jsonpath='{.status.profilingResults.selectedConfig}' | python3 -m json.tool # View Pareto-optimal configs from profiling kubectl get dgdr my-model -n $NAMESPACE \ -o jsonpath='{.status.profilingResults.pareto}' ``` ### Resource Ownership - The DGDR does **not** set an owner reference on the DGD it creates. Deleting a DGDR does not delete the DGD — it persists independently so it can continue serving traffic. - The relationship is tracked via labels: `dgdr.nvidia.com/name` and `dgdr.nvidia.com/namespace`. - Additional resources (planner ConfigMaps) are created in the same namespace and labeled with `dgdr.nvidia.com/name`. ## Known Issues - **`pareto_analysis.py` produces NaN for some configurations.** Tracked as an engineering follow-up. Workaround: re-run with a narrower sweep; narrow sweeps bypass the NaN path in practice. - **PCIe profiler data not yet available.** See the PCIe callout under [SKU Format](#sku-format). ## Further Reading - [Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide) — DGD, DCD, DGDR, recipes, strategy selection, and common pitfalls - [Profiler Guide](/dynamo/components/profiler/profiler-guide) — Profiling algorithms, picking modes, gate checks - [Profiler Examples](/dynamo/components/profiler/profiler-examples) — Ready-to-use YAML for SLA targets, private models, MoE, overrides - [Planner Guide](/dynamo/components/planner/planner-guide) — Scaling modes, PlannerConfig reference - [API Reference](/dynamo/additional-resources/api-reference-k-8-s) — Complete CRD field specifications - [Creating Deployments](/dynamo/additional-resources/creating-deployments) — DGD spec for full manual control # Model Caching Large language models can take minutes to download. Without caching, every pod downloads the full model independently, wasting bandwidth and delaying startup. Dynamo supports a simple shared-storage path and a ModelExpress path for faster weight distribution across larger clusters. ## Option 1: PVC + Download Job (Recommended) The simplest approach: create a shared PVC, run a one-time Job to download the model, then mount the PVC in your DynamoGraphDeployment. This is the pattern used by all Dynamo recipes today. ### Step 1: Create a Shared PVC ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-cache spec: accessModes: - ReadWriteMany resources: requests: storage: 100Gi ``` `ReadWriteMany` access mode is required so multiple pods can mount the PVC simultaneously. Ensure your storage class supports RWX (e.g., NFS, CephFS, or cloud-provider shared file systems). ### Step 2: Download the model ```yaml apiVersion: batch/v1 kind: Job metadata: name: model-download spec: template: spec: restartPolicy: Never containers: - name: downloader image: python:3.12-slim command: ["sh", "-c"] args: - | pip install huggingface_hub hf_transfer HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \ $MODEL_NAME --revision $MODEL_REVISION env: - name: MODEL_NAME value: "Qwen/Qwen3-0.6B" - name: MODEL_REVISION value: "main" - name: HF_HOME value: /cache/huggingface envFrom: - secretRef: name: hf-token-secret volumeMounts: - name: model-cache mountPath: /cache/huggingface volumes: - name: model-cache persistentVolumeClaim: claimName: model-cache ``` ### Find the Snapshot Path After the Job completes, the model is stored in HuggingFace's cache layout: ``` hub/models----/snapshots// ``` For example, `meta-llama/Llama-3.1-70B-Instruct` becomes: ``` hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/9d3b8e0f71f8c1e0f9b7c2a3d4e5f6a7b8c9d0e1/ ``` To find the exact commit hash after the download Job completes: ```bash kubectl run find-snapshot --rm -it --image=busybox --restart=Never \ --overrides='{ "spec": { "volumes": [{"name": "c", "persistentVolumeClaim": {"claimName": "model-cache"}}], "containers": [{ "name": "f", "image": "busybox", "command": ["find", "/c/hub", "-mindepth", "3", "-maxdepth", "3", "-type", "d"], "volumeMounts": [{"name": "c", "mountPath": "/c"}] }] } }' ``` Alternatively, look up the commit hash on the HuggingFace Hub model page under **Files and versions**. You need this path for the `pvcModelPath` field in a DGDR spec (see [Deployment Overview — Model Caching](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide#production-detail-model-caching)). ### Step 3: Mount in DynamoGraphDeployment ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-deployment spec: pvcs: - create: false name: model-cache services: VllmWorker: volumeMounts: - name: model-cache mountPoint: /home/dynamo/.cache/huggingface ``` All `VllmWorker` pods that mount `model-cache` now read from the shared cache, avoiding per-pod worker downloads. If you also want the frontend to reuse tokenizer and config files, mount the same PVC there too. ### Compilation Cache For vLLM, you can also cache compiled artifacts (CUDA graphs, etc.) with a second PVC: ```yaml spec: pvcs: - create: false name: model-cache - create: false name: compilation-cache services: VllmWorker: volumeMounts: - name: model-cache mountPoint: /home/dynamo/.cache/huggingface - name: compilation-cache mountPoint: /home/dynamo/.cache/vllm ``` ## Option 2: ModelExpress (P2P Distribution) [ModelExpress](https://github.com/ai-dynamo/modelexpress) is a model weight distribution service that integrates with vLLM's weight loading pipeline. It can publish model weights from one worker and let later workers pull those tensors from GPU memory over NIXL/RDMA instead of repeating a full storage download. ModelExpress can also use **ModelStreamer** as a loading strategy. ModelStreamer streams safetensors directly from object storage or a local filesystem path into GPU memory through the `runai-model-streamer` package. In that setup, the first worker can stream from storage and then publish ModelExpress metadata so later workers can use the P2P path. Use this path when startup time or fleet-wide model rollout time matters more than the simplicity of a shared PVC. ### How It Works 1. A ModelExpress server runs in the cluster and stores metadata for available sources. 2. vLLM workers use the ModelExpress loader (`--load-format mx` on newer ModelExpress images, or `mx-source` / `mx-target` on older split-loader images). 3. If a compatible source is already serving the model, a new worker pulls model tensors from that source over NIXL/RDMA. 4. If no source is available, the worker falls back to storage. With a shared filesystem (RWX PVC, NFS, hostPath), the worker reads directly from the server's cache. Without a shared filesystem, set `MODEL_EXPRESS_NO_SHARED_STORAGE=1` so the client streams files from the server over gRPC; see [Streaming Without Shared Storage](#streaming-without-shared-storage) below. When `MX_MODEL_URI` is set, ModelStreamer can stream safetensors from S3, GCS, Azure Blob Storage, or a local path. 5. The Kubernetes operator can inject `MODEL_EXPRESS_URL` into all Dynamo pods from the platform `modelExpressURL` setting. ### What To Configure | Layer | What to configure | Notes | |-------|-------------------|-------| | Runtime image | Include the `modelexpress` Python package and, for ModelStreamer, `runai-model-streamer` plus the object-storage dependencies. | Dynamo or vLLM raises an import error if the worker uses a ModelExpress load format but the package is missing. | | ModelExpress server | Deploy the server with Redis or Kubernetes CRD metadata backend. | See the [ModelExpress deployment guide](https://github.com/ai-dynamo/modelexpress/blob/main/docs/DEPLOYMENT.md). | | Dynamo platform | Set `dynamo-operator.modelExpressURL`. | The operator injects `MODEL_EXPRESS_URL` into pods. | | vLLM worker | Set the ModelExpress load format and point at the server. | Newer ModelExpress images use `--load-format mx`; older Dynamo images may use `mx-source` / `mx-target`. | | ModelStreamer | Set `MX_MODEL_URI` to the storage location. | Supported URI forms include `s3://...`, `gs://...`, `az://...`, an absolute local path, or a Hugging Face model ID resolved from the local cache. | ### Setup **Install with Dynamo Platform:** ```bash helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \ --namespace ${NAMESPACE} \ --set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080" ``` **Configure workers to use ModelExpress:** ```yaml services: VllmWorker: extraPodSpec: mainContainer: image: command: ["python3", "-m", "dynamo.vllm"] args: - --model - meta-llama/Llama-3.1-70B-Instruct - --load-format - mx - --model-express-url - http://model-express-server.model-express.svc.cluster.local:8080 env: - name: VLLM_PLUGINS value: modelexpress ``` When `MODEL_EXPRESS_URL` is configured in the operator, it is automatically injected as an environment variable into all component pods. Passing `--model-express-url` explicitly is still useful in examples because the worker validates that a server URL is available when using the older `mx-source` / `mx-target` load formats. Use the load format supported by your runtime image. ModelExpress v0.3 and newer document the unified `mx` loader. Some Dynamo images still expose the older split `mx-source` and `mx-target` loader names; those require the same server URL but separate source and target roles. ### Streaming Without Shared Storage If the ModelExpress server's cache is on a non-shared volume (e.g. a `ReadWriteOnce` PVC, a cross-namespace deployment, or any topology where worker pods cannot mount the same filesystem as the server), the default shared-storage mode fails: the server reports the model as downloaded and returns its own local path, the worker cannot read that path from inside its own pod, and the load silently falls back to a direct HuggingFace download -- defeating the point of running ModelExpress. Set `MODEL_EXPRESS_NO_SHARED_STORAGE=1` on every worker pod to switch the ModelExpress client into gRPC streaming mode. The server then sends model files to the client over the existing gRPC channel and the worker writes them to its own local cache. ```yaml services: VllmWorker: extraPodSpec: mainContainer: image: command: ["python3", "-m", "dynamo.vllm"] args: - --model - meta-llama/Llama-3.1-70B-Instruct - --load-format - mx env: - name: VLLM_PLUGINS value: modelexpress - name: MODEL_EXPRESS_NO_SHARED_STORAGE value: "1" ``` `MODEL_EXPRESS_URL` is injected automatically by the operator (`dynamo-operator.modelExpressURL`); you do not need to set it explicitly here. No volume mount for the ModelExpress cache is required on worker pods in this mode. Use this path when: - The server runs with an RWO PVC, or in a different namespace from the workers. - The cluster has no RDMA / InfiniBand fabric available, so P2P over NIXL is not an option. - You want ModelExpress to act as a centralized download-and-cache server (one HuggingFace pull, fan out over gRPC to many workers) without standing up object storage and `MX_MODEL_URI`. Shared-filesystem mode is still faster when available, so prefer an RWX PVC mounted on both the server and the workers when the storage class supports it. See the [ModelExpress storage access modes documentation](https://github.com/ai-dynamo/modelexpress/blob/main/docs/DEPLOYMENT.md#storage-access-modes) for the full trade-off and tuning knobs (chunk size, etc.). ### ModelStreamer From Object Storage Set `MX_MODEL_URI` when the first worker should stream safetensors directly from storage instead of reading a PVC or relying on a prior source worker. ```yaml services: VllmWorker: extraPodSpec: mainContainer: image: command: ["python3", "-m", "dynamo.vllm"] args: - --model - meta-llama/Llama-3.1-70B-Instruct - --load-format - mx - --model-express-url - http://model-express-server.model-express.svc.cluster.local:8080 env: - name: VLLM_PLUGINS value: modelexpress - name: MX_MODEL_URI value: s3://my-model-bucket/meta-llama/Llama-3.1-70B-Instruct - name: RUNAI_STREAMER_CONCURRENCY value: "8" ``` ModelStreamer relies on the underlying cloud SDK credentials: | Storage backend | `MX_MODEL_URI` example | Credential options | |-----------------|------------------------|--------------------| | S3 or S3-compatible storage | `s3://bucket/path/to/model` | IRSA / workload identity, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`, `AWS_DEFAULT_REGION`, and optional `AWS_ENDPOINT_URL` | | Google Cloud Storage | `gs://bucket/path/to/model` | GKE Workload Identity, Application Default Credentials, or `GOOGLE_APPLICATION_CREDENTIALS` | | Azure Blob Storage | `az://container/path/to/model` | Managed Identity, service principal env vars, or `AZURE_ACCOUNT_NAME` / `AZURE_ACCOUNT_KEY` | | Local filesystem or PVC | `/models/meta-llama/Llama-3.1-70B-Instruct` | Mount the path into the worker pod | Credentials are consumed by the storage SDKs in the worker pod. They do not flow through the ModelExpress server. ### Relationship To Shadow Engine Failover ModelExpress and ModelStreamer are model loading and distribution paths. They are not required for [Shadow Engine Failover](/dynamo/kubernetes-deployment/advanced-platform/shadow-engine-failover), and enabling them does not create standby engines. Use Shadow Engine Failover only when you specifically need an active/shadow recovery topology backed by GPU Memory Service (GMS), DRA, and a backend load format such as `--load-format gms`. Keep the ModelExpress / ModelStreamer configuration separate unless you have validated a combined workflow for your runtime image and cluster. ### When to Use ModelExpress | Scenario | Recommended Approach | |----------|---------------------| | Small cluster, simple setup | PVC + Download Job | | Large cluster, many nodes | ModelExpress P2P | | Models already on shared storage (NFS) | PVC | | Models in S3, GCS, Azure Blob Storage, or local safetensors paths | ModelExpress + ModelStreamer | | Frequent model updates across fleet | ModelExpress P2P, optionally seeded by ModelStreamer | | ModelExpress server with non-shared storage (RWO PVC, cross-namespace) | ModelExpress with `MODEL_EXPRESS_NO_SHARED_STORAGE=1` | ## See Also - [Managing Models with DynamoModel](/dynamo/kubernetes-deployment/deploy-models/managing-models-with-dynamo-model) — declarative model management CRD - [Detailed Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) — Helm chart configuration including ModelExpress - [Shadow Engine Failover](/dynamo/kubernetes-deployment/advanced-platform/shadow-engine-failover) — GMS-backed active/shadow engine recovery, separate from model distribution - [ModelExpress deployment guide](https://github.com/ai-dynamo/modelexpress/blob/main/docs/DEPLOYMENT.md) — server, P2P, and ModelStreamer configuration - [LoRA Adapters](/dynamo/user-guides/lo-ra-adapters) — dynamic adapter loading (separate from base model caching) # ModelExpress ModelExpress is a model weight distribution service for faster worker startup in larger Dynamo clusters. Instead of every worker downloading the full model from storage, one worker can publish model weight availability and later workers can pull compatible tensors from that source over NIXL/RDMA. ModelExpress can also pair with ModelStreamer to stream safetensors directly from object storage into GPU memory. Use ModelExpress when model rollout time, autoscale cold start, or fleet-wide model updates matter more than the simplicity of a shared PVC. For smaller clusters, start with [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching). ## When to Use It | Scenario | Recommended path | | --- | --- | | Small cluster or first deployment | [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) with PVC + download Job | | Large cluster with many replicas | ModelExpress P2P distribution | | Models already on shared storage | PVC or shared filesystem path | | Models in S3, GCS, Azure Blob Storage, or local safetensors paths | ModelExpress + ModelStreamer | | Frequent model updates across a fleet | ModelExpress P2P, optionally seeded by ModelStreamer | | ModelExpress server has non-shared storage | ModelExpress with `MODEL_EXPRESS_NO_SHARED_STORAGE=1` | ## How It Works 1. A ModelExpress server runs in the cluster and stores metadata for available model sources. 2. vLLM workers use the ModelExpress loader (`--load-format mx` on newer images, or `mx-source` / `mx-target` on older split-loader images). 3. If a compatible source worker is already serving the model, a new worker pulls model tensors from that source over NIXL/RDMA. 4. If no source is available, the worker falls back to storage. With ModelStreamer, the first worker can stream safetensors from `s3://`, `gs://`, `az://`, or a local path. 5. The Kubernetes operator can inject `MODEL_EXPRESS_URL` into all Dynamo pods from the platform `modelExpressURL` setting. ## Configure the Platform Set the ModelExpress server URL when installing the Dynamo platform: ```bash helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \ --namespace ${NAMESPACE} \ --set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080" ``` If the ModelExpress server is installed separately, point `dynamo-operator.modelExpressURL` at that service. The operator injects the value into worker pods as `MODEL_EXPRESS_URL`. ## Configure vLLM Workers Use a runtime image that includes the `modelexpress` Python package. For ModelStreamer, the image also needs `runai-model-streamer` and the relevant object-storage SDK dependencies. ```yaml services: VllmWorker: extraPodSpec: mainContainer: image: command: ["python3", "-m", "dynamo.vllm"] args: - --model - meta-llama/Llama-3.1-70B-Instruct - --load-format - mx env: - name: VLLM_PLUGINS value: modelexpress ``` Use the load format supported by your runtime image. ModelExpress v0.3 and newer document the unified `mx` loader. Some older Dynamo images expose `mx-source` and `mx-target` loader names instead. ## Stream Without Shared Storage If the ModelExpress server cache is on a non-shared volume, workers cannot read the server's local cache path. Set `MODEL_EXPRESS_NO_SHARED_STORAGE=1` on worker pods so the client streams model files from the server over gRPC: ```yaml services: VllmWorker: extraPodSpec: mainContainer: env: - name: VLLM_PLUGINS value: modelexpress - name: MODEL_EXPRESS_NO_SHARED_STORAGE value: "1" ``` Use this path when the server has an RWO PVC, runs in a different namespace, or the cluster has no RDMA fabric available. Shared-filesystem mode is still faster when available. ## Stream From Object Storage Set `MX_MODEL_URI` when the first worker should stream safetensors directly from object storage or a local mounted path: ```yaml services: VllmWorker: extraPodSpec: mainContainer: image: command: ["python3", "-m", "dynamo.vllm"] args: - --model - meta-llama/Llama-3.1-70B-Instruct - --load-format - mx env: - name: VLLM_PLUGINS value: modelexpress - name: MX_MODEL_URI value: s3://my-model-bucket/meta-llama/Llama-3.1-70B-Instruct - name: RUNAI_STREAMER_CONCURRENCY value: "8" ``` | Storage backend | `MX_MODEL_URI` example | Credential options | | --- | --- | --- | | S3 or S3-compatible storage | `s3://bucket/path/to/model` | IRSA / workload identity, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`, `AWS_DEFAULT_REGION`, optional `AWS_ENDPOINT_URL` | | Google Cloud Storage | `gs://bucket/path/to/model` | GKE Workload Identity, Application Default Credentials, or `GOOGLE_APPLICATION_CREDENTIALS` | | Azure Blob Storage | `az://container/path/to/model` | Managed Identity, service principal env vars, or `AZURE_ACCOUNT_NAME` / `AZURE_ACCOUNT_KEY` | | Local filesystem or PVC | `/models/meta-llama/Llama-3.1-70B-Instruct` | Mount the path into the worker pod | Credentials are consumed by the storage SDKs in the worker pod. They do not flow through the ModelExpress server. ## See Also - [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) - simple PVC-based model caching and the longer ModelExpress background. - [ModelExpress deployment guide](https://github.com/ai-dynamo/modelexpress/blob/main/docs/DEPLOYMENT.md) - server, P2P, and ModelStreamer configuration. - [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) - Dynamo platform install options, including `modelExpressURL`. # Autoscaling This guide explains how to configure autoscaling for DynamoGraphDeployment (DGD) services using the `sglang-agg` example from `examples/backends/sglang/deploy/agg.yaml`. ## Example DGD All examples in this guide use the following DGD: ```yaml # examples/backends/sglang/deploy/agg.yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: sglang-agg namespace: default spec: services: Frontend: componentType: frontend replicas: 1 decode: componentType: worker replicas: 1 resources: limits: gpu: "1" ``` **Key identifiers:** - **DGD name**: `sglang-agg` - **Namespace**: `default` - **Services**: `Frontend`, `decode` - **dynamo_namespace label**: `default-sglang-agg` (used for metric filtering) ## Overview Dynamo provides flexible autoscaling through the `DynamoGraphDeploymentScalingAdapter` (DGDSA) resource. To have the operator create a DGDSA for a service, follow the Enabling DGDSA for a Service section below. These adapters implement the Kubernetes [Scale subresource](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource), enabling integration with: | Autoscaler | Description | Best For | |------------|-------------|----------| | **KEDA** | Event-driven autoscaling (recommended) | Most use cases | | **Kubernetes HPA** | Native horizontal pod autoscaling | Simple CPU/memory-based scaling | | **Dynamo Planner** | LLM-aware autoscaling with SLA optimization | Production LLM workloads | | **Custom Controllers** | Any scale-subresource-compatible controller | Custom requirements | > **⚠️ Deprecation Notice**: The `spec.services[X].autoscaling` field in DGD is **deprecated and ignored**. Use DGDSA with HPA, KEDA, or Planner instead. If you have existing DGDs with `autoscaling` configured, you'll see a warning. Remove the field to silence the warning. ## Architecture ``` ┌──────────────────────────────────┐ ┌─────────────────────────────────────┐ │ DynamoGraphDeployment │ │ Scaling Adapters (auto-created) │ │ "sglang-agg" │ │ (one per service) │ ├──────────────────────────────────┤ ├─────────────────────────────────────┤ │ │ │ │ │ spec.services: │ │ ┌─────────────────────────────┐ │ ┌──────────────────┐ │ │ │ │ sglang-agg-frontend │◄───┼──────│ Autoscalers │ │ ┌────────────────────────┐◄───┼──────────┼──│ spec.replicas: 1 │ │ │ │ │ │ Frontend: 1 replica │ │ │ └─────────────────────────────┘ │ │ • KEDA │ │ └────────────────────────┘ │ │ │ │ • HPA │ │ │ │ ┌─────────────────────────────┐ │ │ • Planner │ │ ┌────────────────────────┐◄───┼──────────┼──│ sglang-agg-decode │◄───┼──────│ • Custom │ │ │ decode: 1 replica │ │ │ │ spec.replicas: 1 │ │ │ │ │ └────────────────────────┘ │ │ └─────────────────────────────┘ │ └──────────────────┘ │ │ │ │ └──────────────────────────────────┘ └─────────────────────────────────────┘ ``` **How it works:** 1. You deploy a DGD with services (Frontend, decode) 2. The operator auto-creates one DGDSA per service 3. Autoscalers (KEDA, HPA, Planner) target the adapters via `/scale` subresource 4. Adapter controller syncs replica changes to the DGD 5. DGD controller reconciles the underlying pods ## Viewing Scaling Adapters After deploying the `sglang-agg` DGD, verify the auto-created adapters: ```bash kubectl get dgdsa -n default # Example output: # NAME DGD SERVICE REPLICAS AGE # sglang-agg-frontend sglang-agg Frontend 1 5m # sglang-agg-decode sglang-agg decode 1 5m ``` ## Replica Ownership Model When DGDSA is enabled, it becomes the **source of truth** for replica counts. This follows the same pattern as Kubernetes Deployments owning ReplicaSets. ### How It Works 1. **DGDSA owns replicas**: Autoscalers (HPA, KEDA, Planner) update the DGDSA's `spec.replicas` 2. **DGDSA syncs to DGD**: The DGDSA controller writes the replica count to the DGD's service 3. **Direct DGD edits blocked**: A validating webhook prevents users from directly editing `spec.services[X].replicas` in the DGD 4. **Controllers allowed**: Only authorized controllers (operator, Planner) can modify DGD replicas ### Manual Scaling with DGDSA Enabled When DGDSA is enabled, use `kubectl scale` on the adapter (not the DGD): ```bash # ✅ Correct - scale via DGDSA kubectl scale dgdsa sglang-agg-decode --replicas=3 # ❌ Blocked - direct DGD edit rejected by webhook kubectl patch dgd sglang-agg --type=merge -p '{"spec":{"services":{"decode":{"replicas":3}}}}' # Error: spec.services[decode].replicas cannot be modified directly when scaling adapter is enabled; # use 'kubectl scale dgdsa/sglang-agg-decode --replicas=3' or update the DynamoGraphDeploymentScalingAdapter instead ``` ## Enabling DGDSA for a Service By default, no DGDSA is created for services, allowing direct replica management via the DGD. To enable autoscaling via HPA, KEDA, or Planner, explicitly enable the scaling adapter: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: sglang-agg spec: services: Frontend: replicas: 2 # ← No DGDSA by default, direct edits allowed decode: replicas: 1 scalingAdapter: enabled: true # ← DGDSA created, managed via adapter ``` **When to enable DGDSA:** - You want to use HPA, KEDA, or Planner for autoscaling - You want a clear separation between "desired scale" (adapter) and "deployment config" (DGD) - You want protection against accidental direct replica edits **When to keep DGDSA disabled (default):** - You want simple, manual replica management - You don't need autoscaling for that service - You prefer direct DGD edits over adapter-based scaling ## Autoscaling with Dynamo Planner The Dynamo Planner is an LLM-aware autoscaler that optimizes scaling decisions based on inference-specific metrics like Time To First Token (TTFT), Inter-Token Latency (ITL), and KV cache utilization. **When to use Planner:** - You want LLM-optimized autoscaling out of the box - You need coordinated scaling across prefill/decode services - You want SLA-driven scaling (e.g., target TTFT \< 500ms) **How Planner works:** Planner is deployed as a service component within your DGD. It: 1. Queries Prometheus for frontend metrics (request rate, latency, etc.) 2. Uses profiling data to predict optimal replica counts 3. Scales prefill/decode workers to meet SLA targets **Deployment:** The recommended way to deploy Planner is via `DynamoGraphDeploymentRequest` (DGDR). See the [SLA Planner Quick Start](/dynamo/components/planner/planner-guide) for complete instructions. Example configurations with Planner: - `examples/backends/vllm/deploy/disagg_planner.yaml` - `examples/backends/sglang/deploy/disagg_planner.yaml` - `examples/backends/trtllm/deploy/disagg_planner.yaml` For more details, see the [SLA Planner documentation](/dynamo/components/planner/planner-guide). ## Autoscaling with Kubernetes HPA The Horizontal Pod Autoscaler (HPA) is Kubernetes' native autoscaling solution. **When to use HPA:** - You have simple, predictable scaling requirements - You want to use standard Kubernetes tooling - You need CPU or memory-based scaling For custom metrics (like TTFT or queue depth), consider using [KEDA](#autoscaling-with-keda-recommended) instead - it's simpler to configure. ### Basic HPA (CPU-based) ```yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: sglang-agg-frontend-hpa namespace: default spec: scaleTargetRef: apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeploymentScalingAdapter name: sglang-agg-frontend minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 behavior: scaleDown: stabilizationWindowSeconds: 300 scaleUp: stabilizationWindowSeconds: 0 ``` ### HPA with Dynamo Metrics Dynamo exports several metrics useful for autoscaling. These are available at the `/metrics` endpoint on each frontend pod. > **See also**: For a complete list of all Dynamo metrics, see the [Metrics Reference](/dynamo/user-guides/observability-local/metrics). For Prometheus and Grafana setup, see the [Prometheus and Grafana Setup Guide](/dynamo/user-guides/observability-local/prometheus-grafana-setup). #### Available Dynamo Metrics | Metric | Type | Description | Good for scaling | |--------|------|-------------|------------------| | `dynamo_frontend_active_requests` | Gauge | Total concurrent requests from HTTP entry to response complete | ✅ All services | | `dynamo_frontend_stage_requests{stage,phase}` | Gauge | Requests currently in a given frontend pipeline stage (`preprocess`, `route`, `dispatch`) | ✅ Workers — use `sum(...)` for queue-depth behavior, or `stage="dispatch"` for backend-prefill saturation | | `dynamo_frontend_time_to_first_token_seconds` | Histogram | TTFT latency | ✅ Workers | | `dynamo_frontend_inter_token_latency_seconds` | Histogram | ITL latency | ✅ Decode | | `dynamo_frontend_request_duration_seconds` | Histogram | Total request duration | ⚠️ General | | `dynamo_frontend_inflight_requests` | Gauge | Concurrent requests to engine | ⚠️ **Deprecated** — use `dynamo_frontend_active_requests` | | `dynamo_frontend_queued_requests` | Gauge | Requests waiting in HTTP queue | ⚠️ **Deprecated** — use `sum(dynamo_frontend_stage_requests)` across `preprocess` + `route` + `dispatch` | For the full definition of the `stage` and `phase` labels and derived-signal formulas, see [Stage and phase labels](/dynamo/user-guides/observability-local/metrics#stage-and-phase-labels) in the Metrics Reference. #### Metric Labels Dynamo metrics include these labels for filtering: | Label | Description | Example | |-------|-------------|---------| | `dynamo_namespace` | Unique DGD identifier (`{k8s-namespace}-{dgd-name}`) | `default-sglang-agg` | | `model` | Model being served | `Qwen/Qwen3-0.6B` | When you have multiple DGDs in the same namespace, use `dynamo_namespace` to filter metrics for a specific DGD. #### Example: Scale Decode Service Based on TTFT Using HPA with Prometheus Adapter requires configuring external metrics. **Step 1: Configure Prometheus Adapter** Add this to your Helm values file (e.g., `prometheus-adapter-values.yaml`): ```yaml # prometheus-adapter-values.yaml prometheus: url: http://prometheus-kube-prometheus-prometheus.monitoring.svc port: 9090 rules: external: # TTFT p95 from frontend - used to scale decode - seriesQuery: 'dynamo_frontend_time_to_first_token_seconds_bucket{namespace!=""}' resources: overrides: namespace: {resource: "namespace"} name: as: "dynamo_ttft_p95_seconds" metricsQuery: | histogram_quantile(0.95, sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{<<.LabelMatchers>>}[5m])) by (le, namespace, dynamo_namespace) ) ``` **Step 2: Install Prometheus Adapter** ```bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \ -n monitoring --create-namespace \ -f prometheus-adapter-values.yaml ``` **Step 3: Verify the metric is available** ```bash kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces//dynamo_ttft_p95_seconds" | jq ``` **Step 4: Create the HPA** ```yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: sglang-agg-decode-hpa spec: scaleTargetRef: apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeploymentScalingAdapter name: sglang-agg-decode # ← DGD name + service name (lowercase) minReplicas: 1 maxReplicas: 10 metrics: - type: External external: metric: name: dynamo_ttft_p95_seconds selector: matchLabels: dynamo_namespace: "default-sglang-agg" # ← {namespace}-{dgd-name} target: type: Value value: "500m" # Scale up when TTFT p95 > 500ms behavior: scaleDown: stabilizationWindowSeconds: 60 # Wait 1 min before scaling down policies: - type: Pods value: 1 periodSeconds: 30 scaleUp: stabilizationWindowSeconds: 0 # Scale up immediately policies: - type: Pods value: 2 periodSeconds: 30 ``` **How it works:** 1. Frontend pods export `dynamo_frontend_time_to_first_token_seconds` histogram 2. Prometheus Adapter calculates p95 TTFT per `dynamo_namespace` 3. HPA monitors this metric filtered by `dynamo_namespace: "default-sglang-agg"` 4. When TTFT p95 > 500ms, HPA scales up the `sglang-agg-decode` adapter 5. Adapter controller syncs the replica count to the DGD's `decode` service 6. More decode workers are created, reducing TTFT #### Example: Scale Based on Queue Depth "Queue depth" here means the number of requests that have entered the frontend but haven't yet received a first token — i.e. the sum of `dynamo_frontend_stage_requests` across the `preprocess`, `route`, and `dispatch` stages. This replaces the deprecated `dynamo_frontend_queued_requests` gauge. Add this rule to your `prometheus-adapter-values.yaml` (alongside the TTFT rule): ```yaml # Add to rules.external in prometheus-adapter-values.yaml - seriesQuery: 'dynamo_frontend_stage_requests{namespace!="",stage=~"preprocess|route|dispatch"}' resources: overrides: namespace: {resource: "namespace"} name: as: "dynamo_frontend_pending_requests" metricsQuery: | sum(<<.Series>>{<<.LabelMatchers>>}) by (namespace, dynamo_namespace) ``` Then create the HPA: ```yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: sglang-agg-decode-queue-hpa namespace: default spec: scaleTargetRef: apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeploymentScalingAdapter name: sglang-agg-decode minReplicas: 1 maxReplicas: 10 metrics: - type: External external: metric: name: dynamo_frontend_pending_requests selector: matchLabels: dynamo_namespace: "default-sglang-agg" target: type: Value value: "10" # Scale up when queue > 10 requests ``` ## Autoscaling with KEDA (Recommended) KEDA (Kubernetes Event-driven Autoscaling) extends Kubernetes with event-driven autoscaling, supporting 50+ scalers including Prometheus. **Advantages over HPA + Prometheus Adapter:** - No Prometheus Adapter configuration needed - PromQL queries are defined in the ScaledObject itself (declarative, per-deployment) - Easy to update - just `kubectl apply` the ScaledObject - Can scale to zero when idle - Supports multiple triggers per object **When to use KEDA:** - You want simpler configuration (no Prometheus Adapter to manage) - You need event-driven scaling (e.g., queue depth, Kafka, etc.) - You want to scale to zero when idle ### Installing KEDA ```bash # Add KEDA Helm repo helm repo add kedacore https://kedacore.github.io/charts helm repo update # Install KEDA helm install keda kedacore/keda \ --namespace keda \ --create-namespace # Verify installation kubectl get pods -n keda ``` If you have Prometheus Adapter installed, either uninstall it first (`helm uninstall prometheus-adapter -n monitoring`) or install KEDA with `--set metricsServer.enabled=false` to avoid API conflicts. ### Example: Scale Decode Based on TTFT Using the `sglang-agg` DGD from `examples/backends/sglang/deploy/agg.yaml`: ```yaml apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: sglang-agg-decode-scaler namespace: default spec: scaleTargetRef: apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeploymentScalingAdapter name: sglang-agg-decode minReplicaCount: 1 maxReplicaCount: 10 pollingInterval: 15 # Check metrics every 15 seconds cooldownPeriod: 60 # Wait 60s before scaling down triggers: - type: prometheus metadata: # Update this URL to match your Prometheus service serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090 metricName: dynamo_ttft_p95 query: | histogram_quantile(0.95, sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{dynamo_namespace="default-sglang-agg"}[5m])) by (le) ) threshold: "0.5" # Scale up when TTFT p95 > 500ms (0.5 seconds) activationThreshold: "0.1" # Start scaling when TTFT > 100ms ``` Apply it: ```bash kubectl apply -f sglang-agg-decode-scaler.yaml ``` ### Verify KEDA Scaling ```bash # Check ScaledObject status kubectl get scaledobject -n default # KEDA creates an HPA under the hood - you can see it kubectl get hpa -n default # Example output: # NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS # keda-hpa-sglang-agg-decode-scaler DynamoGraphDeploymentScalingAdapter/sglang-agg-decode 45m/500m 1 10 1 # Get detailed status kubectl describe scaledobject sglang-agg-decode-scaler -n default ``` ### Example: Scale Based on Queue Depth ```yaml apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: sglang-agg-decode-queue-scaler namespace: default spec: scaleTargetRef: apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeploymentScalingAdapter name: sglang-agg-decode minReplicaCount: 1 maxReplicaCount: 10 pollingInterval: 15 cooldownPeriod: 60 triggers: - type: prometheus metadata: serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090 metricName: dynamo_frontend_pending_requests query: | sum(dynamo_frontend_stage_requests{dynamo_namespace="default-sglang-agg",stage=~"preprocess|route|dispatch"}) threshold: "10" # Scale up when queue > 10 requests ``` ### How KEDA Works KEDA creates and manages an HPA under the hood: ``` ┌──────────────────────────────────────────────────────────────────────┐ │ You create: ScaledObject │ │ - scaleTargetRef: sglang-agg-decode │ │ - triggers: prometheus query │ └──────────────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────┐ │ KEDA Operator automatically creates: HPA │ │ - name: keda-hpa-sglang-agg-decode-scaler │ │ - scaleTargetRef: sglang-agg-decode │ │ - metrics: External (from KEDA metrics server) │ └──────────────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────┐ │ DynamoGraphDeploymentScalingAdapter: sglang-agg-decode │ │ - spec.replicas: updated by HPA │ └──────────────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────┐ │ DynamoGraphDeployment: sglang-agg │ │ - spec.services.decode.replicas: synced from adapter │ └──────────────────────────────────────────────────────────────────────┘ ``` ## Mixed Autoscaling For disaggregated deployments (prefill + decode), you can use different autoscaling strategies for different services: ```yaml --- # HPA for Frontend (CPU-based) apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: sglang-agg-frontend-hpa namespace: default spec: scaleTargetRef: apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeploymentScalingAdapter name: sglang-agg-frontend minReplicas: 1 maxReplicas: 5 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 --- # KEDA for Decode (TTFT-based) apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: sglang-agg-decode-scaler namespace: default spec: scaleTargetRef: apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeploymentScalingAdapter name: sglang-agg-decode minReplicaCount: 1 maxReplicaCount: 10 triggers: - type: prometheus metadata: serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090 query: | histogram_quantile(0.95, sum(rate(dynamo_frontend_time_to_first_token_seconds_bucket{dynamo_namespace="default-sglang-agg"}[5m])) by (le) ) threshold: "0.5" ``` ## Manual Scaling ### With DGDSA Enabled When DGDSA is enabled, scale via the adapter: ```bash kubectl scale dgdsa sglang-agg-decode -n default --replicas=3 ``` Verify the scaling: ```bash kubectl get dgdsa sglang-agg-decode -n default # Output: # NAME DGD SERVICE REPLICAS AGE # sglang-agg-decode sglang-agg decode 3 10m ``` If an autoscaler (KEDA, HPA, Planner) is managing the adapter, your change will be overwritten on the next evaluation cycle. ### With DGDSA Disabled (default) If you've disabled the scaling adapter for a service, edit the DGD directly: ```bash kubectl patch dgd sglang-agg --type=merge -p '{"spec":{"services":{"decode":{"replicas":3}}}}' ``` Or edit the YAML (no `scalingAdapter.enabled: true` means direct edits are allowed): ```yaml spec: services: decode: replicas: 3 # No scalingAdapter.enabled means replicas can be edited directly ``` ## Best Practices ### 1. Choose One Autoscaler Per Service Avoid configuring multiple autoscalers for the same service: | Configuration | Status | |---------------|--------| | HPA for frontend, Planner for prefill/decode | ✅ Good | | KEDA for all services | ✅ Good | | Planner only (default) | ✅ Good | | HPA + Planner both targeting decode | ❌ Bad - they will fight | ### 2. Use Appropriate Metrics | Service Type | Recommended Metrics | Dynamo Metric | |--------------|---------------------|---------------| | Frontend | CPU utilization, request rate | `dynamo_frontend_requests_total` | | Prefill | Dispatch-stage depth (backend prefill saturation), TTFT | `dynamo_frontend_stage_requests{stage="dispatch"}`, `dynamo_frontend_time_to_first_token_seconds` | | Decode | ITL, active concurrency | `dynamo_frontend_inter_token_latency_seconds`, `dynamo_frontend_active_requests` | ### 3. Configure Stabilization Windows Prevent thrashing with appropriate stabilization: ```yaml # HPA behavior: scaleDown: stabilizationWindowSeconds: 300 # Wait 5 min before scaling down scaleUp: stabilizationWindowSeconds: 0 # Scale up immediately # KEDA spec: cooldownPeriod: 300 ``` ### 4. Set Sensible Min/Max Replicas Always configure minimum and maximum replicas in your HPA/KEDA to prevent: - Scaling to zero (unless intentional) - Unbounded scaling that exhausts cluster resources ## Troubleshooting ### Adapters Not Created ```bash # Check DGD status kubectl describe dgd sglang-agg -n default # Check operator logs kubectl logs -n dynamo-system deployment/dynamo-operator ``` ### Scaling Not Working ```bash # Check adapter status kubectl describe dgdsa sglang-agg-decode -n default # Check HPA/KEDA status kubectl describe hpa sglang-agg-decode-hpa -n default kubectl describe scaledobject sglang-agg-decode-scaler -n default # Verify metrics are available in Kubernetes metrics API kubectl get --raw /apis/external.metrics.k8s.io/v1beta1 ``` ### Metrics Not Available If HPA/KEDA shows `` for metrics: ```bash # Check if Dynamo metrics are being scraped kubectl port-forward -n default svc/sglang-agg-frontend 8000:8000 curl http://localhost:8000/metrics | grep dynamo_frontend # Example output (note: stage_requests has no `model` label — it's per frontend pod): # dynamo_frontend_active_requests{model="Qwen/Qwen3-0.6B"} 5 # dynamo_frontend_stage_requests{stage="preprocess",phase=""} 0 # dynamo_frontend_stage_requests{stage="route",phase="aggregated"} 0 # dynamo_frontend_stage_requests{stage="dispatch",phase="aggregated"} 2 # dynamo_frontend_queued_requests{model="Qwen/Qwen3-0.6B"} 2 # deprecated # dynamo_frontend_inflight_requests{model="Qwen/Qwen3-0.6B"} 5 # deprecated # Verify Prometheus is scraping the metrics kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 # Then query: dynamo_frontend_time_to_first_token_seconds_bucket # Check KEDA operator logs kubectl logs -n keda deployment/keda-operator ``` ### Rapid Scaling Up and Down If you see unstable scaling: 1. Check if multiple autoscalers are targeting the same adapter 2. Increase `cooldownPeriod` in KEDA ScaledObject 3. Increase `stabilizationWindowSeconds` in HPA behavior ## References - [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) - [KEDA Documentation](https://keda.sh/) - [Prometheus Adapter](https://github.com/kubernetes-sigs/prometheus-adapter) - [Planner Documentation](/dynamo/components/planner/planner-guide) - [Dynamo Metrics Reference](/dynamo/user-guides/observability-local/metrics) - [Prometheus and Grafana Setup](/dynamo/user-guides/observability-local/prometheus-grafana-setup) # Rolling Updates This guide covers how rolling updates work for `DynamoGraphDeployment` (DGD) resources. Rolling updates allow you to update worker configurations (images, resources, environment variables, etc.) with minimal downtime by gradually replacing old pods with new ones. The behavior of rolling updates depends on the backing resource type of your deployment. DGDs backed by Kubernetes Deployments benefit from **managed rolling updates** with namespace isolation, while Grove and LWS-backed deployments use their native update mechanisms. ## Example Consider a disaggregated deployment with separate prefill and decode workers. You want to update the tensor parallelism of the decode worker to 2. **Before** — original deployment: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: vllm-disagg spec: services: Frontend: componentType: frontend replicas: 1 extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 VllmDecodeWorker: componentType: worker replicas: 1 extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 command: - python3 - -m - dynamo.vllm args: - --model - Qwen/Qwen3-0.6B - --disaggregation-mode - decode VllmPrefillWorker: componentType: worker subComponentType: prefill replicas: 1 extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 command: - python3 - -m - dynamo.vllm args: - --model - Qwen/Qwen3-0.6B - --disaggregation-mode - prefill ``` **After** — updated with parallelism tuning: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: vllm-disagg spec: services: Frontend: componentType: frontend replicas: 1 extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 VllmDecodeWorker: componentType: worker replicas: 1 extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 command: - python3 - -m - dynamo.vllm args: - --model - Qwen/Qwen3-0.6B - --disaggregation-mode - decode - --tensor-parallelism - "2" VllmPrefillWorker: componentType: worker subComponentType: prefill replicas: 1 extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 command: - python3 - -m - dynamo.vllm args: - --model - Qwen/Qwen3-0.6B - --disaggregation-mode - prefill ``` Apply the update: ```bash kubectl apply -f vllm-disagg.yaml ``` Monitor rolling update progress: ```bash kubectl get dgd vllm-disagg -n dynamo -o jsonpath='{.status.rollingUpdate}' ``` ## Default Behavior (Grove and LWS) For DGDs backed by **Grove** (PodCliques, PodCliqueSets) or **LWS** (LeaderWorkerSets), the operator does not manage rolling updates directly. Instead, these deployments rely on the native rolling update mechanisms of their underlying resources. ### What Happens - A modification to the pod spec of a service triggers the rolling update behavior of the backing resource. In the example above, the modification to the pod spec of the decode worker triggers the rolling update of just the decode worker. - For Grove, PodCliques (PCLQ) and PodCliqueScalingGroups use a static rolling update strategy of `maxUnavailable: 1` and `maxSurge: 0`. LWS follows the same `maxUnavailable: 1` and `maxSurge: 0` strategy. - **Old and new workers operate within the same Dynamo namespace.** This means old and new workers can discover each other through service discovery. The following diagram illustrates the rolling update of the decode worker in a Grove PodCliqueSet (PCS). Only the decode PodClique is updated — the frontend and prefill PodCliques are unaffected: ``` ┌─ PodCliqueSet: vllm-disagg ───────────────────────────────────────────────────────┐ │ │ │ ┌─ PCLQ: Frontend ──────┐ ┌─ PCLQ: VllmPrefillWorker ─┐ │ │ │ │ │ │ │ │ │ ┌──────────────────┐ │ │ ┌──────────────────────┐ │ │ │ │ │ Pod (v1) ✓ │ │ │ │ Pod (v1) ✓ │ │ No changes — │ │ │ └──────────────────┘ │ │ └──────────────────────┘ │ not rolling │ │ │ │ │ │ │ │ └────────────────────────┘ └────────────────────────────┘ │ │ │ │ ┌─ PCLQ: VllmDecodeWorker ──────────────────────────────────────────────────────┐ │ │ │ │ │ │ │ maxUnavailable: 1, maxSurge: 0 │ │ │ │ │ │ │ │ ┌──────────────────────┐ ┌──────────────────────┐ │ │ │ │ │ Pod (v2) ✓ NEW │ │ Pod (v1) Terminating │ ← rolling one at a time │ │ │ │ └──────────────────────┘ └──────────────────────┘ │ │ │ │ │ │ │ └────────────────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────┐ │ │ │ Dynamo Namespace: vllm-disagg │ │ │ │ │ │ │ │ All v1 and v2 pods registered │ │ │ │ and discoverable by each other │ │ │ └──────────────────────────────────┘ │ │ │ └────────────────────────────────────────────────────────────────────────────────────┘ ``` ### Implications for Disaggregated Deployments Because old and new workers share the same Dynamo namespace, they are grouped together by the router. In a disaggregated setup, this can lead to cross-generation communication — for example, the router might send a request from a newly deployed prefill worker to an old decode worker (or vice versa). If the old and new versions are incompatible, this can result in errors. For Grove and LWS deployments with disaggregated prefill/decode workers, be aware that during a rolling update, new workers may communicate with old workers. Ensure that your worker versions are backward-compatible, or consider using Deployment-backed DGDs which provide namespace isolation during updates. Managed rolling updates with namespace isolation are planned for Grove and LWS-backed deployments in a future release. See [Future Work](#future-work) for details. ## Managed Rolling Updates (Deployments) For DGDs backed by Kubernetes **Deployments** (single-node, non-multinode services), the Dynamo operator implements managed rolling updates with namespace isolation. This is tracked in the DGD status and provides stronger guarantees for disaggregated deployments. ### How It Works 1. **Spec change detection** — The operator computes a hash of all worker service specs (prefill, decode, and worker component types). When this hash changes, a rolling update is triggered. 2. **Namespace isolation** — New worker `DynamoComponentDeployments` (DCDs) are created with the spec hash appended to their Dynamo namespace. This means new workers register in a different Dynamo namespace than old workers, preventing cross-generation discovery. A new prefill worker will only discover and route to new decode workers, avoiding compatibility issues. 3. **Gradual replacement** — The operator gradually scales up new worker DCDs and scales down old ones, respecting `maxSurge` and `maxUnavailable` constraints. When a worker service is updated (all new replicas are ready, all old replicas are terminated), it is marked as completed. 4. **Cleanup** — Once all worker services have completed the transition, old worker DCDs are deleted and the rolling update is marked as completed. ``` ┌─ DynamoGraphDeployment: vllm-disagg ──────────────────────────────────────────────┐ │ │ │ ┌─ DCD: Frontend ──────────┐ │ │ │ │ │ │ │ ┌────────────────────┐ │ No changes — │ │ │ │ Pod (v1) ✓ │ │ not a worker component │ │ │ └────────────────────┘ │ │ │ │ │ │ │ └──────────────────────────┘ │ │ │ │ ┌─ OLD DCDs (hash: a1b2c3d4) ──────────────────────────────────────────────────┐ │ │ │ │ │ │ │ ┌─ DCD: VllmDecodeWorker-a1b2c3d4 ──┐ ┌─ DCD: VllmPrefillWorker-a1b2c3d4 ┐│ │ │ │ │ │ │ ││ │ │ │ │ ┌──────────────────────┐ │ │ ┌─────────────────────┐ ││ │ │ │ │ │ Pod (v1) Terminating │ │ │ │ Pod (v1) Terminating│ ││ │ │ │ │ └──────────────────────┘ │ │ └─────────────────────┘ ││ │ │ │ │ │ │ ││ │ │ │ │ Dynamo Namespace: vllm-disagg │ │ Dynamo Namespace: vllm-disagg ││ │ │ │ │ -a1b2c3d4 │ │ -a1b2c3d4 ││ │ │ │ └────────────────────────────────────┘ └───────────────────────────────────┘│ │ │ │ │ │ │ └───────────────────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─ NEW DCDs (hash: f5e6d7c8) ──────────────────────────────────────────────────┐ │ │ │ │ │ │ │ ┌─ DCD: VllmDecodeWorker-f5e6d7c8 ──┐ ┌─ DCD: VllmPrefillWorker-f5e6d7c8 ┐│ │ │ │ │ │ │ ││ │ │ │ │ ┌──────────────────────┐ │ │ ┌─────────────────────┐ ││ │ │ │ │ │ Pod (v2) ✓ NEW │ │ │ │ Pod (v2) ✓ NEW │ ││ │ │ │ │ └──────────────────────┘ │ │ └─────────────────────┘ ││ │ │ │ │ │ │ ││ │ │ │ │ Dynamo Namespace: vllm-disagg │ │ Dynamo Namespace: vllm-disagg ││ │ │ │ │ -f5e6d7c8 │ │ -f5e6d7c8 ││ │ │ │ └────────────────────────────────────┘ └───────────────────────────────────┘│ │ │ │ │ │ │ └───────────────────────────────────────────────────────────────────────────────┘ │ │ │ │ Old and new workers are in different Dynamo namespaces — │ │ new prefill only discovers new decode, preventing cross-generation routing. │ │ │ └────────────────────────────────────────────────────────────────────────────────────┘ ``` Only worker component types (`worker`, `prefill`, `decode`) participate in managed rolling updates. Non-worker components like `frontend` are updated in-place without namespace isolation. ### Rolling Update Phases The rolling update progress is tracked in `.status.rollingUpdate` with the following phases: | Phase | Description | |-------|-------------| | `Pending` | A spec change was detected and the rolling update has been initialized. | | `InProgress` | New worker DCDs are being scaled up and old ones are being scaled down. | | `Completed` | All worker services have transitioned to new replicas. Old DCDs have been cleaned up. | The status also tracks: - `startTime` — When the rolling update began. - `endTime` — When the rolling update completed. - `updatedServices` — List of worker services that have completed the transition. ### Configuring maxSurge and maxUnavailable You can configure the rolling update strategy per service using annotations: | Annotation | Description | Default | |------------|-------------|---------| | `nvidia.com/deployment-rolling-update-max-surge` | Maximum number of extra pods that can be created above the desired count during the update. | `25%` | | `nvidia.com/deployment-rolling-update-max-unavailable` | Maximum number of pods that can be unavailable during the update. | `25%` | Values can be absolute integers (e.g., `"1"`, `"2"`) or percentages (e.g., `"25%"`, `"50%"`). Percentages are resolved against the desired replica count — rounding up for `maxSurge` and rounding down for `maxUnavailable`. The operator ensures at least one of `maxSurge` or `maxUnavailable` is greater than zero to guarantee forward progress. **Example** — zero-downtime update with surge capacity: ```yaml VllmPrefillWorker: componentType: worker subComponentType: prefill replicas: 4 annotations: nvidia.com/deployment-rolling-update-max-surge: "1" nvidia.com/deployment-rolling-update-max-unavailable: "0" ``` This ensures that all 4 existing prefill replicas remain available while 1 new replica is brought up at a time. **Example** — fast update allowing temporary capacity reduction: ```yaml VllmDecodeWorker: componentType: worker subComponentType: decode replicas: 8 annotations: nvidia.com/deployment-rolling-update-max-surge: "0" nvidia.com/deployment-rolling-update-max-unavailable: "2" ``` This avoids creating extra pods but allows up to 2 decode replicas to be unavailable at a time, speeding up the transition. ### Worker Hash and DCD Naming Worker DCDs always include a hash suffix derived from the worker specs: `{dgd-name}-{service-name}-{hash}` (e.g., `vllm-disagg-vllmdecodeworker-a1b2c3d4`). During a rolling update, the new worker DCDs are created with the new spec hash while the old DCDs retain the previous hash, allowing both generations to coexist: - **Old worker DCD:** `vllm-disagg-vllmdecodeworker-a1b2c3d4` (previous hash) - **New worker DCD:** `vllm-disagg-vllmdecodeworker-f5e6d7c8` (new hash) The hash is computed from a SHA-256 digest of all worker service specs (excluding non-pod-template fields like `replicas`, `autoscaling`, and `ingress`). This means: - Scaling changes (replica count) do **not** trigger a rolling update. - Pod template changes (image, resources, env vars, volumes, etc.) **do** trigger a rolling update. - The hash covers **all** worker services together — changing any single worker's spec triggers a rolling update for all workers. The current worker hash is stored as the annotation `nvidia.com/current-worker-hash` on the DGD resource, and individual worker DCDs are labeled with `nvidia.com/dynamo-worker-hash` for filtering. ### Status During Rolling Updates During a rolling update, the DGD status aggregates information from both old and new worker DCDs: - **Replicas** — Total count across old and new. - **ReadyReplicas** — Aggregate ready count across old and new. - **UpdatedReplicas** — Only new worker replicas. This provides a holistic view of the deployment's health during the transition. ## Comparison | Aspect | Grove / LWS | Deployments (Managed) | |--------|-------------|----------------------| | Update mechanism | Native resource rolling update | Operator-managed with DCD lifecycle | | Namespace isolation | No — old and new share the same namespace | Yes — hash-based namespace separation | | Cross-generation discovery | Possible — old and new workers can see each other | Prevented — new workers only discover new workers | | maxSurge / maxUnavailable | Fixed (`maxUnavailable: 1`, `maxSurge: 0` for Grove) | Configurable per service via annotations | | Status tracking | Native resource status | DGD `.status.rollingUpdate` with phase and per-service tracking | | Multinode support | Yes | No (single-node only) | ## Future Work The following enhancements are planned for future releases: - **Managed rolling updates for Grove and LWS** — Extending managed rolling updates with namespace isolation to Grove and LWS-backed deployments, providing the same cross-generation discovery protection that Deployment-backed DGDs have today. - **Coordinated worker updates** — Currently, prefill and decode workers are updated independently, which can result in an imbalance between old and new sets during the transition. Future releases will coordinate the rollout across worker types. - **Partitioned rollouts** — The ability to roll out updates to a percentage of workers (e.g., 30%), pause, observe metrics, and then continue. This enables canary-style deployments for safer rollouts. - **DGD-level rolling update configuration** — The ability to configure `maxSurge` and `maxUnavailable` at the DGD API level, regardless of the backing resource type. # Disaggregated Inference Communication Guide This guide explains how prefill and decode workers communicate in Dynamo's disaggregated inference architecture on Kubernetes. It answers the frequently asked question: **Why can't prefill and decode workers use NVLink to communicate on the same node?** ## Summary - **NVLink cannot be used between Kubernetes pods** due to process isolation and GPU partitioning - **RDMA (InfiniBand, RoCE, or AWS EFA) is required** for production disaggregated deployments - **Without RDMA, expect 200-500x performance degradation** in Time To First Token (TTFT) — observed ~98s TTFT with TCP vs ~200-500ms with RDMA - **UCX or libfabric** are the communication layers that NIXL uses to transfer KV cache between workers - **Topology-aware KV transfer** can constrain or bias decode routing so KV transfers stay within a selected topology domain such as zone or rack. See [Topology-Aware KV Transfer](/dynamo/kubernetes-deployment/operate/topology-aware-kv-transfer). --- ## Architecture Overview ### Communication Stack Disaggregated inference communication stack showing NIXL, UCX/libfabric, and transport layers ### Component Responsibilities | Component | Role | Location | |-----------|------|----------| | **NIXL** | High-level KV cache transfer API | Dynamo runtime library | | **UCX or libfabric** | Low-level communication framework | System library | | **Transports** | Physical data movement | Hardware/kernel drivers | --- ## Why NVLink Cannot Be Used Between Pods ### The Fundamental Constraint NVLink is a **direct GPU-to-GPU interconnect** that operates at the hardware level. It requires: 1. **Same process** - Both GPUs must be visible to a single process so `cudaDeviceEnablePeerAccess()` can be called 2. **Direct memory access** - Process must have permission to access both GPU memory regions 3. **Peer-to-peer mapping** - CUDA runtime must establish memory mappings between GPUs **Kubernetes pods violate all three requirements:** Why NVLink cannot work between Kubernetes pods due to process isolation ### Technical Explanation 1. **Process Isolation**: Kubernetes pods run in separate Linux namespaces. Even on the same node, Pod A cannot directly access Pod B's memory space. 2. **GPU Partitioning**: The Kubernetes device plugin assigns specific GPUs to each pod via `CUDA_VISIBLE_DEVICES`. Pod A's GPU 0 and Pod B's GPU 0 are physically different devices. 3. **Process/Namespace Isolation**: Each pod runs in a separate process namespace. NVLink peer-to-peer transfers require both GPUs to be within the same process so `cudaDeviceEnablePeerAccess()` can be called. 4. **Memory Registration**: NVLink transfers use `cudaMemcpy` with peer access enabled. This requires calling `cudaDeviceEnablePeerAccess()` - impossible across process boundaries. ### Where NVLink DOES Work NVLink works **within a pod** for parallelism strategies (TP, EP) where all GPUs are in the same process: ```yaml # Decode worker with TP=4 uses NVLink between its 4 GPUs VLLMDecodeWorker: resources: limits: gpu: "4" # All 4 GPUs visible to single process args: - --tensor-parallel-size - "4" # NVLink used for TP/EP communication within pod ``` --- ## Supported Communication Options ### Transport Comparison | Transport | Bandwidth | Latency | Same-Node | Cross-Node | GPU Direct | |-----------|-----------|---------|-----------|------------|------------| | **NVLink** | 450-900 GB/s | ~µs | ✅ (intra-pod only) | ❌ | ✅ | | **InfiniBand RDMA** | 20-50 GB/s | ~1 µs | ✅ | ✅ | ✅ (with GPUDirect) | | **RoCE RDMA** | 10-25 GB/s | ~2 µs | ✅ | ✅ | ✅ (with GPUDirect) | | **TCP** | 1-3 GB/s | ~50 µs | ✅ | ✅ | ❌ (host staging) | ### Same-Node Communication When prefill and decode workers are on the **same physical node**: Same-node RDMA communication between prefill and decode pods **Options (best to worst):** 1. InfiniBand RDMA with GPUDirect → GPU-to-GPU, bypasses CPU 2. RoCE RDMA with GPUDirect → GPU-to-GPU, bypasses CPU 3. Host-staged RDMA → GPU→CPU→RDMA→CPU→GPU 4. TCP (fallback) → GPU→CPU→TCP→CPU→GPU **Best Practice**: Use RDMA even for same-node communication. The overhead is minimal and it provides consistent behavior whether pods land on the same or different nodes. ### Cross-Node Communication When prefill and decode workers are on **different nodes**: Cross-node RDMA communication between prefill and decode pods on separate nodes **Requirements for optimal cross-node performance:** - RDMA network fabric (InfiniBand, RoCE, or AWS EFA) - GPUDirect RDMA enabled (GPU memory registered with NIC) - Proper UCX or libfabric configuration --- ## UCX Configuration Reference ### Environment Variables UCX behavior is controlled through environment variables. Set these on both prefill and decode worker pods. #### Core Transport Selection ```yaml env: - name: UCX_TLS value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc" ``` | Transport | Description | When to Use | |-----------|-------------|-------------| | `rc_x` | Reliable Connection (accelerated) | Primary RDMA transport | | `rc` | Reliable Connection (standard) | Fallback RDMA | | `dc_x` | Dynamically Connected (accelerated) | Scalable RDMA (many endpoints) | | `dc` | Dynamically Connected (standard) | Fallback scalable RDMA | | `cuda_copy` | GPU↔Host memory staging | Required for GPU buffers | | `cuda_ipc` | CUDA IPC (same-node, same-pod) | Intra-pod GPU transfers | | `tcp` | TCP sockets | Fallback when RDMA unavailable | | `srd` | Scalable Reliable Datagram (AWS EFA) | AWS-specific (provided by EFA, not core UCX) | **Excluding transports**: Use `^` prefix to exclude (e.g., `UCX_TLS=^mm` excludes memory mapping). **Note**: When specifying `UCX_TLS` explicitly with GPU memory, you must include `cuda_copy` or `cuda_ipc` for UCX to recognize GPU buffers. #### Rendezvous Protocol Settings ```yaml env: - name: UCX_RNDV_SCHEME value: "get_zcopy" - name: UCX_RNDV_THRESH value: "0" ``` | Variable | Value | Description | |----------|-------|-------------| | `UCX_RNDV_SCHEME` | `get_zcopy` | Zero-copy RDMA GET (receiver pulls data) | | `UCX_RNDV_SCHEME` | `put_zcopy` | Zero-copy RDMA PUT (sender pushes data) | | `UCX_RNDV_SCHEME` | `auto` | Let UCX choose based on message size | | `UCX_RNDV_THRESH` | `0` | Use rendezvous for all message sizes | | `UCX_RNDV_THRESH` | `8192` | Use rendezvous for messages ≥8KB | | `UCX_RNDV_THRESH` | `auto` | Let UCX calculate optimal threshold | **Recommendation**: Use `get_zcopy` with threshold `0` for KV cache transfers (always large). > **⚠️ AWS EFA Exception**: Do NOT use `get_zcopy` on AWS with Ubuntu 24.04 + Kernel ≥6.8. See [AWS EFA Configuration](#aws-efa-configuration) for required settings. #### Memory Registration ```yaml env: - name: UCX_IB_REG_METHODS value: "odp,rcache" ``` | Method | Description | |--------|-------------| | `odp` | On-Demand Paging (dynamic registration) | | `rcache` | Registration cache (reuse registrations) | | `direct` | Direct registration (each transfer) | #### Debugging and Diagnostics ```yaml env: - name: UCX_LOG_LEVEL value: "info" # Options: fatal, error, warn, info, debug, trace, data, func - name: UCX_LOG_FILE value: "/tmp/ucx.log" # Optional: log to file instead of stdout ``` **Note**: UCX statistics (`UCX_STATS_DEST`, `UCX_STATS_TRIGGER`) require UCX compiled with `--enable-stats` flag, which is not enabled in default builds. ### Complete Production Configuration ```yaml env: # Transport selection - RDMA with GPU support - name: UCX_TLS value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc" # Rendezvous for large transfers - name: UCX_RNDV_SCHEME value: "get_zcopy" - name: UCX_RNDV_THRESH value: "0" # Memory registration optimization - name: UCX_IB_REG_METHODS value: "odp,rcache" # RDMA settings - name: UCX_IB_GID_INDEX value: "3" # RoCE v2 GID index (cluster-specific) ``` ### InfiniBand Configuration For clusters with InfiniBand RDMA (e.g., ConnectX NICs), use UCX with the `rc` (Reliable Connection) transport. This is the standard path for on-premises and bare-metal Kubernetes clusters. **RDMA Resources:** Request one `rdma/ib` device per GPU. The RDMA device plugin injects `/dev/infiniband/*` devices automatically: ```yaml resources: limits: gpu: "4" custom: rdma/ib: "4" ``` No pod annotations are needed. InfiniBand devices are injected by the device plugin. **Security Context:** Add `IPC_LOCK` and `SYS_RESOURCE` capabilities. `IPC_LOCK` allows RDMA memory pinning, `SYS_RESOURCE` allows memlock limit escalation: ```yaml securityContext: runAsUser: 0 capabilities: add: - IPC_LOCK - SYS_RESOURCE ``` **Environment Variables (worker containers):** ```yaml env: # --- UCX (RDMA transport) --- - name: UCX_TLS value: "rc_x,rc,cuda_copy,cuda_ipc" - name: UCX_NET_DEVICES value: ":1" # e.g. "mlx5_0:1" — run `ibv_devinfo` to find your device - name: UCX_IB_ADDR_TYPE value: "eth" # required for cross-pod IB on Kubernetes - name: UCX_RNDV_SCHEME value: "get_zcopy" - name: UCX_RNDV_THRESH value: "0" - name: UCX_RC_TIMEOUT value: "600s" - name: UCX_KEEPALIVE_INTERVAL value: "300s" ``` | Variable | Description | |----------|-------------| | `UCX_TLS` | `rc_x` (accelerated RC) listed first for optimal RDMA performance | | `UCX_NET_DEVICES` | Bind to a specific IB device. Run `ibv_devinfo` inside a pod to list available devices. Use a non-bonded device with a valid LID. | | `UCX_IB_ADDR_TYPE` | Must be `eth` for cross-pod communication on Kubernetes. Without this, UCX uses LID-based addressing which does not route between pods. | | `UCX_RNDV_SCHEME` | `get_zcopy` enables zero-copy RDMA GET, optimal for large KV cache transfers | > **Note**: `UCX_IB_ADDR_TYPE=eth` is the most common missing setting when bringing up NIXL disagg on InfiniBand clusters. If NIXL init succeeds but transfers fail with `NIXL_ERR_REMOTE_DISCONNECT`, this is likely the cause. **Known Issue — Bonded IB devices:** Some clusters expose bonded InfiniBand devices (e.g., `mlx5_bond_0`) with LID=0. If UCX selects a bonded device, transfers may fail. Verify device LIDs and select a non-bonded device: ```bash # Inside a pod with rdma/ib resources: ibv_devinfo | grep -E "hca_id|lid" # Use a device with a non-zero LID in UCX_NET_DEVICES ``` ### AWS EFA Configuration NIXL supports **libfabric** as the backend for AWS EFA deployments. This is the **recommended approach** for disaggregated inference on AWS, achieving ~9.6 GB/s KV transfer bandwidth. See the [AWS EFA with NIXL documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nixl.html) for complete setup instructions. **Requirements:** - EFA installer version **1.47.0** or later - Libfabric (installed via EFA installer at `/opt/amazon/efa`) - GDRCopy for GPU Direct RDMA operations (GPU Operator v26.x installs this automatically) - EFA-enabled container image (e.g., `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0-efa-amd64`) **Kernel Compatibility:** GDRCopy v2.5.1 has a build failure on kernel 6.15+ due to a `vm_flags_set` redefinition. Pin your Ubuntu EKS AMI to kernel 6.14 or earlier until GDRCopy v2.5.2 is available in GPU Operator. | Kernel Version | GDRCopy v2.5.1 | GDRCopy v2.5.2 | |----------------|----------------|----------------| | 6.14 and below | ✅ Works | ✅ Works | | 6.15+ | ❌ Build fails | ✅ Works | **Pod Anti-Affinity (Required):** EFA is designed for **cross-node** communication. Prefill and decode workers must be scheduled on **different nodes** to avoid EAGAIN errors during KV transfer. ```yaml VllmDecodeWorker: extraPodSpec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: nvidia.com/dynamo-component operator: In values: - VllmPrefillWorker topologyKey: kubernetes.io/hostname ``` > **Note**: Anti-affinity only needs to be configured on one side (here, the decode worker). The Kubernetes scheduler enforces the constraint symmetrically—if decode cannot be placed with prefill, they will end up on different nodes regardless of which pod has the rule. **EFA Resource Requests:** Request EFA interfaces in your pod spec. The p5.48xlarge instance has **32 EFA interfaces** (32 network cards × 1 interface each) with 3200 Gbps total bandwidth. The number of interfaces to allocate per worker depends on your deployment: | Deployment | EFA per Worker | Rationale | |------------|----------------|-----------| | 1P + 1D per node pair | 4 | Achieved ~9.6 GB/s; leaves 24 interfaces for other pods | | Multi-worker per node | 2-4 | Balance between workers sharing the node | | Maximum bandwidth | 8-16 | For very large KV cache transfers or TP>1 | Example with 4 EFA interfaces (validated configuration): ```yaml extraPodSpec: mainContainer: securityContext: capabilities: add: ["IPC_LOCK"] resources: limits: vpc.amazonaws.com/efa: "4" requests: vpc.amazonaws.com/efa: "4" ``` > **Note**: NIXL/libfabric automatically stripes traffic across all allocated EFA interfaces. The 4-interface configuration achieved ~9.6 GB/s in testing, which is sufficient for Llama-3.1-8B KV cache transfers at ISL=8000. Increase the count if your workload requires higher bandwidth (e.g., larger models or higher TP). **Environment Variables:** ```yaml env: - name: NIXL_LOG_LEVEL value: "INFO" - name: LD_LIBRARY_PATH value: "/usr/local/nixl/lib/x86_64-linux-gnu:/opt/amazon/efa/lib64:$(LD_LIBRARY_PATH)" ``` **vLLM Configuration:** ```bash vllm serve \ --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device":"cuda","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}' ``` | Parameter | Value | Purpose | |-----------|-------|---------| | `kv_connector` | `NixlConnector` | Enables NIXL for KV-cache transfer | | `kv_role` | `kv_both` | Symmetric functionality (producer and consumer) | | `kv_buffer_device` | `cuda` | Uses GPU memory for KV-cache buffer | | `backends` | `["LIBFABRIC"]` | Routes NIXL traffic over EFA | **Verification:** ```bash # Confirm EFA/libfabric installation fi_info -p efa -t FI_EP_RDM # Verify GDRCopy device ls -la /dev/gdrdrv # Check NIXL initialization in pod logs (should show 32 EFA devices on p5.48xlarge) kubectl logs | grep -i "NIXL\|libfabric\|efa" ``` **Expected Log Output:** ```text NIXL INFO Loaded backend plugin: LIBFABRIC NIXL INFO Found 32 fabric devices ``` --- ## Deployment Configuration ### Kubernetes Resource Requirements ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment spec: services: VLLMPrefillWorker: resources: limits: gpu: "2" extraPodSpec: mainContainer: securityContext: capabilities: add: ["IPC_LOCK"] # Required for RDMA memory pinning resources: limits: rdma/ib: "2" # RDMA resources (match TP size) requests: rdma/ib: "2" ``` ### Required Capabilities and Resources | Setting | Purpose | Notes | |---------|---------|-------| | `IPC_LOCK` capability | Pin memory for RDMA | Bypasses RLIMIT_MEMLOCK; required for `ibv_reg_mr()` to pin GPU/host buffers | | `rdma/ib` resources | RDMA NIC access | Provided by RDMA device plugin | | `sharedMemory.size` | IPC between processes | 16Gi for vLLM, 80Gi for TRT-LLM | ### Infrastructure Prerequisites 1. **RDMA Device Plugin**: Exposes `rdma/ib` or `vpc.amazonaws.com/efa` resources to Kubernetes ```bash # InfiniBand/RoCE kubectl get nodes -o jsonpath='{.items[*].status.allocatable.rdma/ib}' # AWS EFA kubectl get nodes -o jsonpath='{.items[*].status.allocatable.vpc\.amazonaws\.com/efa}' ``` 2. **RDMA Network**: One of: - InfiniBand or RoCE fabric - AWS EFA (Elastic Fabric Adapter) 3. **GPUDirect RDMA** (optional but recommended): - NVIDIA driver with GPUDirect enabled - `nvidia-peermem` kernel module loaded (InfiniBand/RoCE) - GDRCopy installed (AWS EFA with libfabric) --- ## Diagnostics and Performance Validation ### Pre-Deployment Validation #### 1. Verify RDMA Availability ```bash # Check RDMA devices on node kubectl debug node/ -it --image=ubuntu:22.04 -- bash ibv_devinfo ``` Expected output shows InfiniBand or RoCE devices: ```text hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 28.35.2000 ... ``` #### 2. Check UCX Transport Capabilities ```bash # Inside a Dynamo worker pod ucx_info -d ``` Look for GPU memory support: ```text # Memory domain: mlx5_0 # Component: ib # memory types: host (access,reg,cache), cuda (access,reg,cache) # ^^^^ GPU memory supported ``` **If you only see `host`**: GPUDirect RDMA is not working. KV transfers will use host staging. #### 3. Test UCX Performance ```bash # Server (on decode worker pod) ucx_perftest -t tag_bw -n 100 -s 134217728 # Client (on prefill worker pod) ucx_perftest -t tag_bw -n 100 -s 134217728 ``` **Expected bandwidth**: - InfiniBand HDR: 20-25 GB/s per port - RoCE 100GbE: 10-12 GB/s - TCP fallback: 1-2 GB/s ### NIXL Benchmark Tool Deploy the NIXL benchmark to validate end-to-end KV transfer performance: ```bash cd deploy/pre-deployment/nixl ./build_and_deploy.sh ``` This deploys a benchmark that measures actual GPU-to-GPU transfer rates through NIXL. ### Runtime Diagnostics #### Verify NIXL Backend Initialization ```bash kubectl logs | grep -i "NIXL\|UCX" ``` **Good output**: ```text NIXL INFO Backend UCX was instantiated ``` **Bad output** (RDMA not working): ```text UCX WARN no RDMA transports available NIXL INFO falling back to TCP transport ``` #### Monitor Transfer Performance Check Grafana dashboards for: - **NIXL transfer bandwidth**: Should show GB/s, not MB/s - **KV cache transfer latency**: Should be under 500ms for typical workloads **Red flags indicating RDMA issues**: - Transfer bandwidth under 1 GB/s - TTFT > 10 seconds - `Unsupported operation` errors in logs ### Common Diagnostic Commands ```bash # Check UCX transport selection kubectl exec -- env | grep UCX # Verify RDMA device visibility kubectl exec -- ls /dev/infiniband/ # Check GPUDirect RDMA status (on node) kubectl debug node/ -it --image=ubuntu:22.04 -- \ nsenter -t 1 -m -u -n -p -- dmesg | grep -i "nvidia\|peermem\|gdr" # Test basic connectivity between pods kubectl exec -- ping -c 3 ``` --- ## Performance Expectations ### KV Cache Transfer Overhead | Configuration | TTFT Overhead (avg) | KV Transfer BW | Source | |---------------|---------------------|----------------|--------| | Aggregated (baseline) | 0 | N/A | No KV transfer needed | | Disagg + InfiniBand RDMA with GPUDirect | +200-500ms | 20-50 GB/s | *Expected* based on hardware specs | | Disagg + RoCE RDMA with GPUDirect | +300-800ms | 10-25 GB/s | *Expected* based on hardware specs | | Disagg + AWS EFA with libfabric + GDRCopy | **+37ms** | **~9.6 GB/s** | *Measured* on AWS p5.48xlarge (Llama-3.1-8B, ISL=8000, OSL=50) | | Disagg + Host-staged (no GPUDirect) | +1-3s | 1-3 GB/s | *Expected* - CPU bottleneck | | Disagg + AWS EFA with UCX (without GPUDirect) | ~3x slower than aggregated | ~1 GB/s | *Measured* on AWS p5.48xlarge | | Disagg + TCP fallback | **+90-100s** | ~100 MB/s | *Measured* ~98s TTFT on AWS p5.48xlarge | > **Note**: For AWS EFA deployments, use libfabric with GDRCopy to enable GPUDirect RDMA. UCX on AWS EFA does not support GPUDirect on kernel ≥6.8 and results in severely degraded performance. See [AWS EFA Configuration](#aws-efa-configuration) for setup instructions. ### When Disaggregated Makes Sense **Use disaggregated architecture when:** - Input sequence length (ISL) ≥ 4000 tokens (14-22% throughput gain) - You need independent scaling of prefill vs decode capacity - Prefill and decode have different hardware requirements **Use aggregated architecture when:** - Low-latency TTFT is critical - Input sequences under 2000 tokens (minimal disagg benefit) - RDMA is not available ### Break-Even Analysis The KV transfer overhead is amortized across output tokens. **Measured data from Llama-3.1-8B-Instruct** on AWS p5.48xlarge with NIXL+libfabric: ```text KV Transfer Overhead (TTFT min, unqueued): - Aggregated: ~173ms - Disaggregated: ~210ms - KV transfer cost: ~37ms Performance at ISL=8000, OSL=50, concurrency=10: - ITL improvement: 41% faster per-token generation - Throughput gain: 22% higher output throughput ``` **Key Insight**: The KV transfer overhead via libfabric+EFA is only **~37ms**. Combined with 41% faster decode (ITL), disaggregated inference delivers **22% higher throughput** for prefill-bound workloads. | Metric | Aggregated | Disaggregated | Difference | |--------|------------|---------------|------------| | TTFT (min, unqueued) | 173 ms | 210 ms | +37ms | | TTFT (p95) | 2097 ms | 1752 ms | **-16%** | | ITL (avg) | 28.5 ms | 16.9 ms | **-41%** | | Output throughput (ISL=8000, OSL=50) | 204 tok/s | 248 tok/s | **+22%** | **Disagg advantage scales with input length (ISL)** (all at OSL=50, concurrency=10): | ISL | Throughput Δ | ITL Δ | Recommendation | |-----|--------------|-------|----------------| | 1000 | ~0% | -7% | Use aggregated | | 2000 | +3% | -11% | Either works | | 4000 | +14% | -18% | Disagg preferred | | 8000 | **+22%** | **-41%** | **Disagg strongly preferred** | --- ## Troubleshooting Guide ### Problem: TTFT is 10+ seconds **Symptoms**: TTFT degrades from expected 200-500ms to 10+ seconds **Root Cause**: RDMA not active, falling back to TCP **Diagnosis**: ```bash kubectl logs | grep -i "transport\|UCX\|TCP" ``` **Solutions**: 1. Verify RDMA device plugin is installed 2. Add `rdma/ib` resource requests to pod spec 3. Add `IPC_LOCK` capability 4. Set UCX environment variables ### Problem: "Unsupported operation" errors **Symptoms**: Logs show `Unexpected UCX error: Unsupported operation` **Root Cause**: UCX attempting GPU RDMA on hardware that doesn't support it **Solutions**: 1. Check if GPUDirect RDMA is enabled: `ucx_info -d | grep cuda` 2. If not supported, set `UCX_RNDV_THRESH=inf` to disable GPU RDMA 3. Verify `nvidia-peermem` module is loaded ### Problem: AWS EFA not using GPU Direct **Symptoms**: 3x performance degradation on AWS despite EFA configured **Root Cause**: GPU Direct RDMA not functional on kernel ≥6.8 with EFA when using UCX **Solution**: Use libfabric instead of UCX for AWS EFA deployments. Libfabric with GDRCopy provides efficient GPU Direct RDMA operations on AWS. See the [AWS EFA Configuration](#aws-efa-configuration) section for setup instructions. **Alternative options** (if libfabric is not available): 1. Use kernel before 6.8 (Ubuntu 22.04 with kernel 5.15) 2. Accept host-staging performance penalty ### Problem: EFA EAGAIN errors (fi_read still retrying) **Symptoms**: Decode worker logs show repeated EAGAIN errors: ```text fi_read still retrying EAGAIN on rail 0 fi_read still retrying EAGAIN on rail 1 ... ``` **Root Cause**: Prefill and decode workers are scheduled on the **same node**. AWS EFA is designed for cross-node communication and does not function correctly for intra-node transfers. **Diagnosis**: ```bash # Check if workers are on the same node kubectl get pods -o wide | grep vllm ``` If both prefill and decode workers show the same NODE, this is the problem. **Solution**: Add pod anti-affinity rules to ensure workers are scheduled on different nodes: ```yaml VllmDecodeWorker: extraPodSpec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: nvidia.com/dynamo-component operator: In values: - VllmPrefillWorker topologyKey: kubernetes.io/hostname ``` > **Note**: Use `nvidia.com/dynamo-component` as the label key, not `app.kubernetes.io/component`. The Dynamo operator uses this label to identify component types. ### Problem: NIXL_ERR_BACKEND at create_backend on InfiniBand **Symptoms**: NIXL backend creation fails immediately with `NIXL_ERR_BACKEND`. UCX logs show: ```text mlx5dv_devx_obj_destroy(SRQ) failed: Invalid argument mlx5dv_devx_obj_destroy(CQ) failed: Invalid argument ``` Or: ```text select.c: no active messages transport: Unsupported operation ``` **Root Causes**: 1. **Bonded IB device with LID=0**: UCX selects `mlx5_bond_0` by default, but bonded devices may have LID=0 (invalid for UD transport). Fix: set `UCX_NET_DEVICES` to a non-bonded device with a valid LID. 2. **UCX/OFED version mismatch**: The container's UCX mlx5 library may be compiled against a different devx ABI than the host kernel driver. Any transport using IB (rc, cuda_ipc with IB) triggers the devx crash. 3. **Missing RDMA device injection**: If `rdma/ib` is not requested in the pod spec, no IB devices are injected into the container. **Diagnosis**: ```bash # Check which IB devices are visible and their LIDs ibv_devinfo | grep -E "hca_id|lid" # Verify rdma/ib was requested kubectl get pod -o jsonpath='{.spec.containers[0].resources}' # Check /dev/infiniband exists ls -la /dev/infiniband/ ``` **Solutions**: 1. Request `rdma/ib` resources (1 per GPU) in the pod spec 2. Set `UCX_NET_DEVICES` to a non-bonded device if `mlx5_bond_0` has LID=0 3. Ensure the container image's UCX build matches the host OFED version ### Problem: Intermittent transfer failures **Symptoms**: Sporadic `getXferStatus: backend 'UCX' returned error status` **Diagnosis**: ```bash # Enable UCX debug logging kubectl set env deployment/ UCX_LOG_LEVEL=debug kubectl logs | grep -i error ``` **Common causes**: - Network congestion or packet loss - Mismatched UCX versions between pods - RDMA resource exhaustion --- ## Quick Reference ### Minimum Viable RDMA Configuration ```yaml env: - name: UCX_TLS value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc" - name: UCX_RNDV_SCHEME value: "get_zcopy" - name: UCX_RNDV_THRESH value: "0" securityContext: capabilities: add: ["IPC_LOCK"] resources: limits: rdma/ib: "2" requests: rdma/ib: "2" ``` ### Diagnostic Checklist - [ ] `rdma/ib` resources visible: `kubectl get nodes -o jsonpath='{..allocatable.rdma/ib}'` - [ ] NIXL initialized: `kubectl logs | grep "Backend"` - [ ] Transfer bandwidth > 1 GB/s (check Grafana metrics) **For UCX deployments:** - [ ] UCX sees RDMA devices: `ucx_info -d | grep "Transport: rc"` - [ ] UCX sees GPU memory: `ucx_info -d | grep "memory types.*cuda"` **For libfabric deployments (AWS EFA):** - [ ] EFA devices available: `fi_info -p efa` - [ ] GDRCopy installed: `ls /dev/gdrdrv` --- ## Related Documentation - [Disaggregated Serving Architecture](/dynamo/design-docs/disaggregated-serving) - [AIConfigurator Deployment Guide](/dynamo/user-guides/disaggregated-serving) - [NIXL Benchmark Deployment](../../deploy/pre-deployment/nixl/README.md) - [KV Cache Transfer Methods](/dynamo/additional-resources/tensor-rt-llm-details/kv-cache-transfer) # Topology-Aware KV Transfer # Topology-Aware KV Transfer Topology-aware KV transfer lets a disaggregated Dynamo deployment route decode requests toward workers that share the selected prefill worker's topology domain, such as zone or rack. This reduces slow cross-domain KV-cache transfers when prefill and decode workers exchange KV data over NIXL. Use this feature when: - Your deployment uses separate prefill and decode workers. - Your cluster exposes useful node labels, such as `topology.kubernetes.io/zone` or a rack/block label. - Same-domain KV transfer is required for correctness or strongly preferred for latency and bandwidth. This page covers the Kubernetes operator path. For router and runtime behavior, see [Router Topology-Aware KV Transfer](/dynamo/components/router/topology-aware-kv-transfer). For RDMA/NIXL transport setup, see [Disagg Communication](/dynamo/kubernetes-deployment/operate/disagg-communication). ## How It Works ```mermaid flowchart LR DGD["DGD spec.experimental.kvTransferPolicy"] --> Operator["Operator injects worker env + Downward API volume"] Node["Node label"] --> Controller["Topology label controller"] Controller --> Pod["Worker pod label"] Pod --> Volume["/etc/dynamo/topology/"] Volume --> Runtime["Worker publishes ModelRuntimeConfig topology metadata"] Runtime --> Prefill["Prefill router derives decode constraints"] Prefill --> Decode["Decode router selects same or preferred topology"] ``` The operator configures worker pods from `spec.experimental.kvTransferPolicy`: - Adds a `nvidia.com/topology-label-key` annotation to worker pods. - Runs a topology-label controller that copies the configured node label onto the worker pod after scheduling. - Projects that pod label into `/etc/dynamo/topology/` with a Downward API volume. - Injects worker environment variables that tell the backend runtime which topology domain and enforcement policy to publish. The frontend does not read this policy from its own environment. Workers publish the topology metadata in their `ModelRuntimeConfig`; the router reads it from runtime discovery. ## Prerequisites | Requirement | Details | |-------------|---------| | Disaggregated serving | Separate prefill and decode worker services. | | KV router | The frontend should use `DYN_ROUTER_MODE=kv`. | | Node topology labels | Every node that can host a worker must carry the configured `labelKey`. | | Dynamo operator | The operator must include topology-label controller and node-read RBAC. | | KV transfer transport | RDMA, EFA, or another NIXL-compatible transport should already be configured for production disaggregated deployments. | Confirm that the label you plan to use exists on worker nodes: ```bash kubectl get nodes -L topology.kubernetes.io/zone ``` ## Required Same-Domain Routing `enforcement: required` constrains decode worker selection to workers whose topology value matches the selected prefill worker for the configured domain. If no decode worker satisfies the generated constraint, the router fails the request instead of silently crossing the domain. ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeployment metadata: name: qwen3-disagg-zone spec: experimental: kvTransferPolicy: labelKey: topology.kubernetes.io/zone domain: zone enforcement: required components: - name: Frontend type: frontend replicas: 1 podTemplate: spec: containers: - name: main image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 env: - name: DYN_ROUTER_MODE value: kv - name: VllmPrefillWorker type: worker replicas: 2 podTemplate: spec: containers: - name: main image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 command: ["python3", "-m", "dynamo.vllm"] args: ["--model", "Qwen/Qwen3-0.6B", "--disaggregation-mode", "prefill"] envFrom: - secretRef: name: hf-token-secret resources: limits: nvidia.com/gpu: "1" - name: VllmDecodeWorker type: worker replicas: 2 podTemplate: spec: containers: - name: main image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 command: ["python3", "-m", "dynamo.vllm"] args: ["--model", "Qwen/Qwen3-0.6B", "--disaggregation-mode", "decode"] envFrom: - secretRef: name: hf-token-secret resources: limits: nvidia.com/gpu: "1" ``` `enforcement` defaults to `required` when omitted. `required` is a decode-routing constraint, not a capacity planner. The `DynamoGraphDeployment` author or cluster administrator must ensure that every topology domain that can receive prefill workers also has sufficient same-domain decode capacity. If a domain has prefill workers but no matching decode workers, or too little decode capacity, the router cannot spill to another domain without violating the policy. ### Capacity Planning Across Domains Plan prefill and decode capacity per topology domain before enabling `enforcement: required`. For example, assume: - Two availability zones: `az-1` and `az-2`. - The target fleet is 60 prefill workers and 120 decode workers. - The fleet should be split evenly across the two zones. - The target prefill-to-decode ratio is 1:2 in each zone. That means each zone should run 30 prefill workers and 60 decode workers: | Zone | Prefill workers | Decode workers | Ratio | |------|-----------------|----------------|-------| | `az-1` | 30 | 60 | 1:2 | | `az-2` | 30 | 60 | 1:2 | In a `DynamoGraphDeployment`, express this as separate prefill and decode components per zone. Pin each component to its zone and set `kvTransferPolicy.enforcement` to `required` so the router refuses cross-zone decode selection. The DGD author or cluster administrator must ensure each zone has enough schedulable capacity for its pinned replicas. Worker command and args are omitted here; configure each worker for prefill or decode mode as in the base disaggregated serving manifest: ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeployment metadata: name: qwen3-disagg-zone-capacity spec: experimental: kvTransferPolicy: labelKey: topology.kubernetes.io/zone domain: zone enforcement: required components: - name: Frontend type: frontend replicas: 1 podTemplate: spec: containers: - name: main image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 env: - name: DYN_ROUTER_MODE value: kv - name: VllmPrefillWorkerAz1 type: worker replicas: 30 podTemplate: spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: topology.kubernetes.io/zone operator: In values: ["az-1"] containers: - name: main image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 envFrom: - secretRef: name: hf-token-secret - name: VllmDecodeWorkerAz1 type: worker replicas: 60 podTemplate: spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: topology.kubernetes.io/zone operator: In values: ["az-1"] containers: - name: main image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 envFrom: - secretRef: name: hf-token-secret - name: VllmPrefillWorkerAz2 type: worker replicas: 30 podTemplate: spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: topology.kubernetes.io/zone operator: In values: ["az-2"] containers: - name: main image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 envFrom: - secretRef: name: hf-token-secret - name: VllmDecodeWorkerAz2 type: worker replicas: 60 podTemplate: spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: topology.kubernetes.io/zone operator: In values: ["az-2"] containers: - name: main image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 envFrom: - secretRef: name: hf-token-secret ``` ## Preferred Same-Domain Routing `enforcement: preferred` keeps all decode workers eligible but biases worker selection toward the same topology domain. ```yaml spec: experimental: kvTransferPolicy: labelKey: topology.kubernetes.io/zone domain: zone enforcement: preferred preferredWeight: 0.85 ``` `preferredWeight` is required with `enforcement: preferred`. It must be between `0` and `1`. A higher value creates a stronger same-domain preference, but it is not a probability and does not guarantee same-domain selection. ## Field Reference | Field | Required | Description | |-------|----------|-------------| | `labelKey` | Yes | Kubernetes node label key to copy onto worker pods, for example `topology.kubernetes.io/zone`. | | `domain` | Yes | Logical topology domain name published by workers, for example `zone` or `rack`. Must match `^[a-z0-9]([a-z0-9-]*[a-z0-9])?$`. | | `enforcement` | No | `required` or `preferred`. Defaults to `required`. | | `preferredWeight` | Only with `preferred` | Bias weight from `0` to `1`; only valid with `enforcement: preferred`. | The runtime uses `domain`, not the Kubernetes label key, when creating routing constraints. For example, `labelKey: topology.kubernetes.io/zone` and `domain: zone` produce worker topology metadata like: ```json { "topology_domains": { "zone": "us-east-1a" }, "kv_transfer_domain": "zone", "kv_transfer_enforcement": "required" } ``` ## Verify the Deployment After the DGD creates worker pods, verify the operator pipeline from node label to runtime topology file. ```bash export NAMESPACE= export POD= kubectl get pod "$POD" -n "$NAMESPACE" \ -o jsonpath='{.metadata.annotations.nvidia\.com/topology-label-key}{"\n"}' kubectl get pod "$POD" -n "$NAMESPACE" \ -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}' kubectl exec "$POD" -n "$NAMESPACE" -- \ sh -c 'find /etc/dynamo/topology -maxdepth 1 -type f -print -exec cat {} \;' ``` Expected results: - The annotation value is the configured `labelKey`. - The worker pod has the copied topology label. - `/etc/dynamo/topology/` exists and contains the topology value. Worker logs should include topology config during startup: ```bash kubectl logs "$POD" -n "$NAMESPACE" | grep -i "Topology config" ``` ## Troubleshooting ### Pod Has No Copied Topology Label Check whether the node has the configured label: ```bash NODE=$(kubectl get pod "$POD" -n "$NAMESPACE" -o jsonpath='{.spec.nodeName}') kubectl get node "$NODE" -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}' ``` If the label is missing, the topology-label controller emits a warning event with reason `TopologyLabelMissing` and leaves topology metadata unavailable for that worker. ```bash kubectl get events -n "$NAMESPACE" \ --field-selector involvedObject.name="$POD",reason=TopologyLabelMissing ``` ### Worker Exits While Waiting for Topology When topology is enabled, the worker waits for the transfer-domain file to appear and contain data. If it stays empty, check: - `spec.experimental.kvTransferPolicy.domain` matches the projected file name. - `spec.experimental.kvTransferPolicy.labelKey` exists on the worker's node. - The worker pod has the `nvidia.com/topology-label-key` annotation. - The topology-label controller is running and has node `get` RBAC. ### Required Policy Fails Requests With `enforcement: required`, decode routing fails if no decode worker has the same generated topology taint as the selected prefill worker. Verify both prefill and decode workers publish the same `domain`, and that each domain where prefill workers can be selected has enough matching decode workers for the expected p/d ratio. Use `preferred` while validating a heterogeneous rollout if cross-domain routing is acceptable during partial capacity. ## Relationship to Topology Aware Scheduling [Topology Aware Scheduling](/dynamo/kubernetes-deployment/scale/topology-aware-scheduling) controls where Kubernetes places pods. Topology-aware KV transfer controls how Dynamo routes between already-running prefill and decode workers. Use them together when possible: - Topology Aware Scheduling keeps workers placed inside useful topology boundaries. - Topology-aware KV transfer prevents the router from choosing a decode worker outside the selected prefill worker's transfer domain. # Metrics ## Overview This guide provides a walkthrough for collecting and visualizing metrics from Dynamo components using the kube-prometheus-stack. The kube-prometheus-stack provides a powerful and flexible way to configure monitoring for Kubernetes applications through custom resources like PodMonitors, making it easy to automatically discover and scrape metrics from Dynamo components. ## Prerequisites ### Install kube-prometheus-stack If you don't have an existing Prometheus setup, you'll likely want to install the kube-prometheus-stack. This is a collection of Kubernetes manifests that includes the Prometheus Operator, Prometheus, Grafana, and other monitoring components in a pre-configured setup. The stack introduces custom resources that make it easy to deploy and manage monitoring in Kubernetes: - `PodMonitor`: Automatically discovers and scrapes metrics from pods based on label selectors - `ServiceMonitor`: Similar to PodMonitor but works with Services - `PrometheusRule`: Defines alerting and recording rules For a basic installation: ```bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Values allow PodMonitors to be picked up that are outside of the kube-prometheus-stack helm release helm install prometheus -n monitoring --create-namespace prometheus-community/kube-prometheus-stack \ --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \ --set prometheus.prometheusSpec.podMonitorNamespaceSelector.matchLabels=null \ --set prometheus.prometheusSpec.probeNamespaceSelector.matchLabels=null ``` The commands enumerated below assume you have installed the kube-prometheus-stack with the installation method listed above. Depending on your installation configuration of the monitoring stack, you may need to modify the `kubectl` commands that follow in this document accordingly (e.g modifying Namespace or Service names accordingly). ### Install Dynamo Operator Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) for detailed instructions on deploying the Dynamo operator. Make sure to set the `dynamo-operator.dynamo.metrics.prometheusEndpoint` to the Prometheus endpoint you installed in the previous step. ```bash helm install dynamo-platform ... --set dynamo-operator.dynamo.metrics.prometheusEndpoint=http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 ``` ### Node Exporter for CPU/Memory Metrics The Dynamo Grafana dashboard includes panels for node-level CPU utilization, system load, and container resource usage. These metrics are collected and exported to Prometheus via [node-exporter](https://github.com/prometheus/node_exporter), which exposes hardware and OS metrics from Linux systems. The kube-prometheus-stack installation described above includes node-exporter by default. If you're using a custom Prometheus setup, you'll need to ensure node-exporter is deployed as a DaemonSet on your cluster nodes. To verify node-exporter is running: ```bash kubectl get daemonset -A | grep node-exporter ``` If node-exporter is not running, you can install it via the kube-prometheus-stack or deploy it separately. For more information, see the [node-exporter documentation](https://github.com/prometheus/node_exporter). ### DCGM Metrics Collection (Optional) GPU utilization metrics are collected and exported to Prometheus via dcgm-exporter. The Dynamo Grafana dashboard includes a panel for GPU utilization related to your Dynamo deployment. For that panel to be populated, you need to ensure that the dcgm-exporter is running in your cluster. To check if the dcgm-exporter is running, please run the following command: ```bash kubectl get daemonset -A | grep dcgm-exporter ``` If the output is empty, you need to install the dcgm-exporter. For more information, please consult the official [dcgm-exporter documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html). ## Deploy a DynamoGraphDeployment Let's start by deploying a simple vLLM aggregated deployment: ```bash export NAMESPACE=dynamo-system # namespace where dynamo operator is installed pushd examples/backends/vllm/deploy kubectl apply -f agg.yaml -n $NAMESPACE popd ``` This will create two components: - A Frontend component exposing metrics on its HTTP port - A Worker component exposing metrics on its system port Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about: - Deployment configuration: See the [vLLM README](/dynamo/backends/v-llm) - Available metrics: See the [metrics guide](/dynamo/user-guides/observability-local/metrics) ### Validate the Deployment Let's send some test requests to populate metrics: ```bash curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [ { "role": "user", "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden." } ], "stream": true, "max_tokens": 30 }' ``` For more information about validating the deployment, see the [vLLM README](/dynamo/backends/v-llm). ## Set Up Metrics Collection ### Enable NIXL Telemetry (Optional) To enable NIXL telemetry metrics in addition to Dynamo metrics, set the following environment variables in your worker component: spec: services: YourWorker: envs: - name: NIXL_TELEMETRY_ENABLE value: "y" NIXL telemetry is disabled by default. When enabled, NIXL metrics will be exposed on the port specified by `NIXL_TELEMETRY_PROMETHEUS_PORT` (19090 by default). ### Create PodMonitors The Prometheus Operator uses PodMonitor resources to automatically discover and scrape metrics from pods. To enable this discovery, the Dynamo operator automatically creates PodMonitor resource and adds these labels to all pods: - `nvidia.com/metrics-enabled: "true"` - Enables metrics collection - `nvidia.com/dynamo-component-type: "frontend|worker"` - Identifies the component type You can opt-out specific deployments from metrics collection by adding this annotation to your DynamoGraphDeployment: ```yaml apiVersion: nvidia.com/v1 kind: DynamoGraphDeployment metadata: name: my-deployment annotations: nvidia.com/enable-metrics: "false" spec: # … ``` ### Configure Grafana Dashboard Apply the Dynamo dashboard configuration to populate Grafana with the Dynamo dashboard: ```bash kubectl apply -n monitoring -f deploy/observability/grafana-dynamo-dashboard-configmap.yaml ``` The dashboard is embedded in the ConfigMap. Since it is labeled with `grafana_dashboard: "1"`, the Grafana will discover and populate it to its list of available dashboards. The dashboard includes panels for: - Frontend request rates - Time to first token - Inter-token latency - Request duration - Input/Output sequence lengths - GPU utilization via DCGM - Node CPU utilization and system load - Container CPU usage per pod - Memory usage per pod ## Viewing the Metrics ### In Prometheus ```bash kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring ``` Visit http://localhost:9090 and try these example queries: - `dynamo_frontend_requests_total` - `dynamo_frontend_time_to_first_token_seconds_bucket` ![Prometheus UI showing Dynamo metrics](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/ce1da1f068dc5b2a4e6b8fb18bd9d9f2dfae55962c18ee13b3c48e43b4737eb2/pages-v1.2.0/assets/img/prometheus-k8s.png) ### In Grafana ```bash # Get Grafana credentials export GRAFANA_USER=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-user}" | base64 --decode) export GRAFANA_PASSWORD=$(kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode) echo "Grafana user: $GRAFANA_USER" echo "Grafana password: $GRAFANA_PASSWORD" # Port forward Grafana service kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring ``` Visit http://localhost:3000 and log in with the credentials captured above. Once logged in, find the Dynamo dashboard under General. ![Grafana dashboard showing Dynamo metrics](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/692d89c259057f1e6494336aa3ba470521a4ff6cc4d738e3cb25fa250e0e59a0/pages-v1.2.0/assets/img/grafana-k8s.png) ## Operator Metrics > **Note:** The metrics described above are for Dynamo **applications** (frontends, workers). The Dynamo **Operator** itself also exposes metrics for monitoring controller reconciliation, webhook validation, and resource inventory. > > See the **[Operator Metrics Guide](/dynamo/kubernetes-deployment/operate/observability/operator-metrics)** for details on operator-specific metrics and the operator dashboard. # Logging This guide demonstrates how to set up logging for Dynamo in Kubernetes using Grafana Loki and Alloy. This setup provides a simple reference logging setup that can be followed in Kubernetes clusters including Minikube and MicroK8s. This setup is intended for development and testing purposes. For production environments, please refer to the official documentation for high-availability configurations. ## Components Overview - **[Grafana Loki](https://grafana.com/oss/loki/)**: Fast and cost-effective Kubernetes-native log aggregation system. - **[Grafana Alloy](https://grafana.com/oss/alloy/)**: OpenTelemetry collector that replaces Promtail, gathering logs, metrics and traces from Kubernetes pods. - **[Grafana](https://grafana.com/grafana/)**: Visualization platform for querying and exploring logs. ## Prerequisites ### 1. Dynamo Kubernetes Platform This guide assumes you have installed Dynamo Kubernetes Platform. For more information, see [Dynamo Kubernetes Platform](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart). ### 2. Kube-prometheus While this guide does not use Prometheus, it assumes Grafana is pre-installed with the kube-prometheus. For more information, see [kube-prometheus](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack). ### 3. Environment Variables #### Kubernetes Setup Variables The following env variables are set: - `MONITORING_NAMESPACE`: The namespace where Loki is installed - `DYN_NAMESPACE`: The namespace where Dynamo Kubernetes Platform is installed ```bash export MONITORING_NAMESPACE=monitoring export DYN_NAMESPACE=dynamo-system ``` #### Dynamo Logging Variables | Variable | Description | Example | |----------|-------------|---------| | `DYN_LOGGING_JSONL` | Enable JSONL logging format (required for Loki) | `true` | | `DYN_LOG` | Log levels per target `,=,=` | `DYN_LOG=info,dynamo_runtime::system_status_server:trace` | | `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps | `true` | ## Installation Steps ### 1. Install Loki First, we'll install Loki in single binary mode, which is ideal for testing and development: ```bash # Add the Grafana Helm repository helm repo add grafana https://grafana.github.io/helm-charts helm repo update # Install Loki helm install --values deploy/observability/logging/values/loki-values.yaml loki grafana/loki -n $MONITORING_NAMESPACE ``` Our configuration (`loki-values.yaml`) sets up Loki in a simple configuration that is suitable for testing and development. It uses a local MinIO for storage. The installation pods can be viewed with: ```bash kubectl get pods -n $MONITORING_NAMESPACE -l app=loki ``` ### 2. Install Grafana Alloy Next, install the Grafana Alloy collector to gather logs from your Kubernetes cluster and forward them to Loki. Here we use the Helm chart `k8s-monitoring` provided by Grafana to install the collector: ```bash # Generate a custom values file with the namespace information envsubst < deploy/observability/logging/values/alloy-values.yaml > alloy-custom-values.yaml # Install the collector helm install --values alloy-custom-values.yaml alloy grafana/k8s-monitoring -n $MONITORING_NAMESPACE ``` The values file (`alloy-values.yaml`) includes the following configurations for the collector: - Destination to forward logs to Loki - Namespace to collect logs from - Pod labels to be mapped to Loki labels - Collection method (kubernetesApi or tailing `/var/log/containers/`) ```yaml destinations: - name: loki type: loki url: http://loki-gateway.$MONITORING_NAMESPACE.svc.cluster.local/loki/api/v1/push podLogs: enabled: true gatherMethod: kubernetesApi # collect logs from the kubernetes api, rather than /var/log/containers/; friendly for testing and development collector: alloy-logs labels: app_kubernetes_io_name: app.kubernetes.io/name nvidia_com_dynamo_component_type: nvidia.com/dynamo-component-type nvidia_com_dynamo_graph_deployment_name: nvidia.com/dynamo-graph-deployment-name labelsToKeep: - "app_kubernetes_io_name" - "container" - "instance" - "job" - "level" - "namespace" - "service_name" - "service_namespace" - "deployment_environment" - "deployment_environment_name" - "nvidia_com_dynamo_component_type" # extract this label from the dynamo graph deployment - "nvidia_com_dynamo_graph_deployment_name" # extract this label from the dynamo graph deployment namespaces: - $DYN_NAMESPACE ``` ### 3. Configure Grafana with the Loki datasource and Dynamo Logs dashboard We will be viewing the logs associated with our DynamoGraphDeployment in Grafana. To do this, we need to configure Grafana with the Loki datasource and Dynamo Logs dashboard. Since we are using Grafana with the Prometheus Operator, we can simply apply the following ConfigMaps to quickly achieve this configuration. ```bash # Configure Grafana with the Loki datasource envsubst < deploy/observability/logging/grafana/loki-datasource.yaml | kubectl apply -n $MONITORING_NAMESPACE -f - # Configure Grafana with the Dynamo Logs dashboard kubectl apply -f deploy/observability/logging/grafana/logging-dashboard.yaml -n $MONITORING_NAMESPACE ``` If using Grafana installed without the Prometheus Operator, you can manually import the Loki datasource and Dynamo Logs dashboard using the Grafana UI. ### 4. Deploy a DynamoGraphDeployment with JSONL Logging At this point, we should have everything in place to collect and view logs in our Grafana instance. All that is left is to deploy a DynamoGraphDeployment to collect logs from. To enable structured logs in a DynamoGraphDeployment, we need to set the `DYN_LOGGING_JSONL` environment variable to `1`. This is done for us in the `agg_logging.yaml` setup for the Sglang backend. We can now deploy the DynamoGraphDeployment with: ```bash kubectl apply -n $DYN_NAMESPACE -f examples/backends/sglang/deploy/agg_logging.yaml ``` Send a few chat completions requests to generate structured logs across the frontend and worker pods across the DynamoGraphDeployment. We are now all set to view the logs in Grafana. ## Viewing Logs in Grafana Port-forward the Grafana service to access the UI: ```bash kubectl port-forward svc/prometheus-grafana 3000:80 -n $MONITORING_NAMESPACE ``` If everything is working, under Home > Dashboards > Dynamo Logs, you should see a dashboard that can be used to view the logs associated with our DynamoGraphDeployments The dashboard enables filtering by DynamoGraphDeployment, namespace, and component type (e.g., frontend, worker, etc.). # Operator Metrics ## Overview The Dynamo Operator exposes Prometheus metrics for monitoring its own health and performance. These metrics are separate from application metrics (frontend/worker) and provide visibility into: - **Controller Reconciliation**: How efficiently controllers process DynamoGraphDeployments, DynamoComponentDeployments, and DynamoModels - **Webhook Validation**: Performance and outcomes of admission webhook requests - **Resource Inventory**: Current count of managed resources by state and namespace ## Prerequisites The operator metrics feature requires the same monitoring infrastructure as application metrics. For detailed setup instructions, see the [Kubernetes Metrics Guide](/dynamo/kubernetes-deployment/operate/observability/metrics#prerequisites). **Quick checklist:** - ✅ kube-prometheus-stack installed (for ServiceMonitor support) - ✅ Prometheus and Grafana running - ✅ Dynamo Operator installed via Helm ## Metrics Collection ### ServiceMonitor Operator metrics are automatically collected via a ServiceMonitor, which is created by the Helm chart when `metricsService.enabled: true` (default). **Unlike application metrics** (which use PodMonitor), the operator uses ServiceMonitor and requires no manual RBAC configuration. The operator's metrics endpoint uses controller-runtime's built-in `WithAuthenticationAndAuthorization` filter for secure serving. To verify the ServiceMonitor is created: ```bash kubectl get servicemonitor -n dynamo-system ``` ### Disabling Metrics Collection To disable operator metrics collection: ```bash helm upgrade dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \ --namespace dynamo-system \ --set dynamo-operator.metricsService.enabled=false ``` ## Available Metrics All metrics use the `dynamo_operator` namespace prefix. ### Reconciliation Metrics | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `dynamo_operator_reconcile_duration_seconds` | Histogram | `resource_type`, `namespace`, `result` | Duration of reconciliation loops | | `dynamo_operator_reconcile_total` | Counter | `resource_type`, `namespace`, `result` | Total number of reconciliations | | `dynamo_operator_reconcile_errors_total` | Counter | `resource_type`, `namespace`, `error_type` | Total reconciliation errors by type | **Labels:** - `resource_type`: `DynamoGraphDeployment`, `DynamoComponentDeployment`, `DynamoModel`, `DynamoGraphDeploymentRequest`, `DynamoGraphDeploymentScalingAdapter` - `namespace`: Target namespace of the resource - `result`: `success`, `error`, `requeue` - `error_type`: `not_found`, `already_exists`, `conflict`, `validation`, `bad_request`, `unauthorized`, `forbidden`, `timeout`, `server_timeout`, `unavailable`, `rate_limited`, `internal` ### Webhook Metrics | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `dynamo_operator_webhook_duration_seconds` | Histogram | `resource_type`, `operation` | Duration of webhook validation requests | | `dynamo_operator_webhook_requests_total` | Counter | `resource_type`, `operation`, `result` | Total webhook admission requests | | `dynamo_operator_webhook_denials_total` | Counter | `resource_type`, `operation`, `reason` | Total webhook denials with reasons | **Labels:** - `resource_type`: Same as reconciliation metrics - `operation`: `CREATE`, `UPDATE`, `DELETE` - `result`: `allowed`, `denied` - `reason`: Validation failure reason (e.g., `immutable_field_changed`, `invalid_config`) ### Resource Inventory Metrics | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `dynamo_operator_resources_total` | Gauge | `resource_type`, `namespace`, `status` | Current count of resources by state | **Labels:** - `resource_type`: `DynamoGraphDeployment`, `DynamoComponentDeployment`, `DynamoModel`, `DynamoGraphDeploymentRequest`, `DynamoGraphDeploymentScalingAdapter` - `namespace`: Resource namespace - `status`: Resource state derived from each CRD's status. Common values: - `"ready"` - Resource is healthy and operational (DCD, DM, DGDSA) - `"not_ready"` - Resource exists but is not operational (DCD, DM, DGDSA) - `"unknown"` - State cannot be determined (default for empty status) - DGD uses: `"pending"`, `"successful"`, `"failed"` from `.status.state` - DGDR uses: `"Pending"`, `"Profiling"`, `"Ready"`, `"Deploying"`, `"Deployed"`, `"Failed"` from `.status.phase` ## Example Queries ### Reconciliation Performance ```promql # P95 reconciliation duration by resource type histogram_quantile(0.95, sum by (resource_type, le) ( rate(dynamo_operator_reconcile_duration_seconds_bucket[5m]) ) ) # Reconciliation rate by result sum by (resource_type, result) ( rate(dynamo_operator_reconcile_total[5m]) ) # Error rate by type sum by (resource_type, error_type) ( rate(dynamo_operator_reconcile_errors_total[5m]) ) ``` ### Webhook Performance ```promql # Webhook P95 latency histogram_quantile(0.95, sum by (resource_type, le) ( rate(dynamo_operator_webhook_duration_seconds_bucket[5m]) ) ) # Webhook denial rate sum by (resource_type, operation, reason) ( rate(dynamo_operator_webhook_denials_total[5m]) ) ``` ### Resource Inventory ```promql # Total resources by type and state sum by (resource_type, status) ( dynamo_operator_resources_total ) # DynamoGraphDeployments by state sum by (status) ( dynamo_operator_resources_total{resource_type="DynamoGraphDeployment"} ) # All resources by namespace and state sum by (resource_type, namespace, status) ( dynamo_operator_resources_total ) ``` ## Grafana Dashboard A pre-built Grafana dashboard is available for visualizing operator metrics. ### Dashboard Sections 1. **Reconciliation Metrics** (3 panels) - Reconciliation rate by resource type and result - P95 reconciliation duration - Reconciliation errors by type 2. **Webhook Metrics** (3 panels) - Webhook request rate by operation - P95 webhook duration - Webhook denials by reason 3. **Resource Inventory** (2 panels) - Resource inventory timeline by state and namespace (filterable by resource type) - Current resource count by state (filterable by resource type) 4. **Operational Health** (2 panels) - Reconciliation success rate gauges - Webhook admission success rate gauges ### Deploying the Dashboard ```bash kubectl apply -f deploy/observability/grafana-operator-dashboard-configmap.yaml ``` The dashboard will automatically appear in Grafana (assuming you have the Grafana dashboard sidecar configured, which is included in kube-prometheus-stack). ### Finding the Dashboard 1. Port-forward to Grafana (if needed): ```bash kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring ``` 2. Log in to Grafana at http://localhost:3000 3. Navigate to **Dashboards** → Search for **"Dynamo Operator"** ### Dashboard Filters The dashboard includes two filter variables: - **Namespace**: View metrics across all namespaces or filter by specific ones (multi-select) - **Resource Type**: Filter all panels by resource type or select "All" to see aggregated metrics across all CRDs (single select) When "All" is selected for Resource Type, all panels will show data for all five managed CRDs with resource_type labels for differentiation. ## Accessing Metrics Directly For instructions on accessing Prometheus and Grafana, see the [Kubernetes Metrics Guide](/dynamo/kubernetes-deployment/operate/observability/metrics#viewing-the-metrics). Once you have access to Prometheus, you can query operator metrics directly: ```bash # Port-forward to Prometheus kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring # Visit http://localhost:9090 and try queries like: # - dynamo_operator_reconcile_total # - dynamo_operator_webhook_requests_total # - dynamo_operator_resources_total ``` ## Troubleshooting ### Metrics Not Appearing in Prometheus 1. **Check ServiceMonitor exists:** ```bash kubectl get servicemonitor -n dynamo-system | grep operator ``` 2. **Check ServiceMonitor is discovered by Prometheus:** - Go to Prometheus UI → Status → Targets - Look for `serviceMonitor/dynamo-system/dynamo-platform-dynamo-operator-operator` - Should show state: `UP` 3. **Check Prometheus selector configuration:** ```bash kubectl get prometheus -o yaml | grep serviceMonitorSelector ``` Ensure `serviceMonitorSelectorNilUsesHelmValues: false` was set during kube-prometheus-stack installation. ### Dashboard Not Appearing in Grafana 1. **Check ConfigMap is created:** ```bash kubectl get configmap -n monitoring grafana-operator-dashboard ``` 2. **Check ConfigMap has the label:** ```bash kubectl get configmap -n monitoring grafana-operator-dashboard -o jsonpath='{.metadata.labels.grafana_dashboard}' ``` Should return `"1"` 3. **Check Grafana dashboard sidecar configuration:** ```bash kubectl get deployment -n monitoring prometheus-grafana -o yaml | grep -A 5 sidecar ``` The sidecar should be configured to watch for `grafana_dashboard: "1"` label. 4. **Restart Grafana pod** to force dashboard refresh: ```bash kubectl rollout restart deployment/prometheus-grafana -n monitoring ``` ## Related Documentation - [Kubernetes Metrics Guide](/dynamo/kubernetes-deployment/operate/observability/metrics) - Application metrics for frontends and workers - [Dynamo Operator Guide](/dynamo/kubernetes-deployment/start-here/dynamo-operator) - Operator architecture and deployment modes - [Operator Webhooks](/dynamo/kubernetes-deployment/advanced-platform/webhooks) - Webhook validation details # Multinode Deployments This guide explains how to deploy Dynamo workloads across multiple nodes. Multinode deployments enable you to scale compute-intensive LLM workloads across multiple physical machines, maximizing GPU utilization and supporting larger models. ## Overview Dynamo supports multinode deployments through the `multinode` section in resource specifications. This allows you to: - Distribute workloads across multiple physical nodes - Scale GPU resources beyond a single machine - Support large models requiring extensive tensor parallelism - Achieve high availability and fault tolerance ## Basic requirements - **Kubernetes Cluster**: Version 1.24 or later - **GPU Nodes**: Multiple nodes with NVIDIA GPUs - **High-Speed Networking**: InfiniBand, RoCE, or high-bandwidth Ethernet (recommended for optimal performance) ### Advanced Multinode Orchestration #### Using Grove (default) For sophisticated multinode deployments, Dynamo integrates with advanced Kubernetes orchestration systems: - **[Grove](https://github.com/NVIDIA/grove)**: Network topology-aware gang scheduling and auto-scaling for AI workloads - **[KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler)**: Kubernetes native scheduler optimized for AI workloads at scale These systems provide enhanced scheduling capabilities including topology-aware placement, gang scheduling, and coordinated auto-scaling across multiple nodes. **Features Enabled with Grove:** - Declarative composition of AI workloads - Multi-level horizontal auto-scaling - Custom startup ordering for components - Resource-aware rolling updates - [Topology Aware Scheduling](/dynamo/kubernetes-deployment/scale/topology-aware-scheduling) — pack pods within a rack, block, or other topology domain for lower latency [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a Kubernetes native scheduler optimized for AI workloads at large scale. **Features Enabled with KAI-Scheduler:** - Gang scheduling - Network topology-aware pod placement - AI workload-optimized scheduling algorithms - GPU resource awareness and allocation - Support for complex scheduling constraints - Integration with Grove for enhanced capabilities - Performance optimizations for large-scale deployments ##### Prerequisites - [Grove](https://github.com/NVIDIA/grove/blob/main/docs/installation.md) installed on the cluster - (Optional) [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) installed on the cluster with the default queue name `dynamo` created. If no queue annotation is specified on the DGD resource, the operator uses the `dynamo` queue by default. Custom queue names can be specified via the `nvidia.com/kai-scheduler-queue` annotation, but the queue must exist in the cluster before deployment. KAI-Scheduler is optional but recommended for advanced scheduling capabilities. #### Using LWS and Volcano LWS is a simple multinode deployment mechanism that allows you to deploy a workload across multiple nodes. - **LWS**: [LWS Installation](https://github.com/kubernetes-sigs/lws#installation) - **Volcano**: [Volcano Installation](https://github.com/volcano-sh/volcano#quick-start-guide) Volcano is a Kubernetes native scheduler optimized for AI workloads at scale. It is used in conjunction with LWS to provide gang scheduling support. ## Core Concepts ### Orchestrator Selection Algorithm Dynamo automatically selects the best available orchestrator for multinode deployments using the following logic: #### When Both Grove and LWS are Available: - **Grove is selected by default** (recommended for advanced AI workloads) - **LWS is selected** if you explicitly set `nvidia.com/enable-grove: "false"` annotation on your DGD resource #### When Only One Orchestrator is Available: - The installed orchestrator (Grove or LWS) is automatically selected #### Scheduler Integration: - **With Grove**: Automatically integrates with [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) when available, providing: - Advanced queue management via `nvidia.com/kai-scheduler-queue` annotation - AI-optimized scheduling policies - Resource-aware workload placement - **With LWS**: Uses Volcano scheduler for gang scheduling and resource coordination #### Configuration Examples: **Default (Grove with KAI-Scheduler):** ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-multinode-deployment annotations: nvidia.com/kai-scheduler-queue: "dynamo" spec: # ... your deployment spec ``` > **Note:** The `nvidia.com/kai-scheduler-queue` annotation defaults to `"dynamo"`. If you specify a custom queue name, ensure the queue exists in your cluster before deploying. You can verify available queues with `kubectl get queues`. **Force LWS usage:** ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-multinode-deployment annotations: nvidia.com/enable-grove: "false" spec: # ... your deployment spec ``` ### The `multinode` Section The `multinode` section in a resource specification defines how many physical nodes the workload should span: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-multinode-deployment spec: # ... your deployment spec services: my-service: ... multinode: nodeCount: 2 resources: limits: gpu: "2" # 2 GPUs per node ``` ### GPU Distribution The relationship between `multinode.nodeCount` and `gpu` is multiplicative: - **`multinode.nodeCount`**: Number of physical nodes - **`gpu`**: Number of GPUs per node - **Total GPUs**: `multinode.nodeCount × gpu` **Example:** - `multinode.nodeCount: "2"` + `gpu: "4"` = 8 total GPUs (4 GPUs per node across 2 nodes) - `multinode.nodeCount: "4"` + `gpu: "8"` = 32 total GPUs (8 GPUs per node across 4 nodes) ### Tensor Parallelism Alignment The tensor parallelism (`tp-size` or `--tp`) in your command/args must match the total number of GPUs: ```yaml # Example: 2 multinode.nodeCount × 4 GPUs = 8 total GPUs apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-multinode-deployment spec: # ... your deployment spec services: my-service: ... multinode: nodeCount: 2 resources: limits: gpu: "4" extraPodSpec: mainContainer: ... args: # Command args must use tp-size=8 - "--tp-size" - "8" # Must equal multinode.nodeCount × gpu ``` ## Backend-Specific Operator Behavior When you deploy a multinode workload, the Dynamo operator automatically applies backend-specific configurations to enable distributed execution. Understanding these automatic modifications helps troubleshoot issues and optimize your deployments. ### vLLM Backend For vLLM multinode deployments, the operator automatically selects and configures the appropriate distributed execution mode based on your parallelism settings: #### Deployment Modes The operator automatically determines the deployment mode based on your parallelism configuration: **1. Tensor/Pipeline Parallelism Mode (Single model across nodes)** - **When used**: When `world_size > GPUs_per_node` where `world_size = tensor_parallel_size × pipeline_parallel_size` - **Use case**: Distributing a single model instance across multiple nodes using tensor or pipeline parallelism The operator uses Ray for multi-node tensor/pipeline parallel deployments. Ray provides automatic placement group management and worker spawning across nodes. **Leader Node:** - **Command**: `ray start --head --port=6379 && --distributed-executor-backend ray` - **Behavior**: Starts Ray head node, then runs vLLM which creates a placement group spanning all Ray workers - **Probes**: All health probes remain active (liveness, readiness, startup) **Worker Nodes:** - **Command**: `ray start --address=:6379 --block` - **Behavior**: Joins Ray cluster and blocks; vLLM on leader spawns Ray actors to these workers - **Probes**: All probes (liveness, readiness, startup) are automatically removed vLLM's Ray executor automatically creates a placement group and spawns workers across the cluster. The `--nnodes` flag is NOT used with Ray - it's only compatible with the `mp` backend. **2. Data Parallel Mode (Multiple model instances across nodes)** - **When used**: When `world_size × data_parallel_size > GPUs_per_node` - **Use case**: Running multiple independent model instances across nodes with data parallelism (e.g., MoE models with expert parallelism) **All Nodes (Leader and Workers):** - **Injected Flags**: - `--data-parallel-address ` - Address of the coordination server - `--data-parallel-size-local ` - Number of data parallel workers per node - `--data-parallel-rpc-port 13445` - RPC port for data parallel coordination - `--data-parallel-start-rank ` - Starting rank for this node (calculated automatically) - **Probes**: Worker probes are removed; leader probes remain active **Note**: The operator intelligently injects these flags into your command regardless of command structure (direct Python commands or shell wrappers) #### Why Ray for Multi-Node TP/PP? vLLM supports two distributed executor backends: `ray` and `mp`. For multi-node deployments: - **Ray executor**: vLLM creates a placement group and spawns Ray actors across the cluster. Workers don't run vLLM directly - the leader's vLLM process manages everything. - **mp executor**: Each node must run its own vLLM process with `--nnodes`, `--node-rank`, `--master-addr`, `--master-port`. This approach is more complex to orchestrate. The Dynamo operator uses Ray because: 1. It aligns with vLLM's official multi-node documentation (see `multi-node-serving.sh`) 2. Simpler orchestration - only the leader runs vLLM, workers just need Ray agents 3. vLLM automatically handles placement group creation and worker management #### Compilation Cache Support When a volume mount is configured with `useAsCompilationCache: true`, the operator automatically sets: - **`VLLM_CACHE_ROOT`**: Environment variable pointing to the cache mount point ### SGLang Backend For SGLang multinode deployments, the operator injects distributed training parameters: #### Leader Node - **Distributed Flags**: Injects `--dist-init-addr :29500 --nnodes --node-rank 0` - **Probes**: All health probes remain active #### Worker Nodes - **Distributed Flags**: Injects `--dist-init-addr :29500 --nnodes --node-rank ` - The `node-rank` is automatically determined from the pod's stateful identity - **Probes**: All probes (liveness, readiness, startup) are automatically removed **Note:** The operator intelligently injects these flags regardless of your command structure (direct Python commands or shell wrappers). ### TensorRT-LLM Backend For TensorRT-LLM multinode deployments, the operator configures MPI-based communication: #### Leader Node - **SSH Configuration**: Automatically sets up SSH keys and configuration from a Kubernetes secret - **MPI Command**: Wraps your command in an `mpirun` command with: - Proper host list including all worker nodes - SSH configuration for passwordless authentication on port 2222 - Environment variable propagation to all nodes - Activation of the Dynamo virtual environment - **Probes**: All health probes remain active #### Worker Nodes - **SSH Daemon**: Replaces your command with SSH daemon setup and execution - Generates host keys in user-writable directories (non-privileged) - Configures SSH daemon to listen on port 2222 - Sets up authorized keys for leader access - **Probes**: - **Liveness and Startup**: Removed (workers run SSH daemon, not the main application) - **Readiness**: Replaced with TCP socket check on SSH port 2222 - Initial Delay: 20 seconds - Period: 20 seconds - Timeout: 5 seconds - Failure Threshold: 10 #### Additional Configuration - **Environment Variable**: `OMPI_MCA_orte_keep_fqdn_hostnames=1` is added to all nodes - **SSH Volume**: Automatically mounts the SSH keypair secret (typically named `mpirun-ssh-key-`) - **Automatic SSH key generation**: The operator automatically generates the SSH keypair secret when it detects a multi-node `DynamoGraphDeployment`. No manual secret creation is required. ### Compilation Cache Configuration The operator supports compilation cache volumes for backend-specific optimization: | Backend | Support Level | Environment Variables | Default Mount Point | |---------|--------------|----------------------|---------------------| | vLLM | Fully Supported | `VLLM_CACHE_ROOT` | User-specified | | SGLang | Partial Support | _None (pending upstream)_ | User-specified | | TensorRT-LLM | Partial Support | _None (pending upstream)_ | User-specified | To enable compilation cache, add a volume mount with `useAsCompilationCache: true` in your component specification. For vLLM, the operator will automatically configure the necessary environment variables. For other backends, volume mounts are created, but additional environment configuration may be required until upstream support is added. ## Next Steps For additional support and examples, see the working multinode configurations in: - **SGLang**: [examples/backends/sglang/deploy/](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/sglang/deploy/README.md) - **TensorRT-LLM**: [examples/backends/trtllm/deploy/](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/trtllm/deploy/README.md) - **vLLM**: [examples/backends/vllm/deploy/](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/vllm/deploy/README.md) These examples demonstrate proper usage of the `multinode` section with corresponding `gpu` limits and correct `tp-size` configuration. # Grove Grove is a Kubernetes API specifically designed to address the orchestration challenges of modern AI workloads, particularly disaggregated inference systems. Grove provides seamless integration with NVIDIA Dynamo for comprehensive AI infrastructure management. ## Overview Grove was originally motivated by the challenges of orchestrating multinode, disaggregated inference systems. It provides a consistent and unified API that allows users to define, configure, and scale prefill, decode, and any other components like routing within a single custom resource. ### How Grove Works for Disaggregated Serving Grove enables disaggregated serving by breaking down large language model inference into separate, specialized components that can be independently scaled and managed. This architecture provides several advantages: - **Component Specialization**: Separate prefill, decode, and routing components optimized for their specific tasks - **Independent Scaling**: Each component can scale based on its individual resource requirements and workload patterns - **Resource Optimization**: Better utilization of hardware resources through specialized workload placement - **Fault Isolation**: Issues in one component don't necessarily affect others ## Core Components and API Resources Grove implements disaggregated serving through several custom Kubernetes resources that provide declarative composition of role-based pod groups: ### PodCliqueSet The top-level Grove object that defines a group of components managed and colocated together. Key features include: - Support for autoscaling - Topology-aware spread of replicas for availability - Unified management of multiple disaggregated components ### PodClique Represents a group of pods with a specific role (e.g., leader, worker, frontend). Each clique features: - Independent configuration options - Custom scaling logic support - Role-specific resource allocation ### PodCliqueScalingGroup A set of PodCliques that scale and are scheduled together, ideal for tightly coupled roles like prefill leader and worker components that need coordinated scaling behavior. ## Key Capabilities for Disaggregated Serving Grove provides several specialized features that make it particularly well-suited for disaggregated serving: ### Flexible Gang Scheduling PodCliques and PodCliqueScalingGroups allow users to specify flexible gang-scheduling requirements at multiple levels within a PodCliqueSet to prevent resource deadlocks and ensure all components of a disaggregated system start together. ### Multi-level Horizontal Auto-Scaling Supports pluggable horizontal auto-scaling solutions to scale PodCliqueSet, PodClique, and PodCliqueScalingGroup custom resources independently based on their specific metrics and requirements. ### Network Topology-Aware Scheduling Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability, crucial for disaggregated systems where components need efficient inter-node communication. Dynamo exposes this capability through the `topologyConstraint` field on DynamoGraphDeployment resources, so users can opt in to topology-aware placement without interacting with Grove internals. See the [Topology Aware Scheduling guide](/dynamo/kubernetes-deployment/scale/topology-aware-scheduling) for configuration details and examples. ### Custom Startup Dependencies Prescribes the order in which PodCliques must start in a declarative specification, with pod startup decoupled from pod creation or scheduling. This ensures proper initialization order for disaggregated components. ## Use Cases and Examples Grove specifically supports: - **Multi-node disaggregated inference** for large models such as DeepSeek-R1 and Llama-4-Maverick - **Single-node disaggregated inference** for optimized resource utilization - **Agentic pipelines of models** for complex AI workflows - **Standard aggregated serving** patterns for single node or single GPU inference ## Integration with NVIDIA Dynamo Grove is strategically aligned with NVIDIA Dynamo for seamless integration within the AI infrastructure stack: ### Complementary Roles - **Grove**: Handles the Kubernetes orchestration layer for disaggregated AI workloads - **Dynamo**: Provides comprehensive AI infrastructure capabilities including serving backends, routing, and resource management ### Release Coordination Grove is aligning its release schedule with NVIDIA Dynamo to ensure seamless integration, with the finalized release cadence reflected in the project roadmap. ### Unified AI Platform The integration creates a comprehensive platform where: - Grove manages complex orchestration of disaggregated components - Dynamo provides the serving infrastructure, routing capabilities, and backend integrations - Together they enable sophisticated AI serving architectures with simplified management ## Architecture Benefits Grove represents a significant advancement in Kubernetes-based orchestration for AI workloads by: 1. **Simplifying Complex Deployments**: Provides a unified API that can manage multiple components (prefill, decode, routing) within a single resource definition 2. **Enabling Sophisticated Architectures**: Supports advanced disaggregated inference patterns that were previously difficult to orchestrate 3. **Reducing Operational Complexity**: Abstracts away the complexity of coordinating multiple interdependent AI components 4. **Optimizing Resource Utilization**: Enables fine-grained control over component placement and scaling ## Getting Started Grove relies on KAI Scheduler for resource allocation and scheduling. For KAI Scheduler, see the [KAI Scheduler Deployment Guide](https://github.com/NVIDIA/KAI-Scheduler). For installation instructions, see the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md). For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](/dynamo/kubernetes-deployment/scale/multinode-deployments), which demonstrates multi-node disaggregated serving scenarios. For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove). Dynamo Kubernetes Platform also allows you to install Grove and KAI Scheduler as part of the platform installation. See the [Dynamo Kubernetes Platform Deployment Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) for more details. # Topology Aware Scheduling Topology Aware Scheduling (TAS) lets you control where Dynamo places inference workload pods relative to the cluster's network topology. By packing related pods within the same rack, block, or other topology domain, you reduce inter-node latency and improve throughput — especially for disaggregated serving where prefill, decode, and routing components communicate frequently. TAS is **opt-in**. Existing deployments without topology constraints continue to work unchanged. TAS controls pod placement. To constrain or bias the Dynamo router's prefill-to-decode handoff after pods are already running, see [Topology-Aware KV Transfer](/dynamo/kubernetes-deployment/operate/topology-aware-kv-transfer). ## Prerequisites | Requirement | Details | |-------------|---------| | **Grove** | Installed on the cluster. See the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md). | | **ClusterTopology CR** | A cluster-scoped `ClusterTopology` resource configured by the cluster admin, mapping topology domain names to node labels. See [Grove documentation](https://github.com/NVIDIA/grove) for setup instructions. | | **KAI Scheduler** | [KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is required by Grove for topology-aware pod placement. | | **Dynamo operator** | The latest Dynamo operator Helm chart includes read-only RBAC for `clustertopologies.grove.io` via a dedicated ClusterRole. No extra configuration is needed. | ## Topology Domains Topology domains are **free-form** identifiers defined by the cluster admin in the `ClusterTopology` CR. Common examples include `region`, `zone`, `datacenter`, `block`, `rack`, `host`, and `numa`, but any name matching the pattern `^[a-z0-9]([a-z0-9-]*[a-z0-9])?$` is valid (no leading or trailing hyphens). Domain names must match exactly what is configured in the `ClusterTopology` CR referenced by `topologyProfile`. During DGD creation, the Dynamo webhook validates that every `packDomain` exists in the referenced `ClusterTopology`. When you specify a `packDomain`, the scheduler packs all replicas of the constrained component within a single instance of that domain. For example, `packDomain: rack` means "place all pods within the same rack." ## Topology Profile Every DGD that uses topology constraints must reference a `ClusterTopology` CR by name via the `topologyProfile` field. This field is set at `spec.topologyConstraint` (the deployment level) and is inherited by all services — services must not set `topologyProfile` themselves. The `topologyProfile` tells the Dynamo operator and the underlying framework which topology hierarchy to use for scheduling and validation. ## Enabling TAS on a DGD Add a `topologyConstraint` field to your `DynamoGraphDeployment` at the deployment level, at the service level, or both. The deployment level must include a `topologyProfile`. Each constraint specifies a `packDomain`. ### Example 1: Deployment-Level Constraint (Services Inherit) All services inherit the deployment-level constraint. This is the simplest configuration when you want uniform topology packing. ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-llm spec: topologyConstraint: topologyProfile: my-cluster-topology packDomain: zone services: VllmWorker: componentType: worker replicas: 2 envFromSecret: hf-token-secret resources: limits: gpu: "1" extraPodSpec: mainContainer: image: my-image command: ["/bin/sh", "-c"] args: - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B Frontend: componentType: frontend replicas: 1 extraPodSpec: mainContainer: image: my-image command: ["/bin/sh", "-c"] args: - python3 -m dynamo.frontend ``` ### Example 2: Service-Level Constraint Only Only the specified service gets topology packing. Other services are scheduled without topology constraints. The deployment level must still set `topologyProfile`. ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-llm spec: topologyConstraint: topologyProfile: my-cluster-topology services: VllmWorker: componentType: worker replicas: 2 multinode: nodeCount: 4 topologyConstraint: packDomain: rack envFromSecret: hf-token-secret resources: limits: gpu: "8" extraPodSpec: mainContainer: image: my-image command: ["/bin/sh", "-c"] args: - python3 -m dynamo.vllm --model meta-llama/Llama-4-Maverick-17B-128E Frontend: componentType: frontend replicas: 1 extraPodSpec: mainContainer: image: my-image command: ["/bin/sh", "-c"] args: - python3 -m dynamo.frontend ``` ### Example 3: Mixed (Deployment-Level Default + Per-Service Override) Set a broad constraint at the deployment level and a narrower override on specific services. Service-level constraints must be **equal to or narrower than** the deployment-level constraint (determined by the ordering in the `ClusterTopology` CR). ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-llm spec: topologyConstraint: topologyProfile: my-cluster-topology packDomain: zone services: VllmWorker: componentType: worker replicas: 2 multinode: nodeCount: 4 topologyConstraint: packDomain: block # narrower than zone — valid envFromSecret: hf-token-secret resources: limits: gpu: "8" extraPodSpec: mainContainer: image: my-image command: ["/bin/sh", "-c"] args: - python3 -m dynamo.vllm --model meta-llama/Llama-4-Maverick-17B-128E Frontend: componentType: frontend replicas: 1 # inherits zone from spec.topologyConstraint extraPodSpec: mainContainer: image: my-image command: ["/bin/sh", "-c"] args: - python3 -m dynamo.frontend ``` ## Hierarchy Rules When **both** a deployment-level and a service-level `topologyConstraint` are set, the service's `packDomain` must be **equal to or narrower** than the deployment-level `packDomain`. "Narrower" is determined by the ordering of levels in the referenced `ClusterTopology` CR — levels appearing later in the `spec.levels` array are considered narrower. The Dynamo webhook rejects the DGD at creation time if a service constraint is broader than the deployment constraint (when validating against a `ClusterTopology` CR). When only one level is set (deployment-level only or service-level only), no hierarchy check applies. | Configuration | Behavior | |---------------|----------| | `spec.topologyConstraint` set, service has none | Service inherits the deployment-level constraint | | `spec.topologyConstraint` set, service also set | Both applied; service must be narrower or equal | | `spec.topologyConstraint.topologyProfile` set, no `packDomain` at spec | Profile is provided for service-level constraints only | | Neither set | No topology constraints (default) | ## Field Reference | Field | Level | Required | Description | |-------|-------|----------|-------------| | `topologyProfile` | `spec.topologyConstraint` | Yes (when any constraint is set) | Name of the `ClusterTopology` CR defining the topology hierarchy. | | `topologyProfile` | service-level `topologyConstraint` | N/A (not in schema) | Inherited from `spec.topologyConstraint`. The service-level type does not include this field. | | `packDomain` | `spec.topologyConstraint` | Optional | Default pack domain for services that don't specify their own. | | `packDomain` | service-level `topologyConstraint` | Required | Pack domain for this service. Must match a level in the `ClusterTopology` CR. | ## Multinode Considerations For multinode services (services with a `multinode` section), the topology constraint is applied at the **scaling group** level rather than on individual worker pods. This is important because a multinode service spawns `replicas × nodeCount` pods — for example, 2 replicas with `nodeCount: 4` produces 8 pods across 8 nodes. Applying the constraint at the scaling group level means the scheduler packs each replica's set of nodes within the requested domain, without over-constraining individual pods to a single host. For example, with this configuration: ```yaml VllmWorker: replicas: 2 multinode: nodeCount: 4 topologyConstraint: packDomain: rack ``` Each replica's 4 nodes are packed within a single rack. The two replicas may land in different racks (the constraint applies per-replica, not across all replicas). **Recommendation:** For multinode services, use `rack` or `block` as the `packDomain` to keep workers within a high-bandwidth domain while still allowing the scheduler to spread them across hosts within that domain. Avoid `host` for multinode services, as packing multiple nodes onto one host is not meaningful. ## Immutability Topology constraints **cannot be changed after the DGD is created**. This includes: - Adding a topology constraint to a DGD or service that did not have one - Removing an existing topology constraint - Changing the `topologyProfile` value - Changing the `packDomain` value To change topology constraints, **delete and recreate** the DGD. This matches the behavior of the underlying framework, which enforces immutability on topology constraints for generated resources. ## Monitoring Topology Enforcement When any topology constraint is set, the DGD status includes a `TopologyLevelsAvailable` condition that reports whether the topology levels referenced by your constraints still exist in the cluster topology. **Healthy state:** ```yaml status: conditions: - type: Ready status: "True" - type: TopologyLevelsAvailable status: "True" reason: AllTopologyLevelsAvailable message: "All required topology levels are available in the cluster topology" ``` **Degraded state** (e.g., an admin removed a topology level from the `ClusterTopology` CR after deployment): ```yaml status: conditions: - type: Ready status: "True" - type: TopologyLevelsAvailable status: "False" reason: TopologyLevelsUnavailable message: "Topology level 'rack' is no longer available in the cluster topology" ``` When topology levels become unavailable, Dynamo emits a **Warning** event on the DGD. The deployment may still appear `Ready` because the underlying framework keeps pods running, but topology placement is no longer guaranteed. ## Troubleshooting ### DGD rejected: "ClusterTopology not found" The Dynamo webhook validates that the `ClusterTopology` CR referenced by `topologyProfile` exists when any topology constraint is set. If it cannot read the `ClusterTopology` CR: - Verify that the cluster admin has created the `ClusterTopology` resource named in `topologyProfile`. See the [Grove documentation](https://github.com/NVIDIA/grove) for setup. - Verify that the Dynamo operator has RBAC to read `clustertopologies.grove.io` (included in the default Helm chart). ### DGD rejected: "packDomain not found in cluster topology" The specified `packDomain` does not exist as a level in the referenced `ClusterTopology` CR. Check which domains are defined: ```bash kubectl get clustertopology -o yaml ``` Ensure the domain you are requesting (e.g., `rack`) is configured in the `ClusterTopology` with a corresponding node label. ### DGD rejected: "topologyProfile is required" Any DGD that has a topology constraint (at the spec or service level) must set `spec.topologyConstraint.topologyProfile` to the name of a `ClusterTopology` CR. Add the `topologyProfile` field to `spec.topologyConstraint`. ### Pods stuck in Pending The scheduler cannot satisfy the topology constraint. Common causes: - Not enough nodes within a single instance of the requested domain (e.g., requesting 8 GPUs packed in one rack, but no rack has 8 available GPUs). - Node labels do not match the `ClusterTopology` configuration. Inspect scheduler events for details: ```bash kubectl describe pod -n ``` ### TopologyLevelsAvailable is False The DGD was deployed successfully, but the topology definition has since changed. The underlying framework detected that one or more required topology levels are no longer available. - Check the condition message for specifics. - Inspect the `ClusterTopology` CR to see if a domain was removed or renamed. - If the topology was intentionally changed, delete and recreate the DGD to pick up the new topology. ### DGD rejected: hierarchy violation A service-level `packDomain` is broader than the deployment-level `packDomain`. "Broader" and "narrower" are determined by the order of levels in the `ClusterTopology` CR — levels appearing earlier in `spec.levels` are broader. Ensure service-level constraints are equal to or narrower than the deployment-level constraint. # Service Discovery Dynamo components (frontends, workers, planner) need to be able to discover each other and their capabilities at runtime. We refer to this as service discovery. There are 2 kinds of service discovery backends supported on Kubernetes. ## Discovery Backends | Backend | Default | Dependencies | Use Case | |---------|---------|--------------|----------| | **Kubernetes** | ✅ Yes | None (native K8s) | Recommended for all Kubernetes deployments | | **KV Store (etcd)** | No | etcd cluster | Legacy deployments | ## Kubernetes Discovery (Default) Kubernetes discovery is the default and recommended backend when running on Kubernetes. It uses native Kubernetes primitives to facilitate discovery of components: - **DynamoWorkerMetadata CRD**: Each worker stores its registered endpoints and model cards in a Custom Resource - **EndpointSlices**: EndpointSlices signal each component's readiness status ### Implementation Details Each pod runs a **discovery daemon** that watches both EndpointSlices and DynamoWorkerMetadata CRs. A pod is only discoverable when it appears as "ready" in an EndpointSlice AND has a corresponding `DynamoWorkerMetadata` CR. This correlation ensures pods aren't discoverable until they're ready, metadata is immediately available, and stale entries are cleaned up when pods terminate. #### DynamoWorkerMetadata CRD Each worker pod creates a `DynamoWorkerMetadata` CR that stores its discovery metadata: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoWorkerMetadata metadata: name: my-worker-pod-abc123 namespace: dynamo-system ownerReferences: - apiVersion: v1 kind: Pod name: my-worker-pod-abc123 uid: controller: true spec: data: endpoints: "dynamo/backend/generate": type: Endpoint namespace: dynamo component: backend endpoint: generate instance_id: 12345678901234567890 transport: nats_tcp: "dynamo_backend.generate-abc123" model_cards: {} ``` The CR is named after the pod and includes an owner reference for automatic garbage collection when the pod is deleted. #### EndpointSlices While DynamoWorkerMetadata resources provide an up-to-date snapshot of a component's capabilities, EndpointSlices give a snapshot of health of the various Dynamo components. The operator creates a Kubernetes Service targeting the Dynamo components. The Kubernetes controller in turn creates and maintains EndpointSlice resources that keep track of the readiness of the pods targeted by the Service. Watching these slices gives us an up-to-date snapshot of which Dynamo components are ready to serve traffic. ##### Readiness Probes A pod is marked ready if the readiness probe succeeds. On Dynamo workers, this is when the `generate` endpoint is available and healthy. These probes are configured by the Dynamo operator for each pod/component. #### RBAC Each Dynamo component pod is automatically given a ServiceAccount that allows it to watch `EndpointSlice` and `DynamoWorkerMetadata` resources within its namespace. #### Environment Variables The following environment variables are automatically injected into pods by the operator to facilitate service discovery: | Variable | Description | |----------|-------------| | `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` | | `POD_NAME` | Pod name (via downward API) | | `POD_NAMESPACE` | Pod namespace (via downward API) | | `POD_UID` | Pod UID (via downward API) | The pod's instance ID is deterministically generated by hashing the pod name, ensuring consistent identity and correlation between EndpointSlices and CRs. ## KV Store Discovery (etcd) To use etcd-based discovery instead of Kubernetes-native discovery, add the annotation to your DynamoGraphDeployment: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-deployment annotations: nvidia.com/dynamo-discovery-backend: etcd spec: services: # ... ``` This requires an etcd cluster to be available. The etcd connection is configured via the platform Helm chart. # Webhooks This document describes the webhook functionality in the Dynamo Operator, including validation webhooks, certificate management, and troubleshooting. ## Overview The Dynamo Operator uses **Kubernetes admission webhooks** to provide real-time validation and mutation of custom resources. Currently, the operator implements **validation webhooks** that ensure invalid configurations are rejected immediately at the API server level, providing faster feedback to users compared to controller-based validation. All webhook types (validating, mutating, conversion, etc.) share the same **webhook server** and **TLS certificate infrastructure**, making certificate management consistent across all webhook operations. ### Key Features - ✅ **Always enabled** - Webhooks are a required component of the operator - ✅ **Shared certificate infrastructure** - All webhook types use the same TLS certificates - ✅ **Automatic certificate generation and rotation** - Built-in cert-controller, no manual management required - ✅ **cert-manager integration** - Optional integration for custom PKI or organizational certificate policies - ✅ **Immutability enforcement** - Critical fields protected via CEL validation rules ### Current Webhook Types - **Validating Webhooks**: Validate custom resource specifications before persistence - `DynamoComponentDeployment` validation - `DynamoGraphDeployment` validation - `DynamoModel` validation - `DynamoGraphDeploymentRequest` validation - **Mutating Webhooks**: Apply default values to resources on creation - `DynamoGraphDeployment` defaulting **Note:** All webhook types use the same certificate infrastructure described in this document. --- ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ API Server │ │ 1. User submits CR (kubectl apply) │ │ 2. API server calls MutatingWebhookConfiguration │ └────────────────────────┬────────────────────────────────────────┘ │ HTTPS (TLS required) ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Webhook Server (in Operator Pod) │ │ 3. Applies defaults (e.g., operator version annotation) │ │ 4. Returns mutated CR │ └────────────────────────┬────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ API Server │ │ 5. API server calls ValidatingWebhookConfiguration │ └────────────────────────┬────────────────────────────────────────┘ │ HTTPS (TLS required) ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Webhook Server (in Operator Pod) │ │ 6. Validates CR against business rules │ │ 7. Returns admit/deny decision + warnings │ └────────────────────────┬────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ API Server │ │ 8. If admitted: Persist CR to etcd │ │ 9. If denied: Return error to user │ └─────────────────────────────────────────────────────────────────┘ ``` ### Admission Flow 1. **Mutating webhooks**: Apply defaults and transformations before validation 2. **Validating webhooks**: Validate the (possibly mutated) CR against business rules 3. **CEL validation**: Kubernetes-native immutability checks (always active) --- ## Upgrading from versions with `webhook.enabled: false` The `webhook.enabled` Helm value has been removed. Webhooks are now a required component of the operator and are always active. If you previously ran with `webhook.enabled: false`, take the following steps before upgrading: 1. **Remove `webhook.enabled`** from any custom values files. Helm will ignore the unknown key, but it should be cleaned up to avoid confusion. 2. **Ensure port 9443 is reachable** from the Kubernetes API server to the operator pod. If you have `NetworkPolicy` rules or firewall configurations restricting traffic, add an ingress rule allowing the API server to reach the webhook server on port 9443. 3. **Ensure webhook TLS certificates are available.** By default, the operator's built-in cert-controller generates and rotates self-signed certificates automatically at startup — no action needed. If you use cert-manager or externally managed certificates, verify your configuration is in place before upgrading. --- ## Configuration ### Certificate Management Options The operator supports three certificate management modes: | Mode | Description | Use Case | |------|-------------|----------| | **Automatic (Default)** | Operator's built-in cert-controller generates and rotates certificates | All environments (recommended) | | **cert-manager** | Integrate with cert-manager for certificate lifecycle management | Clusters with cert-manager and custom PKI requirements | | **External** | Bring your own certificates | Environments with externally managed PKI | --- ### Advanced Configuration #### Complete Configuration Reference ```yaml dynamo-operator: webhook: # Certificate management (optional, to use cert-manager instead of built-in) certManager: enabled: false issuerRef: kind: Issuer name: selfsigned-issuer # Certificate secret configuration certificateSecret: name: webhook-server-cert external: false # Set to true for externally managed certificates # Webhook behavior configuration failurePolicy: Fail # Fail (reject on error) or Ignore (allow on error) timeoutSeconds: 10 # Webhook timeout # Namespace filtering (advanced) namespaceSelector: {} # Kubernetes label selector for namespaces ``` #### Failure Policy ```yaml # Fail: Reject resources if webhook is unavailable (recommended for production) webhook: failurePolicy: Fail # Ignore: Allow resources if webhook is unavailable (use with caution) webhook: failurePolicy: Ignore ``` **Recommendation:** Use `Fail` in production to ensure validation is always enforced. Only use `Ignore` if you need high availability and can tolerate occasional invalid resources. #### Namespace Filtering Control which namespaces are validated (applies to **cluster-wide operator** only): ```yaml # Only validate resources in namespaces with specific labels webhook: namespaceSelector: matchLabels: dynamo-validation: enabled # Or exclude specific namespaces webhook: namespaceSelector: matchExpressions: - key: dynamo-validation operator: NotIn values: ["disabled"] ``` **Note:** For **namespace-restricted operators** (deprecated), the namespace selector is automatically set to validate only the operator's namespace. This configuration is ignored in namespace-restricted mode. --- ## Certificate Management ### Automatic Certificates (Default) **Zero configuration required!** The operator's built-in cert-controller generates and rotates certificates automatically at startup. #### How It Works 1. **Operator starts**: The `CertManager` checks for an existing certificate Secret (configured via `webhook.certificateSecret.name`, default: `webhook-server-cert`). If missing or invalid, it generates a self-signed Root CA and server certificate and writes them to the Secret. 2. **CA bundle injection**: The `CABundleInjector` reads `ca.crt` from the Secret and patches both the `ValidatingWebhookConfiguration` and `MutatingWebhookConfiguration` with the base64-encoded CA bundle. 3. **Certificate rotation**: The cert-controller monitors certificate validity and regenerates certificates before they expire. 4. **Webhook server starts**: The webhook server only begins serving after certificates are confirmed ready, preventing startup races. #### Certificate Validity - **Root CA**: 10 years - **Server Certificate**: 10 years (same as Root CA) - **Automatic rotation**: The cert-controller monitors validity and regenerates before expiration #### Smart Certificate Management The cert-controller is intelligent about certificate lifecycle: - ✅ **Checks existing certificates** at startup before generating new ones - ✅ **Skips generation** if valid certificates already exist in the Secret - ✅ **Regenerates** only when needed (missing, expiring soon, or incorrect SANs) This means: - Fast operator restarts (no unnecessary cert generation) - No dependency on Helm hooks or external Jobs - Certificates persist across pod restarts (stored in Secret) #### Manual Certificate Rotation If you need to rotate certificates manually: ```bash # Delete the certificate secret -- the operator will regenerate it on restart kubectl delete secret -webhook-server-cert -n # Restart the operator pod to trigger regeneration kubectl rollout restart deployment/-dynamo-operator -n ``` --- ### cert-manager Integration For clusters with cert-manager installed, you can enable automated certificate lifecycle management. #### Prerequisites 1. **cert-manager installed** (v1.0+) 2. **CA issuer configured** (e.g., `selfsigned-issuer`) #### Configuration ```yaml dynamo-operator: webhook: certManager: enabled: true issuerRef: kind: Issuer # Or ClusterIssuer name: selfsigned-issuer # Your issuer name ``` #### How It Works 1. **Helm creates Certificate resource**: Requests TLS certificate from cert-manager 2. **cert-manager generates certificate**: Based on configured issuer 3. **cert-manager stores in Secret**: `-webhook-server-cert` 4. **cert-manager ca-injector**: Automatically injects CA bundle into `ValidatingWebhookConfiguration` 5. **Operator pod**: Mounts certificate secret and serves webhook #### When to Use cert-manager - ✅ **Custom validity periods**: Configure certificate lifetime to match organizational policy - ✅ **Integration with existing PKI**: Use your organization's certificate infrastructure - ✅ **Centralized certificate management**: Manage all cluster certificates through cert-manager #### Certificate Rotation With cert-manager, certificate rotation is **fully automated**: 1. **Leaf certificate rotation** (default: every year) - cert-manager auto-renews before expiration - controller-runtime auto-reloads new certificate - **No pod restart required** - **No caBundle update required** (same Root CA) 2. **Root CA rotation** (every 10 years) - cert-manager rotates Root CA - ca-injector auto-updates caBundle in `ValidatingWebhookConfiguration` - **No manual intervention required** #### Example: Self-Signed Issuer ```yaml apiVersion: cert-manager.io/v1 kind: Issuer metadata: name: selfsigned-issuer namespace: dynamo-system spec: selfSigned: {} --- # Enable in platform values.yaml dynamo-operator: webhook: certManager: enabled: true issuerRef: kind: Issuer name: selfsigned-issuer ``` --- ### External Certificates Bring your own certificates for custom PKI requirements. #### Steps 1. **Create certificate secret manually**: ```bash kubectl create secret tls -webhook-server-cert \ --cert=tls.crt \ --key=tls.key \ -n # Also add ca.crt to the secret kubectl patch secret -webhook-server-cert -n \ --type='json' \ -p='[{"op": "add", "path": "/data/ca.crt", "value": "'$(base64 -w0 < ca.crt)'"}]' ``` 2. **Configure operator to use external secret**: ```yaml dynamo-operator: webhook: certificateSecret: external: true caBundle: # Must manually specify ``` 3. **Deploy operator**: ```bash helm install dynamo-platform . -n -f values.yaml ``` #### Certificate Requirements - **Secret name**: Must match `webhook.certificateSecret.name` (default: `webhook-server-cert`) - **Secret keys**: `tls.crt`, `tls.key`, `ca.crt` - **Certificate SAN**: Must include `..svc` - Example: `dynamo-platform-dynamo-operator-webhook-service.dynamo-system.svc` --- ## Multi-Operator Deployments (DEPRECATED) > **DEPRECATED:** Namespace-restricted mode and multi-operator deployments are deprecated and will be removed in a future release. Use a single cluster-wide operator instead. The operator supports running both **cluster-wide** and **namespace-restricted** instances simultaneously using a **lease-based coordination mechanism**. ### Scenario ``` Cluster: ├─ Operator A (cluster-wide, namespace: platform-system) │ └─ Validates all namespaces EXCEPT team-a └─ Operator B (namespace-restricted, namespace: team-a) └─ Validates only team-a namespace ``` ### How It Works 1. **Namespace-restricted operator** creates a Lease in its namespace 2. **Cluster-wide operator** watches for Leases named `dynamo-operator-ns-lock` 3. **Cluster-wide operator** skips validation for namespaces with active Leases 4. **Namespace-restricted operator** validates resources in its namespace ### Lease Configuration The lease mechanism is **automatically configured** based on deployment mode: ```yaml # Cluster-wide operator (default) namespaceRestriction: enabled: false # → Watches for leases in all namespaces # → Skips validation for namespaces with active leases # Namespace-restricted operator namespaceRestriction: enabled: true namespace: team-a # → Creates lease in team-a namespace # → Does NOT check for leases (no cluster permissions) ``` ### Deployment Example ```bash # 1. Deploy cluster-wide operator helm install platform-operator dynamo-platform \ -n platform-system \ --set namespaceRestriction.enabled=false # 2. Deploy namespace-restricted operator for team-a helm install team-a-operator dynamo-platform \ -n team-a \ --set namespaceRestriction.enabled=true \ --set namespaceRestriction.namespace=team-a ``` ### ValidatingWebhookConfiguration Naming The webhook configuration name reflects the deployment mode: - **Cluster-wide**: `-validating` - **Namespace-restricted**: `-validating-` Example: ```bash # Cluster-wide platform-operator-validating # Namespace-restricted (team-a) team-a-operator-validating-team-a ``` This allows multiple webhook configurations to coexist without conflicts. ### Lease Health If the namespace-restricted operator is deleted or becomes unhealthy: - Lease expires after `leaseDuration + gracePeriod` (default: ~30 seconds) - Cluster-wide operator automatically resumes validation for that namespace --- ## Troubleshooting ### Webhook Not Called **Symptoms:** - Invalid resources are accepted - No validation errors in logs **Checks:** 1. **Verify webhook configuration exists**: ```bash kubectl get validatingwebhookconfiguration | grep dynamo ``` 2. **Check webhook configuration**: ```bash kubectl get validatingwebhookconfiguration -o yaml # Verify: # - caBundle is present and non-empty # - clientConfig.service points to correct service # - webhooks[].namespaceSelector matches your namespace ``` 3. **Verify webhook service exists**: ```bash kubectl get service -n | grep webhook ``` 4. **Check operator logs for webhook startup**: ```bash kubectl logs -n deployment/-dynamo-operator | grep webhook # Should see: "Registering validation webhooks" # Should see: "Starting webhook server" ``` --- ### Connection Refused Errors **Symptoms:** ``` Error from server (InternalError): Internal error occurred: failed calling webhook: Post "https://...webhook-service...:443/validate-...": dial tcp ...:443: connect: connection refused ``` **Checks:** 1. **Verify operator pod is running**: ```bash kubectl get pods -n -l app.kubernetes.io/name=dynamo-operator ``` 2. **Check webhook server is listening**: ```bash # Port-forward to pod kubectl port-forward -n pod/ 9443:9443 # In another terminal, test connection curl -k https://localhost:9443/validate-nvidia-com-v1alpha1-dynamocomponentdeployment # Should NOT get "connection refused" ``` 3. **Verify webhook port in deployment**: ```bash kubectl get deployment -n -dynamo-operator -o yaml | grep -A5 "containerPort: 9443" ``` 4. **Check for webhook initialization errors**: ```bash kubectl logs -n deployment/-dynamo-operator | grep -i error ``` --- ### Certificate Errors **Symptoms:** ``` Error from server (InternalError): Internal error occurred: failed calling webhook: x509: certificate signed by unknown authority ``` **Checks:** 1. **Verify caBundle is present**: ```bash kubectl get validatingwebhookconfiguration -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d # Should output a valid PEM certificate ``` 2. **Verify certificate secret exists**: ```bash kubectl get secret -n -webhook-server-cert ``` 3. **Check certificate validity**: ```bash kubectl get secret -n -webhook-server-cert -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -text # Check: # - Not expired # - SAN includes: ..svc ``` 4. **Check operator logs for CA injection errors**: ```bash kubectl logs -n deployment/-dynamo-operator | grep -i "cert\|ca.*bundle\|inject" ``` --- ### Certificate Controller Errors **Symptoms:** - Operator logs show cert-controller errors - Certificate Secret is not created - CA bundle is not injected into webhook configurations **Checks:** 1. **Check cert-controller logs**: ```bash kubectl logs -n deployment/-dynamo-operator | grep -i "cert-manager\|cert-rotation\|cert-controller" ``` 2. **Verify RBAC permissions**: ```bash # The operator needs permissions to manage Secrets, ValidatingWebhookConfigurations, # MutatingWebhookConfigurations, and CustomResourceDefinitions kubectl auth can-i create secrets -n --as=system:serviceaccount::-dynamo-operator kubectl auth can-i patch validatingwebhookconfigurations --as=system:serviceaccount::-dynamo-operator ``` 3. **Check if the certificate Secret was created**: ```bash kubectl get secret -n -webhook-server-cert ``` 4. **Force certificate regeneration**: ```bash # Delete the certificate secret and restart the operator kubectl delete secret -webhook-server-cert -n kubectl rollout restart deployment/-dynamo-operator -n ``` --- ### Validation Errors Not Clear **Symptoms:** - Webhook rejects resource but error message is unclear **Solution:** Check operator logs for detailed validation errors: ```bash kubectl logs -n deployment/-dynamo-operator | grep "validate create\|validate update" ``` Webhook logs include: - Resource name and namespace - Validation errors with context - Warnings for immutable field changes --- ### Stuck Deleting Resources **Symptoms:** - Resource stuck in "Terminating" state - Webhook blocks finalizer removal **Solution:** The webhook automatically skips validation for resources being deleted. If stuck: 1. **Check if webhook is blocking**: ```bash kubectl describe -n # Look for events mentioning webhook errors ``` 2. **Temporarily work around the webhook**: ```bash # Option 1: Set failurePolicy to Ignore kubectl patch validatingwebhookconfiguration \ --type='json' \ -p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]' # Option 2 (last resort): Delete ValidatingWebhookConfiguration kubectl delete validatingwebhookconfiguration ``` 3. **Delete resource again**: ```bash kubectl delete -n ``` 4. **Restore webhook configuration**: ```bash helm upgrade dynamo-platform -n ``` --- ## Best Practices ### Production Deployments 1. ✅ **Use `failurePolicy: Fail`** (default) to ensure validation is enforced 2. ✅ **Monitor webhook latency** - Validation adds ~10-50ms per resource operation 3. ✅ **Automatic certificates work well for production** - The built-in cert-controller handles generation and rotation; use cert-manager only if you need integration with organizational PKI 4. ✅ **Test webhook configuration** in staging before production ### Development Deployments 1. ✅ **Use `failurePolicy: Ignore`** if webhook availability is problematic during development 2. ✅ **Keep automatic certificates** (zero configuration, built into the operator) ### Multi-Tenant Deployments 1. ✅ **Deploy one cluster-wide operator** for platform-wide validation 2. ~~Deploy namespace-restricted operators for tenant-specific namespaces~~ (**DEPRECATED** - use cluster-wide mode instead) --- ## Additional Resources - [Kubernetes Admission Webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/) - [cert-manager Documentation](https://cert-manager.io/docs/) - [Kubebuilder Webhook Tutorial](https://book.kubebuilder.io/cronjob-tutorial/webhook-implementation.html) - [CEL Validation Rules](https://kubernetes.io/docs/reference/using-api/cel/) --- ## Support For issues or questions: - Check [Troubleshooting](#troubleshooting) section - Review operator logs: `kubectl logs -n deployment/-dynamo-operator` - Open an issue on GitHub # Snapshotting GPU Workers > ⚠️ **Experimental Feature**: Dynamo Snapshot is currently in preview and may only be functional in some cluster setups. The `snapshot-agent` DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details. **Dynamo Snapshot** is infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in Userspace) and NVIDIA's `cuda-checkpoint` utility. The usual flow is: 1. start a worker once and checkpoint its initialized state 2. store that checkpoint on a namespace-local snapshot volume 3. restore later workers from that checkpoint instead of cold-starting again | Startup Type | Time | What Happens | |--------------|------|--------------| | **Cold Start** | ~1 min | Download model, load to GPU, initialize engine | | **Warm Start** (restore from checkpoint) | ~10 sec | Restore from a ready checkpoint directory | > ⚠️ Restore time depends on storage bandwidth, GPU model, and whether the restore stays on the same node. For more background on the snapshot architecture and startup improvements, see [NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes](https://developer.nvidia.com/blog/nvidia-dynamo-snapshot-fast-startup-for-inference-workloads-on-kubernetes/). ## Prerequisites - x86_64 (`amd64`) GPU nodes - NVIDIA driver 580.xx or newer on the target GPU nodes (590.xx or newer if testing multi-GPU snapshots) - vLLM or SGLang backend today - Checkpoint storage. `ReadWriteMany` is the safest default for cross-node or concurrent multi-node access, but `podMount` mode can also use suitable `ReadWriteOnce` storage for sequential checkpoint/restore workflows. - **CRI-O / OpenShift:** set `runtime.type=crio` on the snapshot chart (and `openshift.enabled=true` on OpenShift). Defaults are for containerd; see the chart README for sockets and Helm flags. ## Quick Start via `DynamoCheckpoint` CR 1. Build a placeholder image 2. Install the snapshot chart 3. Create a `DynamoCheckpoint` and wait for it to become ready 4. Deploy a `DynamoGraphDeployment` that restores from the corresponding `checkpointRef` ### 1. Build and push a placeholder image Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with restore tooling. If you do not already have one, build it and push it to a registry your cluster can pull from: ```bash export RUNTIME_IMAGE=registry.example.com/dynamo/vllm-runtime:1.0.0 export PLACEHOLDER_IMAGE=registry.example.com/dynamo/vllm-placeholder:1.0.0 cd deploy/snapshot make docker-build-placeholder \ PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \ PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}" make docker-push-placeholder \ PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}" ``` The placeholder image preserves the normal runtime entrypoint/command contract and adds the `criu`, `cuda-checkpoint`, and `nsrestore` tooling needed for checkpoint and restore. To build either snapshot image against a custom CRIU fork or ref, pass `CRIU_REPO` and `CRIU_REF` through `make`. If they are unset, the Dockerfile defaults are used. ```bash make docker-build-agent \ IMG=registry.example.com/dynamo/snapshot-agent:1.0.0 \ CRIU_REPO="${YOUR_CRIU_REPO}" \ CRIU_REF="branch-or-sha" make docker-build-placeholder \ PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \ PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}" \ CRIU_REPO="${YOUR_CRIU_REPO}" \ CRIU_REF="branch-or-sha" ``` ### 2. Enable checkpointing in the platform and verify it Whether you are installing or upgrading `dynamo-platform`, the operator only needs checkpointing enabled: ```yaml dynamo-operator: checkpoint: enabled: true ``` If the platform is already installed, verify that the operator config contains the checkpoint block: ```bash OPERATOR_CONFIG=$(kubectl get deploy -n "${PLATFORM_NAMESPACE}" \ -l app.kubernetes.io/name=dynamo-operator,app.kubernetes.io/component=manager \ -o jsonpath='{.items[0].spec.template.spec.volumes[?(@.name=="operator-config")].configMap.name}') kubectl get configmap "${OPERATOR_CONFIG}" -n "${PLATFORM_NAMESPACE}" \ -o jsonpath='{.data.config\.yaml}' | sed -n '/^checkpoint:/,/^[^[:space:]]/p' ``` Verify that the rendered config includes `enabled: true`. ### 3. Install the snapshot chart For the default namespace-local mode, install the snapshot chart in each workload namespace. The chart creates the PVC and the agent in that namespace: ```bash helm upgrade --install snapshot ./deploy/helm/charts/snapshot \ --namespace ${NAMESPACE} \ --create-namespace \ --set storage.pvc.create=true ``` In the default `agentMount` mode, the snapshot-agent DaemonSet mounts the checkpoint PVC directly. On a multi-node GPU cluster that means agent pods on multiple nodes may mount the same PVC, so the PVC generally needs `ReadWriteMany`. The chart defaults to that mode. If your cluster does not have a default storage class, also set `storage.pvc.storageClass`. If you are reusing an existing checkpoint PVC, do not set `storage.pvc.create=true`; install the chart with `storage.pvc.create=false` and set `storage.pvc.name` instead. CRI-O or OpenShift: append for example `--set runtime.type=crio` and, on OpenShift, `--set openshift.enabled=true` (see `deploy/helm/charts/snapshot/README.md`). For clusters that prefer one privileged snapshot agent instead of one DaemonSet per workload namespace, install the chart once in an infrastructure namespace. In this mode the chart does not create workload PVCs; the Dynamo operator either creates each namespace-local PVC or verifies that it already exists: ```bash helm upgrade --install snapshot ./deploy/helm/charts/snapshot \ --namespace dynamo-system \ --create-namespace \ --set storage.accessMode=podMount \ --set storage.pvc.create=false \ --set rbac.namespaceRestricted=false ``` To let the operator create the workload PVC in each namespace that uses checkpoint/restore, configure the operator with `create: true`: ```yaml dynamo-operator: checkpoint: enabled: true storage: type: pvc pvc: pvcName: snapshot-pvc basePath: /checkpoints create: true size: 1Ti storageClassName: "" accessMode: ReadWriteMany ``` The chart and operator use separate configuration surfaces here: the snapshot chart PVC name is `storage.pvc.name`, while the operator config field is `checkpoint.storage.pvc.pvcName`. This is a key difference from `agentMount`: `podMount` removes the requirement that the snapshot-agent DaemonSet mount the checkpoint PVC on every GPU node. Only the active checkpoint/restore workload pod mounts the PVC, and the agent reaches it through that pod's mount namespace. `ReadWriteMany` remains the safest operator-managed default, especially when multiple checkpoint/restore pods may access the same PVC concurrently or when restore scheduling can span nodes. Suitable `ReadWriteOnce` storage classes can still be used for sequential `podMount` checkpoint/restore flows when the backend can attach the volume to the node running the active workload pod. `podMount` depends on the target container remaining alive while the agent resolves `/host/proc//root/`. If the container exits or restarts during checkpoint/restore setup, if the runtime cannot expose a stable host PID, or if node security settings prevent host proc traversal, the agent fails or skips that attempt and Kubernetes/operator reconciliation must try again after a fresh container is available. To use an already-present PVC instead, omit `create` or set it to `false`. The operator will fail reconciliation with a clear error if the named PVC does not exist in the workload namespace. Verify that the DaemonSet is ready. After a checkpoint or restore workload is reconciled, verify the workload namespace PVC: ```bash kubectl rollout status daemonset/snapshot-agent -n dynamo-system kubectl get pods -n dynamo-system -l app.kubernetes.io/component=snapshot-agent -o wide kubectl get pvc snapshot-pvc -n ${NAMESPACE} ``` ### 4. Create a standalone `DynamoCheckpoint` The checkpoint Job pod template should match the worker container you want to checkpoint. For a standalone checkpoint, the important parts are the legacy `spec.identity` metadata, a container named `main`, and the placeholder image; the rest of the pod template should mirror your normal worker config. Extra containers are allowed, but only `main` is checkpointed unless `spec.job.targetContainerName` selects another container. ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoCheckpoint metadata: name: qwen3-06b-bf16 spec: identity: model: Qwen/Qwen3-0.6B backendFramework: vllm tensorParallelSize: 1 dtype: bfloat16 maxModelLen: 2048 job: activeDeadlineSeconds: 3600 podTemplateSpec: spec: ... containers: - name: main image: registry.example.com/dynamo/vllm-placeholder:1.0.0 ... ``` GMS + Snapshot support is currently disabled. For a full working example, see [deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml](https://github.com/ai-dynamo/dynamo/blob/main/deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml). Apply it: ```bash kubectl apply -f qwen3-checkpoint.yaml -n ${NAMESPACE} ``` ### 5. Wait for the checkpoint to become ready ```bash kubectl get dckpt -n ${NAMESPACE} \ -o custom-columns=NAME:.metadata.name,CHECKPOINT_ID:.status.checkpointID,PHASE:.status.phase kubectl wait \ --for=jsonpath='{.status.phase}'=Ready \ dynamocheckpoint/qwen3-06b-bf16 \ -n ${NAMESPACE} \ --timeout=30m ``` The useful status fields are: - `status.phase`: high-level lifecycle (`Pending`, `Creating`, `Ready`, `Failed`) - `status.checkpointID`: artifact ID used by the snapshot protocol - `status.identityHash`: deprecated compatibility alias for `status.checkpointID` - `status.jobName`: checkpoint Job name - `status.createdAt`: timestamp recorded when the checkpoint became ready - `status.message`: progress or failure detail when available ### 6. Deploy a `DynamoGraphDeployment` that restores from `checkpointRef` Once the checkpoint is `Ready`, restore a worker from it explicitly: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: vllm-checkpointref-demo spec: services: Frontend: componentType: frontend replicas: 1 extraPodSpec: mainContainer: image: registry.example.com/dynamo/vllm-runtime:1.0.0 VllmDecodeWorker: componentType: worker replicas: 1 checkpoint: enabled: true checkpointRef: qwen3-06b-bf16 extraPodSpec: mainContainer: image: registry.example.com/dynamo/vllm-placeholder:1.0.0 ... ... ``` Apply it: ```bash kubectl apply -f vllm-checkpointref-demo.yaml -n ${NAMESPACE} kubectl get pods -n ${NAMESPACE} -w ``` The `VllmDecodeWorker` pod should restore from the ready checkpoint instead of creating a new one. ## DGD Auto Flow `checkpointRef` is the most explicit path. If you set it, the DGD uses that existing `DynamoCheckpoint` and does not create a new automatic checkpoint for the component. This is the escape hatch for users who intentionally want to reuse a pre-existing checkpoint and accept the compatibility risk. Without `checkpointRef`, `mode: Auto` is the DGD-managed path: for each checkpoint-enabled worker generation, the DGD controller creates a DGD-scoped `DynamoCheckpoint` and the checkpoint controller starts a checkpoint Job. Automatic DGD checkpoints are not reused across DGDs, even when two manifests are identical. The automatic checkpoint ID is derived from the DGD namespace/name/UID, component name, and active worker hash. The DGD UID prevents cross-DGD reuse; the worker hash keeps a scale down/up on the same worker generation using the same DGD-scoped checkpoint while creating a new checkpoint for a new worker generation. By default, `startupPolicy: Immediate` starts workers cold while the checkpoint job runs in the background. Once the checkpoint becomes `Ready`, only newly-created Pods restore from it. Existing Pods are not mutated or restarted just because the checkpoint became ready. If you want workers to wait for the checkpoint before starting, set `startupPolicy: WaitForCheckpoint`. That policy keeps normal worker replicas at zero until the checkpoint is `Ready`, then starts workers from the checkpoint. ```yaml checkpoint: enabled: true mode: Auto startupPolicy: Immediate # default; optional ``` Inside a `DynamoGraphDeployment`, it looks like this: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: vllm-auto-demo spec: services: Frontend: componentType: frontend replicas: 1 extraPodSpec: mainContainer: image: registry.example.com/dynamo/vllm-runtime:1.0.0 VllmDecodeWorker: componentType: worker replicas: 1 checkpoint: enabled: true mode: Auto startupPolicy: Immediate extraPodSpec: mainContainer: image: registry.example.com/dynamo/vllm-placeholder:1.0.0 ... ... ``` The legacy `checkpoint.identity` field is ignored for DGD-managed automatic checkpoints. It is retained only for API compatibility and standalone `DynamoCheckpoint` workflows. Useful inspection commands: ```bash kubectl get dgd vllm-auto-demo -n ${NAMESPACE} \ -o jsonpath='{.status.checkpoints.VllmDecodeWorker.checkpointName}{"\n"}{.status.checkpoints.VllmDecodeWorker.checkpointID}{"\n"}{.status.checkpoints.VllmDecodeWorker.ready}{"\n"}' kubectl get dckpt -n ${NAMESPACE} ``` If you use the default `Immediate` policy and want to create restored pods after the checkpoint becomes ready, scale the worker: ```bash kubectl patch dgd vllm-auto-demo -n ${NAMESPACE} --type=merge \ -p '{"spec":{"services":{"VllmDecodeWorker":{"replicas":2}}}}' ``` ## Failover Restore Failover restore is not yet available. The current Snapshot flow does not support GMS + Snapshot, so do not use failover restore as a supported checkpoint/restore path. For current GMS and active/passive failover guidance, see [Shadow Engine Failover](/dynamo/kubernetes-deployment/advanced-platform/shadow-engine-failover). ## Lower-Level Testing With `snapshotctl` It is possible to checkpoint and restore pods without the Dynamo operator via the lower-level `snapshotctl` utility. However, the snapshot helm chart must be installed, with a running `snapshot-agent` DaemonSet in the namespace with the checkpoint PVC mounted. `snapshotctl` is intended for lower-level debugging and validation workflows, not as the primary user-facing checkpoint interface. For command details and manifest requirements, see [deploy/snapshot/cmd/snapshotctl/README.md](../../deploy/snapshot/cmd/snapshotctl/README.md). ### Checkpoint from a worker pod manifest ```bash snapshotctl checkpoint \ --manifest ./worker-pod.yaml \ --container main \ --namespace ${NAMESPACE} ``` The checkpoint manifest must be for a pod and use a placeholder image. `--container` names the workload container to checkpoint. If you do not pass `--checkpoint-id`, `snapshotctl` generates one and prints it: ```text status=completed namespace=... name=... checkpoint_job=... checkpoint_id=manual-snapshot-... checkpoint_location=/checkpoints/... ``` ### Restore from a worker pod manifest ```bash snapshotctl restore \ --manifest ./worker-pod.yaml \ --namespace ${NAMESPACE} \ --checkpoint-id manual-snapshot-... \ --containers main ``` This creates a new restore pod and returns after the request is submitted. Observe progress through Kubernetes readiness, events, and logs. ### Restore an existing pod in place ```bash snapshotctl restore \ --pod existing-restore-target \ --namespace ${NAMESPACE} \ --checkpoint-id manual-snapshot-... \ --containers main ``` This patches restore metadata onto an existing pod that is already snapshot-compatible and returns after the patch is accepted. ## Checkpoint IDs and Legacy Identity `status.checkpointID` is the artifact ID used by the snapshot protocol and the directory name under checkpoint storage. For DGD-managed automatic checkpoints, this ID is scoped to a single DGD/component worker generation. It is not a compatibility claim across DGDs, and identical manifests are not treated as proof that a checkpoint can be reused safely. The legacy `spec.identity` shape is still required on standalone `DynamoCheckpoint` objects and remains the fallback for explicit/manual workflows. When a standalone checkpoint does not already have `status.checkpointID` or the checkpoint-ID label, the operator computes the legacy **16-character SHA256 hash** (64 bits) from these fields: | Legacy field | Required | Affects legacy hash | Example | |--------------|----------|---------------------|---------| | `model` | ✓ | ✓ | `meta-llama/Llama-3-8B` | | `backendFramework` | ✓ | ✓ | `vllm` | | `dynamoVersion` | | ✓ | `0.9.0`, `1.0.0` | | `tensorParallelSize` | | ✓ | `1`, `2`, `4`, `8` | | `pipelineParallelSize` | | ✓ | `1`, `2` | | `dtype` | | ✓ | `float16`, `bfloat16`, `fp8` | | `maxModelLen` | | ✓ | `4096`, `8192` | | `extraParameters` | | ✓ | custom key-value pairs | Fields that do **not** change the legacy hash include: - replica count - node placement (`nodeSelector`, `affinity`, `tolerations`) - resource requests/limits - logging or observability configuration DGD-managed automatic checkpoints ignore this legacy identity as a reuse boundary. The DGD controller creates its own DGD-scoped checkpoint ID and synthesizes a legacy identity only because the v1alpha1 `DynamoCheckpoint` API still requires the field. ## `DynamoCheckpoint` CRD The `DynamoCheckpoint` (shortname: `dckpt`) is the operator-managed resource for checkpoint lifecycle. Use it when you want: - pre-warmed checkpoints before any `DynamoGraphDeployment` exists - explicit lifecycle control independent from a DGD - a stable human-readable name that services can reference with `checkpointRef` The operator requires: - `spec.identity` - `spec.job.podTemplateSpec` `spec.job.backoffLimit` is deprecated and ignored. Checkpoint Jobs are always single-attempt. Check status with: ```bash kubectl get dckpt -n ${NAMESPACE} kubectl describe dckpt qwen3-06b-bf16 -n ${NAMESPACE} kubectl get dckpt qwen3-06b-bf16 -n ${NAMESPACE} -o yaml ``` The `status` block looks like: ```yaml status: phase: Ready checkpointID: 3bff874d069f0ed5 identityHash: 3bff874d069f0ed5 # deprecated compatibility alias jobName: checkpoint-job-3bff874d069f0ed5-1 createdAt: "2026-01-29T10:05:00Z" message: "" ``` ## Limitations - **Backend support is limited**: checkpoint/restore currently supports vLLM workers only, and that support is still a limited preview. - **Worker coverage is narrow**: specialized workers such as multimodal, embedding, and diffusion are not supported. - **Multi-GPU remains preview**: vLLM tensor-parallel configurations have limited validation and are not yet a broadly supported path across clusters. - **GMS restore remains experimental**: GMS + Snapshot is currently disabled. - **Admission is create-only**: with DGD `startupPolicy: Immediate`, only Pods created after a checkpoint is `Ready` are restore-shaped. Existing Pods cold-started before checkpoint readiness keep running as-is. - **Network state is sensitive**: restore is sensitive to live TCP socket state. Loopback bootstrap/control sockets are the most reliable path today. - **Privileged DaemonSet required**: `snapshot-agent` must run privileged to execute CRIU and `cuda-checkpoint`. Workload pods do not need to be privileged. ## Troubleshooting ### Checkpoint Job finishes but the checkpoint never becomes `Ready` Snapshot only becomes `Ready` after `snapshot-agent` confirms the checkpoint contents. A completed Job is not enough by itself. ```bash kubectl get dckpt -n ${NAMESPACE} \ -o custom-columns=NAME:.metadata.name,PHASE:.status.phase,MESSAGE:.status.message,JOB:.status.jobName JOB_NAME=$(kubectl get dckpt -n ${NAMESPACE} -o jsonpath='{.status.jobName}') if [ -n "${JOB_NAME}" ]; then kubectl logs job/"${JOB_NAME}" -n ${NAMESPACE} fi kubectl logs daemonset/snapshot-agent -n ${NAMESPACE} --all-containers ``` If the worker template is wrong, the most common causes are using the raw runtime image instead of the placeholder image, or leaving out normal mounts and secrets that the worker needs to start. ### Restore cannot find or mount checkpoint storage For the default `agentMount` install, restore discovers checkpoint storage from the `snapshot-agent` DaemonSet in the workload namespace. That DaemonSet must be ready and must mount the checkpoint PVC. ```bash kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE} kubectl get daemonset -n ${NAMESPACE} -l app.kubernetes.io/component=snapshot-agent -o wide kubectl get pvc -n ${NAMESPACE} ``` For a shared-agent `podMount` install, the `snapshot-agent` DaemonSet can run in the infrastructure namespace instead. Verify the shared-agent pods there, then verify that the workload namespace has the checkpoint PVC that the operator created or validated: ```bash kubectl rollout status daemonset/snapshot-agent -n dynamo-system kubectl get pods -n dynamo-system -l app.kubernetes.io/component=snapshot-agent -o wide kubectl get pvc snapshot-pvc -n ${NAMESPACE} ``` In `podMount` mode the agent reaches the checkpoint through the workload pod's mount namespace rather than by mounting the PVC itself. Check the workload pod's checkpoint storage annotations and the `snapshot-agent` logs to see the actual resolved checkpoint path. `snapshotctl` uses the chart's storage resolution path, so for lower-level `snapshotctl` debugging make sure the snapshot chart configuration matches the access mode you are testing. ### `snapshotctl` manifest is rejected or the restore target is wrong `snapshotctl` requires a `Pod` manifest and a target-container list. Multi-container manifests are supported as long as every name passed via `--container` or `--containers` exists in the pod spec. ```bash snapshotctl checkpoint --manifest ./worker-pod.yaml --container main --namespace ${NAMESPACE} snapshotctl restore --manifest ./worker-pod.yaml --containers main --namespace ${NAMESPACE} --checkpoint-id ``` If the manifest already carries snapshot target metadata, it must agree with the CLI flag; `snapshotctl` rejects mismatches instead of silently picking one. ## Planned Features - Stable multi-GPU and multinode support - TensorRT-LLM support ## Related Documentation - [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) - [Shadow Engine Failover](/dynamo/kubernetes-deployment/advanced-platform/shadow-engine-failover) - [API Reference](/dynamo/additional-resources/api-reference-k-8-s) # Shadow Engine Failover > ⚠️ **Experimental Feature**: Shadow Engine Failover is an opt-in preview > feature. It depends on GPU Memory Service (GMS), Dynamic Resource Allocation > (DRA), and backend-specific support. Its API shape and behavior may change, > and the failover state machine is still settling. Use it only for > non-production evaluation unless you have validated the exact backend, > topology, and failure mode in your cluster. ## Overview Use Shadow Engine Failover when you want a standby engine to take over after an unknown backend engine or software-process failure while the GPU and node remain healthy. The goal is to avoid paying a full model weight reload after a same-node process failure. Shadow Engine Failover is the Kubernetes workflow. GPU Memory Service is the enabling mechanism underneath it: GMS owns the GPU-resident model weights, and the active and standby engines attach to those weights through DRA. This is separate from [Dynamo Snapshot](/dynamo/kubernetes-deployment/advanced-platform/snapshot). Snapshot captures and restores a process image with CRIU and `cuda-checkpoint`. Shadow Engine Failover keeps model weights resident in GPU memory so a standby or replacement engine can attach after selected process-level failures. They both target recovery latency, but they solve different problems and are not interchangeable. ## Failure Recovery Flow The following diagram illustrates same-node process-level recovery: ```text ┌──────────────────────── Same healthy node + GPU ───────────────────────┐ │ │ │ Before failure │ │ ┌──────────────┐ attach/use ┌───────────────────────────┐ │ │ │ Engine A │ ───────────────────▶ │ GMS-owned model weights │ │ │ │ active │ │ resident in GPU memory │ │ │ └──────┬───────┘ └────────────┬──────────────┘ │ │ │ ▲ │ │ │ │ attach/use │ │ │ unknown software/engine failure │ │ │ ▼ │ │ │ ┌──────────────┐ ┌──────┴───────┐ │ │ │ Engine A │ exits │ Engine B │ │ │ └──────────────┘ │ shadow │ │ │ └──────┬───────┘ │ │ │ takeover │ │ ▼ │ │ ┌──────────────┐ │ │ │ Engine B │ │ │ │ active │ │ │ └──────────────┘ │ │ │ └────────────────────────────────────────────────────────────────────────┘ ``` **How it works:** 1. The operator creates active and standby engine containers or pods for the worker, depending on the selected failover mode. 2. The engines share GPU access through DRA and attach to model weights owned by GMS. 3. An unknown software or engine failure terminates the active engine, while the GMS process, GPU, and node remain healthy. 4. The standby or replacement engine takes over and attaches to the resident GMS-owned weights instead of performing a full weight reload. 5. In-flight requests and KV cache state are not preserved. If the GPU, node, or GMS process is lost, the replacement worker must use the normal rescheduling and model-load path. ## When to Use It Today - Use it to evaluate same-node recovery from unknown vLLM engine or software-process failures. - Use it when the cost you are trying to avoid is loading another independent copy of model weights into GPU memory. - Use the GMS-only examples to validate backend weight loading through GMS, not as a complete failover workflow. - Do not use it for hardware failure, GPU loss, node loss, cross-node recovery, in-flight request recovery, or KV-cache recovery. - Do not combine it with Snapshot restore. Snapshot plus GMS is not yet available. ## GPU Memory Service GMS moves ownership of GPU-resident model weights out of the engine process and into a separate GPU memory service. In the failover workflow, this lets the active and standby engines share the same weight memory boundary instead of loading independent copies. Direct GMS enablement is useful for backend integration testing and sleep/wake-style lifecycle experiments. By itself, it does not configure active/passive failover; use the `failover` field for the shadow engine flow. ## Prerequisites - Kubernetes 1.34 or newer with DRA v1 (`resource.k8s.io/v1`) enabled. - NVIDIA GPU DRA driver installed. - A matching DRA `DeviceClass`, defaulting to `gpu.nvidia.com`. - A supported backend image. The current failover examples are vLLM-focused. - Backend command-line support for GMS loading, such as `--load-format gms`. - Enough GPU memory for the GMS processes and active or standby engines sharing the device. ## Limitations - It is not a general checkpoint/restore system. - It is not a hardware fault tolerance mechanism for GPU, node, or rack loss. - It does not diagnose or fix the backend failure. - It does not preserve in-flight requests, network sockets, or KV cache state. - It does not make Snapshot restore supported for GPU memory workloads. - Snapshot plus GMS is temporarily blocked by admission because of known GPU driver restore issues. - It is not covered by the normal v1beta1 compatibility guarantees while it lives under `experimental`. ## API Placement For `v1alpha1` `DynamoGraphDeployment`, GMS and failover are service-level fields: ```yaml gpuMemoryService: enabled: true failover: enabled: true ``` For `v1beta1`, preview fields are grouped under `experimental` to make the stability contract explicit: ```yaml experimental: gpuMemoryService: mode: IntraPod failover: mode: IntraPod ``` See the [API reference](/dynamo/additional-resources/api-reference-k-8-s) for the exact schema supported by your CRD version. ## Basic Shadow Engine Failover Example Failover builds on GMS. In intra-pod mode, the operator clones the worker's main container into active and standby engine containers that share GPUs through DRA and the GMS sidecar. The standby engine takes over when the active engine fails. ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: vllm-agg-failover annotations: nvidia.com/dynamo-kube-discovery-mode: container spec: services: VllmWorker: componentType: worker replicas: 1 resources: limits: gpu: "2" gpuMemoryService: enabled: true failover: enabled: true extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag args: - --model - Qwen/Qwen3-0.6B - --tensor-parallel-size - "2" - --load-format - gms ``` See the [vLLM failover example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/vllm/deploy/agg_failover.yaml) for the full manifest. ## Basic GMS Example The worker must request GPUs through the normal Dynamo service resources, enable `gpuMemoryService`, and run a backend command that can load from GMS. ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: vllm-agg-gms spec: services: VllmWorker: componentType: worker replicas: 1 resources: limits: gpu: "1" gpuMemoryService: enabled: true extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag workingDir: /workspace/examples/backends/vllm command: - python3 - -m - dynamo.vllm args: - --model - Qwen/Qwen3-0.6B - --load-format - gms ``` Working GMS-only examples: - [vLLM GMS example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/vllm/deploy/agg_gms.yaml) - [SGLang GMS example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/sglang/deploy/agg_gms.yaml) ## Related Documentation - [Snapshot](/dynamo/kubernetes-deployment/advanced-platform/snapshot) - [API Reference](/dynamo/additional-resources/api-reference-k-8-s) - [vLLM failover example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/vllm/deploy/agg_failover.yaml) - [vLLM GMS example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/vllm/deploy/agg_gms.yaml) - [SGLang GMS example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/sglang/deploy/agg_gms.yaml) # Developing the Operator with Tilt ## Overview [Tilt](https://tilt.dev) provides a live-reload development environment for the Dynamo Kubernetes operator. Instead of manually building images, pushing to a registry, and redeploying on every change, Tilt watches your source files and automatically recompiles the Go binary, syncs it into the running container, and restarts the process — all in seconds. Under the hood, the Tiltfile: 1. **Compiles** the Go manager binary locally (`CGO_ENABLED=0`). 2. **Builds** a minimal Docker image containing only the binary. 3. **Renders** the production Helm chart (`deploy/helm/charts/platform`) with `helm template`, applies CRDs via `kubectl`, and deploys all rendered resources. 4. **Live-updates** the binary inside the running container on every code change — no full image rebuild required. This gives you a fully working cluster where you can apply `DynamoGraphDeployment` and `DynamoGraphDeploymentRequest` resources and have them reconcile into real workloads — while iterating on controller logic with sub-second feedback. ## Prerequisites | Tool | Version | Purpose | |------|---------|---------| | [Tilt](https://docs.tilt.dev/install.html) | v0.33+ | Development orchestration | | [Helm](https://helm.sh/docs/intro/install/) | v3 | Chart rendering | | [Go](https://go.dev/dl/) | 1.25+ | Compiling the operator | | [kubectl](https://kubernetes.io/docs/tasks/tools/) | — | Cluster access | | A Kubernetes cluster | — | kind, minikube, or remote cluster | You also need a **container registry** that is accessible to your cluster's nodes, so they can pull the operator image Tilt builds. If you use a local cluster like kind with a local registry, Tilt can push there directly. ## Quick Start ```bash cd deploy/operator # Create your personal settings file (gitignored) cat > tilt-settings.yaml < If no registry is configured, the image is only available locally. This works with kind using a local registry but will fail on remote clusters. ## How It Works When you run `tilt up`, the following resources are created in order: ``` manager-build Compile Go binary locally │ ├───── crds Apply CRDs via server-side apply │ operator Deploy operator pod (live-updated) ``` The operator handles webhook certificate generation, CA bundle injection, and MPI SSH key provisioning at runtime — no external setup needed. ### What Each Resource Does **manager-build** — Runs `CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build` to compile the operator binary. Re-runs on changes to `api/`, `cmd/`, `internal/`, `go.mod`, or `go.sum`. **crds** — Applies CRDs from the Helm chart via `kubectl apply --server-side`. When `skip_codegen` is `false`, runs `make generate && make manifests` first. **operator** — The operator Deployment itself. Tilt watches the compiled binary and uses `live_update` to sync it into the running container and restart the process — no image rebuild needed. On startup, the operator's built-in cert controller generates a self-signed TLS certificate, injects the CA bundle into webhook configurations, and creates the MPI SSH secret — matching production behavior exactly. ### Live Update Cycle The inner development loop looks like this: 1. You edit Go source files under `deploy/operator/`. 2. Tilt detects the change and recompiles the binary (~2-5 seconds). 3. The new binary is synced into the running container via `live_update`. 4. The process restarts automatically. 5. Your controller changes are live — test by applying a DGD/DGDR. No `docker build`, no `docker push`, no `kubectl rollout restart`. ## Webhook Certificates The operator handles webhook TLS certificates automatically at runtime using a built-in cert controller (based on OPA cert-controller). On startup it: 1. Creates a self-signed CA and webhook serving certificate. 2. Stores them in the `webhook-server-cert` Secret. 3. Injects the CA bundle into `ValidatingWebhookConfiguration` and `MutatingWebhookConfiguration` resources. This matches production behavior and requires no external tooling. For alternative certificate management (cert-manager or external certs), see the [webhook documentation](/dynamo/kubernetes-deployment/advanced-platform/webhooks) and configure via `helm_values` in `tilt-settings.yaml`. ## Typical Workflows ### Iterating on Controller Logic The most common workflow — you're modifying reconciliation logic and want fast feedback: ```yaml # tilt-settings.yaml allowed_contexts: [my-cluster] registry: docker.io/myuser skip_codegen: true ``` ```bash tilt up # Edit files under internal/controller/ # Tilt auto-recompiles and live-updates # Apply test resources: kubectl apply -f examples/backends/vllm/deploy/agg.yaml ``` ### Changing API Types (CRDs) When you modify files under `api/`, you need codegen to run: ```yaml # tilt-settings.yaml skip_codegen: false # or omit — false is the default ``` Tilt will run `make generate && make manifests` and re-apply CRDs whenever `api/` files change. ### Testing Multi-Node Features Enable the necessary subcharts: ```yaml # tilt-settings.yaml enable_grove: true enable_kai_scheduler: true ``` ### Using Environment Variables You can override the registry without editing the settings file: ```bash REGISTRY=ghcr.io/myorg tilt up ``` ## Tilt UI The web UI at [http://localhost:10350](http://localhost:10350) shows: - **Resource status** — green/red/pending for each resource - **Build logs** — compilation output and errors - **Runtime logs** — operator logs streamed in real time - **Port forwards** — the health endpoint is forwarded to `localhost:8081` Resources are grouped by label (`operator` and `infrastructure`) to keep the UI organized. ## Cleanup ```bash # Stop Tilt and leave resources deployed # (Ctrl-C in the terminal) # Stop Tilt and tear down all resources tilt down ``` ## Troubleshooting ### Image Pull Errors If pods show `ImagePullBackOff`: - Verify `registry` is set in `tilt-settings.yaml` or via `REGISTRY` env var. - Ensure your cluster nodes can pull from that registry. - For kind with a local registry, follow the [kind local registry guide](https://kind.sigs.k8s.io/docs/user/local-registry/). ### Webhook TLS Errors If applying a DGD/DGDR fails with `x509: certificate signed by unknown authority`: - Check the operator logs in the Tilt UI — the cert controller logs its progress on startup. - Verify the `webhook-server-cert` Secret exists and has been populated: ```bash kubectl -n dynamo-system get secret webhook-server-cert ``` - The operator may need a few seconds after startup to generate certs and inject the CA bundle. Wait for the `cert-controller` log messages before applying resources. ### CRD Codegen Failures If `crds` fails with codegen errors: - Ensure `controller-gen` is installed: `make controller-gen` - Try running codegen manually: `make generate && make manifests` - Set `skip_codegen: true` temporarily to bypass if you haven't changed API types. ### Context Safety Guard If Tilt refuses to start with a context error, add your cluster context to `allowed_contexts` in `tilt-settings.yaml`: ```yaml allowed_contexts: - my-cluster-context ``` # Amazon Elastic Kubernetes Service (EKS) # Steps to create an EKS cluster This guide demonstrates the Dynamo platform on Amazon Elastic Kubernetes Service (EKS). ## Setup environment variables We will use those environment variables throughout this guide. If you would like to use a different region, modify the `AWS_REGION` variable ```bash export AWS_REGION="us-east-1" export CLUSTER_NAME="ai-dynamo" export DYNAMO_NAMESPACE="dynamo-system" export DYNAMO_RELEASE_VERSION="1.0.0" ``` ## Install CLIs ### Install AWS CLI ([AWS CLI installation guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)) ```bash sudo apt install unzip curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install ``` ### Install Kubernetes CLI ([kubectl installation guide for EKS](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)) ```bash curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.35.2/2026-02-27/bin/darwin/amd64/kubectl chmod +x ./kubectl mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH echo 'export PATH=$HOME/bin:$PATH' >> ~/.bashrc ``` ### Install Eksctl CLI ([eksctl installation guide](https://eksctl.io/installation/)) ```bash ARCH=amd64 PLATFORM=$(uname -s)_$ARCH curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz" curl -sL "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt" | grep $PLATFORM | sha256sum --check tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz sudo mv /tmp/eksctl /usr/local/bin ``` ### Install Helm CLI ([Helm setup for EKS](https://docs.aws.amazon.com/eks/latest/userguide/helm.html)) ```bash curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 > get_helm.sh chmod 700 get_helm.sh ./get_helm.sh ``` ## Create an EKS Auto Mode cluster Creating an EKS Auto Mode cluster using Eksctl with `eksctl.yaml`. This will create an EKS Auto Mode cluster with the Amazon EFS CSI Driver installed as an addon, we will later use Amazon EFS to store model weights and compilation to be used by Dynamo. ```bash # Use all availability zones in a region, exclude use1-az3 where EKS control plane is not available export EKS_CP_AZS=$(aws ec2 describe-availability-zones \ --region ${AWS_REGION} \ --filters "Name=opt-in-status,Values=opt-in-not-required" \ --query "AvailabilityZones[?ZoneId!='use1-az3'].[ZoneName]" \ --output text | sed 's/ /, /g; s/^/ - /') eksctl create cluster -f <(envsubst < templates/eksctl.yaml) ``` *Note: eksctl will automatically configure kubeconfig context for you, if not you can run: `aws eks update-kubeconfig --region $AWS_REGION --name $CLUSTER_NAME`* ### Create an EKS Auto Mode GPU NodePool Creating a GPU NodePool that targets the **g5,g6,g6e,g7e,p5,p5e,p5en** instance families. ```bash kubectl apply -f automode-np-gpu.yaml ``` ## Create a default StorageClass Create a default StorageClass to use the storage capability of EKS Auto Mode, this will make the default StorageClass to use EBS volumes for Stateful workloads needed by NATS that is used with Dynamo. ```bash kubectl apply -f - << EOF apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: auto-ebs-sc annotations: storageclass.kubernetes.io/is-default-class: "true" allowedTopologies: - matchLabelExpressions: - key: eks.amazonaws.com/compute-type values: - auto provisioner: ebs.csi.eks.amazonaws.com volumeBindingMode: WaitForFirstConsumer parameters: type: gp3 encrypted: "true" EOF ``` ## Create an Amazon EFS shared file system Follow the [EFS setup guide](/dynamo/kubernetes-deployment/cloud-provider-guides/aws/efs) to create an EFS file system and make it available as shared storage for Dynamo workloads. ## Install Dynamo Kubernetes Platform ### Install Dynamo Platform ```bash helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-"$DYNAMO_RELEASE_VERSION".tgz helm install dynamo-platform dynamo-platform-"$DYNAMO_RELEASE_VERSION".tgz \ --namespace "$DYNAMO_NAMESPACE" \ --create-namespace ``` ### Setup HuggingFace TOKEN ```bash export HF_TOKEN= kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN=${HF_TOKEN} \ -n ${DYNAMO_NAMESPACE} ``` ### Verify installation Validate that the Dynamo platform pods are running, you should see an output similar to output below. ```bash kubectl get pods -n ${DYNAMO_NAMESPACE} NAME READY STATUS RESTARTS AGE dynamo-platform-dynamo-operator-controller-manager-ff54b5dstgcq 1/1 Running 0 106s dynamo-platform-nats-0 2/2 Running 0 106s ``` Validate that the Dynamo CRDs were installed ```bash kubectl get crds | grep dynamo dynamocheckpoints.nvidia.com 2026-03-17T13:18:05Z dynamocomponentdeployments.nvidia.com 2026-03-17T13:18:06Z dynamographdeploymentrequests.nvidia.com 2026-03-17T13:18:08Z dynamographdeployments.nvidia.com 2026-03-17T13:18:09Z dynamographdeploymentscalingadapters.nvidia.com 2026-03-17T13:18:10Z dynamomodels.nvidia.com 2026-03-17T13:18:10Z dynamoworkermetadatas.nvidia.com 2026-03-17T13:18:11Z ``` ## Deploy a Dynamo DynamoGraphDeployment (DGD) | Manifest | Description | |----------|-------------| | `manifests/vllm/disagg.yaml` | Disaggregated prefill/decode DGD using NIXL with LIBFABRIC backend over EFA. Targets `g7e.12xlarge` instances with GPUDirect RDMA support for high-throughput KV-cache transfer between prefill and decode workers. | | `manifests/vllm/disagg-p5.yaml` | Disaggregated prefill/decode DGD using NIXL with LIBFABRIC backend over EFA. Targets `p5.48xlarge` reserved instances with 8 EFA devices (4 EFAs per 1 GPU for p5.48xlarge) and TP-2 for Qwen3-32B. Uses 2 decode and 6 prefill replicas on reserved capacity (`karpenter.sh/capacity-type: reserved`). | | `manifests/vllm/disagg-tcp.yaml` | Alternative disaggregated prefill/decode inference graph using TCP instead of EFA. Targets `g6e.2xlarge` instances, suitable for instance types without EFA support. | | `manifests/vllm/agg.yaml` | Aggregated (single-worker) inference graph where a single vLLM worker handles both prefill and decode phases. Simpler deployment without KV-cache transfer overhead. | ### Cache Models on EFS Before deploying an inference graph, download the model weights onto the shared EFS file system. Each Dynamo recipe includes a `model-cache/model-download.yaml` Job manifest that downloads the model from HuggingFace. Copy the recipe's download manifest into the local kustomize directory and apply it: ```bash # Example: cache the Qwen3-32B model which we will be using later cp ../../../recipes/qwen3-32b/model-cache/model-download.yaml manifests/model-download/model-download.yaml kubectl kustomize manifests/model-download | kubectl -n ${DYNAMO_NAMESPACE} apply -f - rm -f manifests/model-download/model-download.yaml ``` The recipe manifests don't set any memory resources on the download container. Without a memory request, the Job pod can get OOMKilled during download — especially for large models. The `kustomization.yaml` in `manifests/model-download/` patches in a memory request to prevent this. By default it adds `4Gi`. For larger models (e.g. DeepSeek-R1, Nemotron-3-Super-120B) increase this value in `manifests/model-download/kustomization.yaml` before applying: ```yaml patches: - target: kind: Job name: model-download patch: | apiVersion: batch/v1 kind: Job metadata: name: model-download spec: template: spec: containers: - name: model-download resources: requests: memory: "16Gi" # increase for larger models ``` Then apply: ```bash kubectl kustomize manifests/model-download | kubectl -n ${DYNAMO_NAMESPACE} apply -f - ``` Monitor the download Job: ```bash kubectl -n ${DYNAMO_NAMESPACE} get jobs model-download kubectl -n ${DYNAMO_NAMESPACE} logs -f job/model-download ``` To re-run a download (e.g. after changing the model or fixing an OOM), delete the previous Job first: ```bash kubectl -n ${DYNAMO_NAMESPACE} delete job model-download ``` Then copy the new recipe's manifest and apply again. ### Disaggregated Serving This example deploys a disaggregated prefill/decode Dynamo Inference Graph that uses NIXL with the LIBFABRIC backend using [Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) for high-throughput KV-cache transfer between workers. It targets `g7e.12xlarge` instances, which support GPUDirect RDMA, and uses the Dynamo EFA-enabled vLLM container `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0-efa-amd64` that ships with the [EFA Installer](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-changelog.html) pre-installed. *Note: For a full list of EFA-supported instance types, see [the AWS EC2 Docs](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types).* ```yaml nodeSelector: node.kubernetes.io/instance-type: g7e.12xlarge ``` KV-cache transfer between workers uses [NIXL](https://github.com/ai-dynamo/nixl) with the LIBFABRIC backend. Enable it by passing the following argument to vLLM: `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config": {"backends": ["LIBFABRIC"]}}'` *Note: On instance types without EFA support, NIXL's libfabric backend falls back to TCP automatically. However, vLLM's `NixlConnector` defaults to `cuda` as the buffer device, so you must add `"kv_buffer_device":"cpu"` to the `kv-transfer-config` argument for disaggregated serving to work without EFA.* Request an EFA device for each worker pod using the `vpc.amazonaws.com/efa` extended resource: ```yaml resources: requests: gpu: "1" custom: vpc.amazonaws.com/efa: "1" limits: gpu: "1" custom: vpc.amazonaws.com/efa: "1" ``` *Note: EKS Auto Mode includes the EFA device plugin making `vpc.amazonaws.com/efa` extended resource available.* All workers (prefill and decode) must be co-located in the same availability zone, since EFA traffic does not cross AZ boundaries. Use a pod affinity rule to enforce this: ```yaml affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: "topology.kubernetes.io/zone" labelSelector: matchLabels: nvidia.com/dynamo-graph-deployment-name: "vllm-disagg" ``` ```bash kubectl -n ${DYNAMO_NAMESPACE} apply -f manifests/vllm/disagg.yaml ``` *Note: `manifests/vllm/disagg-tcp.yaml` provides an alternative example that uses TCP instead of EFA, targeting `g6e.2xlarge` instances.* Verify that all pods reach `Running` status: ```bash kubectl -n ${DYNAMO_NAMESPACE} get pods NAME READY STATUS RESTARTS AGE dynamo-platform-dynamo-operator-controller-manager-ff54b5dstgcq 1/1 Running 0 39m dynamo-platform-nats-0 2/2 Running 0 39m vllm-disagg-frontend-85f8476887-wwtwk 1/1 Running 0 2m13s vllm-disagg-vllmdecodeworker-510a1741-7666987b-tp58w 1/1 Running 0 2m13s vllm-disagg-vllmprefillworker-510a1741-54f76d7954-tjgn8 1/1 Running 0 2m13s ``` ```bash kubectl -n ${DYNAMO_NAMESPACE} port-forward svc/vllm-disagg-frontend 8000:8000 curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-32B", "messages": [ { "role": "user", "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden." } ], "stream": false, "max_tokens": 30 }' ``` You should see output similar to below ```bash {"id":"chatcmpl-23a7c94b-99cb-42ca-ae56-2397aa5a560f","choices":[{"index":0,"message":{"content":"\nOkay, so I need to develop a character background for someone who's an intrepid explorer in Eldoria, specifically focusing on their motivations,","role":"assistant","reasoning_content":null},"finish_reason":"length"}],"created":1773336002,"model":"Qwen/Qwen3-0.6B","object":"chat.completion","usage":{"prompt_tokens":196,"completion_tokens":30,"total_tokens":226,"prompt_tokens_details":{"audio_tokens":null,"cached_tokens":192}},"nvext":{"worker_id":{"prefill_worker_id":4265733549773195,"prefill_dp_rank":0,"decode_worker_id":7535192362430132,"decode_dp_rank":0},"timing":{"request_received_ms":1773336002136,"prefill_wait_time_ms":0.852483,"prefill_time_ms":12.90597,"ttft_ms":13.758453000000001,"total_time_ms":110.89621500000001,"kv_hit_rate":0.0}}} ``` *Note: The initial request for each worker will occur increased latency, this is due to the NIXL backend handshake and initialization overhead, this operation is only for the very first transfer* Watch logs ```bash kubectl logs -n ${DYNAMO_NAMESPACE} -l nvidia.com/dynamo-graph-deployment-name=vllm-disagg --all-containers=true --max-log-requests=20 --prefix=true --timestamps -f ``` Cleanup ```bash kubectl -n ${DYNAMO_NAMESPACE} delete -f manifests/vllm/disagg.yaml ``` ### Aggregated Serving ```bash kubectl -n ${DYNAMO_NAMESPACE} apply -f manifests/vllm/agg.yaml ``` Your pods should be running like below output, making sure they are in status "Running". ```bash kubectl -n ${DYNAMO_NAMESPACE} get pods NAME READY STATUS RESTARTS AGE dynamo-platform-dynamo-operator-controller-manager-ff54b5dstgcq 1/1 Running 0 12m dynamo-platform-nats-0 2/2 Running 0 12m vllm-agg-frontend-ff8457bcf-tq9jh 1/1 Running 0 4m46s vllm-agg-vllmdecodeworker-d0a70291-759df94478-8lc74 1/1 Running 0 4m46s ``` Watch logs ```bash kubectl logs -n ${DYNAMO_NAMESPACE} -l nvidia.com/dynamo-graph-deployment-name=vllm-agg --all-containers=true --max-log-requests=20 --prefix=true --timestamps -f ``` ```bash kubectl -n ${DYNAMO_NAMESPACE} port-forward svc/vllm-agg-frontend 8000:8000 curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [ { "role": "user", "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden." } ], "stream": false, "max_tokens": 30 }' ``` You should see output similar to below ```bash {"id":"chatcmpl-093fac0e-f75e-43b5-90dc-96c8c77a2e7c","choices":[{"index":0,"message":{"content":"\nOkay, I need to develop a character background for the explorer in Eldoria. Let me start by understanding the user's query. They mentioned","role":"assistant","reasoning_content":null},"finish_reason":"length"}],"created":1773443560,"model":"Qwen/Qwen3-0.6B","object":"chat.completion","usage":{"prompt_tokens":196,"completion_tokens":30,"total_tokens":226},"nvext":{"timing":{"request_received_ms":1773443560878,"total_time_ms":99.89782}}}% ``` Cleanup ```bash kubectl -n ${DYNAMO_NAMESPACE} delete -f manifests/vllm/agg.yaml ``` ## Using On-Demand Capacity Reservations (ODCR) and Capacity Blocks (CBs) for ML GPU instances can be difficult to acquire on-demand. AWS provides two reservation mechanisms to guarantee capacity for ML workloads: - [On-Demand Capacity Reservations (ODCRs)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html) reserve capacity in a specific AZ for any duration. You pay for the reserved capacity whether or not you use it. - [Capacity Blocks for ML](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html) reserve GPU instances for a fixed time window (hours to days). Instances are placed in EC2 UltraClusters for low-latency networking. Capacity Blocks have a defined end time, and EC2 will terminate instances before the block expires. EKS Auto Mode uses Karpenter under the hood, which models reserved capacity as `karpenter.sh/capacity-type: reserved` and prioritizes it over on-demand and spot. By default, EKS Auto Mode can launch into open ODCRs automatically, but does not prioritize them. Capacity Blocks are never used automatically. Both require explicit `capacityReservationSelectorTerms` configuration on a NodeClass to be prioritized and labeled as `reserved`. ### Create a NodeClass with Capacity Reservation Create a NodeClass that references your ODCR or Capacity Block reservation. You can select by reservation ID or by tags. First, extract the subnet, security group, and role configuration from the `default` NodeClass that EKS Auto Mode already created: ```bash export NC_SUBNETS=$(kubectl get nodeclass default -o json | jq -c '.spec.subnetSelectorTerms') export NC_SG=$(kubectl get nodeclass default -o json | jq -c '.spec.securityGroupSelectorTerms') export NC_ROLE=$(kubectl get nodeclass default -o json | jq -r '.spec.role') ``` Replace `` with your actual reservation ID from the EC2 console. ```bash export CR_ID= kubectl apply -f - << EOF apiVersion: eks.amazonaws.com/v1 kind: NodeClass metadata: name: gpu-reserved spec: role: ${NC_ROLE} subnetSelectorTerms: ${NC_SUBNETS} securityGroupSelectorTerms: ${NC_SG} capacityReservationSelectorTerms: # Select by reservation ID (ODCR or Capacity Block) - id: "${CR_ID}" # Or select by tags (can be combined) # - tags: # team: "dynamo" EOF ``` Wait until the status of the capacityReservation state is `active`. ```bash kubectl get nodeclass gpu-reserved -o json | jq '.status.capacityReservations' [ { "availabilityZone": "us-east-2c", "endTime": "2026-03-18T11:30:00Z", "id": "cr-xxxxxxxxxxxxxx", "instanceMatchCriteria": "targeted", "instanceType": "p5.48xlarge", "ownerID": "xxxxxxxxxxx", "reservationType": "capacity-block", "state": "active" } ] ``` ### Create a NodePool for Reserved Capacity Create a NodePool that references the `gpu-reserved` NodeClass and uses the `reserved` capacity type. You can optionally include `on-demand` and `spot` as a fallback when the reservation is exhausted. ```bash kubectl apply -f - << EOF apiVersion: karpenter.sh/v1 kind: NodePool metadata: name: gpu-reserved spec: disruption: budgets: - nodes: 10% consolidateAfter: 300s consolidationPolicy: WhenEmptyOrUnderutilized template: spec: nodeClassRef: group: eks.amazonaws.com kind: NodeClass name: gpu-reserved requirements: - key: karpenter.sh/capacity-type operator: In values: - reserved # Uncomment to fallback to on-demand or spot when reservation is exhausted # - on-demand # - spot - key: eks.amazonaws.com/instance-family operator: In values: - g6e - g7e - p5 - p5e - p5en taints: - effect: NoSchedule key: nvidia.com/gpu value: Exists EOF ``` Validate that the `gpu-reserved` NodePool is ready ```bash kubectl get nodepool gpu-reserved NAME NODECLASS NODES READY AGE gpu-reserved gpu-reserved 0 True 8s ``` When configuring `capacityReservationSelectorTerms` on any NodeClass in the cluster, EKS Auto Mode will stop automatically using open ODCRs for all NodeClasses. Make sure all NodeClasses that should use ODCRs have explicit selector terms configured. ### Targeting Reserved Nodes from Workloads Pods are scheduled onto reserved nodes through the existing NodePool requirements and taints. If you want to ensure a workload only runs on reserved capacity, add a node selector: ```yaml nodeSelector: karpenter.sh/capacity-type: reserved tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule ``` ### Capacity Blocks Considerations Capacity Blocks have a fixed end time. EC2 begins terminating instances 30 minutes before the block expires (60 minutes for UltraServer types). Karpenter will start draining nodes 10 minutes before EC2 termination begins, giving your workloads time to gracefully shut down. Plan your inference workloads accordingly, and consider using `on-demand` as a fallback capacity type in the NodePool if you need continuity beyond the Capacity Block window. ## Cleanup Delete all DynamoGraphDeployment ```bash kubectl -n ${DYNAMO_NAMESPACE} get dgd # If you have any, delete them kubectl -n ${DYNAMO_NAMESPACE} delete dgd ``` Uninstall Dynamo platform ```bash helm uninstall -n ${DYNAMO_NAMESPACE} dynamo-platform ``` Clean leftover PVCs related to NATS ```bash kubectl -n ${DYNAMO_NAMESPACE} get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE dynamo-platform-nats-js-dynamo-platform-nats-0 Bound pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 10Gi RWO auto-ebs-sc 75m kubectl -n ${DYNAMO_NAMESPACE} delete pvc dynamo-platform-nats-js-dynamo-platform-nats-0 ``` Delete the AutoMode GPU nodepool ```bash kubectl delete nodepool gpu ``` Cleanup EFS related resources, follow the [EFS setup guide](/dynamo/kubernetes-deployment/cloud-provider-guides/aws/efs#cleanup) cleanup section Delete the EKS Auto Mode cluster using Eksctl ```bash eksctl delete cluster -f <(envsubst < templates/eksctl.yaml) ``` # EFA (RDMA over AWS Fabric) on EKS # EFA (RDMA over AWS Fabric) on EKS This guide covers setting up RDMA over AWS Elastic Fabric Adapter (EFA) on EKS for high-performance disaggregated inference with Dynamo. EFA is the only RDMA fabric available on AWS — InfiniBand and RoCE are not offered. With EFA, Dynamo's prefill and decode workers transfer KV cache directly between GPUs across nodes via GPU-Direct RDMA, bypassing CPU and TCP/IP stacks. Without RDMA, disaggregated inference falls back to TCP with severe performance degradation (~98s TTFT vs ~1s with EFA on Llama-3.1-8B at ISL 8000). See the [Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication) for the transport-layer fundamentals. ## Prerequisites **Recommended GPU EC2 instance types with EFA:** | Instance family | GPU | Aggregate EFA bandwidth | Arch | | ------------------------------ | ------------------------------------------------------- | ------------------------------------------------------- | ----------------- | | `p5.48xlarge` / `p5e.48xlarge` | 8× H100 / H200 | 3.2 Tbps | x86_64 | | `p5en.48xlarge` | 8× H200 | 3.2 Tbps | x86_64 | | `p6-b200.48xlarge` | 8× B200 | 3.2 Tbps | x86_64 | | P6e-GB200 UltraServer | GB200 (topology-dependent, up to 72 GPUs / UltraServer) | 400 GB/s EFAv4 per GPU; up to 28.8 Tbps per UltraServer | **arm64 (Grace)** | This table is not an exhaustive list of all AWS instance types that support EFA. It lists the GPU families most relevant to Dynamo disaggregated inference. **Cluster setup:** - **GPU-Direct RDMA enabled on the host** — either kernel ≥ 5.12 (DMA-BUF path; default on current AWS EKS AMIs, typically 6.14+) **or** an older kernel with the `nvidia-peermem` / AWS `efa_nv_peermem` module loaded (legacy peer-memory path; see [Step 2](#step-2-verify-host-kernel-modules) for how to install it). - **EFA-enabled security group** — VPC security groups must allow all traffic between EFA-attached ENIs. The standard recommendation is a self-referencing security group rule that allows all protocols within the group. See [AWS EFA security group setup](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security). - **EKS node groups created with EFA support** — when using `eksctl`, set `efaEnabled: true` on the GPU node group. This attaches the appropriate number of EFA ENIs per instance type. ## Overview EFA setup involves three pieces: 1. **AWS EFA Kubernetes device plugin** — exposes EFA NICs as the `vpc.amazonaws.com/efa` extended resource (host-level setup, [Step 1](#step-1-install-the-aws-efa-kubernetes-device-plugin)). On modern kernels (≥ 5.12) the DMA-BUF path is used and `efa_nv_peermem` is not required; older kernels need it loaded ([Step 2](#step-2-verify-host-kernel-modules)). 2. **Container image** with libfabric + aws-ofi-nccl + Dynamo ([Step 3](#step-3-build-a-dynamo-efa-image)). 3. **Workload spec** that selects the LIBFABRIC NIXL backend, requests EFA resources, and runs privileged ([Step 4](#step-4-configure-nixl-backend), [Step 5](#step-5-pod-resource-requests)). ## Step 1: Install the AWS EFA Kubernetes Device Plugin The AWS EFA Kubernetes Device Plugin exposes each node's EFA endpoints as the `vpc.amazonaws.com/efa` extended resource so pods can request them. AWS publishes two install paths — pick one: **Helm (recommended, from the official `aws/eks-charts` repo):** ```bash helm repo add eks https://aws.github.io/eks-charts helm repo update helm install aws-efa-k8s-device-plugin \ --namespace kube-system \ eks/aws-efa-k8s-device-plugin ``` **Or raw manifest (from [aws-samples/aws-efa-eks](https://github.com/aws-samples/aws-efa-eks)):** ```bash kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-efa-eks/main/manifest/efa-k8s-device-plugin.yml ``` Wait for the device plugin pods to start on every EFA-capable node: ```bash kubectl get pods -n kube-system -l name=aws-efa-k8s-device-plugin-daemonset -w ``` Verify EFA resources are advertised by each GPU node: ```bash kubectl get nodes -o json | jq '.items[] | select(.status.allocatable["vpc.amazonaws.com/efa"] != null) | {name: .metadata.name, efa: .status.allocatable["vpc.amazonaws.com/efa"], gpu: .status.allocatable["nvidia.com/gpu"]}' ``` Each EFA-capable node should report a non-zero `vpc.amazonaws.com/efa` count (e.g., `32` on `p5.48xlarge`, reflecting that instance's EFA endpoint count). The exact count depends on instance type and how the node group's ENIs were configured at launch. ## Step 2: Verify Host Kernel Modules Modern AWS GPU AMIs (Amazon Linux 2023, Ubuntu 22.04+, kernel ≥ 5.12) use **DMA-BUF** for GPU-Direct RDMA and **do not require** `nvidia-peermem` or `efa_nv_peermem`. The default AMIs for p5/p5e/p5en/p6-b200/GB200 ship with kernels in the 6.x line where DMA-BUF is the active path. To confirm: ```bash # On a GPU node (via kubectl debug or SSH): uname -r # Expected: 6.x kernel (e.g., 6.14.0-1018-aws) lsmod | grep -E "^efa|nvidia" # Expected: efa, nvidia, nvidia_modeset, nvidia_uvm, gdrdrv loaded # Note: nvidia-peermem / efa_nv_peermem NOT loaded is normal on modern kernels cat /sys/module/efa/version # Expected: 3.0.0g or newer ``` If you are on an older kernel (< 5.12) and the host doesn't already have `efa_nv_peermem` loaded, the simplest path is to switch to an AMI that includes EFA host-level components — the EKS-optimized AL2023 NVIDIA AMI and all Bottlerocket AMIs include them. Otherwise, run [`aws-efa-installer`](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-enable) on the host (via a privileged DaemonSet or baked into a custom AMI). See [AWS — Manage EFA devices on Amazon EKS](https://docs.aws.amazon.com/eks/latest/userguide/device-management-efa.html) for the full picture. ## Step 3: Build a Dynamo EFA Image Dynamo's image build is two steps: `container/render.py` writes a Dockerfile for the chosen framework + target, then `docker build` consumes it. Passing `--make-efa` to `render.py` appends the AWS EFA installer stage from [`container/templates/aws.Dockerfile`](../../../../container/templates/aws.Dockerfile), which defines a stage named `aws` on top of `runtime`. **You must pass `--target aws` to `docker build`** — without it, `docker build` stops at the `runtime` stage and you get an image without EFA. See [`container/README.md`](../../../../container/README.md) for the full build workflow. ```bash # vLLM EFA image (amd64 or arm64 — vllm/vllm-openai is multi-arch) container/render.py --framework=vllm --target=runtime --platform=linux/amd64 \ --make-efa --output-short-filename docker build --target aws -t dynamo:latest-vllm-runtime-efa \ -f container/rendered.Dockerfile . container/render.py --framework=vllm --target=runtime --platform=linux/arm64 \ --make-efa --output-short-filename docker buildx build --platform=linux/arm64 --target aws \ -t dynamo:latest-vllm-runtime-efa-arm64 -f container/rendered.Dockerfile . # SGLang EFA image (amd64 or arm64) container/render.py --framework=sglang --target=runtime --platform=linux/amd64 \ --make-efa --output-short-filename docker build --target aws -t dynamo:latest-sglang-runtime-efa \ -f container/rendered.Dockerfile . container/render.py --framework=sglang --target=runtime --platform=linux/arm64 \ --make-efa --output-short-filename docker buildx build --platform=linux/arm64 --target aws \ -t dynamo:latest-sglang-runtime-efa-arm64 -f container/rendered.Dockerfile . # TRT-LLM EFA image (amd64 or arm64 — upstream nvcr.io/nvidia/tensorrt-llm/release # publishes both variants; arm64 is what you want for GB200 / Grace EFA nodes) container/render.py --framework=trtllm --target=runtime --platform=linux/amd64 \ --cuda-version=13.1 --make-efa --output-short-filename docker build --target aws -t dynamo:latest-trtllm-runtime-efa \ -f container/rendered.Dockerfile . container/render.py --framework=trtllm --target=runtime --platform=linux/arm64 \ --cuda-version=13.1 --make-efa --output-short-filename docker buildx build --platform=linux/arm64 --target aws \ -t dynamo:latest-trtllm-runtime-efa-arm64 -f container/rendered.Dockerfile . ``` `--output-short-filename` writes to `container/rendered.Dockerfile`; omit it to get the long auto-generated filename (e.g., `vllm-runtime-cuda12.9-amd64-rendered.Dockerfile`) — useful when keeping several rendered Dockerfiles side by side. See [Known Issues](#known-issues) below for one case where the default-built image does **not** produce a working EFA deployment out of the box (GB200 / arm64 64K-page kernels). The symptom looks like a working setup but fails at startup during NIXL memory registration. ## Step 4: Configure NIXL Backend NIXL is the high-level KV transfer API and supports multiple backends. **For EFA, the LIBFABRIC backend must be selected.** UCX is NIXL's default backend, and while it has CUDA-IPC / RDMA transports available in the image, in standard pod-to-pod EFA configurations it lands on a slow transport (effectively TCP-speed at ~1–3 GB/s) instead of EFA's line rate. Empirically, LIBFABRIC is the only backend that reaches full EFA bandwidth on AWS. Each framework selects the backend differently: | Framework | How to select LIBFABRIC | Default if unset | | --------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | | **SGLang** | `SGLANG_DISAGGREGATION_NIXL_BACKEND=LIBFABRIC` env var | UCX → TCP fallback | | **vLLM** | `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}'` CLI flag | UCX → TCP fallback | | **TRT-LLM** | `TRTLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC` env var | UCX → TCP fallback | | **KVBM (Rust)** | `DYN_KVBM_NIXL_BACKEND_LIBFABRIC=true` env var | UCX → TCP fallback | This is a silent-failure path — getting it wrong manifests as ~100 s TTFT instead of a clear error. Always [verify at startup](#verification) that LIBFABRIC is active. ### Required EFA environment variables In addition to backend selection, set these on every worker pod: ```yaml env: - { name: FI_PROVIDER, value: efa } - { name: FI_EFA_USE_DEVICE_RDMA, value: "1" } - { name: FI_EFA_ENABLE_SHM_TRANSFER, value: "0" } - { name: FI_EFA_ENABLE_SHM, value: "0" } # Place Amazon EFA libs first in LD_LIBRARY_PATH - name: LD_LIBRARY_PATH value: "/opt/amazon/efa/lib:/opt/amazon/efa/lib64:/opt/aws-ofi-nccl/lib:${LD_LIBRARY_PATH}" ``` ### Recommended EFA performance tuning ```yaml env: - { name: FI_EFA_FORK_SAFE, value: "0" } - { name: FI_EFA_USE_HUGE_PAGE, value: "1" } - { name: FI_EFA_MR_MAX_CACHED_COUNT, value: "524288" } - { name: FI_EFA_MR_MAX_CACHED_SIZE, value: "0" } ``` When using `FI_EFA_USE_HUGE_PAGE=1`, also add `hugepages-2Mi: 5120Mi` to the pod resource limits. ## Step 5: Pod Resource Requests Dynamo pods that use EFA must request the resource and run privileged: ```yaml resources: limits: nvidia.com/gpu: "4" # or your TP vpc.amazonaws.com/efa: "4" # number of EFA NICs to allocate hugepages-2Mi: 5120Mi # if FI_EFA_USE_HUGE_PAGE=1 securityContext: privileged: true # REQUIRED — IPC_LOCK alone is insufficient capabilities: add: [IPC_LOCK] hostIPC: true # required by some EFA setups volumeMounts: - { name: shm, mountPath: /dev/shm } ``` ```yaml volumes: - name: shm emptyDir: { medium: Memory, sizeLimit: 80Gi } ``` `privileged: true` is required for NIXL to register CUDA VRAM with the EFA NIC via `fi_mr_reg`. `IPC_LOCK` alone is insufficient. ## Known Issues One issue currently affects default-built Dynamo EFA images. ### Issue 1: libfabric on GB200 fails `fi_mr_reg` on CUDA VRAM **Known affected platforms:** GB200. **Symptom:** Worker pod fails at startup with `fi_mr_reg` returning EFAULT during NIXL initialization. NIXL VRAM registration fails; depending on the framework, the worker either crashes or silently falls back to TCP. **Root cause:** The libfabric version (versions lower than 2.5.x) bundled with the EFA installer (up to currently latest 1.48.0) lacks a CUDA branch in the dmabuf-eligibility check in `prov/efa/src/efa_mr.c`. On x86_64 hosts the legacy `ibv_reg_mr` path handles CUDA pointers natively, so the bug doesn't surface. On arm64 64K-page kernels (GB200), the legacy path returns EFAULT for CUDA VRAM. Tracked in [ofiwg/libfabric#12019](https://github.com/ofiwg/libfabric/issues/12019). **Upstream status:** The bug is resolved in `ofiwg/libfabric` main and v2.5.x via a more comprehensive rewrite of `efa_mr_reg_ibv_mr()`. AWS's `aws/libfabric` fork has not picked up the upstream rewrite; the latest EFA installer (1.48.0) still ships `v2.4.0amzn3.0` with the older code path. **Workarounds:** 1. **Apply the one-line patch to the bundled libfabric.** During image build, replace the `aws.Dockerfile` install step with a custom build: ```dockerfile RUN git clone --depth 1 --branch v2.4.0amzn3.0 https://github.com/aws/libfabric.git /tmp/libfabric && \ cd /tmp/libfabric && \ sed -i 's/efa_mr_is_neuron(efa_mr) || efa_mr_is_rocr(efa_mr)/efa_mr_is_neuron(efa_mr) || efa_mr_is_rocr(efa_mr) || efa_mr_is_cuda(efa_mr)/' prov/efa/src/efa_mr.c && \ ./autogen.sh && \ CPPFLAGS="-I/usr/local/cuda/include" \ LDFLAGS="-L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64/stubs -Wl,-rpath,/usr/local/cuda/lib64" \ ./configure --prefix=/opt/amazon/efa --enable-efa --with-cuda=/usr/local/cuda --enable-cuda-dlopen && \ make -j$(nproc) && make install # Then rebuild aws-ofi-nccl from source against the patched libfabric (do not mix versions) ``` 2. **Replace bundled libfabric with `ofiwg/libfabric@v2.5.1`** (or newer). The upstream rewrite is already present; no patch needed. Rebuild `aws-ofi-nccl` against it. ## Verification After deployment, confirm EFA is actually being used (not silent TCP fallback): **1. NIXL chose the LIBFABRIC backend** (not UCX): ```bash kubectl logs | grep -iE "NIXL.*backend|Backend.*instantiated" # Expected: "Backend LIBFABRIC was instantiated" # WRONG: "Backend UCX was instantiated" ``` **2. The LIBFABRIC plugin is loaded and executing** (not just opened): ```bash kubectl exec -- bash -c ' grep "libplugin_LIBFABRIC" /proc/$(pgrep -f "dynamo|vllm|sglang" | head -1)/maps | grep "r-xp" ' # Expected: at least one line ending in "r-xp" (executable code page mapped) # If only "r--p" : library opened but never run — config didn't apply, NIXL chose a different backend ``` **3. Registered RDMA memory is GPU VRAM, not CPU pinned memory** (no CPU bounce): ```bash kubectl logs | grep "efa_mr_reg_impl" | head -1 # Look for "Registered memory at 0x7d7749bc4000 of size 431767552" kubectl exec -- bash -c 'grep "7d7749bc4" /proc/$(pgrep -f "dynamo|vllm|sglang" | head -1)/maps' # Expected: NO OUTPUT — CUDA VRAM addresses are not in the Linux VMA table. # If the address IS found: CPU pinned memory was registered — CPU bounce — GPU-Direct NOT working. ``` **4. NIXL transfers are happening, none failing** (via Prometheus metrics endpoint): NIXL telemetry is off by default. To enable it, set on each worker: ```yaml env: - { name: NIXL_TELEMETRY_ENABLE, value: "y" } - { name: NIXL_TELEMETRY_EXPORTER, value: prometheus } - { name: NIXL_TELEMETRY_PROMETHEUS_PORT, value: "19090" } # NIXL's own port — distinct from framework metrics ``` Then query: ```bash kubectl exec -- curl -s localhost:19090/metrics | grep -E "nixl_bytes_transferred|nixl_num_failed_transfers" # Expected: nixl_bytes_transferred_count > 0 and increasing # nixl_num_failed_transfers_total stays 0 ``` The same metrics with the `vllm:` prefix are also published to vLLM's own metrics endpoint (typically `DYN_SYSTEM_PORT`, e.g. `8081`) when vLLM is the frontend. **5. Decode side confirms KV receipt**: ```bash kubectl logs | grep "External prefix cache hit rate" # Expected: "External prefix cache hit rate: 100.0%" ``` Do not use `rdma_write_bytes` or other `/sys/class/infiniband/*/counters/*` checks for EFA verification. EFA SRD uses SEND operations at the hardware level, not RDMA READ/WRITE — `rdma_write_bytes` is always 0 on correctly configured EFA by design. Use the Prometheus + `/proc//maps` methodology above instead. ## Common Failure Modes | Symptom | Likely cause | Fix | | ------------------------------------------------------ | -------------------------------------------------------------------- | ----------------------------------------------------------------------------- | | TTFT ~100 s, throughput ~MB/s | Silent TCP fallback — NIXL backend selection not applied | Verify Step 4 backend env var; check NIXL startup log | | TTFT ~10 s, throughput 1–5 GB/s | UCX host-staged (no GPU-Direct on kernel ≥ 6.8) | Switch to LIBFABRIC backend | | Pod fails at startup with `fi_mr_reg` EFAULT on GB200 | Issue 1 (libfabric CUDA dmabuf bug) | Apply patch or use ofiwg/libfabric v2.5.1 | | Pod fails at startup with `fi_mr_reg` EFAULT on x86_64 | `privileged: true` missing OR `efa_nv_peermem` missing on old kernel | Verify Step 5 security context | | Bandwidth halves after image rebuild | libfabric / aws-ofi-nccl ABI mismatch | Rebuild aws-ofi-nccl from source against the libfabric used in the same image | | `rdma_write_bytes` shows 0 | **Not a failure** — EFA SRD uses SEND, not WRITE | Use Prometheus `nixl_bytes_transferred` instead | ## References - [Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication) — transport-layer fundamentals - [RDMA / InfiniBand on AKS](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/rdma-infini-band) — Azure equivalent - [`container/templates/aws.Dockerfile`](../../../../container/templates/aws.Dockerfile) — EFA installer template - [AWS — Manage EFA devices on Amazon EKS](https://docs.aws.amazon.com/eks/latest/userguide/device-management-efa.html) — official EKS-side guide (DRA driver + device plugin) - [AWS EFA documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) — EC2-side EFA overview - [`aws/eks-charts` — `aws-efa-k8s-device-plugin`](https://github.com/aws/eks-charts/tree/master/stable/aws-efa-k8s-device-plugin) — Helm chart source - [ofiwg/libfabric#12019](https://github.com/ofiwg/libfabric/issues/12019) — CUDA dmabuf registration on EFA # Amazon EFS Setup for EKS # Create an Amazon EFS File System for Amazon EKS This guide walks through creating an Amazon EFS file system and connecting it to your EKS cluster. The EFS CSI Driver was already installed as an addon via `eksctl.yaml` during cluster creation. Now we need to create the actual file system and make it available to Kubernetes workloads. This filesystem will be used by Dynamo to store shared model weights and compilation cache across nodes. ## Prerequisites - EKS cluster created following the [EKS guide](/dynamo/kubernetes-deployment/cloud-provider-guides/aws/eks-setup) - Environment variables set: ```bash export AWS_REGION="us-east-1" export CLUSTER_NAME="ai-dynamo" export DYNAMO_NAMESPACE="dynamo-system" ``` ## Retrieve VPC and Subnet Information Get the VPC ID associated with your EKS cluster: ```bash export VPC_ID=$(aws eks describe-cluster \ --name $CLUSTER_NAME \ --region $AWS_REGION \ --query "cluster.resourcesVpcConfig.vpcId" \ --output text) ``` Get the CIDR range for the VPC (used for the security group rule): ```bash export VPC_CIDR=$(aws ec2 describe-vpcs \ --vpc-ids $VPC_ID \ --query "Vpcs[0].CidrBlock" \ --output text) ``` ## Create a Security Group for EFS Create a security group that allows NFS traffic (port 2049) from within the VPC: ```bash export EFS_SG_ID=$(aws ec2 create-security-group \ --group-name dynamo-efs-sg \ --description "Security group for EFS access from EKS" \ --vpc-id $VPC_ID \ --region $AWS_REGION \ --query "GroupId" \ --output text) ``` Add an inbound rule to allow NFS traffic from the VPC CIDR: ```bash aws ec2 authorize-security-group-ingress \ --group-id $EFS_SG_ID \ --protocol tcp \ --port 2049 \ --cidr $VPC_CIDR \ --region $AWS_REGION ``` ## Create the EFS File System ```bash export EFS_FS_ID=$(aws efs create-file-system \ --performance-mode generalPurpose \ --throughput-mode elastic \ --encrypted \ --region $AWS_REGION \ --tags Key=Name,Value=dynamo-efs \ --query "FileSystemId" \ --output text) ``` Wait for the file system to become available: ```bash aws efs describe-file-systems \ --file-system-id $EFS_FS_ID \ --region $AWS_REGION \ --query "FileSystems[0].LifeCycleState" \ --output text ``` You should see `available` before proceeding. ## Create Mount Targets Mount targets allow your EKS nodes to access the EFS file system. You need one mount target per subnet where your nodes run. Get the subnet IDs used by your EKS cluster: ```bash export SUBNET_IDS=$(aws eks describe-cluster \ --name $CLUSTER_NAME \ --region $AWS_REGION \ --query "cluster.resourcesVpcConfig.subnetIds[]" \ --output text) echo "Subnet IDs: $SUBNET_IDS" ``` Create a mount target in each subnet: ```bash for SUBNET_ID in $(echo "$SUBNET_IDS" | tr '\t' '\n'); do echo "Creating mount target in subnet: $SUBNET_ID" aws efs create-mount-target \ --file-system-id $EFS_FS_ID \ --subnet-id $SUBNET_ID \ --security-groups $EFS_SG_ID \ --region $AWS_REGION 2>/dev/null || echo " Mount target already exists or subnet is in a duplicate AZ (this is OK)" done ``` EFS allows only one mount target per Availability Zone. If multiple subnets are in the same AZ, the command will fail for the duplicates, which is expected and safe to ignore. Verify mount targets are available: ```bash aws efs describe-mount-targets \ --file-system-id $EFS_FS_ID \ --region $AWS_REGION \ --query "MountTargets[*].{SubnetId:SubnetId,AZ:AvailabilityZoneName,State:LifeCycleState}" \ --output table ``` Wait until all mount targets show `available` in the State column before proceeding. ## Create Kubernetes StorageClass Create a StorageClass that uses the EFS CSI driver with dynamic provisioning: ```bash kubectl apply -f - << EOF apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: efs-sc-dynamic provisioner: efs.csi.aws.com parameters: provisioningMode: efs-ap fileSystemId: "${EFS_FS_ID}" directoryPerms: "777" uid: "1000" gid: "1000" EOF ``` ## Create a PersistentVolumeClaim We create three separate PVCs because different Dynamo recipe examples reference each one individually: * `model-cache` stores downloaded model weights (e.g. from HuggingFace). * `compilation-cache` stores vLLM/TRT-LLM compilation artifacts. * `perf-cache` stores benchmark traces and performance results. ```bash # Create the namespace we will use for Dynamo if not already exists kubectl create namespace ${DYNAMO_NAMESPACE} # Create PVCs kubectl apply -f - << EOF apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-cache namespace: ${DYNAMO_NAMESPACE} spec: accessModes: - ReadWriteMany resources: requests: storage: 5Gi storageClassName: "efs-sc-dynamic" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: compilation-cache namespace: ${DYNAMO_NAMESPACE} spec: accessModes: - ReadWriteMany resources: requests: storage: 5Gi storageClassName: "efs-sc-dynamic" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: perf-cache namespace: ${DYNAMO_NAMESPACE} spec: accessModes: - ReadWriteMany resources: requests: storage: 5Gi storageClassName: "efs-sc-dynamic" EOF ``` EFS is elastic, the `storage` value in the PVC is required by Kubernetes but does not limit the actual storage. EFS will grow and shrink automatically. ## Verify Confirm the PVC is bound: ```bash kubectl get pvc -n ${DYNAMO_NAMESPACE} ``` You should see output similar to: ```text NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE compilation-cache Bound pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 5Gi RWX efs-sc-dynamic 41s model-cache Bound pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 5Gi RWX efs-sc-dynamic 42s perf-cache Bound pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 5Gi RWX efs-sc-dynamic 41s ``` ## Cleanup To delete the EFS resources when no longer needed: ```bash # Delete the Kubernetes resources kubectl delete pvc model-cache compilation-cache perf-cache -n ${DYNAMO_NAMESPACE} kubectl delete storageclass efs-sc-dynamic # Delete mount targets for MT_ID in $(aws efs describe-mount-targets --file-system-id $EFS_FS_ID --region $AWS_REGION --query "MountTargets[*].MountTargetId" --output text); do aws efs delete-mount-target --mount-target-id $MT_ID --region $AWS_REGION done # Delete the EFS file system aws efs delete-file-system --file-system-id $EFS_FS_ID --region $AWS_REGION # Delete the security group aws ec2 delete-security-group --group-id $EFS_SG_ID --region $AWS_REGION ``` # Amazon Elastic Container Service (ECS) # Dynamo Deployment of vLLM Example on AWS ECS ## 1. EC2 Cluster Setup (for vLLM workloads) 1. Go to AWS ECS console, **Clusters** tab and click on **Create cluster** with name `dynamo-GPU` 2. Input the cluster name and choose **AWS EC2 instances** as the infrastructure. This option will create a cluster with EC2 instances to deploy containers. 3. Choose the ECS-optimized GPU AMI `Amazon Linux 2 (GPU)` (Amazon ECS–optimized), which includes NVIDIA drivers and the Docker GPU runtime out of the box. 4. Choose `g6e.2xlarge` as the **EC2 instance type** and add an `SSH Key pair` so you can log in the instance for debugging purpose. To test with disaggregated serving, we need at least 2 GPUs, so you can choose `g6e.12xlarge` with 4 GPUs 5. Set **Root EBS volume size** as `200` 6. For the networking, use the default settings. Make sure the **security group** has - an inbound rule which allows "All traffic" from this security group. - an inbound rule for port 22 and 8000, so that you can ssh into the instance for debugging purpose 7. Select `Turn on` for **Auto-assign public IP** option. 8. Click on **Create** and a cluster will be deployed through cloudformation. ## 2. Fargate Cluster Setup (for ETCD/NATS services) 1. Go to AWS ECS console, **Clusters** tab and click on **Create cluster** 2. Input the cluster name as `dynamo-fargate` 3. Choose **AWS Fargate (serverless)** as the infrastructure 4. For networking, use the same VPC and subnets as the EC2 cluster to ensure connectivity between services 5. For the security group, use the same security group as the EC2 cluster. This automatically allows communication between all services. 6. Ensure outbound rules allow all traffic (default setting) so the Fargate tasks can download container images and communicate externally 7. Click on **Create** to deploy the Fargate cluster ## 3. ETCD/NATS Task Definitions Setup Add a task for ETCD and NATS services to run on Fargate. A sample task definition JSON is attached. ### 3.1 Create the ecsTaskExecutionRole (Required) Before creating the task definitions, you need to create the `ecsTaskExecutionRole` IAM role. This role allows ECS to pull container images from registries and write logs to CloudWatch on your behalf. If you create task definitions through the AWS Console's step-by-step wizard, this role is created automatically. However, when importing task definitions from JSON (as recommended in this guide), you must create this role manually. Follow the [AWS documentation on creating the task execution IAM role](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_execution_IAM_role.html#create-task-execution-role) to create a role named `ecsTaskExecutionRole` with the `AmazonECSTaskExecutionRolePolicy` policy attached. Based on the task definition, you may need to add Amazon CloudWatch permissions and AWS Secrets Manager permissions to the `ecsTaskExecutionRole`. See details in the [Amazon CloudWatch Logs permissions reference](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/permissions-reference-cwl.html) the [AWS Secrets Manager authentication and access control guide](https://docs.aws.amazon.com/secretsmanager/latest/userguide/auth-and-access.html#auth-and-access_secrets) The role ARN will be `arn:aws:iam:::role/ecsTaskExecutionRole`. Make sure to update `` in any task definition JSON files with your actual AWS account ID. ### 3.2 Task Definition Configuration 1. ETCD container - Container name use `etcd` - Image URL is `bitnamilegacy/etcd` and **Yes** for Essential container - Container port |Container port|Protocol|Port name| App protocol| |-|-|-|-| |2379|TCP|2379|HTTP| |2380|TCP|2380|HTTP| - Environment variable key is `ALLOW_NONE_AUTHENTICATION` and value is `YES` 2. NATS container - Container name use `nats` - Image URL is `nats` and **Yes** for Essential container - Container port |Container port|Protocol|Port name| App protocol| |-|-|-|-| |4222|TCP|4222|HTTP| |6222|TCP|6222|HTTP| |8222|TCP|8222|HTTP| - Docker configuration, add `-js, --trace` in **Command** ## 4. vLLM Task Definitions Setup Ensure you have created the `ecsTaskExecutionRole` as described in section 3.1 before creating these task definitions. 1. Dynamo vLLM Frontend and Decoding Worker Task This task will create vLLM frontend, processors, routers and a decoding worker. Please follow steps below to create this task - Set container name as `dynamo-frontend` and use prebuild [Dynamo container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime). - Choose `Amazon EC2 instances` as the **Launch type** with **Task size** `2 vCPU` and `40 GB`memory - Choose `host` as the Network mode. - Container name use `dynamo-vLLM-frontend` - Add your Image URL (You can use the prebuild [Dynamo container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime)) and **Yes** for Essential container. It can be AWS ECR URL or Nvidia NGC URL. If using NGC URL, please also choose **Private registry authentication** and add your Secret Manager ARN or name. - Container port |Container port|Protocol|Port name| App protocol| |-|-|-|-| |8000|TCP|8000|HTTP| - Use `1` GPU for **Resource allocation limits** - Environment variables settings as below. Will override the `IP_ADDRESS` later. |Key|Value type|Value| |-|-|-| |ETCD_ENDPOINTS|Value|http://IP_ADDRESS:2379| |NATS_SERVER|Value|nats://IP_ADDRESS:4222| - Docker configuration Add `sh,-c` in **Entry point** and `cd examples/backends/vllm && python -m dynamo.frontend --router-mode kv & python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager` in **Command** 2. Dynamo vLLM PrefillWorker Task Create the PrefillWorker task same as the frontend worker, except for following changes - Set container name as `dynamo-prefill` - No container port mapping - Docker configuration with command `cd examples/backends/vllm && python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager --disaggregation-mode prefill` ## 5. Task Deployment You can create a service or directly run the task from the task definition 1. ETCD/NATS Task - Choose the Fargate cluster (`dynamo-fargate`) for **Existing cluster** created in step 2. - Select **Launch type** as `FARGATE` - In the **Networking** section, select the same VPC and subnets used for the EC2 cluster - For **Security group**, select the same security group used by the EC2 cluster - Verify that outbound rules allow all traffic for downloading images and external communication - Wait for this deployment to finish, and get the **Private IP** of this task. 2. Dynamo Frontend and Decoding Worker Task - Choose the EC2 cluster (`dynamo-GPU`) for **Existing cluster** created in step 1. - In the **Container Overrides**, use the IP for ETCD/NATS task for the `ETCD_ENDPOINTS` and `NATS_SERVER` values. - After the deployment, an aggregated serving endpoint is created and you can test it with scripts in step 6. 3. Dynamo PrefillWorker Task - For disaggregated serving, you can deploy a separate prefill worker on another GPU. Choose the EC2 cluster (`dynamo-GPU`) for **Existing cluster** created in step 1 with at least 2 GPUs ( `g6e.12xlarge` for example) - In the **Container Overrides**, use the IP for ETCD/NATS task for the `ETCD_ENDPOINTS` and `NATS_SERVER` values. ## 6. Testing Find the public IP of the dynamo frontend task from the task page. Run following commands to query the endpoint. ```sh export DYNAMO_IP_ADDRESS=TASK_PUBLIC_IP_ADDRESS curl http://$DYNAMO_IP_ADDRESS:8000/v1/models curl http://$DYNAMO_IP_ADDRESS:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [ { "role": "user", "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden." } ], "stream":false, "max_tokens": 30 }' ``` You should be able to see the responses from the hosted endpoint. # Azure Kubernetes Service (AKS) # Dynamo on AKS This guide covers setting up an AKS cluster with GPU nodes and deploying Dynamo. ## Prerequisites - An active Azure subscription with sufficient GPU VM quota - [Azure CLI](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) (`az`) installed and logged in - [kubectl](https://kubernetes.io/docs/tasks/tools/) installed - [Helm](https://helm.sh/docs/intro/install/) v3.0+ installed ## Step 1: Create a Resource Group and Cluster ```bash az group create \ --name \ --location ``` ```bash az aks create \ --resource-group \ --name \ --node-count 1 \ --generate-ssh-keys ``` Then get credentials: ```bash az aks get-credentials \ --resource-group \ --name ``` ## Step 2: Add a GPU Node Pool Add a GPU-enabled node pool with driver installation skipped. The `--skip-gpu-driver-install` flag prevents AKS from managing GPU drivers — the NVIDIA GPU Operator (Step 3) will handle that instead. ```bash az aks nodepool add \ --resource-group \ --cluster-name \ --name gpunp \ --node-count 2 \ --node-vm-size Standard_NC24ads_A100_v4 \ --skip-gpu-driver-install ``` For RDMA-capable workloads (disaggregated inference), use ND-series VMs such as `Standard_ND96asr_v4` or `Standard_ND96isr_H100_v5`. See the [RDMA / InfiniBand guide](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/rdma-infini-band) for the additional setup required on those nodes. For a full list of GPU VM sizes, see [GPU-optimized VM sizes](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu). ## Step 3: Install the NVIDIA GPU Operator The GPU Operator manages NVIDIA drivers, container toolkit, device plugin, and monitoring on GPU nodes. ```bash helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update ``` ```bash helm install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator --create-namespace ``` Verify the pods are running: ```bash kubectl get pods -n gpu-operator ``` Expected output (abbreviated): ```text NAMESPACE NAME READY STATUS RESTARTS AGE gpu-operator gpu-feature-discovery-xxxxx 1/1 Running 0 2m gpu-operator gpu-operator-xxxxx 1/1 Running 0 2m gpu-operator nvidia-container-toolkit-daemonset-xxxxx 1/1 Running 0 2m gpu-operator nvidia-cuda-validator-xxxxx 0/1 Completed 0 1m gpu-operator nvidia-device-plugin-daemonset-xxxxx 1/1 Running 0 2m gpu-operator nvidia-driver-daemonset-xxxxx 1/1 Running 0 2m ``` If you need RDMA / InfiniBand for disaggregated inference, **do not install the GPU Operator yet** — the RDMA setup requires different Helm values. See [RDMA / InfiniBand](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/rdma-infini-band) for the full setup, which includes the correct GPU Operator install command. ## Step 4: Install Dynamo Follow the [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) to install the Dynamo Platform and deploy your first model. ## Additional Guides ### [RDMA / InfiniBand](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/rdma-infini-band) Required for disaggregated inference in production. Without RDMA, KV cache transfers between prefill and decode workers fall back to TCP with severe latency degradation (~98s TTFT vs ~200–500ms with RDMA). ND-series VMs (e.g., `Standard_ND96asr_v4`, `Standard_ND96isr_H100_v5`) include Mellanox ConnectX InfiniBand NICs but require additional setup beyond the GPU Operator: the NVIDIA Network Operator, a NicClusterPolicy for MOFED drivers, an `ib-node-config` DaemonSet to configure kernel modules and memlock limits, and an RDMA Shared Device Plugin to expose the NICs to pods. ### [Storage for Model Caching](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/aks-storage) Prevents each pod from independently downloading model weights on startup. Without shared storage, large models take hours to load per pod and will hit HuggingFace rate limits at scale. Covers Azure Managed Lustre, Azure Files, Azure Disk, and Local CSI options with per-cache-type recommendations (model cache, compilation cache, performance cache). ### [Azure Lustre CSI Driver](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/azure-lustre-csi-driver) The recommended storage for large multi-node models requiring high-throughput shared access. Azure Managed Lustre is not installed by default — this guide covers installing and configuring the Lustre CSI driver before you can use it as a PVC storage class. ### [Spot VMs](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/spot-v-ms) Significantly reduces GPU compute costs by running on preemptible Spot VM node pools. AKS automatically taints Spot nodes with `kubernetes.azure.com/scalesetpriority=spot:NoSchedule`, so Dynamo components need explicit tolerations. The Dynamo Helm chart includes a pre-built `values-aks-spot.yaml` that handles this. ## Clean Up Resources ```bash # Delete all Dynamo Graph Deployments kubectl delete dynamographdeployments.nvidia.com --all --all-namespaces # Uninstall Dynamo Platform export NAMESPACE="dynamo-system" helm uninstall dynamo-platform -n $NAMESPACE # If running Dynamo < 1.0 with a separate CRDs chart: # helm uninstall dynamo-crds -n $NAMESPACE ``` If you want to delete the GPU Operator, follow the [Uninstalling the NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/uninstall.html) guide. If you want to delete the entire AKS cluster, follow the [Delete an AKS cluster](https://learn.microsoft.com/en-us/azure/aks/delete-cluster) guide. # RDMA / InfiniBand on AKS # RDMA / InfiniBand on AKS This guide covers setting up RDMA over InfiniBand on AKS for high-performance disaggregated inference with Dynamo. RDMA enables direct memory access between GPUs across nodes, bypassing CPU and kernel overhead — critical for low-latency KV cache transfer between prefill and decode workers. Without RDMA, disaggregated inference falls back to TCP with severe performance degradation (~98s TTFT vs ~200-500ms with RDMA). See the [Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication) for details on transport options and performance expectations. The Network Operator and NicClusterPolicy steps in this guide are based on the [Azure AKS RDMA InfiniBand](https://github.com/Azure/aks-rdma-infiniband) repository. That project is open-source and not covered by Microsoft Azure support — file issues on the GitHub repository. ## Prerequisites **AKS cluster with RDMA-capable nodes:** - At least **2 GPU nodes** to enable cross-node RDMA communication - **ND-series VMs** with Mellanox ConnectX InfiniBand NICs (e.g., `Standard_ND96asr_v4`, `Standard_ND96isr_H100_v5`) - **Ubuntu OS** on the node pool (required for NVIDIA driver compatibility) - GPU driver installation **skipped** on the node pool (`--skip-gpu-driver-install`) — see [GPU Node Pool Setup](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/aks-setup#step-2-add-a-gpu-node-pool) **Register the AKS InfiniBand feature** to ensure nodes land on the same physical InfiniBand network: ```bash az feature register --namespace Microsoft.ContainerService --name AKSInfinibandSupport az feature show --namespace Microsoft.ContainerService --name AKSInfinibandSupport --query "properties.state" # Wait until "Registered" az provider register --namespace Microsoft.ContainerService ``` ## Overview The RDMA setup involves five components installed in this order: 1. **Network Operator** — Deploys the Mellanox OFED driver and Node Feature Discovery 2. **NicClusterPolicy** — Configures the OFED driver on InfiniBand-capable nodes 3. **IB Node Configuration** — Loads InfiniBand kernel modules and sets memlock limits 4. **RDMA Shared Device Plugin** — Exposes InfiniBand NICs to pods as a Kubernetes resource 5. **GPU Operator** — Installed with RDMA-specific settings (NFD disabled, GPUDirect RDMA enabled, host MOFED) ## Step 1: Install the NVIDIA Network Operator The [NVIDIA Network Operator](https://docs.nvidia.com/networking/display/kubernetes25100/index.html) automates deployment of networking components including Mellanox OFED drivers for InfiniBand support. Create the namespace and label it for privileged workloads: ```bash kubectl create ns network-operator kubectl label --overwrite ns network-operator pod-security.kubernetes.io/enforce=privileged ``` Add the NVIDIA Helm repo (if not already added): ```bash helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update ``` Create a `network-operator-values.yaml`: ```yaml nfd: deployNodeFeatureRules: false ``` Install the Network Operator: ```bash helm install network-operator nvidia/network-operator \ --namespace network-operator \ -f network-operator-values.yaml \ --version v26.1.0 ``` Verify the Network Operator pod is running: ```bash kubectl get pods -n network-operator ``` ## Step 2: Apply the NicClusterPolicy The NicClusterPolicy configures the OFED driver (Mellanox OFED / DOCA driver) as a DaemonSet on all InfiniBand-capable nodes. Apply the base NicClusterPolicy using kustomize: ```bash kubectl apply -k https://github.com/Azure/aks-rdma-infiniband/configs/nicclusterpolicy/base ``` This targets nodes with Mellanox NICs (`feature.node.kubernetes.io/pci-15b3.present`) and installs the DOCA/OFED driver as a DaemonSet. Wait for the MOFED driver DaemonSet to finish installing on all nodes (this may take several minutes): ```bash kubectl get pods -n network-operator -l app=mofed-ubuntu22.04-ds -w # Wait until all pods show Running ``` ## Step 3: Deploy the IB Node Configuration DaemonSet This DaemonSet loads InfiniBand kernel modules and sets unlimited memlock limits on GPU nodes. This is required for RDMA to function — without it, InfiniBand device files may not exist and memory pinning for RDMA transfers will fail. This step is not covered in the Azure RDMA repo but is required for a working setup. The DaemonSet loads `ib_umad` and `rdma_ucm` kernel modules, sets unlimited memlock limits for containerd and kubelet, and restarts both services to apply the changes. Create `ib-node-config.yaml`: ```yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: ib-node-config namespace: kube-system spec: selector: matchLabels: app: ib-node-config template: metadata: labels: app: ib-node-config spec: hostPID: true nodeSelector: kubernetes.azure.com/agentpool: tolerations: - operator: Exists initContainers: - name: ib-setup image: busybox:1.36 securityContext: privileged: true command: - sh - -c - | echo "=== IB Node Configuration ===" nsenter -t 1 -m -u -i -n -- modprobe ib_umad nsenter -t 1 -m -u -i -n -- modprobe rdma_ucm 2>/dev/null || true nsenter -t 1 -m -u -i -n -- modprobe ib_ucm 2>/dev/null || true nsenter -t 1 -m -u -i -n -- lsmod | grep ib_umad && echo "OK: ib_umad" || echo "FAIL: ib_umad" nsenter -t 1 -m -u -i -n -- ls /dev/infiniband/rdma_cm && echo "OK: rdma_cm device" || echo "WARN: no rdma_cm device" nsenter -t 1 -m -u -i -n -- sh -c 'printf "ib_umad\nrdma_ucm\n" > /etc/modules-load.d/ib-umad.conf' nsenter -t 1 -m -u -i -n -- sh -c 'printf "* - memlock unlimited\nroot - memlock unlimited\n" > /etc/security/limits.d/99-ib-memlock.conf' nsenter -t 1 -m -u -i -n -- sh -c 'mkdir -p /etc/systemd/system/containerd.service.d && printf "[Service]\nLimitMEMLOCK=infinity\n" > /etc/systemd/system/containerd.service.d/memlock.conf' nsenter -t 1 -m -u -i -n -- sh -c 'mkdir -p /etc/systemd/system/kubelet.service.d && printf "[Service]\nLimitMEMLOCK=infinity\n" > /etc/systemd/system/kubelet.service.d/memlock.conf' nsenter -t 1 -m -u -i -n -- systemctl daemon-reload nsenter -t 1 -m -u -i -n -- systemctl restart containerd nsenter -t 1 -m -u -i -n -- systemctl restart kubelet sleep 10 nsenter -t 1 -m -u -i -n -- systemctl is-active containerd && echo "OK: containerd active" || echo "FAIL: containerd" nsenter -t 1 -m -u -i -n -- systemctl is-active kubelet && echo "OK: kubelet active" || echo "FAIL: kubelet" echo "=== Setup Complete ===" containers: - name: keepalive image: busybox:1.36 command: ["sh", "-c", "echo IB node config active; sleep infinity"] ``` Replace `` with your GPU node pool name (e.g., `ndh100pool`). ```bash kubectl apply -f ib-node-config.yaml ``` Wait for all pods to complete initialization: ```bash kubectl get pods -n kube-system -l app=ib-node-config -w ``` **What this does:** - **`ib_umad`** — InfiniBand user-space management datagram module, required for RDMA device access - **`rdma_ucm`** — RDMA user-space connection manager - **Memlock limits** — RDMA requires pinning memory pages; without unlimited memlock, large transfers fail - **Service restarts** — containerd and kubelet must be restarted to pick up the new memlock limits ## Step 4: Deploy the RDMA Shared Device Plugin The RDMA Shared Device Plugin exposes InfiniBand NICs as a Kubernetes extended resource so pods can request RDMA access. Create the ConfigMap with the device plugin configuration: ```yaml apiVersion: v1 kind: ConfigMap metadata: name: rdma-devices namespace: kube-system data: config.json: | { "periodicUpdateInterval": 300, "configList": [{ "resourceName": "hca_shared_devices_a", "rdmaHcaMax": 1000, "selectors": { "vendors": ["15b3"], "drivers": ["mlx5_core"] } } ] } ``` Create the DaemonSet: ```yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: rdma-shared-dp-ds namespace: kube-system spec: selector: matchLabels: name: rdma-shared-dp-ds template: metadata: labels: name: rdma-shared-dp-ds spec: hostNetwork: true nodeSelector: kubernetes.azure.com/agentpool: tolerations: - operator: Exists containers: - name: k8s-rdma-shared-dp-ds image: ghcr.io/mellanox/k8s-rdma-shared-dev-plugin:v1.5.3 securityContext: privileged: true volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins - name: plugins-registry mountPath: /var/lib/kubelet/plugins_registry - name: config mountPath: /k8s-rdma-shared-dev-plugin - name: devs mountPath: /dev/ volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins - name: plugins-registry hostPath: path: /var/lib/kubelet/plugins_registry - name: config configMap: name: rdma-devices - name: devs hostPath: path: /dev/ ``` Replace `` with your GPU node pool name (e.g., `ndh100pool`). ```bash kubectl apply -f rdma-configmap.yaml kubectl apply -f rdma-shared-dp-ds.yaml ``` Wait for the device plugin pods to start: ```bash kubectl get pods -n kube-system -l name=rdma-shared-dp-ds -w ``` ## Step 5: Install the GPU Operator (RDMA-Enabled) Install the GPU Operator with RDMA-specific values: ```bash helm install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator --create-namespace \ --set nfd.enabled=false \ --set driver.rdma.enabled=true \ --set driver.rdma.useHostMofed=true ``` Key differences from a standard GPU Operator install: - `nfd.enabled=false` — Network Operator already deploys Node Feature Discovery; running two NFD instances causes conflicts - `driver.rdma.enabled=true` — enables GPUDirect RDMA support; causes the driver daemonset to build and load `nvidia_peermem` - `driver.rdma.useHostMofed=true` — tells the GPU Operator to use the MOFED driver installed by the Network Operator (Step 1) rather than its own; required when the Network Operator manages OFED Wait for the GPU Operator pods to reach `Running` state: ```bash kubectl get pods -n gpu-operator -w ``` ## Verification **1. Check that MOFED driver pods are running on all InfiniBand nodes:** ```bash kubectl get pods -n network-operator -l app=mofed-ubuntu22.04-ds ``` **2. Check that IB node config pods completed initialization:** ```bash kubectl get pods -n kube-system -l app=ib-node-config ``` **3. Check that the RDMA Shared Device Plugin is running:** ```bash kubectl get pods -n kube-system -l name=rdma-shared-dp-ds ``` **4. Verify RDMA resources are available on GPU nodes:** ```bash kubectl get nodes -o json | jq '.items[] | select(.status.allocatable["rdma/hca_shared_devices_a"] != null) | {name: .metadata.name, rdma: .status.allocatable["rdma/hca_shared_devices_a"], gpu: .status.allocatable["nvidia.com/gpu"]}' ``` Each InfiniBand-capable node should report `rdma/hca_shared_devices_a` resources (typically `1k` based on `rdmaHcaMax: 1000`). **5. Check GPU Operator pods are healthy:** ```bash kubectl get pods -n gpu-operator ``` ## Pod Resource Requests Dynamo pods that need RDMA access should request the `rdma/hca_shared_devices_a` resource. When using the Dynamo operator with DGDR, this is handled automatically for disaggregated deployments on RDMA-capable clusters. For manual DGD specs, add the resource request to your container: ```yaml resources: limits: nvidia.com/gpu: 8 rdma/hca_shared_devices_a: 1 ``` **`IPC_LOCK` capability is not required** when this setup is followed. `IPC_LOCK` is historically needed for RDMA because `ibv_reg_mr` calls `mlock()` to pin memory pages — but `mlock()` only needs the capability if the memlock rlimit would otherwise block it. The `ib-node-config` DaemonSet (Step 3) sets `LimitMEMLOCK=infinity` on the kubelet and containerd systemd units, so all pods on GPU nodes inherit an unlimited memlock limit and RDMA memory pinning works without any capability in the pod spec. If you see `ENOMEM` errors from `ibv_reg_mr` and `ib-node-config` is running, verify that containerd and kubelet were restarted after the limits were applied (check the init container logs). If `ib-node-config` is not deployed, add `IPC_LOCK` to your pod's `securityContext.capabilities.add`. ## Troubleshooting **MOFED pods stuck in `Init` or `CrashLoopBackOff`:** - Verify nodes are Ubuntu OS: `kubectl get nodes -o custom-columns="NAME:.metadata.name,OS:.status.nodeInfo.osImage"` - Check MOFED pod logs: `kubectl logs -n network-operator -c mofed-container` **`rdma/hca_shared_devices_a` not showing on nodes:** - Check the RDMA device plugin pods are running: `kubectl get pods -n kube-system -l name=rdma-shared-dp-ds` - Check device plugin logs: `kubectl logs -n kube-system ` - Verify the `rdma-devices` ConfigMap exists: `kubectl get configmap rdma-devices -n kube-system` **IB kernel modules not loading:** - Check the ib-node-config init container logs: `kubectl logs -n kube-system -c ib-setup` - Verify the MOFED driver is installed first (Step 2 must complete before Step 3) **Memlock errors during RDMA transfers (`ENOMEM` from `ibv_reg_mr`):** - Verify the ib-node-config DaemonSet has run on all GPU nodes and init containers completed - Check that containerd and kubelet were restarted: `kubectl logs -n kube-system -c ib-setup` - Confirm the limits took effect on the kubelet process: ```bash # On a GPU node (via kubectl debug or ssh) cat /proc/$(pgrep -x kubelet)/limits | grep -i memlock # Should show: Max locked memory unlimited unlimited ``` - If limits are not unlimited, the ib-node-config DaemonSet needs to be re-applied and services restarted **GPUDirect RDMA not working — `nvidia_peermem` module missing:** ND-series nodes (including ND H100 v5) do **not** ship `nvidia_peermem` in the host OS. This module is required for InfiniBand adapters to directly read/write GPU memory — without it, RDMA transfers fall back to staging through host memory. Verify whether the module is loaded: ```bash # Check on a GPU node via a privileged pod or node shell lsmod | grep nvidia_peermem # If empty, the module is not loaded modinfo nvidia_peermem # If "Module not found", it is also not present in the host's /lib/modules ``` With the GPU Operator managing drivers (`driver.rdma.enabled=true`), `nvidia_peermem` is built and loaded by the `nvidia-driver-daemonset` — it lives in the driver pod's `/lib/modules`, not the host's native kernel modules. Verify the driver daemonset is loading it: ```bash kubectl exec -n gpu-operator $(kubectl get pod -n gpu-operator -l app=nvidia-driver-daemonset -o jsonpath='{.items[0].metadata.name}') -- lsmod | grep nvidia_peermem ``` If this returns empty, ensure `driver.rdma.enabled=true` and `driver.rdma.useHostMofed=true` are set in your GPU Operator Helm values (see [Step 5](#step-5-install-the-gpu-operator-rdma-enabled) above), then restart the driver daemonset: ```bash kubectl rollout restart daemonset/nvidia-driver-daemonset -n gpu-operator ``` The [nvidia-peermem-reloader](https://github.com/Azure/aks-rdma-infiniband/tree/main/configs/nvidia-peermem-reloader) DaemonSet from the Azure RDMA repo is designed for clusters using **AKS-managed GPU drivers** (without the GPU Operator). It simply runs `modprobe nvidia-peermem` — which will fail on ND H100 v5 nodes because the host OS doesn't include the module. When using the GPU Operator (recommended), the operator handles `nvidia_peermem` automatically via `driver.rdma.enabled=true`. ## See Also - [Azure AKS RDMA InfiniBand — GitHub](https://github.com/Azure/aks-rdma-infiniband) - [Set up InfiniBand on Azure HPC VMs — Microsoft Learn](https://learn.microsoft.com/en-us/azure/virtual-machines/setup-infiniband) - [Enable InfiniBand VM extension — Microsoft Learn](https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/enable-infiniband) - [NVIDIA Network Operator Documentation](https://docs.nvidia.com/networking/display/kubernetes25100/index.html) - [Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication) — transport options, UCX configuration, performance expectations # Storage for Model Caching on AKS # Storage for Model Caching on AKS For implementing tiered storage on AKS, you can take advantage of the different storage options available in Azure. This guide covers choosing the right storage for each Dynamo cache type and configuring PVCs. ## Available Storage Options | Storage Option | Performance | Best For | |----------------|-------------|----------| | Local CSI (Ephemeral Disk) | Very high | Fast model caching, warm restarts | | [Azure Managed Lustre](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/azure-lustre-csi-driver) | Extremely high | Large multi-node models, shared cache | | [Azure Disk (Managed Disk)](https://learn.microsoft.com/en-us/azure/aks/azure-csi-driver-volume-provisioning?tabs=dynamic-volume-blob%2Cnfs%2Ckubernetes-secret%2Cnfs-3%2Cgeneral%2Cgeneral2%2Cdynamic-volume-disk%2Cgeneral-disk%2Cdynamic-volume-files%2Cgeneral-files%2Cgeneral-files2%2Cdynamic-volume-files-mid%2Coptimize%2Csmb-share&pivots=csi-disk#create-azure-disk-pvs-using-built-in-storage-classes) | High | Persistent single-writer model cache | | [Azure Files](https://learn.microsoft.com/en-us/azure/aks/azure-csi-driver-volume-provisioning?tabs=dynamic-volume-blob%2Cnfs%2Ckubernetes-secret%2Cnfs-3%2Cgeneral%2Cgeneral2%2Cdynamic-volume-disk%2Cgeneral-disk%2Cdynamic-volume-files%2Cgeneral-files%2Cgeneral-files2%2Cdynamic-volume-files-mid%2Coptimize%2Csmb-share&pivots=csi-files#use-a-persistent-volume-for-storage) | Medium | Shared small/medium models | | [Azure Blob (via Fuse or init)](https://learn.microsoft.com/en-us/azure/aks/azure-csi-driver-volume-provisioning?tabs=dynamic-volume-blob%2Cnfs%2Ckubernetes-secret%2Cnfs-3%2Cgeneral%2Cgeneral2%2Cdynamic-volume-disk%2Cgeneral-disk%2Cdynamic-volume-files%2Cgeneral-files%2Cgeneral-files2%2Cdynamic-volume-files-mid%2Coptimize%2Csmb-share&pivots=csi-blob#create-a-pvc-using-built-in-storage-class) | Low-Medium | Cold model storage, bootstrap downloads | Azure Managed Lustre and Local CSI (ephemeral disk) are not installed by default in AKS and require additional setup before use. Azure Disk, Azure Files, and Azure Blob CSI drivers are available out of the box. See the [Azure Lustre CSI Driver](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/azure-lustre-csi-driver) guide for Lustre setup, or the [AKS CSI storage options documentation](https://learn.microsoft.com/azure/aks/csi-storage-drivers) for a full overview of built-in drivers. For Azure Managed Lustre setup, see the [Azure Lustre CSI Driver](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/azure-lustre-csi-driver) guide. ## Recommendations by Cache Type - **Model Cache** — raw model artifacts, configuration files, tokenizers, etc. - Persistence: Required to avoid repeated downloads and reduce cold-start latency. - Recommended storage: Azure Managed Lustre (shared, high throughput) or Azure Disk (single-replica, persistent). - **Compilation Cache** — backend-specific compiled artifacts (e.g., TensorRT engines). - Persistence: Optional. - Recommended storage: Local CSI (fast, node-local) or Azure Disk (persistent when GPU configuration is fixed). - **Performance Cache** — runtime tuning and profiling data. - Persistence: Not required. - Recommended storage: Local CSI (or other ephemeral storage). ## Check Available Storage Classes List the storage classes available in your AKS cluster: ```bash kubectl get storageclass NAME PROVISIONER RECLAIMPOLICY azureblob-csi blob.csi.azure.com Delete azurefile file.csi.azure.com Delete azurefile-csi file.csi.azure.com Delete azurefile-csi-premium file.csi.azure.com Delete azurefile-premium file.csi.azure.com Delete default disk.csi.azure.com Delete managed disk.csi.azure.com Delete managed-csi disk.csi.azure.com Delete managed-csi-premium disk.csi.azure.com Delete managed-premium disk.csi.azure.com Delete sc.azurelustre.csi.azure.com azurelustre.csi.azure.com Retain ``` ## Example PVC Configuration In the `cache.yaml` in the different [recipes](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/recipes), you can set the `storageClassName` to a storage option available in your AKS cluster: ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-cache spec: accessModes: - ReadWriteMany resources: requests: storage: 100Gi storageClassName: "sc.azurelustre.csi.azure.com" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: compilation-cache spec: accessModes: - ReadWriteMany resources: requests: storage: 50Gi storageClassName: "azurefile-csi" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: perf-cache spec: accessModes: - ReadWriteMany resources: requests: storage: 50Gi storageClassName: "local-ephemeral" ``` ## See Also - [Azure Lustre CSI Driver](/dynamo/kubernetes-deployment/cloud-provider-guides/azure/azure-lustre-csi-driver) — Full setup guide for Azure Managed Lustre - [Model Caching](/dynamo/kubernetes-deployment/model-loading/model-caching) — Full walkthrough for setting up model caching with Dynamo, including download Jobs and mount configuration - [AKS CSI Storage Drivers](https://learn.microsoft.com/azure/aks/csi-storage-drivers) — Microsoft documentation for all built-in CSI drivers # Azure Lustre CSI Driver for AKS # Azure Lustre CSI Driver for AKS This guide covers installing and configuring the [Azure Lustre CSI driver](https://github.com/kubernetes-sigs/azurelustre-csi-driver) on an AKS cluster so that Dynamo workloads can use Azure Managed Lustre (AMLFS) filesystems for high-performance model storage. ## Prerequisites **AKS cluster requirements** - Kubernetes 1.21 or later - Node pools must use the **Ubuntu** OS SKU — Windows and Azure Linux (CBL Mariner) nodes are not supported - AKS is the only supported Kubernetes distribution (self-managed clusters are not supported) **Tools** - Azure CLI (`az`) - `kubectl` **Network connectivity** AKS and your AMLFS filesystem must have network reachability. Two supported topologies: - **VNet peering**: Deploy AKS in its own VNet and peer it with the AMLFS VNet. The AKS infrastructure VNet lives in the auto-created resource group `MC___`. - **Shared VNet**: Use AKS's "Bring your own VNet" feature and deploy AKS in a dedicated subnet inside the AMLFS VNet. Do not use the same subnet as AMLFS. Do not place AKS nodes and the AMLFS filesystem in the same subnet, even when sharing a VNet. ## Step 1: Connect to your AKS cluster ```bash az login az aks get-credentials \ --subscription \ --resource-group \ --name kubectl config current-context ``` ## Step 2: Install the CSI driver There is no Helm chart. Install via the provided shell script: ```bash # Install latest version curl -skSL https://raw.githubusercontent.com/kubernetes-sigs/azurelustre-csi-driver/main/deploy/install-driver.sh | bash -s main # Or install a specific version curl -skSL https://raw.githubusercontent.com/kubernetes-sigs/azurelustre-csi-driver/main/deploy/install-driver.sh | bash -s v0.3.1 ``` The script deploys the CSI controller (2-replica Deployment) and node plugin (DaemonSet) into `kube-system`, and waits for them to become ready. **Verify the installation:** ```bash # Controller pods — expect 2/2 or 3/3 Running kubectl get -n kube-system pod -l app=csi-azurelustre-controller # Node plugin pods — expect 3/3 Running on each node kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide ``` ## Step 3: Configure storage There are two provisioning modes depending on whether your AMLFS filesystem already exists. ### Option A: Static provisioning (existing AMLFS filesystem) Use this when you want to bring your own Azure Managed Lustre filesystem. If you don't have one yet, create it first, then configure the CSI driver to use it. #### Create an Azure Managed Lustre filesystem **1. Register the resource provider (first time only):** ```bash az provider register --namespace Microsoft.StorageCache # Wait until state is "Registered" az provider show --namespace Microsoft.StorageCache --query "registrationState" ``` **2. Validate your subnet before creating the filesystem:** The subnet must be dedicated to AMLFS (do not share with AKS nodes or other resources) and sized to hold the filesystem. Check requirements first: ```bash # Get the required subnet size for your planned SKU and capacity az amlfs get-subnets-size \ --sku AMLFS-Durable-Premium-250 \ --storage-capacity 16 # Validate that your subnet meets the requirements az amlfs check-amlfs-subnet \ --filesystem-subnet /subscriptions//resourceGroups//providers/Microsoft.Network/virtualNetworks//subnets/ \ --sku AMLFS-Durable-Premium-250 \ --location \ --storage-capacity 16 ``` **3. Create a dedicated subnet for AMLFS:** AMLFS requires its own subnet — it cannot share the subnet used by AKS nodes. Create a new subnet in the AKS VNet (or in a peered VNet): ```bash # Get the node resource group and check for a custom VNet subnet az aks show \ --name \ --resource-group \ --query "{vnet: agentPoolProfiles[0].vnetSubnetId, nodeRG: nodeResourceGroup}" ``` If `vnet` is non-null, your cluster uses Azure CNI with a custom VNet — use that VNet name and resource group below. If `vnet` is `null`, AKS manages its own VNet in the node resource group. Find it: ```bash az network vnet list \ --resource-group \ --query "[].{name:name, addressPrefixes:addressSpace.addressPrefixes}" ``` List existing subnets to find a free CIDR range: ```bash az network vnet subnet list \ --resource-group \ --vnet-name \ --query "[].{name:name, prefix:addressPrefix}" ``` Pick a non-overlapping CIDR within the VNet's address space. The `filesystemSubnetSize` value from `get-subnets-size` is the number of IPs required. Azure also reserves 5 IPs per subnet, so add those when sizing the prefix (e.g., `filesystemSubnetSize: 8` → 13 IPs needed → use `/28` for 16 addresses or more). Then create the dedicated AMLFS subnet: ```bash az network vnet subnet create \ --name amlfs-subnet \ --resource-group \ --vnet-name \ --address-prefix ``` Use the full subnet resource ID in the next step: `/subscriptions//resourceGroups//providers/Microsoft.Network/virtualNetworks//subnets/amlfs-subnet` **4. Create the filesystem:** ```bash az amlfs create \ --name \ --resource-group \ --location \ --sku AMLFS-Durable-Premium-250 \ --storage-capacity 16 \ --zones "[1]" \ --filesystem-subnet /subscriptions//resourceGroups//providers/Microsoft.Network/virtualNetworks//subnets/ \ --maintenance-window "{dayOfWeek:Sunday,timeOfDayUtc:'22:00'}" ``` This takes **10–20 minutes**. Use `--no-wait` to return immediately and poll with `az amlfs show`. **Available SKUs:** | SKU | Min size | Throughput | |-----|----------|------------| | `AMLFS-Durable-Premium-40` | 48 TiB | 40 MB/s per TiB | | `AMLFS-Durable-Premium-125` | 16 TiB | 125 MB/s per TiB | | `AMLFS-Durable-Premium-250` | 8 TiB | 250 MB/s per TiB | | `AMLFS-Durable-Premium-500` | 4 TiB | 500 MB/s per TiB | **5. Get the MGS IP address:** ```bash az amlfs show \ --name \ --resource-group \ --query "{mgsAddress: clientInfo.mgsAddress, mountCommand: clientInfo.mountCommand}" ``` Use the `mgsAddress` value in the StorageClass below. Alternatively, find it in the Azure portal under your filesystem's **Client connection** pane. **StorageClass:** ```yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: azurelustre-static provisioner: azurelustre.csi.azure.com parameters: mgs-ip-address: # From portal > Client connection reclaimPolicy: Retain volumeBindingMode: Immediate mountOptions: - noatime - flock ``` **PersistentVolumeClaim:** ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc-lustre spec: accessModes: - ReadWriteMany resources: requests: storage: # Match your filesystem size, e.g. 16Ti storageClassName: azurelustre-static ``` ```bash kubectl apply -f storageclass.yaml kubectl apply -f pvc.yaml ``` ### Option B: Dynamic provisioning (auto-create AMLFS filesystem) Requires driver v0.3.0 or later. The driver creates an AMLFS cluster automatically when the PVC is created — this takes **10+ minutes**. **Additional IAM permissions required** on the kubelet managed identity (grant before creating the PVC): ``` Microsoft.StorageCache/amlFilesystems/read Microsoft.StorageCache/amlFilesystems/write Microsoft.StorageCache/amlFilesystems/delete Microsoft.StorageCache/checkAmlFSSubnets/action Microsoft.StorageCache/getRequiredAmlFSSubnetsSize/* Microsoft.Network/virtualNetworks/subnets/read Microsoft.Network/virtualNetworks/subnets/join/action Microsoft.ManagedIdentity/userAssignedIdentities/assign/action ``` Alternatively assign the broader roles: **Reader** at subscription scope, **Contributor** at the target resource group, and **Network Contributor** at the VNet scope. **Available SKUs:** | SKU | Throughput | |-----|------------| | `AMLFS-Durable-Premium-40` | 40 MB/s per TiB | | `AMLFS-Durable-Premium-125` | 125 MB/s per TiB (min 48 TiB) | | `AMLFS-Durable-Premium-250` | 250 MB/s per TiB | | `AMLFS-Durable-Premium-500` | 500 MB/s per TiB | **StorageClass:** ```yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: azurelustre-dynamic provisioner: azurelustre.csi.azure.com parameters: sku-name: "AMLFS-Durable-Premium-125" zone: "1" # Availability zone: "1", "2", or "3" maintenance-day-of-week: "Sunday" maintenance-time-of-day-utc: "22:00" # Optional overrides (defaults to AKS cluster values): # location: "eastus" # resource-group-name: "my-rg" # vnet-name: "my-vnet" # subnet-name: "my-subnet" reclaimPolicy: Delete # WARNING: deletes the AMLFS cluster when PVC is deleted — use Retain in production volumeBindingMode: Immediate mountOptions: - noatime - flock ``` **PersistentVolumeClaim:** ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc-lustre-dynamic spec: accessModes: - ReadWriteMany resources: requests: storage: 48Ti # Minimum for AMLFS-Durable-Premium-125 storageClassName: azurelustre-dynamic ``` ```bash kubectl apply -f storageclass-dynamic.yaml kubectl apply -f pvc-dynamic.yaml # Monitor provisioning (takes 10+ minutes) kubectl describe pvc pvc-lustre-dynamic ``` ## Troubleshooting **Pod stuck in `ContainerCreating`** ```bash kubectl describe pod # Look for volume mount errors in Events kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=50 ``` **PVC stuck in `Pending` (dynamic provisioning)** ```bash kubectl describe pvc # Check Events for authorization errors — kubelet identity may lack IAM permissions ``` **Node cannot mount** — verify Ubuntu OS SKU: ```bash kubectl get nodes -o custom-columns="NAME:.metadata.name,OS:.status.nodeInfo.osImage" ``` ## See also - [Azure Managed Lustre CSI Driver — GitHub](https://github.com/kubernetes-sigs/azurelustre-csi-driver) - [Use Azure Managed Lustre with AKS — Microsoft Learn](https://learn.microsoft.com/en-us/azure/azure-managed-lustre/use-csi-driver-kubernetes) # AKS Spot VMs # Running Dynamo on AKS Spot VMs [Azure Spot VMs](https://azure.microsoft.com/en-us/products/virtual-machines/spot) offer significant cost savings for GPU workloads but can be evicted by Azure at any time. This guide covers the configuration required to schedule Dynamo on Spot VM node pools. ## How AKS Taints Spot Nodes When a node pool uses Spot VMs, AKS automatically applies the following taint to all nodes in that pool: ```yaml kubernetes.azure.com/scalesetpriority=spot:NoSchedule ``` This prevents standard workloads from landing on Spot nodes by default. Any pod that should run on a Spot node must explicitly tolerate this taint. ## Required Toleration Add the following toleration to any workload that should run on Spot nodes: ```yaml tolerations: - key: kubernetes.azure.com/scalesetpriority operator: Equal value: spot effect: NoSchedule ``` ## Deploying Dynamo on Spot Nodes The Dynamo platform Helm chart includes a pre-built values file for Spot VM deployments — [`examples/deployments/AKS/values-aks-spot.yaml`](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/deployments/AKS/values-aks-spot.yaml) — which adds the required toleration to all Dynamo components: - Dynamo operator controller manager - Webhook CA inject and cert generation jobs - etcd - NATS - MPI SSH key generation job - Other core Dynamo platform pods Install Dynamo with the Spot values file: ```bash helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \ --namespace dynamo-system \ --create-namespace \ -f ./values-aks-spot.yaml ``` To upgrade an existing installation: ```bash helm upgrade dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \ --namespace dynamo-system \ -f ./values-aks-spot.yaml ``` ## Creating a Spot GPU Node Pool Add a Spot GPU node pool to an existing AKS cluster: ```bash az aks nodepool add \ --resource-group \ --cluster-name \ --name spotgpunp \ --node-count 2 \ --node-vm-size Standard_NC24ads_A100_v4 \ --priority Spot \ --eviction-policy Delete \ --spot-max-price -1 \ --skip-gpu-driver-install ``` `--spot-max-price -1` means pay up to the on-demand price (recommended). `--eviction-policy Delete` removes evicted nodes from the pool; use `Deallocate` if you want to preserve node state across evictions. ## See Also - [Azure Spot VMs overview](https://learn.microsoft.com/en-us/azure/virtual-machines/spot-vms) - [Use Spot VMs in AKS](https://learn.microsoft.com/en-us/azure/aks/spot-node-pool) # Google Kubernetes Engine (GKE) # Dynamo Deployment on GKE ## Pre-requisites ### Install gcloud CLI https://cloud.google.com/sdk/docs/install ### Create GKE cluster ```bash export PROJECT_ID=<> export REGION=<> export ZONE=<> export CLUSTER_NAME=<> export CLUSTER_MACHINE_TYPE=n2-standard-4 export NODE_POOL_MACHINE_TYPE=g2-standard-24 export GPU_TYPE=nvidia-l4 export GPU_COUNT=2 export CPU_NODE=2 export GPU_NODE=2 export DISK_SIZE=200 gcloud container clusters create ${CLUSTER_NAME} \ --project=${PROJECT_ID} \ --location=${ZONE} \ --subnetwork=default \ --disk-size=${DISK_SIZE} \ --machine-type=${CLUSTER_MACHINE_TYPE} \ --num-nodes=${CPU_NODE} ``` #### Create GPU pool ```bash gcloud container node-pools create gpu-pool \ --accelerator type=${GPU_TYPE},count=${GPU_COUNT},gpu-driver-version=latest \ --project=${PROJECT_ID} \ --location=${ZONE} \ --cluster=${CLUSTER_NAME} \ --machine-type=${NODE_POOL_MACHINE_TYPE} \ --disk-size=${DISK_SIZE} \ --num-nodes=${GPU_NODE} \ --enable-autoscaling \ --min-nodes=1 \ --max-nodes=3 ``` ### Clone Dynamo GitHub repository **Note:** Please make sure GitHub branch/commit version matches with Dynamo platform and VLLM container. ```bash git clone https://github.com/ai-dynamo/dynamo.git # Checkout to the desired branch git checkout release/0.6.0 ``` ### Set environment variables for GKE ```bash export NAMESPACE=dynamo-system kubectl create namespace $NAMESPACE kubectl config set-context --current --namespace=$NAMESPACE export HF_TOKEN= kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN=${HF_TOKEN} \ -n ${NAMESPACE} ``` ## Install Dynamo Kubernetes Platform [See installation steps](/dynamo/kubernetes-deployment/start-here/installation-guide#overview) After installation, verify the installation: **Expected output** ```bash kubectl get pods NAME READY STATUS RESTARTS AGE dynamo-platform-dynamo-operator-controller-manager-69b9794fpgv9 2/2 Running 0 4m27s dynamo-platform-etcd-0 1/1 Running 0 4m27s dynamo-platform-nats-0 2/2 Running 0 4m27s ``` ## Deploy Inference Graph We will deploy a LLM model to the Dynamo platform. Here we use `Qwen/Qwen3-0.6B` model with VLLM and disaggregated deployment as an example. In the deployment yaml file, some adjustments have to/ could be made: - **(Required)** Add args to change `LD_LIBRARY_PATH` and `PATH` of decoder container, to enable GKE find the correct GPU driver - Change VLLM image to the desired one on NGC - Add namespace to metadata - Adjust GPU/CPU request and limits - Change model to deploy More configurations please refer to https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/deployments/GKE/vllm ### Highlighted configurations in yaml file Please note that `LD_LIBRARY_PATH` needs to be set properly in GKE as per [Run GPUs in GKE](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) The following snippet needs to be present in the `args` field of the deployment `yaml` file: ```bash export LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH export PATH=$PATH:/usr/local/nvidia/bin:/usr/local/nvidia/lib64 /sbin/ldconfig ``` For example, refer to the following from [`examples/deployments/GKE/vllm/disagg.yaml`](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/deployments/GKE/vllm/disagg.yaml) ```yaml metadata: name: vllm-disagg namespace: dynamo-system spec: services: Frontend: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0 VllmDecodeWorker: ​​ resources: limits: gpu: "3" image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0 args: - | export LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH export PATH=$PATH:/usr/local/nvidia/bin:/usr/local/nvidia/lib64 /sbin/ldconfig python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B ``` ## Deploy the model ```bash cd dynamo/examples/deployments/GKE/vllm kubectl apply -f disagg_gke.yaml -n ${NAMESPACE} ``` **Expected output after successful deployment** ```bash kubectl get pods NAME READY STATUS RESTARTS AGE dynamo-platform-dynamo-operator-controller-manager-c665684ssqkx 2/2 Running 0 65m dynamo-platform-etcd-0 1/1 Running 0 65m dynamo-platform-nats-0 2/2 Running 0 65m vllm-disagg-frontend-5954ddc4dd-4w2cb 1/1 Running 0 11m vllm-disagg-vllmdecodeworker-77844cfcff-ddn4v 1/1 Running 0 11m vllm-disagg-vllmprefillworker-55d5b74b4f-zrskh 1/1 Running 0 11m ``` ## Test the Deployment ```bash export DEPLOYMENT_NAME=vllm-disagg # Find the frontend pod export FRONTEND_POD=$(kubectl get pods -n ${NAMESPACE} | grep "${DEPLOYMENT_NAME}-frontend" | sort -k1 | tail -n1 | awk '{print $1}') # Forward the pod's port to localhost kubectl port-forward deployment/vllm-disagg-frontend 8000:8000 -n ${NAMESPACE} # disagg curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [ { "role": "user", "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden." } ], "stream":false, "max_tokens": 30 }' ``` ### Response ```json {"id":"chatcmpl-bd0670d9-0342-4eea-97c1-99b69f1f931f","choices":[{"index":0,"message":{"content":"Okay, here's a detailed character background for your intrepid explorer, tailored to fit the premise of Aeloria, with a focus on a","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"stop","logprobs":null}],"created":1756336263,"model":"Qwen/Qwen3-0.6B","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":{"prompt_tokens":190,"completion_tokens":29,"total_tokens":219,"prompt_tokens_details":null,"completion_tokens_details":null}} ``` # Feature Guides Use these guides after you have Dynamo running and want to improve serving behavior, operate a deployment, or adapt Dynamo to a new workload. ## Recommended path Most deployments start with the core performance loop: | Step | Guide | Use when | |---|---|---| | 1 | [KV Cache Aware Routing](/dynamo/user-guides/kv-cache-aware-routing) | Route requests to workers that already hold useful KV cache. | | 2 | [Disaggregated Serving](/dynamo/user-guides/disaggregated-serving) | Scale prefill and decode workers independently. | | 3 | [KV Cache Offloading](/dynamo/user-guides/kv-cache-offloading) | Extend usable cache capacity beyond GPU memory. | | 4 | [Benchmarking](/dynamo/user-guides/benchmarking) | Compare configurations before you move to production. | ## Where to go next | Goal | Start with | |---|---| | Make serving more resilient | [Fault Tolerance](/dynamo/user-guides/fault-tolerance) | | Monitor local deployments | [Observability (Local)](/dynamo/user-guides/observability-local) | | Reproduce traffic without a full engine | [Mocker Engine Simulation](../mocker/mocker.md) | | Add structured model outputs | [Tool Calling](/dynamo/user-guides/parsing/tool-call-parsing-dynamo) and [Reasoning](/dynamo/user-guides/parsing/reasoning-parsing-dynamo) | | Build agent workloads | [Agents](/dynamo/user-guides/agents) | | Serve specialized workloads | [LoRA Adapters](/dynamo/user-guides/lo-ra-adapters), [Multimodal](/dynamo/user-guides/multimodal), and [Diffusion](/dynamo/user-guides/diffusion) | For cluster deployments, pair these guides with the [Kubernetes Deployment](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart) docs. The same features can be explored locally, then expressed through Dynamo's Kubernetes-native CRDs and operator when you move to a shared GPU cluster. # Router Guide ## Overview The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups. This guide helps you get started with using the Dynamo router and points to the pages that cover routing concepts, configuration, disaggregated serving, and operations in more detail. ## Quick Start The router can be deployed using [Python / CLI](#python--cli-deployment), [Kubernetes](#kubernetes-deployment), or as a [standalone component](#standalone-router). ### Python / CLI Deployment To launch the Dynamo frontend with the KV Router: ```bash python -m dynamo.frontend --router-mode kv --http-port 8000 ``` This command: - Launches the Dynamo frontend service with KV routing enabled - Exposes the service on port 8000 (configurable) - Automatically handles all backend workers registered to the Dynamo endpoint Backend workers register themselves using the `register_model` API. For accurate prefix-cache state, workers must also publish KV cache events with the backend-specific event flags; otherwise the router can run in approximate mode with `--no-router-kv-events`. #### CLI Arguments | Argument | Default | Description | |----------|---------|-------------| | `--router-mode kv` | `round-robin` | Enable KV cache-aware routing | | `--router-temperature ` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) | | `--kv-cache-block-size ` | Backend-specific | KV cache block size (should match backend config) | | `--router-kv-events` / `--no-router-kv-events` | `--router-kv-events` | Enable/disable real-time KV event tracking | | `--load-aware` / `--no-load-aware` | `--no-load-aware` | Route by active load without cache-reuse signals; implies `--router-mode kv` on the frontend | | `--router-kv-overlap-score-credit ` | `1.0` | Credit multiplier for device-local prefix overlap, from 0.0 to 1.0 | | `--router-prefill-load-scale ` | `1.0` | Scale adjusted prompt-side prefill load before adding decode blocks | | `--router-track-prefill-tokens` / `--no-router-track-prefill-tokens` | `--router-track-prefill-tokens` | Include prompt-side load in active worker load accounting | | `--router-prefill-load-model ` | `none` | Prompt-side load model; see [Routing Concepts](/dynamo/components/router/routing-concepts#active-load-modeling) and [Configuration and Tuning](/dynamo/components/router/configuration-and-tuning#aic-prefill-load-model) | | `--router-queue-threshold ` | `16.0` | Queue threshold fraction; enables priority scheduling via `priority` | | `--router-queue-policy ` | `fcfs` | Scheduling policy for the queue: `fcfs` (tail TTFT), `wspt` (avg TTFT), or `lcfs` (comparison-only reverse ordering) | | `--serve-indexer` | `false` | Serve the Dynamo-native remote indexer from this frontend/router on the worker component | | `--use-remote-indexer` | `false` | Query the worker component's served remote indexer instead of maintaining a local overlap indexer | For all available options: `python -m dynamo.frontend --help` For detailed configuration options and tuning parameters, see [Configuration and Tuning](/dynamo/components/router/configuration-and-tuning). For candidate eligibility rules, see [Router Filtering](router-filtering.md). For how the router models prefill and decode load in the cost function, see [Routing Concepts](/dynamo/components/router/routing-concepts#active-load-modeling). ### Kubernetes Deployment To enable the KV Router in Kubernetes, add the `DYN_ROUTER_MODE` environment variable to your frontend service: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-deployment spec: services: Frontend: componentType: frontend replicas: 1 envs: - name: DYN_ROUTER_MODE value: kv # Enable KV Smart Router ``` **Key Points:** - Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only - Configure worker-side KV event publishing when you want event-driven prefix-cache state - Use `--no-router-kv-events` for approximate cache-state prediction when workers are not publishing events #### Environment Variables All CLI arguments can be configured via environment variables using the `DYN_` prefix: | CLI Argument | Environment Variable | Default | |--------------|---------------------|---------| | `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round-robin` | | `--load-aware` | `DYN_ROUTER_LOAD_AWARE=true` | `false` | | `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` | | `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific | | `--no-router-kv-events` | `DYN_ROUTER_USE_KV_EVENTS=false` | `true` | | `--router-kv-overlap-score-credit` | `DYN_ROUTER_KV_OVERLAP_SCORE_CREDIT` | `1.0` | | `--router-prefill-load-scale` | `DYN_ROUTER_PREFILL_LOAD_SCALE` | `1.0` | | `--router-queue-policy` | `DYN_ROUTER_QUEUE_POLICY` | `fcfs` | | `DYN_ENCODER_CUDA_TO_CPU_RATIO` | `8` | Throughput ratio of a non-CPU worker relative to one CPU worker for `device-aware-weighted` routing | For complete K8s examples and advanced configuration, see [K8s Examples](/dynamo/components/router/router-examples#k8s-examples) and [Configuration and Tuning](/dynamo/components/router/configuration-and-tuning). For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](/dynamo/additional-resources/kv-router-a-b-testing). ### Standalone Router You can also run the KV router as a standalone service (without the Dynamo frontend) for disaggregated serving (e.g., routing to prefill workers), multi-tier architectures, or any scenario requiring intelligent KV cache-aware routing decisions. See the [Standalone Router component](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/components/src/dynamo/router/) for more details. #### Frontend-Embedded vs. Standalone Router | Deployment | Process | Metrics Port | Use Case | |------------|---------|--------------|----------| | **Frontend-embedded** | `python -m dynamo.frontend --router-mode kv` | Frontend HTTP port (default 8000) | Standard deployment; router runs inside the frontend process | | **Standalone** | `python -m dynamo.router` | `DYN_SYSTEM_PORT` (if set) | Multi-tier architectures, advanced disaggregated prefill routing, custom pipelines | The standalone router does not include the HTTP frontend (no `/v1/chat/completions` endpoint). It exposes only the `RouterRequestMetrics` via the system status server. See the [Standalone Router README](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/components/src/dynamo/router/README.md). ## Deployment Modes The Dynamo router can be deployed in several configurations. The table below shows every combination and when to use it: | Mode | Command | Routing Logic | KV Events | Topology | Use Case | |------|---------|---------------|-----------|----------|----------| | **Frontend + Round-Robin** | `python -m dynamo.frontend --router-mode round-robin` | Cycles through workers | None | Aggregated | Simplest baseline; no KV awareness | | **Frontend + Random** | `python -m dynamo.frontend --router-mode random` | Random worker selection | None | Aggregated | Stateless load balancing | | **Frontend + KV (Aggregated)** | `python -m dynamo.frontend --router-mode kv` | KV cache overlap + load | NATS Core / JetStream / ZMQ / Approx | Aggregated | Production single-pool serving with cache reuse | | **Frontend + KV (Disaggregated)** | `python -m dynamo.frontend --router-mode kv` with prefill + decode workers | KV cache overlap + load | NATS Core / JetStream / ZMQ / Approx | Disaggregated (prefill + decode pools) | Separate prefill/decode for large-scale serving | | **Frontend + Least-Loaded** | `python -m dynamo.frontend --router-mode least-loaded` | Fewest active connections | None | Aggregated or disaggregated fallback | Simple load-aware balancing without KV awareness | | **Frontend + Device-Aware Weighted** | `python -m dynamo.frontend --router-mode device-aware-weighted` | Device-aware budget + least-loaded within selected device group | None | Aggregated or disaggregated fallback | Heterogeneous fleet balancing (CPU/non-CPU); degenerates to least-loaded when only one device class is present | | **Frontend + Direct** | `python -m dynamo.frontend --router-mode direct` | Worker ID from request hints | None | Aggregated | External orchestrator (e.g., EPP/GAIE) selects workers | | **Standalone Router** | `python -m dynamo.router` | KV cache overlap + load | NATS Core / JetStream / ZMQ | Any | Routing without the HTTP frontend (multi-tier, custom pipelines) | ### Routing Modes (`--router-mode`) | Mode | Value | How Workers Are Selected | |------|-------|-------------------------| | **Round-Robin** | `round-robin` (default) | Cycles through available workers in order | | **Random** | `random` | Selects a random worker for each request | | **KV** | `kv` | Evaluates KV cache overlap and decode load per worker; picks lowest cost | | **Least-Loaded** | `least-loaded` | Routes to the worker with fewest active connections; in disaggregated prefill paths it skips bootstrap optimization and falls back to synchronous prefill | | **Device-Aware Weighted** | `device-aware-weighted` | Partitions workers into CPU and non-CPU groups, applies capability-normalized ratio budgeting using `DYN_ENCODER_CUDA_TO_CPU_RATIO` to decide which group receives the request, then selects the least-loaded worker within that group | | **Direct** | `direct` | Reads the target `worker_id` from the request's routing hints; no selection logic | ### Device-Aware Weighted Routing `device-aware-weighted` is designed for heterogeneous fleets where workers of different compute capability, for example CPU embedding encoders alongside GPU embedding encoders, share the same endpoint. Workers are split into CPU and non-CPU groups. The router compares a capability-normalized load across the two groups: ```text normalized_load = total_inflight(group) / (instance_count(group) x throughput_weight) ``` The throughput weight is `1` for CPU workers and `DYN_ENCODER_CUDA_TO_CPU_RATIO` for non-CPU workers. The next request is routed to the group with the lower normalized load, then to the least-loaded worker inside that group. Use `DYN_ENCODER_CUDA_TO_CPU_RATIO` to approximate the throughput ratio of a non-CPU worker relative to one CPU worker. The default is `8`. When only one device class is present, the policy degenerates to standard least-loaded routing. ### KV Event Transport Modes (within `--router-mode kv`) When using KV routing, the router needs to know what each worker has cached. There are four ways to get this information: | Event Mode | How to Enable | Description | |------------|---------------|-------------| | **NATS Core (local indexer)** | Router default (no router flag) | Workers maintain a local indexer; configure backend-side KV event publishing so the router can recover state and receive events via NATS Core | | **JetStream (durable)** | `--router-durable-kv-events` | Events persisted in NATS JetStream; supports snapshots and durable consumers. *Deprecated.* | | **ZMQ** | `--event-plane zmq` | Workers publish via ZMQ PUB sockets; the standalone `dynamo.indexer` service aggregates events | | **Approximate (no events)** | `--no-router-kv-events` | No events consumed; router predicts cache state from its own routing decisions with TTL-based expiration | ### Aggregated vs. Disaggregated Topology | Topology | Workers | How It Works | |----------|---------|--------------| | **Aggregated** | Single pool (prefill + decode in one process) | All workers handle the full request lifecycle | | **Disaggregated** | Separate prefill and decode pools | Frontend routes to a prefill worker first, then to a decode worker; requires workers registered with `ModelType.Prefill` | Disaggregated mode is activated automatically when prefill workers register alongside decode workers. See [Disaggregated Serving](/dynamo/components/router/disaggregated-serving) for details. ## More Router Docs - **[Routing Concepts](/dynamo/components/router/routing-concepts)**: Cost model, worker selection, and routing primitives - **[Configuration and Tuning](/dynamo/components/router/configuration-and-tuning)**: Router flags, transport modes, load tracking, and metrics - **[Disaggregated Serving](/dynamo/components/router/disaggregated-serving)**: Prefill and decode routing setups - **[Topology-Aware KV Transfer](/dynamo/components/router/topology-aware-kv-transfer)**: Runtime metadata and decode routing constraints for topology-aware prefill/decode handoff - **[Router Operations](/dynamo/components/router/router-operations)**: Replicas, remote indexers, persistence, and recovery - **[Router Examples](/dynamo/components/router/router-examples)**: Python API usage, K8s examples, and custom routing patterns - **[Router Testing](router-testing.md)**: Recommended test layers for non-trivial router changes - **[Standalone Indexer](/dynamo/components/router/standalone-indexer)**: Run the KV indexer as a separate service - **[KV Event Replay — Dynamo vs vLLM](/dynamo/components/router/kv-event-replay-dynamo-vs-v-llm)**: Gap detection and replay behavior # Disaggregated Serving Disaggregated serving separates the two main phases of LLM inference: | Phase | What it does | Scaling pressure | |---|---|---| | Prefill | Processes the prompt and produces the initial KV cache. | Input length, prompt reuse, context size | | Decode | Generates output tokens using the KV cache. | Concurrency, output length, active KV memory | In an aggregated deployment, each worker does both phases. In a disaggregated deployment, prefill workers and decode workers are separate pools. Dynamo routes each request through prefill first, transfers or exposes the KV cache state to decode, and streams the response from the decode worker. ```mermaid flowchart LR Client["Client"] --> Frontend["Dynamo Frontend / Router"] Frontend --> Prefill["Prefill workers
prompt processing"] Prefill -->|"KV transfer over fast fabric"| Decode["Decode workers
token generation"] Decode --> Frontend Frontend --> Client ``` ## When It Helps Disaggregated serving is most useful when prefill and decode need different resource shapes: - long prompts or retrieval-heavy traffic make prefill expensive - long generations or high concurrency make decode the bottleneck - you want to scale prefill and decode replicas independently - you want to pair prefill/decode separation with KV-aware routing - large models need different parallelism for prompt processing and generation It is not automatically better for every workload. For small models, short prompts, low concurrency, or clusters without fast KV transfer, an aggregated deployment may be simpler and faster. ## Mental Model Disaggregated serving usually combines four pieces: | Piece | Role | |---|---| | Frontend/router | Accepts OpenAI-compatible requests and coordinates routing. | | Prefill workers | Run the prompt phase and prepare KV transfer state. | | Decode workers | Continue generation after prefill completes. | | KV transfer path | Moves or exposes KV cache state between prefill and decode workers. | KV-aware routing is related but separate. Disaggregated serving splits prefill and decode. KV-aware routing chooses workers based on cache locality. Many production deployments use both, but you can reason about them independently. For router-specific behavior, see [Router: Disaggregated Serving](/dynamo/components/router/disaggregated-serving) and [KV Cache Aware Routing](/dynamo/user-guides/kv-cache-aware-routing). ## KV Transfer Is the Critical Path Disaggregation only helps when decode workers can access the KV cache produced by prefill quickly. In cross-node or high-throughput deployments, the KV transfer path commonly depends on RDMA-capable networking through the backend's transfer layer, such as NIXL/UCX. If RDMA is missing or silently falls back to TCP, TTFT and throughput can be dominated by KV movement rather than model compute. Treat KV transfer as an early validation step, not a final tuning detail. Common failure modes include missing RDMA device-plugin resources, pods without the needed `rdma/ib` requests or `IPC_LOCK` capability, UCX/NIXL transport errors, mismatched model or KV cache settings between prefill and decode workers, and benchmarks that run through local port-forwarding instead of inside the cluster. Symptoms usually look like high TTFT despite available prefill capacity, decode workers sitting idle while prefill workers are busy, or disaggregated throughput falling below an aggregated baseline after splitting workers across nodes. Production cross-node disaggregated deployments usually require RDMA or an equivalent fast fabric for KV cache transfer. Without it, the backend may fall back to TCP and KV transfer can dominate TTFT and throughput. Validate the transfer path before spending time tuning replica counts. ### Deploying Disaggregated with RDMA Disaggregated deployments transfer KV cache between prefill and decode workers. Without RDMA or another fast transfer path, this movement can become the main performance bottleneck. Prerequisites for a production cross-node deployment: 1. **RDMA-capable network** such as InfiniBand, RoCE, or an equivalent fast fabric. 2. **RDMA device plugin** installed on the cluster so worker pods can request `rdma/ib` resources. 3. **ETCD and NATS** deployed for Dynamo coordination. The following example shows the RDMA-relevant fields in a disaggregated vLLM `DynamoGraphDeployment`. Start from a validated recipe when one exists, then adapt the resource requests, model, image, and parallelism for your cluster. ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: dynamo-disagg namespace: your-namespace spec: backendFramework: vllm pvcs: - name: model-cache create: false services: Frontend: componentType: frontend replicas: 1 volumeMounts: - name: model-cache mountPoint: /opt/models envs: - name: HF_HOME value: /opt/models extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1 imagePullPolicy: IfNotPresent VLLMPrefillWorker: envFromSecret: hf-token-secret componentType: worker subComponentType: prefill replicas: 2 resources: limits: gpu: "2" sharedMemory: size: 16Gi volumeMounts: - name: model-cache mountPoint: /opt/models envs: - name: HF_HOME value: /opt/models - name: UCX_TLS value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc" - name: UCX_RNDV_SCHEME value: "get_zcopy" - name: UCX_RNDV_THRESH value: "0" extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1 workingDir: /workspace imagePullPolicy: IfNotPresent securityContext: capabilities: add: ["IPC_LOCK"] resources: limits: rdma/ib: "2" requests: rdma/ib: "2" command: ["python3", "-m", "dynamo.vllm"] args: - --model - "Qwen/Qwen3-32B-FP8" - "--tensor-parallel-size" - "2" - "--kv-cache-dtype" - "fp8" - "--max-num-seqs" - "1" - --disaggregation-mode - prefill VLLMDecodeWorker: envFromSecret: hf-token-secret componentType: worker subComponentType: decode replicas: 1 resources: limits: gpu: "4" sharedMemory: size: 16Gi volumeMounts: - name: model-cache mountPoint: /opt/models envs: - name: HF_HOME value: /opt/models - name: UCX_TLS value: "rc_x,rc,dc_x,dc,cuda_copy,cuda_ipc" - name: UCX_RNDV_SCHEME value: "get_zcopy" - name: UCX_RNDV_THRESH value: "0" extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1 workingDir: /workspace imagePullPolicy: IfNotPresent securityContext: capabilities: add: ["IPC_LOCK"] resources: limits: rdma/ib: "4" requests: rdma/ib: "4" command: ["python3", "-m", "dynamo.vllm"] args: - --model - "Qwen/Qwen3-32B-FP8" - "--tensor-parallel-size" - "4" - "--kv-cache-dtype" - "fp8" - "--max-num-seqs" - "1024" - --disaggregation-mode - decode ``` Critical RDMA settings: | Setting | Purpose | |---|---| | `rdma/ib: "N"` | Requests RDMA resources for the worker pod. In most disaggregated vLLM deployments, match this to the worker TP size. | | `IPC_LOCK` capability | Allows RDMA memory registration and pinned-memory use. | | `UCX_TLS` | Enables RDMA-capable UCX transports such as `rc_x` and `dc_x`, plus CUDA transports for GPU buffers. | | `UCX_RNDV_SCHEME=get_zcopy` | Enables zero-copy RDMA transfers for large KV-cache movement. | After deployment, check the worker logs for UCX/NIXL initialization: ```bash kubectl logs | grep -i "UCX\|NIXL" ``` Expected output includes: ```text NIXL INFO Backend UCX was instantiated ``` If logs only show TCP transports, RDMA is not active. Check the RDMA device plugin, worker `rdma/ib` resource requests, security context, and UCX settings. For full transport setup and troubleshooting, see the [Disaggregated Communication Guide](/dynamo/kubernetes-deployment/operate/disagg-communication). ## Deployment Paths Choose the path that matches how much control you need: | Starting point | Use when | |---|---| | [Dynamo Recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) | A recipe matches your model, backend, hardware, and serving mode. Start here for validated baselines and `perf.yaml` benchmarks. | | Direct `DynamoGraphDeployment` | You already know the prefill/decode layout, images, parallelism, and KV transfer settings. | | [DGDR](/dynamo/kubernetes-deployment/deploy-models/dgdr-reference) | You want Dynamo to generate a DGD from model, backend, hardware, workload, and SLA intent. | | [Sizing with AIConfigurator](/dynamo/user-guides/disaggregated-serving/sizing-with-ai-configurator) | You want to compare aggregated vs. disaggregated layouts and estimate prefill/decode sizing before deployment. | Good recipe starting points include: - [Qwen3-32B vLLM disagg + KV router](https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3-32b) - [DeepSeek V3.2 TensorRT-LLM disagg + KV router](https://github.com/ai-dynamo/dynamo/tree/main/recipes/deepseek-v32-fp4) - [Llama 3 70B vLLM disaggregated recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes/llama-3-70b) For the Kubernetes resource model, see the [Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide). ## Backend Examples Each built-in backend has examples that show the concrete worker flags and transfer settings: | Backend | Examples | |---|---| | vLLM | [Deployment examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy), including `disagg.yaml`, `disagg_router.yaml`, and `disagg_planner.yaml` | | TensorRT-LLM | [Deployment examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy), including disaggregated, router, and planner variants | | SGLang | [Deployment examples](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy), including NIXL-based disaggregated serving | ## Operational Notes Disaggregated deployments add a data-movement path between workers. Before moving to production, verify: - KV transfer backend and network fabric are configured for your backend - RDMA resources, UCX/NIXL settings, and security context are active when your deployment depends on RDMA - prefill and decode workers agree on model, dtype, block size, and KV layout - pods have the required GPU, shared memory, and network resources - frontend/router flags match your routing strategy - benchmarks run inside the cluster, not through local port-forwarding, when validating high-load performance Use [Dynamo Benchmarking](/dynamo/user-guides/benchmarking) to compare aggregated and disaggregated configurations with the same workload. ## Next Steps 1. Start from a matching [Dynamo Recipe](https://github.com/ai-dynamo/dynamo/tree/main/recipes) when available. 2. Read the backend-specific deployment example for your engine. 3. Use [Sizing with AIConfigurator](/dynamo/user-guides/disaggregated-serving/sizing-with-ai-configurator) or DGDR when you need help choosing prefill/decode sizing. 4. Validate the result with [Dynamo Benchmarking](/dynamo/user-guides/benchmarking). # Sizing with AIConfigurator This page focuses on using [AIConfigurator](https://github.com/ai-dynamo/aiconfigurator/tree/main) to size aggregated and disaggregated Dynamo deployments. For the serving architecture and deployment-path overview, start with [Disaggregated Serving](/dynamo/user-guides/disaggregated-serving). AIConfigurator is a performance optimization tool that helps you find a strong starting configuration for deploying LLMs with Dynamo. Given a supported model, GPU system, backend, and SLA target, it searches aggregated and disaggregated layouts and can generate deployment artifacts for the selected target. ## When to Use AIConfigurator When deploying LLMs with Dynamo, you need to make several critical decisions: - **Aggregated vs Disaggregated**: Which architecture gives better performance for your workload? - **Worker Configuration**: How many prefill and decode workers to deploy? - **Parallelism Settings**: What tensor/pipeline parallel configuration to use? - **SLA Compliance**: How to meet your TTFT and TPOT targets? AIConfigurator is useful when you want: - candidate configurations that are filtered against your SLA requirements - generated Dynamo configuration files and Kubernetes manifests - performance comparisons between aggregated and disaggregated strategies - a support check for a model/system/backend combination before you tune by hand Exact runtime and throughput gains depend on the model, hardware, backend, traffic shape, and available performance data. Treat AIConfigurator output as a validated starting point, then benchmark the generated configuration in your cluster. ### End-to-End Workflow ![AIConfigurator end-to-end workflow](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/bc459161bf6b32c800fb7a1dd007c8134947303b4eed6bfa725ac02b32b6867b/pages-v1.2.0/assets/img/e2e-workflow.svg) ### Aggregated vs Disaggregated Architecture AIConfigurator evaluates two deployment architectures and recommends the best one for your workload: ![Aggregated vs Disaggregated architecture comparison](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/a8e2cf7c5ff4855f4bd1e1ccc2a8076aad952d60dc16c77432769edbd1203a5d/pages-v1.2.0/assets/img/arch-comparison.svg) ### When to Use Each Architecture ![Decision flowchart for choosing aggregated vs disaggregated](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/ac64588ed77cb9588defcadd45b394b7a7e20e0b212ed6199bda37f5eda7a2d6/pages-v1.2.0/assets/img/decision-flowchart.svg) ## Quick Start ```bash # Install pip3 install aiconfigurator # Optional: check whether the model/system/backend is covered aiconfigurator cli support \ --model-path Qwen/Qwen3-32B-FP8 \ --system h200_sxm \ --backend vllm # Find optimal configuration for vLLM backend aiconfigurator cli default \ --model-path Qwen/Qwen3-32B-FP8 \ --total-gpus 8 \ --system h200_sxm \ --backend vllm \ --backend-version 0.12.0 \ --isl 4000 \ --osl 500 \ --ttft 600 \ --tpot 16.67 \ --database-mode SILICON \ --deployment-target dynamo-j2 \ --save-dir ./results_vllm # Deploy on Kubernetes kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yaml ``` ## Complete Walkthrough: vLLM on H200 This section walks through a validated example deploying Qwen3-32B-FP8 on 8× H200 GPUs using vLLM. ### Step 1: Run AIConfigurator ```bash aiconfigurator cli default \ --model-path Qwen/Qwen3-32B-FP8 \ --system h200_sxm \ --total-gpus 8 \ --isl 4000 \ --osl 500 \ --ttft 600 \ --tpot 25 \ --backend vllm \ --backend-version 0.12.0 \ --deployment-target dynamo-j2 \ --generator-set K8sConfig.k8s_namespace=$YOUR_NAMESPACE \ --generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC \ --save-dir ./results_vllm ``` **Parameters explained:** - `--model-path`: HuggingFace model ID or local path (e.g., `Qwen/Qwen3-32B-FP8`). `--model` is also accepted as an alias. - `--system`: GPU system type (`h200_sxm`, `h100_sxm`, `a100_sxm`) - `--total-gpus`: Number of GPUs available for deployment - `--isl` / `--osl`: Input/Output sequence lengths in tokens - `--ttft` / `--tpot`: SLA targets - Time To First Token (ms) and Time Per Output Token (ms) - `--backend`: Inference backend (`vllm`, `trtllm`, or `sglang`) - `--backend-version`: Backend version (e.g., `0.12.0` for vLLM) - `--deployment-target`: Artifact target. `dynamo-j2` generates Dynamo Kubernetes manifests; other targets are available in the upstream CLI. - `--save-dir`: Directory to save generated deployment configs ### Step 2: Review the Results AIConfigurator outputs a comparison of aggregated vs disaggregated deployment strategies: ```text ******************************************************************************** * Dynamo aiconfigurator Final Results * ******************************************************************************** ---------------------------------------------------------------------------- Input Configuration & SLA Target: Model: Qwen/Qwen3-32B-FP8 (is_moe: False) Total GPUs: 8 Best Experiment Chosen: disagg at 446.85 tokens/s/gpu (disagg 1.38x better) ---------------------------------------------------------------------------- Overall Best Configuration: - Best Throughput: 3,574.80 tokens/s - Per-GPU Throughput: 446.85 tokens/s/gpu - Per-User Throughput: 53.58 tokens/s/user - TTFT: 453.18ms - TPOT: 18.66ms - Request Latency: 9766.51ms ---------------------------------------------------------------------------- Pareto Frontier: Qwen/Qwen3-32B-FP8 Pareto Frontier: tokens/s/gpu_cluster vs tokens/s/user ┌─────────────────────────────────────────────────────────────────────────┐ 850.0┤ •• agg │ │ ff disagg │ │ xx disagg best │ │ │ 708.3┤ │ │ f │ │ f │ │ fff │ 566.7┤ f │ │ f │ │ f │ │ •• fffffffffffffffffx │ 425.0┤ •••• ff │ │ ••• f │ │ ••••• f │ │ •••••••••• f │ 283.3┤ ••• f │ │ •• f │ │ •• f │ │ ••••f │ 141.7┤ •f• │ │ f••••• │ │ f ••••••• │ │ fffff •••• │ 0.0┤ •••• │ └┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬┘ 0 30 60 90 120 tokens/s/gpu_cluster tokens/s/user ---------------------------------------------------------------------------- Deployment Details: (p) stands for prefill, (d) stands for decode, bs stands for batch size, a replica stands for the smallest scalable unit xPyD of the disagg system Some math: total gpus used = replicas * gpus/replica gpus/replica = (p)gpus/worker * (p)workers + (d)gpus/worker * (d)workers; for Agg, gpus/replica = gpus/worker gpus/worker = tp * pp * dp = etp * ep * pp for MoE models; tp * pp for dense models (underlined numbers are the actual values in math) agg Top Configurations: (Sorted by tokens/s/gpu) +------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+ | Rank | backend | tokens/s/gpu | tokens/s/user | TTFT | request_latency | concurrency | total_gpus (used) | replicas | gpus/replica | gpus/worker | parallel | bs | +------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+ | 1 | vllm | 322.69 | 41.78 | 546.92 | 12490.03 | 64 (=32x2) | 8 (8=2x4) | 2 | 4 | 4 (=4x1x1) | tp4pp1 | 32 | | 2 | vllm | 293.94 | 44.43 | 593.10 | 11823.67 | 56 (=14x4) | 8 (8=4x2) | 4 | 2 | 2 (=2x1x1) | tp2pp1 | 14 | | 3 | vllm | 208.87 | 42.90 | 460.58 | 12093.52 | 40 (=40x1) | 8 (8=1x8) | 1 | 8 | 8 (=8x1x1) | tp8pp1 | 40 | +------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+-------------+----------+----+ disagg Top Configurations: (Sorted by tokens/s/gpu) +------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+ | Rank | backend | tokens/s/gpu | tokens/s/user | TTFT | request_latency | concurrency | total_gpus (used) | replicas | gpus/replica | (p)workers | (p)gpus/worker | (p)parallel | (p)bs | (d)workers | (d)gpus/worker | (d)parallel | (d)bs | +------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+ | 1 | vllm | 446.85 | 53.58 | 453.18 | 9766.51 | 76 (=76x1) | 8 (8=1x8) | 1 | 8 (=2x2+1x4) | 2 | 2 (=2x1) | tp2pp1 | 1 | 1 | 4 (=4x1) | tp4pp1 | 76 | | 2 | vllm | 446.85 | 41.14 | 453.18 | 12581.87 | 144 (=72x2) | 8 (8=2x4) | 2 | 4 (=1x2+1x2) | 1 | 2 (=2x1) | tp2pp1 | 1 | 1 | 2 (=2x1) | tp2pp1 | 72 | | 3 | vllm | 333.73 | 40.22 | 453.18 | 12860.32 | 72 (=36x2) | 8 (8=2x4) | 2 | 4 (=1x2+2x1) | 1 | 2 (=2x1) | tp2pp1 | 1 | 2 | 1 (=1x1) | tp1pp1 | 18 | +------+---------+--------------+---------------+--------+-----------------+-------------+-------------------+----------+--------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+ ``` **Reading the output:** - **tokens/s/gpu**: Overall throughput efficiency — higher is better - **tokens/s/user**: Per-request generation speed (inverse of TPOT) - **TTFT**: Predicted time to first token - **concurrency**: Total concurrent requests across all replicas (e.g., `56 (=14x4)` means batch size 14 × 4 replicas) - **agg Rank 1** recommends TP4 with 2 replicas — simpler to deploy - **disagg Rank 1** recommends 2 prefill workers (TP2) + 1 decode worker (TP4) — higher throughput but requires RDMA ### Step 3: Deploy on Kubernetes The `--save-dir` generates ready-to-use Kubernetes manifests: ``` ├── agg │ ├── best_config_topn.csv │ ├── exp_config.yaml │ ├── pareto.csv │ ├── top1 │ │ ├── agg_config.yaml │ │ ├── bench_run.sh # aiperf benchmark sweep script (bare-metal) │ │ ├── generator_config.yaml │ │ ├── k8s_bench.yaml # aiperf benchmark sweep Job (Kubernetes) │ │ ├── k8s_deploy.yaml # Kubernetes DynamoGraphDeployment │ │ └── run_0.sh │ ... ├── disagg │ ├── best_config_topn.csv │ ├── exp_config.yaml │ ├── pareto.csv │ ├── top1 │ │ ├── bench_run.sh # aiperf benchmark sweep script (bare-metal) │ │ ├── decode_config.yaml │ │ ├── generator_config.yaml │ │ ├── k8s_bench.yaml # aiperf benchmark sweep Job (Kubernetes) │ │ ├── k8s_deploy.yaml # Kubernetes DynamoGraphDeployment │ │ ├── prefill_config.yaml │ │ ├── run_0.sh │ │ └── run_1.sh (for multi-node setups) │ ... └── pareto_frontier.png ``` #### Prerequisites Before deploying, ensure you have: 1. **HuggingFace Token Secret** (for gated models): ```bash kubectl create secret generic hf-token-secret \ -n your-namespace \ --from-literal=HF_TOKEN="your-huggingface-token" ``` 2. **Model Cache PVC** (recommended for faster restarts): ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-cache namespace: your-namespace spec: accessModes: - ReadWriteMany resources: requests: storage: 100Gi ``` #### Deploy the Configuration The generated `k8s_deploy.yaml` provides a starting point. You'll typically need to customize it for your environment: ```bash kubectl apply -f ./results_vllm/agg/top1/agg/k8s_deploy.yaml ``` **Complete deployment example** with model cache and production settings: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: dynamo-agg namespace: your-namespace spec: backendFramework: vllm pvcs: - name: model-cache create: false # Use existing PVC services: Frontend: componentType: frontend replicas: 1 volumeMounts: - name: model-cache mountPoint: /opt/models envs: - name: HF_HOME value: /opt/models extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1 imagePullPolicy: IfNotPresent VLLMWorker: envFromSecret: hf-token-secret componentType: worker replicas: 4 resources: limits: gpu: "2" sharedMemory: size: 16Gi # Required for vLLM volumeMounts: - name: model-cache mountPoint: /opt/models envs: - name: HF_HOME value: /opt/models extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1 workingDir: /workspace imagePullPolicy: IfNotPresent command: - python3 - -m - dynamo.vllm args: - --model - "Qwen/Qwen3-32B-FP8" - "--no-enable-prefix-caching" - "--tensor-parallel-size" - "2" - "--pipeline-parallel-size" - "1" - "--data-parallel-size" - "1" - "--kv-cache-dtype" - "fp8" - "--max-model-len" - "6000" - "--max-num-seqs" - "1024" ``` **Key deployment settings:** | Setting | Purpose | Notes | |---------|---------|-------| | `backendFramework: vllm` | Tells Dynamo which runtime to use | Required at spec level | | `pvcs` + `volumeMounts` | Caches model weights across restarts | Mount at `/opt/models` (not `/root/`) | | `HF_HOME` env var | Points HuggingFace to cache location | Must match `mountPoint` | | `sharedMemory.size: 16Gi` | IPC memory for vLLM | 16Gi for vLLM, 80Gi for TRT-LLM | | `envFromSecret` | Injects HF_TOKEN | Required for gated models | ### Step 4: Validate with AIPerf After deployment, validate the predictions against actual performance using [AIPerf](https://github.com/ai-dynamo/aiperf). > ℹ️ Run AIPerf **inside the cluster** to avoid network latency affecting measurements. AIC automatically generates AIPerf scripts along with Dynamo configs and stores them in the results folder (when `--save-dir ...` is specified). For Kubernetes deployments, you can run benchmarks using `k8s_bench.yaml`; while for bare-metal systems, use the `bench_run.sh` script. These scripts execute AIPerf across a concurrency list: the default set (`1 2 8 16 32 64 128`) along with `BenchConfig.estimated_concurrency` and its values within ±5%. You can also customize this concurrency list as needed. By default, AIPerf results will be saved in `/tmp/bench_artifacts` of the containers. If PVC name is specified in `--generator-set K8sConfig.k8s_pvc_name=$YOUR_PVC`, result artifacts will be saved in the PVC volume mount instead. ![AIC-to-AIPerf parameter mapping](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/e399bacf9aea6e2fad40a86facb4af53fcaa3b7b974c42fcf0a075ecea1ab0c3/pages-v1.2.0/assets/img/param-mapping.svg) | AIC Output | AIPerf Parameter | Notes | |------------|-----------------|-------| | `concurrency: 56 (=14x4)` | `--concurrency 56` | Use total concurrency when benchmarking via the frontend | | ISL/OSL targets | `--isl 4000 --osl 500` | Match your AIC inputs | | - | `--num-requests 800` | Use `concurrency × 40` minimum for statistical stability | | - | `--extra-inputs "ignore_eos:true"` | Ensures exact OSL tokens generated | > **Note on concurrency**: AIC reports concurrency as `total (=bs × replicas)`. When benchmarking through the frontend (which routes to all replicas), use the total value. If benchmarking a single replica directly, use the per-replica `bs` value instead. ```yaml apiVersion: batch/v1 kind: Job metadata: name: aiperf-benchmark namespace: your-namespace spec: template: spec: restartPolicy: Never containers: - name: aiperf image: python:3.10 command: - /bin/bash - -c - | pip install aiperf aiperf profile \ -m Qwen/Qwen3-32B-FP8 \ --endpoint-type chat \ -u http://dynamo-agg-frontend:8000 \ --isl 4000 --isl-stddev 0 \ --osl 500 --osl-stddev 0 \ --num-requests 800 \ --concurrency 56 \ --streaming \ --extra-inputs "ignore_eos:true" \ --num-warmup-requests 40 \ --ui-type simple ``` ```bash kubectl apply -f k8s_bench.yaml kubectl logs -f -l job-name=aiperf-benchmark ``` **Validated results** (Qwen3-32B-FP8, 8× H200, TP2×4 replicas, aggregated): | Metric | AIC Prediction | Actual (avg) | Status | |--------|---------------|--------------|--------| | TTFT (ms) | 509 | 209 | Better than target | | ITL/TPOT (ms) | 16.49 | 15.06 | Within 10% | | Throughput (req/s) | ~6.3 | 6.9 | Within 10% | | Total Output TPS | ~3,178 | 3,462 | Within 10% | The table above is a validation example, not a universal guarantee. Expect variance across clusters, backend versions, model cache settings, and network fabric. Run multiple benchmark passes and compare against the generated concurrency and sequence-length assumptions. ## Fine-Tuning Your Deployment AIConfigurator provides a strong starting point. Here's how to iterate for production: ### Adjusting for Actual Workload If your real workload differs from the benchmark parameters: ```bash # For longer outputs (chat/code generation): # increase OSL, relax TTFT target aiconfigurator cli default \ --model-path Qwen/Qwen3-32B-FP8 \ --total-gpus 8 \ --system h200_sxm \ --backend vllm \ --backend-version 0.12.0 \ --isl 2000 \ --osl 2000 \ --ttft 1000 \ --tpot 10 \ --save-dir ./results_long_output ``` ### Exploring Alternative Configurations Use `exp` mode to compare custom configurations: ```yaml # custom_exp.yaml exps: - exp_tp2 - exp_tp4 exp_tp2: mode: "patch" serving_mode: "agg" model_path: "Qwen/Qwen3-32B-FP8" total_gpus: 8 system_name: "h200_sxm" backend_name: "vllm" backend_version: "0.12.0" isl: 4000 osl: 500 ttft: 600 tpot: 16.67 config: agg_worker_config: tp_list: [2] exp_tp4: mode: "patch" serving_mode: "agg" model_path: "Qwen/Qwen3-32B-FP8" total_gpus: 8 system_name: "h200_sxm" backend_name: "vllm" backend_version: "0.12.0" isl: 4000 osl: 500 ttft: 600 tpot: 16.67 config: agg_worker_config: tp_list: [4] ``` ```bash aiconfigurator cli exp --yaml-path custom_exp.yaml --save-dir ./results_custom ``` For production disaggregated deployments, validate the KV transfer path before tuning replica counts. See [Disaggregated Serving](/dynamo/user-guides/disaggregated-serving#deploying-disaggregated-with-rdma) for RDMA prerequisites, the DGD resource pattern, and NIXL/UCX verification. ### Tuning vLLM-Specific Parameters Override vLLM engine parameters with `--generator-set`: ```bash aiconfigurator cli default \ --model-path Qwen/Qwen3-32B-FP8 \ --total-gpus 8 \ --system h200_sxm \ --backend vllm \ --backend-version 0.12.0 \ --isl 4000 --osl 500 \ --ttft 600 --tpot 16.67 \ --save-dir ./results_tuned \ --generator-set Workers.agg.kv_cache_free_gpu_memory_fraction=0.85 \ --generator-set Workers.agg.max_num_seqs=2048 ``` Run `aiconfigurator cli default --generator-help` to see all available parameters. ### Prefix Caching Considerations For workloads with repeated prefixes (e.g., system prompts): - **Enable prefix caching** when you have high prefix hit rates - **Disable prefix caching** (`--no-enable-prefix-caching`) for diverse prompts AIConfigurator's default predictions assume no prefix caching. Enable it post-deployment if your workload benefits. ## Supported Configurations ### Backends and Versions For a comprehensive breakdown of which model/system/backend/version combinations are supported in both aggregated and disaggregated modes, refer to the [**support matrix**](https://ai-dynamo.github.io/aiconfigurator/support-matrix/). The raw data is available as [per-system CSV files](https://github.com/ai-dynamo/aiconfigurator/tree/main/src/aiconfigurator/systems/support_matrix), which are automatically generated and tested to ensure accuracy across all supported configurations. You can also check if a system / framework version is supported via the `aiconfigurator cli support` command. For example: ```bash aiconfigurator cli support --model-path Qwen/Qwen3-32B-FP8 --system h100_sxm --backend-version 1.2.0rc5 ``` ## Common Use Cases ```bash # Strict latency SLAs (real-time chat) aiconfigurator cli default \ --model-path meta-llama/Llama-3.1-70B \ --total-gpus 16 \ --system h200_sxm \ --backend vllm \ --backend-version 0.12.0 \ --ttft 200 --tpot 8 # High throughput (batch processing) aiconfigurator cli default \ --model-path Qwen/Qwen3-32B-FP8 \ --total-gpus 32 \ --system h200_sxm \ --backend trtllm \ --ttft 2000 --tpot 50 # Request latency constraint (end-to-end SLA) aiconfigurator cli default \ --model-path Qwen/Qwen3-32B-FP8 \ --total-gpus 16 \ --system h200_sxm \ --backend vllm \ --backend-version 0.12.0 \ --request-latency 12000 \ --isl 4000 --osl 500 ``` ## Additional Options ```bash # Web interface for interactive exploration pip3 install aiconfigurator[webapp] aiconfigurator webapp # Visit http://127.0.0.1:7860 # Quick config generation (no parameter sweep) aiconfigurator cli generate \ --model-path Qwen/Qwen3-32B-FP8 \ --total-gpus 8 \ --system h200_sxm \ --backend vllm # Check model/system support aiconfigurator cli support \ --model-path Qwen/Qwen3-32B-FP8 \ --system h200_sxm \ --backend vllm ``` ## Troubleshooting ### AIConfigurator Issues **Model not found**: Use the full HuggingFace path (e.g., `Qwen/Qwen3-32B-FP8` not `QWEN3_32B`) **Backend version mismatch**: Check supported versions with `aiconfigurator cli support --model-path --system --backend ` ### Deployment Issues **Pods crash with "Permission denied" on cache directory**: - Mount the PVC at `/opt/models` instead of `/root/.cache/huggingface` - Set `HF_HOME=/opt/models` environment variable - Ensure the PVC has `ReadWriteMany` access mode **Workers stuck in CrashLoopBackOff**: - Check logs: `kubectl logs --previous` - Verify `sharedMemory.size` is set (16Gi for vLLM, 80Gi for TRT-LLM) - Ensure HuggingFace token secret exists and is named correctly **Model download slow on every restart**: - Add PVC for model caching (see deployment example above) - Verify `volumeMounts` and `HF_HOME` are configured on workers **"Context stopped or killed" errors (disaggregated only)**: - Deploy ETCD and NATS infrastructure (required for KV cache transfer) - See [Dynamo Kubernetes Guide](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart) for platform setup ### Performance Issues **OOM errors**: Reduce `--max-num-seqs` or increase tensor parallelism **Performance below predictions**: - Verify warmup requests are sufficient (40+ recommended) - Check for competing workloads on the cluster - Ensure KV cache memory fraction is optimized - Run benchmarks from inside the cluster to eliminate network latency **Disaggregated TTFT extremely high (10+ seconds)**: Start by checking the **RDMA and KV transfer path**. Without RDMA or another fast transfer path, KV cache transfer may fall back to TCP and become a severe bottleneck. To diagnose: ```bash # Check if RDMA resources are allocated kubectl get pod -o yaml | grep -A5 "resources:" # Check UCX transport in logs kubectl logs | grep -i "UCX\|transport" ``` To fix: 1. Ensure your cluster has RDMA device plugin installed 2. Add `rdma/ib` resource requests to worker pods 3. Add `IPC_LOCK` capability to security context 4. Add UCX environment variables. See [Disaggregated Serving](/dynamo/user-guides/disaggregated-serving#deploying-disaggregated-with-rdma) for the deployment pattern and verification steps. **Disaggregated working but throughput lower than aggregated**: For balanced workloads (ISL/OSL ratio between 2:1 and 10:1), aggregated is often better. Disaggregated shines for: - Very long inputs (ISL > 8000) with short outputs - Workloads needing independent prefill/decode scaling ## Learn More - [AIConfigurator CLI Guide](https://github.com/ai-dynamo/aiconfigurator/blob/main/docs/cli_user_guide.md) - [Dynamo Deployment Guide](https://github.com/ai-dynamo/aiconfigurator/blob/main/docs/dynamo_deployment_guide.md) - [Dynamo Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide) - [Benchmarking Guide](/dynamo/user-guides/benchmarking) # KVBM Guide The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer and write-through cache for frameworks like vLLM and TensorRT-LLM. KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems. ## Quick start with the pre-built NGC container The fastest path is the published Dynamo container, which includes KVBM: ```bash docker run --gpus all --rm -it \ nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 \ /bin/bash ``` For installation from source or custom builds, see [Local Installation](/dynamo/getting-started/local-installation) and [Release Artifacts](/dynamo/resources/release-artifacts). ## Run KVBM Standalone KVBM can be used independently without using the rest of the Dynamo stack: ```bash pip install kvbm ``` See the [support matrix](/dynamo/resources/support-matrix) for version compatibility. ### Build from Source To build KVBM from source, see the detailed instructions in the [KVBM bindings README](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/lib/bindings/kvbm/README.md#build-from-source). ## Run KVBM in Dynamo with vLLM ### Docker Setup ```bash # Start up etcd for KVBM leader/worker registration and discovery docker compose -f dev/docker-compose.yml up -d ``` Pick one of the following to get a Dynamo vLLM container with KVBM built in. The subsequent serving commands are the same either way. **Option A: Pre-built NGC container (recommended for quick start)** ```bash docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 ``` See the [Local Installation Guide](/dynamo/getting-started/local-installation) for full setup instructions and [Release Artifacts](/dynamo/resources/release-artifacts#container-images) for available versions. **Option B: Build from source** ```bash # Build a dynamo vLLM container (KVBM is built in by default) # x86_64 python container/render.py --framework vllm --target runtime --output-short-filename --platform linux/amd64 docker buildx build --platform linux/amd64 -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile . # arm64 (Grace, Jetson, arm64 EC2) python container/render.py --framework vllm --target runtime --output-short-filename --platform linux/arm64 docker buildx build --platform linux/arm64 -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile . # Launch the container container/run.sh --image dynamo:latest-vllm-runtime -it --mount-workspace --use-nixl-gds ``` ### Aggregated Serving ```bash cd $DYNAMO_HOME/examples/backends/vllm ./launch/agg_kvbm.sh ``` #### Verify Deployment ```bash curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello, how are you?"}], "stream": false, "max_tokens": 10 }' ``` #### Alternative: Using Direct vllm serve You can also use `vllm serve` directly with KVBM: ```bash vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "kvbm.vllm_integration.connector"}' Qwen/Qwen3-0.6B ``` ## Run KVBM in Dynamo with TensorRT-LLM **Prerequisites:** - Ensure `etcd` and `nats` are running before starting - KVBM only supports TensorRT-LLM's PyTorch backend - Disable partial reuse (`enable_partial_reuse: false`) to increase offloading cache hits - KVBM requires TensorRT-LLM v1.2.0rc2 or newer ### Docker Setup ```bash # Start up etcd for KVBM leader/worker registration and discovery docker compose -f dev/docker-compose.yml up -d ``` Pick one of the following to get a Dynamo TensorRT-LLM container with KVBM built in. The subsequent serving commands are the same either way. **Option A: Pre-built NGC container (recommended for quick start)** ```bash docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.2.0 ``` See the [Local Installation Guide](/dynamo/getting-started/local-installation) for full setup instructions and [Release Artifacts](/dynamo/resources/release-artifacts#container-images) for available versions. **Option B: Build from source** ```bash # Build a dynamo TRTLLM container (KVBM is built in by default) # x86_64 python container/render.py --framework trtllm --target runtime --output-short-filename --cuda-version=13.1 --platform linux/amd64 docker buildx build --platform linux/amd64 -t dynamo:latest-trtllm-runtime -f container/rendered.Dockerfile . # arm64 with NVIDIA GPUs (GH200, GB200, P6e-GB200 UltraServer — *not* generic Graviton instances, which have no GPU) python container/render.py --framework trtllm --target runtime --output-short-filename --cuda-version=13.1 --platform linux/arm64 docker buildx build --platform linux/arm64 -t dynamo:latest-trtllm-runtime -f container/rendered.Dockerfile . # Launch the container container/run.sh --image dynamo:latest-trtllm-runtime -it --mount-workspace --use-nixl-gds ``` ### Aggregated Serving ```bash # Write the LLM API config cat > "/tmp/kvbm_llm_api_config.yaml" < **Learn more:** See the [SGLang HiCache Integration Guide](/dynamo/integrations/kv-cache-integrations/hi-cache) for detailed configuration, deployment examples, and troubleshooting. ## Disaggregated Serving with KVBM KVBM supports disaggregated serving where prefill and decode operations run on separate workers. KVBM is enabled on the prefill worker to offload KV cache. ### Disaggregated Serving with vLLM ```bash # 1P1D - one prefill worker and one decode worker # NOTE: requires at least 2 GPUs cd $DYNAMO_HOME/examples/backends/vllm ./launch/disagg_kvbm.sh # 2P2D - two prefill workers and two decode workers # NOTE: requires at least 4 GPUs cd $DYNAMO_HOME/examples/backends/vllm ./launch/disagg_kvbm_2p2d.sh ``` ### Disaggregated Serving with TRT-LLM ```bash # Launch prefill worker with KVBM python3 -m dynamo.trtllm \ --model-path Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \ --extra-engine-args /tmp/kvbm_llm_api_config.yaml \ --disaggregation-mode prefill & ``` ## Configuration ### Cache Tier Configuration Configure KVBM cache tiers using environment variables: ```bash # Option 1: CPU cache only (GPU -> CPU offloading) export DYN_KVBM_CPU_CACHE_GB=4 # 4GB of pinned CPU memory # Option 2: Both CPU and Disk cache (GPU -> CPU -> Disk tiered offloading) export DYN_KVBM_CPU_CACHE_GB=4 export DYN_KVBM_DISK_CACHE_GB=8 # 8GB of disk # [Experimental] Option 3: Disk cache only (GPU -> Disk direct offloading) # NOTE: Experimental, may not provide optimal performance # NOTE: Disk offload filtering not supported with this option export DYN_KVBM_DISK_CACHE_GB=8 ``` You can also specify exact block counts instead of GB: - `DYN_KVBM_CPU_CACHE_OVERRIDE_NUM_BLOCKS` - `DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS` > [!NOTE] KVBM is a write-through cache and it is possible to misconfigure. Each of the capacities should increase as you enable more tiers. As an example, if you configure your GPU device to have 100GB of memory dedicated for KV cache storage, then configure `DYN_KVBM_CPU_CACHE_GB >= 100`. The same goes for configuring the disk cache; `DYN_KVBM_DISK_CACHE_GB >= DYN_KVBM_CPU_CACHE_GB`. If the cpu cache is configured to be less than the device cache, then _there will be no benefit from KVBM_. In many cases you will see performance degradation as KVBM will churn by offloading blocks from the GPU to CPU after every forward pass. To know what your minimum value for `DYN_KVBM_CPU_CACHE_GB` should be for your setup, consult your llm engine's kv cache configuration. ### SSD Lifespan Protection When disk offloading is enabled, disk offload filtering is enabled by default to extend SSD lifespan. The current policy only offloads KV blocks from CPU to disk if the blocks have frequency ≥ 2. Frequency doubles on cache hit (initialized at 1) and decrements by 1 on each time decay step. To disable disk offload filtering: ```bash export DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER=true ``` ### NCCL Replicated Mode for MLA Models For MLA (Multi-Layer Attention) models such as DeepSeek, KVBM can use **NCCL replicated mode** so that only rank 0 loads KV blocks from G2/G3 storage and then broadcasts them to all GPUs via NCCL. This avoids redundant loads and can improve performance when multiple GPUs share the same replicated KV cache. **Enable NCCL MLA mode:** ```bash export DYN_KVBM_NCCL_MLA_MODE=true ``` **Requirements:** - MPI must be initialized (e.g., when launching with `mpirun` or equivalent) so that rank and world size are available for NCCL. - For optimal broadcast-based replication, build KVBM with the NCCL feature: `cargo build -p kvbm --features nccl`. Without it, the connector falls back to worker-level replication (each GPU loads independently). When disabled (default), each GPU loads KV blocks independently. Set `DYN_KVBM_NCCL_MLA_MODE=true` when running MLA models with KVBM to use the NCCL broadcast optimization. ## Enable and View KVBM Metrics ### Setup Monitoring Stack ```bash # Start basic services (etcd & natsd), along with Prometheus and Grafana docker compose -f dev/docker-observability.yml up -d ``` ### Enable Metrics for vLLM ```bash DYN_KVBM_METRICS=true \ DYN_KVBM_CPU_CACHE_GB=20 \ python -m dynamo.vllm \ --model Qwen/Qwen3-0.6B \ --enforce-eager \ --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_connector_module_path":"kvbm.vllm_integration.connector","kv_role":"kv_both"}' ``` ### Enable Metrics for TensorRT-LLM ```bash DYN_KVBM_METRICS=true \ DYN_KVBM_CPU_CACHE_GB=20 \ python3 -m dynamo.trtllm \ --model-path Qwen/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \ --extra-engine-args /tmp/kvbm_llm_api_config.yaml & ``` ### Firewall Configuration (Optional) ```bash # If firewall blocks KVBM metrics ports sudo ufw allow 6880/tcp ``` ### View Metrics Access Grafana at http://localhost:3000 (default login: `dynamo`/`dynamo`) and look for the **KVBM Dashboard**. ### Available Metrics | Metric | Description | |--------|-------------| | `kvbm_matched_tokens` | Number of matched tokens | | `kvbm_offload_blocks_d2h` | Offload blocks from device to host | | `kvbm_offload_blocks_h2d` | Offload blocks from host to disk | | `kvbm_offload_blocks_d2d` | Offload blocks from device to disk (bypassing host) | | `kvbm_onboard_blocks_d2d` | Onboard blocks from disk to device | | `kvbm_onboard_blocks_h2d` | Onboard blocks from host to device | | `kvbm_host_cache_hit_rate` | Host cache hit rate (0.0-1.0) | | `kvbm_disk_cache_hit_rate` | Disk cache hit rate (0.0-1.0) | ## Benchmarking KVBM Use [LMBenchmark](https://github.com/LMCache/LMBenchmark) to evaluate KVBM performance. ### Setup ```bash git clone https://github.com/LMCache/LMBenchmark.git cd LMBenchmark/synthetic-multi-round-qa ``` ### Run Benchmark ```bash # Synthetic multi-turn chat dataset # Arguments: model, endpoint, output prefix, qps ./long_input_short_output_run.sh \ "Qwen/Qwen3-0.6B" \ "http://localhost:8000" \ "benchmark_kvbm" \ 1 ``` Average TTFT and other performance numbers will be in the output. > **TIP:** If metrics are enabled, observe KV offloading and onboarding in the Grafana dashboard. ### Baseline Comparison #### vLLM Baseline (without KVBM) ```bash vllm serve Qwen/Qwen3-0.6B ``` #### TensorRT-LLM Baseline (without KVBM) ```bash # Create config without kv_connector_config cat > "/tmp/llm_api_config.yaml" < **Wait for model readiness.** Before benchmarking, ensure your deployment has fully loaded the model. Check pod logs or hit the health endpoint (`curl http://localhost:8000/health`) — it should return `200 OK` before you proceed. ```bash # Port-forward the frontend service kubectl port-forward -n svc/ 8000:8000 > /dev/null 2>&1 & # Run a single benchmark aiperf profile \ --model \ --url http://localhost:8000 \ --endpoint-type chat \ --streaming \ --concurrency 10 \ --request-count 100 \ --synthetic-input-tokens-mean 2000 \ --output-tokens-mean 256 ``` This produces results in `artifacts/` and prints a summary table to the console: ```text NVIDIA AIPerf | LLM Metrics ┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓ ┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩ │ Time to First Token │ 234.56 │ 189.23 │ 298.45 │ 289.34 │ 267.12 │ 231.12 │ 28.45 │ │ (ms) │ │ │ │ │ │ │ │ │ Request Latency │ 1234.56 │ 987.34 │ 1567.89 │ 1534.23 │ 1456.78 │ 1223.45 │ 156.78 │ │ (ms) │ │ │ │ │ │ │ │ │ Inter Token Latency │ 15.67 │ 12.34 │ 19.45 │ 19.01 │ 18.23 │ 15.45 │ 1.89 │ │ (ms) │ │ │ │ │ │ │ │ │ Request Throughput │ 31.45 │ N/A │ N/A │ N/A │ N/A │ N/A │ N/A │ │ (requests/sec) │ │ │ │ │ │ │ │ └─────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ ``` *Actual numbers will vary based on model size, hardware, batch size, and network conditions. Client-side benchmarks include port-forwarding overhead — use [server-side benchmarking](#server-side-benchmarking-in-cluster) for accurate performance measurement.* To stop the port-forward when done: `kill %1` (or `kill `). ### Step 3: Concurrency Sweep for Pareto Analysis To understand how your deployment behaves across load levels, run a concurrency sweep. Each concurrency level sends enough requests for stable measurements (`max(c*3, 10)`): ```bash MODEL="" URL="http://localhost:8000" for c in 1 2 5 10 50 100; do aiperf profile \ --model "$MODEL" \ --url "$URL" \ --endpoint-type chat \ --streaming \ --concurrency $c \ --request-count $(( c * 3 > 10 ? c * 3 : 10 )) \ --synthetic-input-tokens-mean 2000 \ --output-tokens-mean 256 \ --artifact-dir "artifacts/deployment-a/c$c" done ``` **Note**: Adjust concurrency levels to match your deployment's capacity. Very high concurrency on a small deployment (e.g., c250 on a single GPU) will cause server errors. Start with lower values and increase until you find the saturation point. ### Step 4: [If Comparative] Benchmark a Second Deployment Teardown deployment A and deploy deployment B with a different configuration. Kill the previous port-forward (`kill %1`), then repeat: ```bash kubectl port-forward -n svc/ 8000:8000 > /dev/null 2>&1 & for c in 1 2 5 10 50 100; do aiperf profile \ --model "$MODEL" \ --url "$URL" \ --endpoint-type chat \ --streaming \ --concurrency $c \ --request-count $(( c * 3 > 10 ? c * 3 : 10 )) \ --synthetic-input-tokens-mean 2000 \ --output-tokens-mean 256 \ --artifact-dir "artifacts/deployment-b/c$c" done ``` ### Step 5: Generate Visualizations ```bash # Compare all runs — auto-detects multi-run directories aiperf plot artifacts/deployment-a artifacts/deployment-b # Or compare all subdirectories under a parent aiperf plot artifacts/ # Launch interactive dashboard for exploration aiperf plot artifacts/ --dashboard ``` AIPerf automatically generates plots based on available data: - **TTFT vs Throughput** — find the sweet spot between responsiveness and capacity (always generated for multi-run comparisons) - **Pareto Curves** — throughput per GPU vs latency and interactivity (only generated when GPU telemetry data is available — add `--gpu-telemetry` during profiling if DCGM is running) - **Time series** — per-request TTFT, ITL, and latency over time (generated for single-run analysis) Here is an example Pareto frontier from a concurrency sweep of Qwen3-0.6B on 8x H200 with vLLM, showing the tradeoff between user experience (tokens/sec per user) and resource efficiency (tokens/sec per GPU): ![AIPerf Pareto Frontier](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/b7c4568cf0709ef464121da5844314308ca19c1401d6339e9f553d55b74f3580/pages-v1.2.0/assets/img/aiperf-pareto-frontier.png) See the [AIPerf Visualization Guide](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/plot.md) for full details on plot customization, experiment classification, and themes. ## Use Cases - **Compare DynamoGraphDeployments** (e.g., aggregated vs disaggregated configurations) - **Compare different backends** (e.g., SGLang vs TensorRT-LLM vs vLLM) - **Compare Dynamo vs other platforms** (e.g., Dynamo vs llm-d vs AIBrix) - **Compare different models** (e.g., Llama-3-8B vs Llama-3-70B vs Qwen-3-0.6B) - **Compare different hardware configurations** (e.g., H100 vs A100 vs H200) - **Compare different parallelization strategies** (e.g., different GPU counts or memory configurations) ## AIPerf Quick Reference ### Commonly Used Options ```text aiperf profile [OPTIONS] REQUIRED: --model MODEL Model name (must match the deployed model) --url URL Endpoint URL (e.g., http://localhost:8000) COMMON OPTIONS: --endpoint-type TYPE Endpoint type: chat, completions, embeddings (default: chat) --streaming Enable streaming responses --concurrency N Number of concurrent requests --request-rate N Target requests per second (alternative to --concurrency) --request-count N Total number of requests to send --benchmark-duration N Run for N seconds instead of a fixed request count --synthetic-input-tokens-mean N Average input sequence length in tokens --output-tokens-mean N Average output sequence length in tokens --artifact-dir DIR Output directory for results (default: artifacts/) --warmup-request-count N Warmup requests before measurement --ui TYPE UI mode: dashboard, simple, none (default: dashboard) ``` For the complete CLI reference, see `aiperf profile --help` or the [CLI docs](https://github.com/ai-dynamo/aiperf/blob/main/docs/cli-options.md). ### Output Sequence Length To enforce a specific output length, pass `ignore_eos` and `min_tokens` via `--extra-inputs`: ```bash aiperf profile \ --model \ --url http://localhost:8000 \ --endpoint-type chat \ --streaming \ --concurrency 10 \ --output-tokens-mean 256 \ --extra-inputs max_tokens:256 \ --extra-inputs min_tokens:256 \ --extra-inputs ignore_eos:true ``` ### Understanding Results Each `aiperf profile` run produces an artifact directory containing: - **`profile_export_aiperf.json`** — Structured metrics (latency, throughput, TTFT, ITL, etc.) - **`profile_export.jsonl`** — Per-request raw data - **`profile_export_aiperf.csv`** — CSV format metrics Results are organized by the `--artifact-dir` you specify. For concurrency sweeps, a common pattern is: ```text artifacts/ ├── deployment-a/ │ ├── c1/ │ │ ├── profile_export_aiperf.json │ │ └── profile_export.jsonl │ ├── c10/ │ ├── c50/ │ └── c100/ ├── deployment-b/ │ ├── c1/ │ ├── c10/ │ ├── c50/ │ └── c100/ └── plots/ # Generated by aiperf plot ├── ttft_vs_throughput.png ├── pareto_curve_throughput_per_gpu_vs_latency.png # If GPU telemetry available └── pareto_curve_throughput_per_gpu_vs_interactivity.png # If GPU telemetry available ``` --- # Server-Side Benchmarking (In-Cluster) Server-side benchmarking runs directly within the Kubernetes cluster, eliminating port-forwarding overhead and enabling high-load testing. ## Prerequisites 1. **Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart)) 2. **Storage**: PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md)) 3. **Docker image** containing AIPerf (Dynamo runtime images include it) ## Quick Start ### Step 1: Deploy Your DynamoGraphDeployment Deploy a `DynamoGraphDeployment` using a matching [Dynamo Recipe](https://github.com/ai-dynamo/dynamo/tree/main/recipes), the [Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide), or the backend examples in [examples/backends](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends). Ensure it has a frontend service exposed and the model is fully loaded before running benchmarks — check pod logs or verify the health endpoint returns `200 OK`. ### Step 2: Configure and Run Benchmark Job If your recipe includes a `perf.yaml`, start from that benchmark job because it already encodes the model, endpoint, workload shape, and result collection expected by the recipe. Otherwise, use the generic job below. First, edit `benchmarks/incluster/benchmark_job.yaml` to match your deployment: - **Model name**: Update the `MODEL` variable - **Service URL**: Update the `URL` variable (use `..svc.cluster.local:port` for cross-namespace access) - **Concurrency levels**: Adjust the `for c in ...` loop - **Docker image**: Update the `image` field if needed Then deploy: ```bash export NAMESPACE=benchmarking # Deploy the benchmark job kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE # Monitor the job kubectl logs -f job/dynamo-benchmark -n $NAMESPACE ``` ### Step 3: Retrieve Results ```bash # Create access pod (skip if already running) kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s # Download the results kubectl cp $NAMESPACE/pvc-access-pod:/data/results ./results # Cleanup kubectl delete pod pvc-access-pod -n $NAMESPACE ``` ### Step 4: Generate Plots ```bash aiperf plot ./results ``` ## Cross-Namespace Service Access When referencing services in other namespaces, use full Kubernetes DNS: ```bash # Same namespace --url http://vllm-agg-frontend:8000 # Different namespace --url http://vllm-agg-frontend.production.svc.cluster.local:8000 ``` ## Monitoring and Debugging ```bash # Check job status kubectl describe job dynamo-benchmark -n $NAMESPACE # Follow logs kubectl logs -f job/dynamo-benchmark -n $NAMESPACE # Check pod status kubectl get pods -n $NAMESPACE -l job-name=dynamo-benchmark # Debug failed pod kubectl describe pod -n $NAMESPACE ``` ### Troubleshooting 1. **Service not found**: Ensure your DynamoGraphDeployment frontend service is running 2. **PVC access**: Check that `dynamo-pvc` is properly configured and accessible 3. **Image pull issues**: Ensure the Docker image is accessible from the cluster 4. **Resource constraints**: Adjust resource limits if the job is being evicted ```bash # Check PVC status kubectl get pvc dynamo-pvc -n $NAMESPACE # Verify service exists and has endpoints kubectl get svc -n $NAMESPACE kubectl get endpoints -n $NAMESPACE ``` --- ## Testing with DynoSim / Mocker For development and testing purposes, Dynamo provides DynoSim and the [mocker backend](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) to simulate LLM inference without requiring actual GPU resources. This is useful for: - **Testing deployments** without expensive GPU infrastructure - **Developing and debugging** router, planner, or frontend logic - **CI/CD pipelines** that need to validate infrastructure without model execution - **Benchmarking framework validation** to ensure your setup works before using real backends Mocker is the live simulated engine in DynoSim: it mimics the API and behavior of real backends (SGLang, TensorRT-LLM, vLLM) but generates mock responses instead of running actual inference. Use [DynoSim Runs](/dynamo/user-guides/dynosim/runs) for one simulated workload/config trial and [DynoSim Sweeps](/dynamo/user-guides/dynosim/sweeps) when you want to search across many candidate configurations. See [Live Simulation with Mocker](/dynamo/user-guides/dynosim/mocker) for usage examples and configuration options. --- ## Advanced AIPerf Features AIPerf has many capabilities beyond basic profiling. Here are some particularly useful for Dynamo benchmarking: | Feature | Description | Docs | |---------|-------------|------| | Priority Validation | Send per-request `nvext.agent_hints.priority` values and verify router or backend priority behavior under contention | [Priority Scheduling](../agents/priority-scheduling.md#verify-priority-is-working) | | Trace Replay | Replay production traces for deterministic benchmarking | [Trace Replay](https://github.com/ai-dynamo/aiperf/blob/main/docs/benchmark-modes/trace-replay.md) | | Arrival Patterns | Poisson, constant, gamma traffic distributions | [Arrival Patterns](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/arrival-patterns.md) | | Gradual Ramping | Smooth ramp-up of concurrency and request rate | [Ramping](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/ramping.md) | | Warmup Phase | Eliminate cold-start effects from measurements | [Warmup](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/warmup.md) | | Multi-URL Load Balancing | Distribute requests across multiple endpoints | [Multi-URL](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/multi-url-load-balancing.md) | | GPU Telemetry | Collect DCGM metrics during benchmarking | [GPU Telemetry](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/gpu-telemetry.md) | | Goodput Analysis | SLO-based throughput measurement | [Goodput](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/goodput.md) | | Timeslice Analysis | Per-timeslice performance breakdown | [Timeslices](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/timeslices.md) | | Multi-Turn Conversations | Benchmark multi-turn chat workloads | [Multi-Turn](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/multi-turn.md) | | Experiment Classification | Baseline vs treatment semantic colors in plots | [Plotting](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorials/plot.md) | # Tool Calling & Reasoning Parsing Dynamo parses tool-call and reasoning markup out of raw model output and surfaces it as OpenAI-compatible `tool_calls` and `reasoning_content` on the response. Tool calling is controlled by the `tool_choice` and `tools` request parameters; reasoning parsing is enabled per-model with a reasoning parser. There are two ways to parse, depending on whether the parser lives in Dynamo's own registry or in an upstream engine frontend (`vllm serve`, `sglang serve`, or `trtllm-serve`). ## Choose a parsing path | Path | When to use | Pages | |------|-------------|-------| | **Dynamo** | Dynamo ships a framework-agnostic Rust parser for the model's tool-call or reasoning format. Default path. | [Tool Call Parsing (Dynamo)](/dynamo/user-guides/parsing/tool-call-parsing-dynamo), [Reasoning Parsing (Dynamo)](/dynamo/user-guides/parsing/reasoning-parsing-dynamo) | | **Engine Fallback** | Use the framework's own parser (vLLM or SGLang today; TRT-LLM in progress) when Dynamo doesn't ship one for your model. | [Parser Engine Fallback](/dynamo/user-guides/parsing/parser-engine-fallback) | Start with the Dynamo path. Fall back to the engine path only when Dynamo's registry doesn't list a parser for your model. For exactly which flags combine and which combinations don't make sense, see [Parser Configuration](/dynamo/user-guides/parsing/parser-configuration). ## Why Dynamo parses in the frontend In `vllm serve`, `sglang serve`, and `trtllm-serve`, tool-call and reasoning parsing happen in each engine's own frontend, with subtle behavioral differences across them. For performance, Dynamo orchestrates routing and tokenization and passes tokens directly to each engine, bypassing the engine's OpenAI server to avoid duplicate work per request. So Dynamo implements parsing in its frontend as a framework-agnostic Rust layer — one tested OpenAI-compatible contract across vLLM, SGLang, and TRT-LLM, on a hot path that stays concurrent without a Python GIL bottleneck. The `vllm`/`sglang` chat processors (engine fallback) opt back into the engine's own parser when Dynamo doesn't ship one for your model. ## See Also - [Parser Configuration](/dynamo/user-guides/parsing/parser-configuration) -- which flags combine, and which combinations don't make sense - [Tool Call Parsing (Dynamo)](/dynamo/user-guides/parsing/tool-call-parsing-dynamo) / [Reasoning Parsing (Dynamo)](/dynamo/user-guides/parsing/reasoning-parsing-dynamo) -- Dynamo-native parser names - [Parser Engine Fallback](/dynamo/user-guides/parsing/parser-engine-fallback) -- upstream vLLM / SGLang parsers - [Tool Calling Probe Snapshot for Dynamo 1.2](/dynamo/user-guides/parsing/tool-calling-probe-snapshot-for-dynamo-1-2) -- static release-readiness probe results - [Troubleshooting Tool Calls](/dynamo/user-guides/parsing/troubleshooting-tool-calls) -- capture `logprobs` so issues can be localized - [Frontend Configuration Reference](/dynamo/components/frontend/configuration-reference) -- full CLI flag reference # Tool Call Parsing (Dynamo)

简体中文

You can connect Dynamo to external tools and services using tool calling. By providing a list of available functions, Dynamo can choose to output function arguments for the relevant function(s) which you can execute to augment the prompt with relevant external information. Tool calling is controlled using the `tool_choice` and `tools` request parameters. This page covers parser names for the default Dynamo-native path. If Dynamo does not list a parser for your model, see [Parser Engine Fallback](/dynamo/user-guides/parsing/parser-engine-fallback). For how `--dyn-tool-call-parser` combines with `--dyn-chat-processor` and `--dyn-reasoning-parser` (and which combinations are invalid), see [Parser Configuration](/dynamo/user-guides/parsing/parser-configuration). ## Prerequisites To enable this feature, you should set the following flag while launching the backend worker - `--dyn-tool-call-parser`: select the tool call parser from the supported list below ```bash # can be sglang, trtllm, vllm, etc. based on your installation python -m dynamo. --help ``` If your model's default chat template doesn't support tool calling, but the model itself does, you can specify a custom chat template per worker with `python -m dynamo. --custom-jinja-template `. If your model also emits reasoning content that should be separated from normal output, see [Reasoning Parsing (Dynamo)](/dynamo/user-guides/parsing/reasoning-parsing-dynamo) for the supported `--dyn-reasoning-parser` values. ## Supported Tool Call Parsers The table below lists the currently supported tool call parsers in Dynamo's registry. The **Upstream name** column shows where the vLLM or SGLang parser name differs from Dynamo's -- relevant when using `--dyn-chat-processor vllm` or `sglang` (see [Parser Engine Fallback](/dynamo/user-guides/parsing/parser-engine-fallback)). A blank upstream column means the same name works everywhere. `Dynamo-only` means no upstream parser exists for this format. | Parser Name | Models | Upstream name | Notes | |---|---|---|---| | `kimi_k2` | Kimi K2 Instruct/Thinking, Kimi K2.5 | | Pair with `--dyn-reasoning-parser kimi` or `kimi_k25` | | `minimax_m2` | MiniMax M2 / M2.1 | vLLM: `minimax` | XML `` | | `deepseek_v4` | DeepSeek V4 Pro / Flash | vLLM: `deepseek_v4`; SGLang: `deepseekv4` | DSML tags (`<|DSML|tool_calls>...`). Aliases: `deepseek-v4`, `deepseekv4` | | `deepseek_v3` | DeepSeek V3, DeepSeek R1-0528+ | SGLang: `deepseekv3` | Special Unicode markers | | `deepseek_v3_1` | DeepSeek V3.1 | Dynamo-only | JSON separators | | `deepseek_v3_2` | DeepSeek V3.2+ | Dynamo-only | DSML tags (`<|DSML|function_calls>...`) | | `qwen3_coder` | Qwen3.5, Qwen3-Coder | | XML `` | | `glm47` | GLM-4.5, GLM-4.7 | Dynamo-only | XML `/` | | `nemotron_deci` | Nemotron-Super / -Ultra / -Deci, Llama-Nemotron-Ultra / -Super | Dynamo-only | `` JSON | | `nemotron_nano` | Nemotron-Nano | Dynamo-only | Alias for `qwen3_coder` | | `gemma4` | Google Gemma 4 (thinking models) | vLLM: `gemma4` | Custom non-JSON grammar with `<\|"\|>` string delimiters and `<\|tool_call>...` markers. Aliases: `gemma-4`. Pair with `--dyn-reasoning-parser gemma4` and `--custom-jinja-template examples/chat_templates/gemma4_tool.jinja` | | `harmony` | gpt-oss-20b / -120b | Dynamo-only | Harmony channel format | | `hermes` | Qwen2.5-\*, QwQ-32B, Qwen3-Instruct, Qwen3-Think, NousHermes-2/3 | vLLM: `qwen2_5`; SGLang: `qwen25` (for Qwen models) | `` JSON | | `phi4` | Phi-4, Phi-4-mini, Phi-4-mini-reasoning | vLLM: `phi4_mini_json` | `functools[...]` JSON | | `pythonic` | Llama 4 (Scout / Maverick) | | Python-list tool syntax | | `llama3_json` | Llama 3 / 3.1 / 3.2 / 3.3 Instruct | | `<\|python_tag\|>` tool syntax | | `mistral` | Mistral / Mixtral / Mistral-Nemo, Magistral | | `[TOOL_CALLS]...[/TOOL_CALLS]` | | `jamba` | Jamba 1.5 / 1.6 / 1.7 | Dynamo-only | `` JSON | | `default` | *(fallback)* | Dynamo-only | Empty JSON config (no start/end tokens). Prefer a model-specific parser for production use. | For Kimi K2.5 thinking models, pair `--dyn-tool-call-parser kimi_k2` with `--dyn-reasoning-parser kimi_k25` from [Reasoning Parsing (Dynamo)](/dynamo/user-guides/parsing/reasoning-parsing-dynamo) so that both `` blocks and tool calls are parsed correctly from the same response. ## Examples ### Launch Dynamo Frontend and Backend ```bash # launch backend worker (or dynamo.vllm) python -m dynamo.sglang --model Qwen/Qwen3.5-4B --dyn-tool-call-parser qwen3_coder --dyn-reasoning-parser qwen3 # launch frontend worker python -m dynamo.frontend ``` ### Tool Calling Request Example ```bash curl -s http://localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "Qwen/Qwen3.5-4B", "messages": [ {"role": "user", "content": "What is the weather in San Francisco and New York?"} ], "tools": [{ "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a location.", "parameters": { "type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"] } } }], "tool_choice": "auto" }' ``` Dynamo parses the tool calls out of the model output and surfaces them as OpenAI-compatible `tool_calls` entries on the response: ```json { "id": "chatcmpl-b415caad-9be0-4d9e-ac6d-9d23bfe57703", "choices": [ { "index": 0, "message": { "role": "assistant", "content": null, "reasoning_content": "The user is asking about the weather in two cities: San Francisco and New York. I need to call the get_weather function for each city. I'll make two separate function calls to get the weather information for both locations.\n", "tool_calls": [ { "id": "call-56223a95-3d14-4433-a94e-011f106c0e40", "type": "function", "function": { "name": "get_weather", "arguments": "{\"location\":\"San Francisco\"}" } }, { "id": "call-d5b5772b-6b0c-4120-ad01-623278a937fe", "type": "function", "function": { "name": "get_weather", "arguments": "{\"location\":\"New York\"}" } } ] }, "finish_reason": "tool_calls", "logprobs": null } ], "created": 1778653281, "model": "Qwen/Qwen3.5-4B", ... } ``` If a tool call comes back wrong, add `"logprobs": true` to a single repro request and share the response. See [Troubleshooting Tool Calls](/dynamo/user-guides/parsing/troubleshooting-tool-calls) for what to capture and include when reporting an issue. ## Optional: structural tags You can optionally turn on **xgrammar structural tags** so guided decoding matches the parser's tool-call format at token granularity. See [Structural tag (guided decoding for tool calls)](structural-tag.md). # Reasoning Parsing (Dynamo) Some models emit reasoning or thinking content separately from their final response. Dynamo can split that output into `reasoning_content` and normal assistant content by configuring `--dyn-reasoning-parser` on the backend worker. This page covers parser names for the default Dynamo-native path. If Dynamo does not list a parser for your model, see [Parser Engine Fallback](/dynamo/user-guides/parsing/parser-engine-fallback). For how `--dyn-reasoning-parser` combines with `--dyn-chat-processor` and `--dyn-tool-call-parser` (and which combinations are invalid), see [Parser Configuration](/dynamo/user-guides/parsing/parser-configuration). ## Prerequisites To enable reasoning parsing, launch the backend worker with: - `--dyn-reasoning-parser`: select the reasoning parser from the supported list below ```bash # can be sglang, trtllm, vllm, etc. based on your installation python -m dynamo. --help ``` Some models need both a reasoning parser and a tool call parser. For supported tool call parser names, see [Tool Call Parsing (Dynamo)](/dynamo/user-guides/parsing/tool-call-parsing-dynamo). ## Supported Reasoning Parsers The table below lists the currently supported reasoning parsers in Dynamo's registry. The **Upstream name** column shows where the vLLM or SGLang parser name differs from Dynamo's -- relevant when using `--dyn-chat-processor vllm` or `sglang` (see [Parser Engine Fallback](/dynamo/user-guides/parsing/parser-engine-fallback)). A blank upstream column means the same name works everywhere. `Dynamo-only` means no upstream parser exists for this format. Parsers marked **force-reasoning** emit reasoning content from token one without requiring an explicit opening tag (``, etc.). All others require the opening tag to be present in the model output. | Parser Name | Models | Upstream name | Force-reasoning | Notes | |---|---|---|---|---| | `kimi_k25` | Kimi K2.5 / Kimi K2.6 format-compatible thinking models | Dynamo-only | Yes | `...` with force-reasoning | | `kimi` | Kimi K2 Instruct / Thinking with Unicode delimiters | Dynamo-only | No | `◁think▷...◁/think▷` | | `minimax_append_think` | MiniMax M2 / M2.1 | Dynamo-only | No | Implicit opening `` prepended | | `deepseek_v4` | DeepSeek V4 Pro / Flash | vLLM: `deepseek_v4`; SGLang: `deepseek-v4` | No | `...`. Aliases: `deepseek-v4`, `deepseekv4` | | `deepseek_r1` | DeepSeek R1, DeepSeek V3.1, DeepSeek V3.2 | | Yes | Pass explicitly for V3.1/V3.2 (no alias) | | `qwen3` | Qwen3.5, QwQ-32B, Qwen3-Think, Qwen3-Coder | | No | `...` | | `glm45` | GLM-4.5, GLM-4.7 | Dynamo-only | No | Alias for `nemotron_deci`. `...` | | `nemotron3` | Nemotron-3 / Mini | vLLM: `nemotron_v3` | Yes | Alias for `deepseek_r1`. Also accepts `nemotron_v3` | | `nemotron_deci` | Nemotron-Super / -Ultra / -Deci, Llama-Nemotron | Dynamo-only | No | `...` | | `nemotron_nano` | Nemotron-Nano | Dynamo-only | Yes | Alias for `deepseek_r1` | | `gemma4` | Google Gemma 4 (thinking models) | vLLM: `gemma4` | No | `<\|channel>thought\n...` with `thought\n` role label stripped. Aliases: `gemma-4` | | `gpt_oss` | gpt-oss-20b / -120b | Dynamo-only | No | Harmony channel reasoning format | | `mistral` | Magistral | | Yes | `[THINK]...[/THINK]` | | `granite` | IBM Granite 3.x / Granite 3.2 language models | | No | `Here's my thought process:` / `Here's my response:` | | `step3` | Step-3 / Step-3-Reasoning | Dynamo-only | Yes | `...` | | `basic` | Generic CoT models | Dynamo-only | No | Plain `...` | ## Common Parser Pairings Some models need both parsers configured together. Common pairings include: - `openai/gpt-oss-*`: `--dyn-tool-call-parser harmony --dyn-reasoning-parser gpt_oss` - `deepseek-ai/DeepSeek-V4-*`: `--dyn-tool-call-parser deepseek_v4 --dyn-reasoning-parser deepseek_v4` - `zai-org/GLM-4.7`: `--dyn-tool-call-parser glm47 --dyn-reasoning-parser glm45` - `moonshotai/Kimi-K2.5*` / Kimi K2.6 format-compatible outputs: `--dyn-tool-call-parser kimi_k2 --dyn-reasoning-parser kimi_k25` - `google/gemma-4-*` thinking models: `--dyn-tool-call-parser gemma4 --dyn-reasoning-parser gemma4 --custom-jinja-template examples/chat_templates/gemma4_tool.jinja` - `Qwen/Qwen3.5*`: `--dyn-tool-call-parser qwen3_coder --dyn-reasoning-parser qwen3` - MiniMax M2.1 style outputs: `--dyn-tool-call-parser minimax_m2 --dyn-reasoning-parser minimax_append_think` ## Tool Calling Interplay Reasoning parsing happens before tool call parsing. If a model emits both reasoning content and tool calls, configure both parsers so Dynamo can first separate reasoning text and then parse tool calls from the remaining assistant output. ## Examples ### Launch Dynamo Frontend and Backend ```bash # launch backend worker (or dynamo.vllm) python -m dynamo.sglang --model Qwen/Qwen3.5-4B --dyn-tool-call-parser qwen3_coder --dyn-reasoning-parser qwen3 # launch frontend worker python -m dynamo.frontend ``` ### Reasoning Request Example ```bash curl -s http://localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "Qwen/Qwen3.5-4B", "messages": [{"role": "user", "content": "If a train leaves at 3pm going 60 mph and another leaves at 4pm going 80 mph, when does the second catch up?"}] }' ``` Dynamo splits the model output so the chain-of-thought lands in `reasoning_content` and the user-facing answer stays in `content`: ```json { "choices": [ { "index": 0, "message": { "role": "assistant", "reasoning_content": "The first train has a 1-hour head start at 60 mph, so it is 60 miles ahead at 4pm. The second train closes the gap at 80 - 60 = 20 mph. 60 / 20 = 3 hours after 4pm.", "content": "The second train catches up at 7pm." }, "finish_reason": "stop" } ] } ``` # Parser Engine Fallback When Dynamo's registry does not list a tool-call or reasoning parser for your model, fall back to the upstream engine's parser via a **chat-processor swap**, which keeps frontend tokenization and KV routing. For the Dynamo-native default path, see [Tool Call Parsing (Dynamo)](/dynamo/user-guides/parsing/tool-call-parsing-dynamo) and [Reasoning Parsing (Dynamo)](/dynamo/user-guides/parsing/reasoning-parsing-dynamo). How `--dyn-chat-processor` combines with the parser flags — and which combinations are invalid (engine fallback supports disaggregated serving on vLLM and SGLang; TRT-LLM engine fallback is a work in progress) — is documented once in [Parser Configuration](/dynamo/user-guides/parsing/parser-configuration). Read that first; this page covers only the engine-fallback specifics. ## Configuration Engine fallback runs parsing in the engine's own Python frontend. Select it with `--dyn-chat-processor vllm` or `sglang`, then name the parser with the engine's **frontend** flags: - `--tool-call-parser ` — the engine's tool-call parser - `--reasoning-parser ` — the engine's reasoning parser These are distinct from the Dynamo-native `--dyn-tool-call-parser` / `--dyn-reasoning-parser` (which go on the worker). The accepted names come from the engine's registry and may differ from Dynamo's — e.g. vLLM `nemotron_v3` vs Dynamo `nemotron3`, SGLang `deepseekv3` vs Dynamo `deepseek_v3`. ## Examples ```bash # vLLM chat processor — frontend carries the parser flags, then launch the worker: python -m dynamo.frontend --dyn-chat-processor vllm --tool-call-parser hermes --reasoning-parser qwen3 python -m dynamo.vllm --model Qwen/Qwen3-0.6B # SGLang chat processor python -m dynamo.frontend --dyn-chat-processor sglang --tool-call-parser qwen25 --reasoning-parser qwen3 python -m dynamo.sglang --model Qwen/Qwen3-0.6B ``` If a tool call or reasoning split comes back wrong, add `"logprobs": true` to a single repro request and share the response. See [Troubleshooting Tool Calls](/dynamo/user-guides/parsing/troubleshooting-tool-calls) for what to capture. ## See Also - [Parser Configuration](/dynamo/user-guides/parsing/parser-configuration) -- how the chat-processor and parser flags combine, and which combinations are invalid (start here) - [Tool Call Parsing (Dynamo)](/dynamo/user-guides/parsing/tool-call-parsing-dynamo) -- Dynamo-native tool-call parser names - [Reasoning Parsing (Dynamo)](/dynamo/user-guides/parsing/reasoning-parsing-dynamo) -- Dynamo-native reasoning parser names - [vLLM Chat Processor](/dynamo/backends/v-llm/frontend-processor-fallback) -- vLLM chat-processor details - [SGLang Chat Processor](/dynamo/backends/sg-lang/frontend-processor-fallback) -- SGLang chat-processor details - [Frontend Configuration Reference](/dynamo/components/frontend/configuration-reference) -- Full CLI flag reference # Parser Configuration Dynamo turns a model's raw tool-call and reasoning markup into structured `tool_calls` and `reasoning_content`. Two independent choices control how that parsing happens. This page is the single source of truth for **which flags combine and which combinations don't make sense**. For the parser *names* themselves, follow the per-stage links at the bottom. ## The choices **1. Who parses — `--dyn-chat-processor`** (a *frontend* flag; default `dynamo`): - `dynamo` (default) — Dynamo's framework-agnostic Rust parser. Works on every backend (vLLM, SGLang, TRT-LLM) and with disaggregated serving. - `vllm` / `sglang` — delegate parsing to that engine's own Python parser ("engine fallback"). Use only when Dynamo does not ship a parser for your model. **2. Which parser** — the flag name *and where it goes* depend on choice 1: | Parser Implementation | Parser flag(s) and where they go | Parses with | Disaggregated serving | Backends | |---|---|---|---|---| | `dynamo` (default) | `--dyn-tool-call-parser ` and/or `--dyn-reasoning-parser ` — on the **worker** | Dynamo Rust frontend | Supported | vLLM, SGLang, TRT-LLM | | `vllm` | `--tool-call-parser ` and/or `--reasoning-parser ` — on the **frontend** | vLLM Python | Supported | vLLM | | `sglang` | `--tool-call-parser ` and/or `--reasoning-parser ` — on the **frontend** | SGLang Python | Supported | SGLang | ## The pairing rule - The **`--dyn-*` parser flags pair with the `dynamo` chat processor** and go on the **worker**: `--dyn-tool-call-parser`, `--dyn-reasoning-parser`. - The **bare `--tool-call-parser` / `--reasoning-parser` flags pair with `vllm` / `sglang`** and go on the **frontend**. Tool calling and reasoning are independent — set one, the other, or both — but always from the same family as your chat processor. You never mix the two families. ## What does NOT make sense | Combination | Why it's wrong | |---|---| | `--dyn-chat-processor dynamo` + `--tool-call-parser` / `--reasoning-parser` | The bare flags drive the engine-fallback path; the default Dynamo path uses the `--dyn-` flags. Use `--dyn-tool-call-parser` / `--dyn-reasoning-parser`. | | `--dyn-chat-processor vllm`/`sglang` + `--dyn-tool-call-parser` / `--dyn-reasoning-parser` | The `--dyn-` flags only drive Dynamo's native parser; an engine processor reads its own `--tool-call-parser` / `--reasoning-parser`. | | `--dyn-chat-processor vllm`/`sglang` on TRT-LLM | TRT-LLM engine fallback is a work in progress. Use the default `dynamo` processor. | | Reusing a parser name across families | The registries differ — e.g. Dynamo `deepseek_v3` vs vLLM/SGLang `deepseekv3`, Dynamo `nemotron3` vs vLLM `nemotron_v3`. Use the name from the registry that matches your chat processor. | ## Examples Default (Dynamo-native) — the common case. The same `--dyn-*` flags work on every backend; pick one worker. The chat processor defaults to `dynamo`, so the frontend flag is optional: ```bash # Frontend — chat processor defaults to `dynamo`, so these two are identical: python -m dynamo.frontend python -m dynamo.frontend --dyn-chat-processor dynamo # Worker selects the Dynamo parsers — same flags on vLLM, SGLang, or TRT-LLM: python -m dynamo.vllm --model Qwen/Qwen3-0.6B \ --dyn-tool-call-parser hermes --dyn-reasoning-parser qwen3 python -m dynamo.sglang --model Qwen/Qwen3-0.6B \ --dyn-tool-call-parser hermes --dyn-reasoning-parser qwen3 python -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B \ --dyn-tool-call-parser hermes --dyn-reasoning-parser qwen3 ``` Engine fallback — only when Dynamo lacks a parser for your model. Supported on vLLM and SGLang (not TRT-LLM); the parser flags go on the **frontend** and use the engine's own parser names: ```bash # vLLM chat processor — frontend carries the parser flags, then launch the worker: python -m dynamo.frontend --dyn-chat-processor vllm --tool-call-parser hermes --reasoning-parser qwen3 python -m dynamo.vllm --model Qwen/Qwen3-0.6B # SGLang chat processor python -m dynamo.frontend --dyn-chat-processor sglang --tool-call-parser qwen25 --reasoning-parser qwen3 python -m dynamo.sglang --model Qwen/Qwen3-0.6B ``` ## Parser names and per-stage details - Tool calling: [Tool Call Parsing (Dynamo)](/dynamo/user-guides/parsing/tool-call-parsing-dynamo) (native parser names). - Reasoning: [Reasoning Parsing (Dynamo)](/dynamo/user-guides/parsing/reasoning-parsing-dynamo) (native parser names). - Engine fallback (vLLM / SGLang): [Parser Engine Fallback](/dynamo/user-guides/parsing/parser-engine-fallback). - Engine processors: [vLLM Chat Processor](/dynamo/backends/v-llm/frontend-processor-fallback) and [SGLang Chat Processor](/dynamo/backends/sg-lang/frontend-processor-fallback). - Every frontend flag: [Frontend Configuration Reference](/dynamo/components/frontend/configuration-reference). # Tool Calling Probe Snapshot for Dynamo 1.2 This page captures a one-time Dynamo 1.2.0 release snapshot from the tool-calling probe harness generated on 2026-06-05 at 07:24 UTC. It is not a live dashboard. Failures are non-passing probe requests, and lower is better. The same scenario can contribute separate failures for streaming and non-streaming request modes. `Dynamo errors` counts Dynamo/parser/API-contract failures, including boundary cases. It also counts Dynamo runtime or endpoint/deployment failures where the request timed out before a usable OpenAI response was returned. `Other errors` counts engine/model behavior and mixed/needs inspection failures. Issue notes use the probe classifier: - **Dynamo/parser likely**: raw model-native tool-call syntax leaked into the OpenAI response instead of structured `tool_calls`, final assistant text was routed into reasoning output, delimiter-like literal text was not preserved in a structured argument, or the parser/API contract was otherwise not satisfied. - **Engine/model behavior likely**: the endpoint returned a response, but the model behavior did not satisfy the requested tool workflow. - **Endpoint/deployment**: the request timed out before a usable response. These are counted as Dynamo runtime failures in this static release table. - **Mixed/needs inspection**: raw request/response details need follow-up before assigning ownership. Some current-main rows were run with a different number of probes than the Dynamo 1.2.0 snapshot. Compare each `failures / total` count directly instead of treating every row as an exact A/B pass-rate comparison. The release-note cells below are based on the failed request and response artifacts for both Dynamo 1.2.0 and current main. With this classification, Dynamo runtime/parser/API failures improve on Kimi K2.6, GLM 5.1, and Qwen3.6-35B-A3B. MiniMax 2.7 improves in total failures, but its remaining parser-boundary failure count is unchanged.
Model Tool-call format Dynamo 1.2.0 release Current main Release notes
Total Dynamo errors Other errors Total Dynamo errors Other errors Current failures Improvement from 1.2 to main
Kimi K2.6 Kimi tool-call and reasoning format 22 / 36 21 1 2 / 36 0 2 Current main only fails a multi-step search-and-crawl workflow in streaming and non-streaming modes. The model returns no structured tool calls and asks for endpoint clarification instead of executing the workflow. No raw marker leakage was observed in current main. Dynamo 1.2.0 had 18 parser/API-boundary failures and three endpoint timeouts. Model-native tool-call syntax appeared in reasoning instead of structured tool_calls, and some final assistant text was routed away from assistant content. Current main removes those Dynamo failures and leaves two model-workflow failures.
DeepSeek V4 Pro DeepSeek tool-call and reasoning format 0 / 46 0 0 0 / 46 0 0 No failures in the captured current-main run. No change needed. Dynamo 1.2.0 and current main are both clean.
GLM 5.1 GLM tool-call format 4 / 48 4 0 3 / 48 3 0 Current main still fails delimiter-literal preservation in streaming and non-streaming modes because delimiter-looking text is not preserved in the structured argument. One non-streaming no-tools request also timed out. Current main improves from 4 to 3 Dynamo/runtime failures by removing a Dynamo 1.2.0 timeout in the multi-step search-and-crawl workflow. The delimiter-string preservation issue remains.
MiniMax 2.7 MiniMax tool-call format 8 / 46 2 6 4 / 46 2 2 Current main has four failures. A simple arithmetic auto-tool prompt answers in text instead of producing the requested structured tool call in streaming and non-streaming modes. A delimiter-like literal string prompt returns a structured tool call in both modes, but the marker-looking text inside the argument is not preserved exactly; this is counted as a parser/API-boundary failure. Current main now uses the full 46-probe coverage and improves from 8 failures to 4. The multi-step tool-loop workflow and context echo auto-tool prompt that failed in Dynamo 1.2.0 now pass. Dynamo/parser-boundary failures remain at 2, while other failures drop from 6 to 2.
Gemma 4 31B IT Gemma tool-call and reasoning format 2 / 48 2 0 2 / 46 2 0 Current main still fails delimiter-literal preservation in streaming and non-streaming modes. The response produces a structured tool call, but the SQL string is truncated before the expected literal marker text. No observed failure-count improvement. Dynamo 1.2.0 and current main have the same failure class, with fewer probes in the current-main run.
Qwen3.6-35B-A3B Qwen tool-call format 1 / 48 1 0 0 / 46 0 0 No failures in the captured current-main run. Current main is clean. The Dynamo 1.2.0 non-streaming timeout in the multi-step search-and-crawl workflow is gone.
GPT-OSS 120B GPT-OSS tool-call format 14 / 48 2 12 14 / 48 2 12 Current main still has 14 failures. Multi-tool and parallel-tool prompts produce only one structured tool call, a simple calculation prompt answers in text instead of calling the tool, a marker-literal string argument omits the requested marker-like text, and the search/crawl final answer still misses the expected evidence. No raw model-native marker leakage was observed. The refreshed GPT-OSS current-main run is no longer worse than Dynamo 1.2.0 by count; both are 14 / 48. The prior main-only required-tool regression is gone, and the streaming multi-step workflow now returns final content instead of an empty assistant message, but the core multi-tool, parallel-tool, literal-marker, and final-answer gaps remain.
# Troubleshooting Tool Calls When a tool call comes back wrong (`tool_calls` is `null`, the arguments look malformed, raw `` markers appear in `message.content`, or `finish_reason` is `"stop"` when you expected `"tool_calls"`), the request and response alone usually do not say *where* the bug is. The model and the parser produce indistinguishable failures from the response side. Adding `"logprobs": true` to a single repro request makes the engine's raw token output visible in the response. That is enough for someone on the Dynamo team to identify whether the issue is in the model, the parser configuration, or the parser itself. This page shows the field to add and what the response will look like, so you can capture and share useful diagnostic info. Recipe applies to non-streaming requests against Dynamo's OpenAI `/v1/chat/completions` endpoint. For multi-channel reasoning models (`harmony`, `kimi_k2`, `kimi_k25`, `gemma4`), the recipe recovers only the assistant-content channel; the reasoning channel is not surfaced in `logprobs.content`. If the worker is the SGLang backend, `logprobs: true` is rejected by default because SGLang's tokenizer manager detokenizes top-k tokens serially, causing latency degradation. Launch the worker with `DYN_SGL_ALLOW_TOP_LOGPROBS=1` set in the environment to opt in for the duration of the repro request, then unset it afterward. Tracked at [sgl-project/sglang#24447](https://github.com/sgl-project/sglang/pull/24447). ## The request Add `"logprobs": true` to your failing request: ```bash curl -s http://localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "messages": [ {"role": "user", "content": "What is the weather in NYC?"} ], "tools": [{ "type": "function", "function": { "name": "get_weather", "parameters": { "type": "object", "properties": { "location": {"type": "string"}, "unit": {"enum": ["celsius", "fahrenheit"]} }, "required": ["location"] } } }], "tool_choice": "auto", "temperature": 0.0, "logprobs": true }' ``` ## The response You will get back the usual fields (`message.tool_calls`, `message.content`, `finish_reason`) plus a new `choices[0].logprobs.content` field carrying the engine's raw token stream: ```json { "choices": [{ "finish_reason": "tool_calls", "message": { "role": "assistant", "content": null, "tool_calls": [{ "type": "function", "function": { "name": "get_weather", "arguments": "{\"location\":\"New York, NY\",\"unit\":\"fahrenheit\"}" } }] }, "logprobs": { "content": [ {"token": "", "bytes": [60, 116, 111, 111, 108, 95, 99, 97, 108, 108, 62]}, {"token": "\n", "bytes": [10]}, {"token": "{\"", "bytes": [123, 34]}, {"token": "name", "bytes": [110, 97, 109, 101]}, "...", {"token": "", "bytes": [60, 47, 116, 111, 111, 108, 95, 99, 97, 108, 108, 62]} ] } }] } ``` Each entry in `logprobs.content` is one generated token with its exact UTF-8 `bytes`. Concatenating those bytes in order reconstructs the raw model output, before any tool-call parser touched it. That is the key piece for triage: it tells us what the model actually produced, separately from what the parser made of it. ## What to include when reporting an issue Share these four things in the bug report or issue thread: 1. **The full request body** (model name, messages, tools, sampling params, and `logprobs: true`). 2. **The full response body.** Do not truncate `logprobs.content` -- the per-token entries are the part that matters. 3. **The Dynamo version and the backend** (vLLM, SGLang, TRT-LLM, including versions if known). 4. **The worker launch command**, especially the `--dyn-tool-call-parser` value if set. With those four pieces, the Dynamo team can usually localize the bug without standing up your model. The team will reconstruct the raw stream from the `bytes` arrays and compare it against `message.content` and `message.tool_calls` to decide whether the issue is in the model output, the parser configuration, or the parser logic. ## See also - [Tool Call Parsing (Dynamo)](/dynamo/user-guides/parsing/tool-call-parsing-dynamo) -- Dynamo-native parser names and request examples - [Parser Engine Fallback](/dynamo/user-guides/parsing/parser-engine-fallback) -- `--dyn-chat-processor` fallback path to vLLM and SGLang parsers - [Frontend Configuration Reference](/dynamo/components/frontend/configuration-reference) -- full CLI flag reference for the frontend and worker # Fault Tolerance Dynamo provides comprehensive fault tolerance mechanisms to ensure reliable LLM inference in production deployments. This section covers the various strategies and features that enable Dynamo to handle failures gracefully and maintain service availability. ## Overview Fault tolerance in Dynamo operates at multiple levels: | Layer | Mechanism | Purpose | |-------|-----------|---------| | **Request** | Migration, Cancellation | Handle in-flight request failures | | **Worker** | Health Checks, Graceful Shutdown | Detect and recover from worker failures | | **Engine Process** | [Shadow Engine Failover](/dynamo/kubernetes-deployment/advanced-platform/shadow-engine-failover) | Active/passive recovery for same-node engine-process failures in Kubernetes | | **System** | Load Shedding, Request Rejection | Prevent system overload | | **Infrastructure** | etcd HA, NATS resilience | Handle infrastructure component failures | ## Key Features ### Request Migration When a worker fails during request processing, Dynamo can migrate in-progress requests to healthy workers. The migration system: - Preserves partial generation state (accumulated tokens) - Transparently continues generation on a new worker - Maintains seamless token flow to clients See [Request Migration](/dynamo/user-guides/fault-tolerance/request-migration) for details. ### Request Cancellation Dynamo supports canceling in-flight requests to free computational resources: - Graceful stop signals for clean termination - Kill signals for immediate termination - Hierarchical cancellation propagation through request chains See [Request Cancellation](/dynamo/user-guides/fault-tolerance/request-cancellation) for details. ### Graceful Shutdown Workers handle shutdown signals (SIGTERM/SIGINT) gracefully: - Immediately stop accepting new requests - Optionally drain in-flight requests before terminating - Clean up resources (engines, connections, temp files) See [Graceful Shutdown](/dynamo/user-guides/fault-tolerance/graceful-shutdown) for details. ### Request Rejection (Load Shedding) When workers are overloaded, Dynamo rejects new requests to prevent cascading failures: - Configurable busy thresholds based on KV cache utilization - Real-time worker load monitoring - HTTP 503 responses with retry guidance See [Request Rejection](/dynamo/user-guides/fault-tolerance/request-rejection) for details. ### Health Checks Dynamo provides multiple health check mechanisms: - **HTTP Endpoints**: `/health` and `/live` endpoints for orchestration - **Canary Health Checks**: Active monitoring via periodic test requests - **Engine Monitoring**: Automatic shutdown on engine failure detection See [Health Checks](/dynamo/user-guides/observability-local/health-checks) for details. ### Shadow Engine Failover For Kubernetes deployments, [Shadow Engine Failover](/dynamo/kubernetes-deployment/advanced-platform/shadow-engine-failover) can help with same-node recovery from unknown backend engine or software-process failures. It uses GPU Memory Service to keep model weights resident while a standby or replacement engine attaches. It does not preserve in-flight requests or KV cache state, and it does not cover GPU or node loss. ## Configuration Quick Reference | Feature | Environment Variable | Default | |---------|---------------------|---------| | Worker health port | `DYN_SYSTEM_PORT` | `9090` | | Canary health checks | `DYN_HEALTH_CHECK_ENABLED` | `false` | | Canary wait time | `DYN_CANARY_WAIT_TIME` | `10` seconds | | Health check timeout | `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` | `3` seconds | | Decode blocks threshold | `DYN_ACTIVE_DECODE_BLOCKS_THRESHOLD` | `1.0` | | Prefill tokens threshold | `DYN_ACTIVE_PREFILL_TOKENS_THRESHOLD` | `10000000` | ## Failure Scenarios and Recovery ### Worker Pod Restart 1. Worker receives SIGTERM from Kubernetes 2. Endpoints are immediately invalidated (no new requests) 3. In-flight requests complete or migrate (based on configuration) 4. Resources are cleaned up 5. Pod restarts with fresh state ### Worker Crash (Unexpected) 1. etcd lease expires (TTL-based detection) 2. Client discovers endpoint removal via etcd watch 3. New requests route to remaining healthy workers 4. In-flight requests on crashed worker are migrated (if enabled) ### Network Partition 1. Worker loses connectivity to etcd/NATS 2. Lease keep-alive fails, lease eventually expires 3. Worker is removed from service discovery 4. Traffic reroutes to reachable workers ### GPU Failure 1. Engine health check detects GPU error (XID, OOM, etc.) 2. Worker initiates graceful shutdown 3. Runtime is shut down, engine cleaned up 4. Process exits with code 1 for pod restart ## Testing Fault Tolerance Dynamo includes a comprehensive testing framework for validating fault tolerance: - Request cancellation tests - Migration tests with worker failures - etcd HA failover tests - Hardware fault injection (GPU XID, network partitions) See [Fault Tolerance Testing](/dynamo/user-guides/fault-tolerance/testing) for details. ## Related Documentation - [Observability](/dynamo/user-guides/observability-local) - Metrics and monitoring - [Shadow Engine Failover](/dynamo/kubernetes-deployment/advanced-platform/shadow-engine-failover) - Same-node active/passive engine failover for Kubernetes deployments - [Distributed Runtime](/dynamo/design-docs/distributed-runtime) - Service discovery architecture - [Event Plane](/dynamo/design-docs/communication-planes/event-plane) - Pub/sub for KV cache events and worker metrics - [Discovery Plane](/dynamo/design-docs/communication-planes/discovery-plane) - Service discovery and coordination # Request Migration

简体中文

This document describes how Dynamo implements request migration to handle worker failures gracefully during LLM text generation. Request migration allows in-progress requests to continue on different workers when the original worker becomes unavailable, providing fault tolerance and improved user experience. ## Overview Request migration is implemented through a Migration operator that sits in the LLM processing pipeline between the Backend operator and the service backend. When a worker fails during request processing, the migration system preserves the partial generation state and recreates the request on a new worker to continue from where the previous worker left off. ## Architecture Components ### Migrator The migration system is integrated into the LLM processing pipeline between the frontend preprocessing and the actual service backends. This positioning allows it to intercept all communication flows and manage failure scenarios transparently. Key responsibilities: - Intercepts all requests and responses flowing through the pipeline - Detects worker failure scenarios through error pattern matching - Manages retry logic with configurable migration limits - Tracks partial response state for seamless continuation ### Migration Limit Configuration The migration limit is configured at the **frontend** level and applies globally to all models served by that frontend. This parameter specifies the maximum number of times a request can be migrated to another worker: - Default behavior: no migration allowed (migration_limit=0) - Set via `--migration-limit` flag on the frontend - Applies to all models served by the frontend ### Max Sequence Length Configuration The max sequence length setting controls how long the migration system will cache token state for a request. Once the total sequence length (prompt + generated tokens) exceeds this limit, migration is disabled for that request and token tracking stops: - Default behavior: no limit (`--migration-max-seq-len` unset) - Set via `--migration-max-seq-len` flag or `DYN_MIGRATION_MAX_SEQ_LEN` environment variable on the frontend - Prevents unbounded memory growth from caching long sequences - Boundary: exactly at the limit is still migratable; only strictly exceeding it disables migration - The check runs both at request initialization (prompt length) and during generation (prompt + output tokens) ## Token State Tracking and Request Migration The core of the migration system is the ability to preserve and continue partial generations through token state management. This ensures that when a worker fails mid-generation, the new worker can seamlessly continue from the exact point of failure. ### Token Accumulation Process When a request is being processed and responses are flowing back from a worker, the migration system tracks every token that has been successfully generated: 1. **Initial Request State**: The system starts with the original preprocessed request containing the initial prompt tokens. 2. **Response Tracking**: As each response arrives from the worker, the migration system extracts the newly generated tokens and appends them to the request's token sequence. This creates accumulates all tokens that have been generated. 3. **Token Count Management**: The system also updates the remaining token budget to reflect the number of tokens already generated, ensuring that the total generation stays within the originally requested limits. ### Migration Trigger Scenarios The migration system handles two distinct failure scenarios: #### 1. New Request Migration (Initial Connection Failure) **Scenario**: Worker is unreachable when creating the initial connection. **Error Pattern**: Communication system reports chosen worker instance is unavailable. **Migration Process**: - Detects connection failure during initial stream setup - Decrements migration retry count - Attempts to create a new stream with the original request - No partial state to preserve since generation hasn't started #### 2. Ongoing Request Migration (Mid-Stream Disconnection) **Scenario**: Connection lost during active generation after partial responses have been received. **Error Pattern**: Stream termination detected before generation completion. **Migration Process**: 1. **Failure Detection**: The system detects the stream disconnection through error monitoring. 2. **State Preservation**: At this point, the request's token sequence contains both the original prompt tokens and all successfully generated tokens from the failed worker. 3. **New Stream Creation**: A fresh stream is created with the accumulated request state, ensuring the new worker has complete context. 4. **Continuation**: The new worker receives the request with the full token context and continues generation from the exact point where the previous worker left off. ### Seamless Token Flow and Request State Evolution From the client's perspective, the token stream appears continuous and uninterrupted. The client receives tokens from the first worker until failure occurs, then seamlessly continues receiving tokens from the backup worker without any indication of the underlying migration. The request state evolves dynamically during processing. Initially, the request contains only the original prompt tokens. As generation proceeds, each successfully generated token is appended to the request's token sequence, creating a growing record of the complete conversation context. When a migration occurs, this accumulated state is transferred to the new worker, which uses it to reconstruct the complete context. The new worker then continues generation as if it had been processing the request from the beginning, but starting from the current position in the sequence. The migration is transparent because: 1. No tokens are lost or duplicated during the transition 2. The new worker has complete context via the accumulated token sequence 3. Generation continues from the exact failure point 4. Response streaming maintains consistent format and timing This token accumulation mechanism ensures that migrations are truly seamless, preserving all computational work and maintaining generation quality across worker transitions. ## Benefits 1. **Fault Tolerance**: System continues operating during individual worker failures 2. **Resource Efficiency**: Partial generations are preserved rather than restarted 3. **Seamless User Experience**: Users experience no interruption during worker failures 4. **Configurable Behavior**: Migration limits allow tuning based on deployment requirements 5. **No Token Loss**: Complete preservation of generation state across migrations ## Design Considerations The migration system is designed with several important architectural considerations: **Multi-Model Support**: Since a frontend may serve multiple models simultaneously, the migration limit is configured at the frontend level and applies uniformly to all models, simplifying operational management. **State Management**: The system carefully tracks not only token sequences but also metadata such as remaining token budgets, stop conditions, and sampling parameters to ensure complete state preservation. **Error Handling**: The migration system distinguishes between different types of failures and applies appropriate recovery strategies for each scenario. ## Monitoring and Metrics The migration system exposes Prometheus metrics to monitor migration activity. These metrics are available on the frontend's `/metrics` endpoint (default port 8000): - `dynamo_frontend_model_migration_total`: Counter tracking the total number of request migrations - Labels: - `model`: The model name being served - `migration_type`: Either `new_request` (initial connection failure) or `ongoing_request` (mid-stream disconnection) - `dynamo_frontend_model_migration_max_seq_len_exceeded_total`: Counter tracking the number of times migration was disabled because the sequence length exceeded the configured `--migration-max-seq-len` - Labels: - `model`: The model name being served **Example metrics output:** ```text dynamo_frontend_model_migration_total{migration_type="ongoing_request",model="Qwen/Qwen3-0.6B"} 3 dynamo_frontend_model_migration_total{migration_type="new_request",model="Qwen/Qwen3-0.6B"} 1 dynamo_frontend_model_migration_max_seq_len_exceeded_total{model="Qwen/Qwen3-0.6B"} 2 ``` These metrics can be used to: - Monitor worker reliability and failure patterns - Alert on excessive migration rates indicating infrastructure issues - Track the effectiveness of fault tolerance mechanisms - Monitor how often `--migration-max-seq-len` is being reached, which may indicate the limit needs adjustment For more information on Dynamo metrics, see the [Metrics documentation](/dynamo/user-guides/observability-local/metrics). ## Known Limitations ### Multiple Choices (`n > 1`) Request migration is **not supported** for OpenAI-compatible requests that ask for multiple generated choices with `n > 1`. Dynamo disables migration for those requests, even when `--migration-limit` is greater than 0. **Why:** Multi-choice generation maintains separate per-choice output state. Migrating a partially completed request would need to transfer the generated token state, remaining token budget, finish state, and decoder state for each choice independently. The current migration path preserves a single continuation state, so retrying an interleaved `n > 1` request could duplicate or drop choice-specific output. This limitation does not affect normal single-choice requests where `n` is omitted or set to 1. ### Guided Decoding (Structured Output) Request migration is **not supported** for requests that use guided decoding (structured output / JSON schema). When a worker fails mid-stream during a guided-decoding request, the error is propagated to the client instead of attempting migration. **Why:** Inference backends initialize the guided-decoding finite state machine (FSM) fresh for every new request and only advance it on newly-generated tokens, not on context/prompt tokens. When a partially-completed request is migrated to a new worker, the new worker replays the already-generated tokens as context but starts the FSM from the schema root. This mismatch between the token state and FSM state produces corrupted output — typically duplicated or nested JSON. This limitation applies equally to all backends (vLLM, SGLang, TRT-LLM). **Future path:** Supporting migration for guided-decoding requests would require serializing and restoring the FSM state across workers, or replaying prior output tokens through the FSM on the new worker. This is tracked as a future enhancement. ## Operational Impact Request migration fundamentally changes how the system handles failures, moving from a "fail-fast" approach to a "graceful degradation" model. This architectural shift enables higher availability and better resource utilization while maintaining the same external API contract for clients. # Request Cancellation This document describes how Dynamo implements request cancellation to cancel in-flight requests between Dynamo workers. Request cancellation allows in-flight requests to terminate early, saving computational resources that would otherwise be spent on responses that are no longer needed. ## How Cancellation Works ### Frontend Cancellation Detection The frontend monitors each client connection for unexpected disconnects. When a client disconnects before the response is fully delivered, the frontend detects this and initiates cancellation. This covers two scenarios: 1. **Connection closed unexpectedly** — The client disconnects during request processing before response streaming begins. 2. **Stream closed unexpectedly** — The client disconnects while an active SSE stream is delivering response tokens. In both cases, the frontend cancels the request's `AsyncEngineContext`, which propagates cancellation to any linked child contexts on downstream workers. ### Worker Cancellation Detection On the worker side, the runtime monitors the TCP connection from the frontend for cancellation signals. The worker detects cancellation in three scenarios: 1. **Control message received** — The frontend explicitly sent a cancellation control message. 2. **TCP connection dropped** — The frontend disconnected without sending a control message (e.g., frontend crash or network failure). When the worker receives a cancellation signal, it sets the corresponding state on the request's `AsyncEngineContext`. It is then up to the worker's engine implementation to observe the cancellation (e.g., by checking `is_stopped()`) and terminate processing accordingly. For details on implementing cancellation handling in a backend worker, see the [Backend Development Guide](/dynamo/backends/custom-backend/python-workers-lower-level#request-cancellation). ### Cancellation Propagation Cancellation propagates through multi-tier request chains via linked `AsyncEngineContext` objects. When a parent context is cancelled, all linked child contexts are automatically cancelled as well. This ensures that when a client cancels a request at the frontend, all associated sub-requests on downstream workers are automatically cancelled, saving computational resources across the entire request pipeline. ## Metrics Dynamo exposes Prometheus metrics to monitor request cancellations at both the frontend and runtime layers. ### Frontend Metric | Metric | Type | Description | |--------|------|-------------| | `dynamo_frontend_model_cancellation_total` | Counter | Total number of request cancellations detected by the frontend | #### Labels | Label | Description | Example Values | |-------|-------------|----------------| | `model` | The model name from the request | `Qwen/Qwen3-0.6B` | | `endpoint` | The API endpoint that received the request | `completions`, `chat_completions`, `embeddings`, `images`, `videos`, `audios`, `responses`, `anthropic_messages`, `tensor` | | `request_type` | Whether the request was unary or streaming | `unary`, `stream` | **Endpoint:** Available on the frontend HTTP service at `/metrics`. ### Runtime Metric | Metric | Type | Description | |--------|------|-------------| | `dynamo_component_cancellation_total` | Counter | Total number of requests cancelled by the work handler | This metric uses Dynamo's auto-injected component labels: | Label | Description | Example Values | |-------|-------------|----------------| | `dynamo_namespace` | The Dynamo namespace | `dynamo` | | `dynamo_component` | The component that handled the request | `backend`, `prefill`, `decode` | | `dynamo_endpoint` | The endpoint within the component | `generate` | The counter uses deduplication logic to ensure that each cancelled request is only counted once, even if both a control message and a socket close are detected for the same request. Note that this metric records cancellation signals received by the worker, not whether the request was actually aborted at the engine level. It is up to the worker's engine implementation to observe the cancellation (e.g., by checking `is_stopped()`) and terminate processing accordingly. **Endpoint:** Available on the worker system metrics port at `/metrics` (typically port 9100). ### Example Metrics Output Frontend metrics (from `/metrics` on the frontend HTTP service): ```text dynamo_frontend_model_cancellation_total{endpoint="chat_completions",model="Qwen/Qwen3-0.6B",request_type="stream"} 5 dynamo_frontend_model_cancellation_total{endpoint="chat_completions",model="Qwen/Qwen3-0.6B",request_type="unary"} 1 dynamo_frontend_model_cancellation_total{endpoint="completions",model="Qwen/Qwen3-0.6B",request_type="stream"} 2 ``` Runtime metrics (from `/metrics` on the worker system port): ```text dynamo_component_cancellation_total{dynamo_component="backend",dynamo_endpoint="generate",dynamo_namespace="dynamo"} 8 ``` ## AsyncEngineContext Trait At the core of Dynamo's request cancellation system is the `AsyncEngineContext` trait. This trait is associated with every request stream and provides lifecycle management for async operations, including stream identification, graceful shutdown capabilities, and immediate termination capabilities. ### Key Methods #### Identification - **`id()`**: Returns the unique identifier for the stream. This ID is set by the user for request identification, and the same ID can be used for sub-requests to associate them with the original user request. #### Status Checking - **`is_stopped()`**: Returns `true` if graceful cancellation has been requested via `stop_generating()`. This represents a signal to the worker that the request has been cancelled and it should return early. - **`is_killed()`**: Returns `true` if a hard stop has been issued via `kill()`. This typically indicates that the network connection between client and server has been cut or an immediate termination is required. #### Async Status Monitoring - **`stopped()`**: An async method that completes when the context becomes stopped. If already stopped, returns immediately. - **`killed()`**: An async method that completes when the context becomes killed. If already killed, returns immediately. #### Cancellation Control - **`stop_generating()`**: The recommended method for cancelling a request. This informs the engine to stop producing results for the stream gracefully. This method is idempotent and does not invalidate results currently in the stream. - **`stop()`**: Alias for `stop_generating()`. - **`kill()`**: Extends `stop_generating()` but also indicates a preference to terminate without draining remaining items in the stream. This is implementation-specific and may not be supported by all engines. #### Child Request Management - **`link_child(child: Arc)`**: Links a child `AsyncEngineContext` to this context. When `stop_generating()`, `stop()`, or `kill()` is called on the parent context, the same method is automatically called on all linked child contexts in the order they were linked. This is especially useful in disaggregated serving scenarios where a frontend receives cancellation notification and needs to cancel requests to workers, and the worker can then cancel its sub-requests (e.g., remote prefill operations). ### Thread Safety The `AsyncEngineContext` trait ensures thread-safety with `Send + Sync` bounds, allowing safe concurrent access across multiple threads and async tasks. ## Python Bindings The `AsyncEngineContext` functionality is exposed to Python through the `Context` class, which provides a largely one-to-one mapping from Rust methods to Python methods. ### Python Context Class The Python `Context` class wraps the Rust `AsyncEngineContext` and exposes the following methods: - **`id()`**: Returns the unique identifier for the context - **`is_stopped()`**: Synchronous method equivalent to the Rust `is_stopped()` - **`is_killed()`**: Synchronous method equivalent to the Rust `is_killed()` - **`stop_generating()`**: Issues a stop generating signal, equivalent to the Rust method - **`async_killed_or_stopped()`**: An async method that completes when the context becomes either killed or stopped, whichever happens first. This combines the functionality of the Rust `killed()` and `stopped()` async methods using `tokio::select!`. For a working example of request cancellation, see the [cancellation demo](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/custom_backend/cancellation/README.md). ### Context Usage in Python The context is available optionally in both incoming and outgoing request scenarios: #### Incoming Requests For incoming requests, the generate method may optionally accept a `context` argument after the `request` argument. If the `context` parameter is specified in the method signature, it will receive the context object of the incoming request. Request handlers can: - Check for cancellation synchronously using `context.is_stopped()` before beginning expensive operations - Listen for cancellation asynchronously using `await context.async_killed_or_stopped()` Example: ```python async def generate(self, request, context): for i in range(1000): # Check for cancellation before expensive work if context.is_stopped(): raise asyncio.CancelledError # Perform work... await expensive_computation() yield result ``` #### Outgoing Requests For outgoing requests, Python scripts may optionally provide a context object to outgoing runtime endpoint client router operations (such as `generate`, `round_robin`, `random`, `direct` methods) as a keyword argument. The script can cancel the outgoing request via the provided context object. This is especially useful when child outgoing requests need to be cancelled when the parent incoming request is cancelled. In such cases, the script can simply pass the incoming context object to the outgoing request, automatically linking the cancellation behavior. Example: ```python async def generate(self, request, context): # Forward the incoming context to outgoing request # If the incoming request is cancelled, the outgoing request will be too stream = await self.client.generate(request, context=context) async for response in stream: yield response ``` # Request Rejection This document describes how Dynamo implements request rejection to prevent system overload and maintain service stability under high load conditions. ## Overview Request rejection (also known as load shedding) is a fault tolerance mechanism that proactively rejects new requests when workers are overloaded. This prevents: - Cascading failures from resource exhaustion - Degraded latency for all requests - Out-of-memory conditions on GPU workers When all workers exceed their configured busy thresholds, new requests receive an HTTP 503 (Service Unavailable) response, signaling clients to retry later. ## Architecture ``` ┌─────────────────┐ │ Worker Monitor │ │ (Background) │ └────────┬────────┘ │ Updates busy list ▼ ┌──────────┐ ┌──────────┐ ┌─────────────────────┐ ┌──────────┐ │ Client │───▶│ Frontend │───▶│ Push Router │───▶│ Worker │ └──────────┘ └──────────┘ │ (checks busy list) │ └──────────┘ └─────────────────────┘ │ │ If all workers busy ▼ ┌─────────────────────┐ │ HTTP 503 Error │ │ "All workers busy" │ └─────────────────────┘ ``` ## Configuration ### Frontend Arguments Configure busy thresholds when starting the frontend. `--admission-control token-capacity` is required to activate the thresholds; the default (`none`) leaves them disabled. ```bash python -m dynamo.frontend \ --admission-control token-capacity \ --active-decode-blocks-threshold 0.85 \ --active-prefill-tokens-threshold 10000 ``` | Argument | Type | Description | |----------|------|-------------| | `--active-decode-blocks-threshold` | float (0.0-1.0) | KV cache block utilization threshold | | `--active-prefill-tokens-threshold` | int | Prefill token count threshold | | `--active-prefill-tokens-threshold-frac` | float | Prefill token threshold as a fraction of `max_num_batched_tokens` | | `--admission-control` | `token-capacity` \| `none` | Admission control mode. `token-capacity` applies the busy thresholds above; `none` (the default) clears them while leaving router queueing controlled by `--router-queue-threshold`. To enable busy-worker admission, you must pass `--admission-control token-capacity` | ### Dynamic Configuration via API Thresholds can be adjusted at runtime via the `/busy_threshold` endpoint: #### Set Thresholds ```bash curl -X POST http://localhost:8000/busy_threshold \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 10000 }' ``` #### Get Current Thresholds ```bash curl http://localhost:8000/busy_threshold ``` Response: ```json { "thresholds": [ { "model": "Qwen/Qwen3-0.6B", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 10000 } ] } ``` ## Busy Detection Logic Workers are marked as "busy" based on a dual-threshold system. A worker is considered busy when **either** threshold is exceeded. ### KV Cache Block Threshold Monitors the percentage of KV cache blocks in use: ``` busy = active_decode_blocks / kv_total_blocks > threshold ``` Example: With `active_decode_blocks_threshold=0.85`, a worker using 87% of its KV cache blocks is marked busy. ### Prefill Token Threshold Monitors the number of tokens currently being prefilled: ``` busy = active_prefill_tokens > threshold ``` Example: With `active_prefill_tokens_threshold=10000`, a worker prefilling 12,000 tokens is marked busy. ### Data-Parallel Rank Aggregation For workers with multiple data-parallel ranks (tensor parallelism), the worker is only marked busy if **ALL** ranks are busy: ```python def is_busy(worker): return all(rank.is_busy() for rank in worker.dp_ranks) ``` This prevents false positives when only some ranks are temporarily loaded. ## Worker Load Monitoring The `KvWorkerMonitor` runs as a background task that: 1. Subscribes to KV cache metrics events from workers 2. Maintains load state for each worker instance 3. Recalculates busy instances when metrics change 4. Updates the router with the current busy list ### Metrics Collected Workers publish these metrics for monitoring: | Metric | Description | |--------|-------------| | `active_decode_blocks` | Number of KV cache blocks currently in use | | `kv_total_blocks` | Total KV cache blocks available | | `active_prefill_tokens` | Number of tokens currently being prefilled | ## Rejection Behavior ### Request Flow 1. Request arrives at frontend 2. Push router checks if busy threshold is configured 3. If configured, router retrieves list of free (non-busy) instances 4. If no free instances exist (but instances are registered): - Request is rejected with `PipelineError::ServiceOverloaded` - HTTP 503 response is returned to client ### Error Response When requests are rejected, clients receive: ```http HTTP/1.1 503 Service Unavailable Content-Type: application/json { "message": "Service temporarily unavailable: All workers are busy, please retry later", "type": "service_unavailable", "code": 503 } ``` ### Client Retry Strategy Clients should implement exponential backoff when receiving 503 responses: ```python import time import random def send_with_retry(request, max_retries=5): for attempt in range(max_retries): response = send_request(request) if response.status_code != 503: return response # Exponential backoff with jitter wait_time = min(60, (2 ** attempt) + random.uniform(0, 1)) time.sleep(wait_time) raise Exception("Max retries exceeded") ``` ## Monitoring ### Prometheus Metrics Track rejection behavior with these metrics: - `dynamo_frontend_model_rejection_total`: Counter tracking the total number of requests rejected due to resource exhaustion - Labels: - `model`: The model name being served - `endpoint`: The API endpoint that received the request (e.g., `chat_completions`, `completions`, `embeddings`) - This metric is incremented when the router returns a `ResourceExhausted` error because all workers are busy. The rejected request is surfaced to the client as an HTTP 503 response. **Example metrics output:** ```text dynamo_frontend_model_rejection_total{endpoint="chat_completions",model="Qwen/Qwen3-0.6B"} 32 dynamo_frontend_model_rejection_total{endpoint="completions",model="Qwen/Qwen3-0.6B"} 5 ``` **Endpoint:** Available on the frontend HTTP service at `/metrics`. ## Tuning Thresholds ### Conservative Settings (Latency-Focused) For applications prioritizing low latency: ```bash --active-decode-blocks-threshold 0.70 --active-prefill-tokens-threshold 5000 ``` - Rejects earlier, before workers become fully loaded - Maintains lower queue depths - Better tail latencies ### Aggressive Settings (Throughput-Focused) For applications prioritizing throughput: ```bash --active-decode-blocks-threshold 0.95 --active-prefill-tokens-threshold 20000 ``` - Allows higher worker utilization - May increase latency variability - Better overall throughput ### Disabled (No Rejection) To disable request rejection entirely: ```bash # Simply don't set the threshold arguments python -m dynamo.frontend ``` Without thresholds configured, all requests are accepted regardless of worker load. ## Best Practices ### 1. Start Conservative, Then Tune Begin with conservative thresholds and increase based on observed behavior: ```bash # Start here --active-decode-blocks-threshold 0.75 # Increase if rejection rate is too high --active-decode-blocks-threshold 0.85 ``` ### 2. Monitor Before Enabling Observe worker load patterns before setting thresholds: ```bash # Watch KV cache utilization watch -n 1 'curl -s localhost:8000/metrics | grep kv_blocks' ``` ### 3. Use Both Thresholds for Disaggregated Serving In disaggregated deployments: - Use `active_prefill_tokens_threshold` for prefill workers - Use `active_decode_blocks_threshold` for decode workers ### 4. Coordinate with Autoscaling If using Kubernetes HPA, ensure rejection thresholds trigger before autoscaling: ```yaml # HPA triggers at 70% utilization # Rejection at 85% provides buffer --active-decode-blocks-threshold 0.85 ``` ## Related Documentation - [Health Checks](/dynamo/user-guides/observability-local/health-checks) - Worker health monitoring - [Metrics](/dynamo/user-guides/observability-local/metrics) - Available Prometheus metrics - [Request Migration](/dynamo/user-guides/fault-tolerance/request-migration) - Handling failed requests # Graceful Shutdown This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up. ## Overview Graceful shutdown in Dynamo ensures that: 1. **Routing stops quickly** - Endpoints are unregistered from discovery first 2. **In-flight requests can finish** - Workers keep serving during a short grace period 3. **Endpoints drain** - After the grace period, endpoints are invalidated and optionally wait for in-flight work 4. **Resources are cleaned up** - Engines, connections, and temporary files are released 5. **Pods restart cleanly** - Exit codes signal Kubernetes for proper restart behavior ## Signal Handling All Dynamo components handle Unix signals for graceful shutdown: | Signal | Trigger | Behavior | |--------|---------|----------| | `SIGTERM` | Kubernetes pod termination | Graceful shutdown initiated | | `SIGINT` | Ctrl+C / manual interrupt | Graceful shutdown initiated | ### Implementation Each component registers signal handlers at startup: ```python def signal_handler(): asyncio.create_task(graceful_shutdown(runtime, endpoints)) for sig in (signal.SIGTERM, signal.SIGINT): loop.add_signal_handler(sig, signal_handler) ``` The `graceful_shutdown()` function: 1. Logs the shutdown signal 2. Unregisters all endpoints from discovery 3. Waits for a configurable grace period (`DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECS`, default 5s) 4. Calls `runtime.shutdown()` to invalidate endpoints and stop accepting new requests 5. Waits for in-flight requests (based on `graceful_shutdown` per endpoint) 6. Returns to allow cleanup to proceed ## Endpoint Draining After the grace period, `runtime.shutdown()` invalidates endpoints so no new requests are accepted. The behavior for in-flight requests depends on the `graceful_shutdown` parameter when serving the endpoint. ### Configuration When registering an endpoint, the `graceful_shutdown` parameter controls draining behavior: ```python generate_endpoint.serve_endpoint( handler.generate, graceful_shutdown=True, # Wait for all requests to finish metrics_labels=[("model", model_name)], health_check_payload=health_check_payload, ) ``` | `graceful_shutdown` | Behavior | |---------------------|----------| | `True` | Wait for all in-flight requests to complete before returning | | `False` | Return immediately without waiting for requests | ### Component-Specific Behavior | Component | Default Behavior | Rationale | |-----------|------------------|-----------| | **Frontend** | N/A (HTTP server) | HTTP server handles its own shutdown | | **Prefill Workers** | `graceful_shutdown=True` | Prefill operations must complete to avoid wasted computation | | **Decode Workers** | `graceful_shutdown=True` | Decode operations should complete to avoid wasted computation | | **Router** | `graceful_shutdown=True` | Ensure routing decisions complete | ### Migration Integration Backend workers always use `graceful_shutdown=True`, meaning they wait for in-flight requests to complete until the engine is stopped. Request migration is configured at the **frontend** level via `--migration-limit`: - When migration is enabled at the frontend, disconnected streams from failed workers are automatically retried on healthy workers - Workers don't need to know about migration configuration - they simply complete their work or signal incomplete streams - See [Request Migration Architecture](/dynamo/user-guides/fault-tolerance/request-migration) for details on how migration works ## Resource Cleanup After endpoint draining, components clean up their resources in `finally` blocks: ### vLLM Worker Cleanup ```python finally: logger.debug("Cleaning up worker") handler.cleanup() ``` The handler's `cleanup()` method: - Removes temporary directories (LoRA adapters, etc.) - Releases engine resources ### SGLang Worker Cleanup ```python def cleanup(self) -> None: # Cancel pending consume tasks for task in self._consume_tasks: if not task.done(): task.cancel() self._consume_tasks.clear() # Shutdown engine self.engine.shutdown() ``` ### TensorRT-LLM Worker Cleanup ```python async def cleanup(self): if self._llm: try: self._llm.shutdown() except Exception as e: logging.error(f"Error during cleanup: {e}") finally: self._llm = None ``` ## Error-Initiated Shutdown Workers can initiate graceful shutdown when fatal errors occur: ### Engine Health Monitoring (vLLM) The `VllmEngineMonitor` continuously checks engine health: ```python async def _check_engine_health(self): while True: try: await self.engine_client.check_health() await asyncio.sleep(HEALTH_CHECK_INTERVAL) # 2 seconds except EngineDeadError as e: logger.error(f"Health check failed: {e}") self._shutdown_engine() self.runtime.shutdown() os._exit(1) ``` Configuration: - `HEALTH_CHECK_INTERVAL`: 2 seconds between checks - `ENGINE_SHUTDOWN_TIMEOUT`: 30 seconds max for engine shutdown ### Fatal Error Handling (TensorRT-LLM) ```python async def _initiate_shutdown(self, error: Exception): logging.warning(f"Initiating graceful shutdown due to: {error}") try: if self.runtime: self.runtime.shutdown() if self.engine: await self.engine.cleanup() except Exception as cleanup_error: logging.error(f"Error during graceful shutdown: {cleanup_error}") finally: logging.critical("Forcing process exit for restart") os._exit(1) ``` ## Kubernetes Integration ### Pod Termination Flow 1. Kubernetes sends `SIGTERM` to the pod 2. Dynamo initiates graceful shutdown 3. Pod has `terminationGracePeriodSeconds` to complete (default: 30s) 4. If not terminated, Kubernetes sends `SIGKILL` ### Recommended Configuration For production deployments, configure adequate termination grace period: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment spec: services: VllmWorker: extraPodSpec: terminationGracePeriodSeconds: 60 # Allow time for request draining ``` ### Health Check Integration Kubernetes uses health endpoints to determine pod readiness: - **During shutdown**: Endpoints become unavailable - **Readiness probe fails**: Traffic stops routing to the pod - **Graceful draining**: Existing requests complete ## Best Practices ### 1. Set Appropriate Grace Periods Match `terminationGracePeriodSeconds` to your expected request completion time: - Short requests (\< 10s): 30s grace period - Long generation (> 30s): 120s+ grace period ### 2. Enable Request Migration Enable migration at the frontend to allow request recovery when workers shut down: ```bash python3 -m dynamo.frontend ... --migration-limit 3 # Allow up to 3 migration attempts ``` This allows the frontend to automatically retry disconnected streams on healthy workers. ### 3. Monitor Shutdown Metrics Track shutdown behavior via logs: ``` INFO Received shutdown signal, shutting down DistributedRuntime INFO DistributedRuntime shutdown complete DEBUG Cleaning up worker ``` ### 4. Handle Cleanup Errors Ensure cleanup methods handle errors gracefully: ```python def cleanup(self): for resource in self.resources: try: resource.cleanup() except Exception as e: logger.warning(f"Cleanup failed: {e}") # Continue with other resources ``` ## Related Documentation - [Request Migration](/dynamo/user-guides/fault-tolerance/request-migration) - How requests migrate during shutdown - [Request Cancellation](/dynamo/user-guides/fault-tolerance/request-cancellation) - Canceling in-flight requests - [Health Checks](/dynamo/user-guides/observability-local/health-checks) - Liveness and readiness probes # Testing This document describes the test infrastructure for validating Dynamo's fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios. ## Overview Dynamo's fault tolerance test suite is located in `tests/fault_tolerance/` and includes: | Test Category | Location | Purpose | |---------------|----------|---------| | Cancellation | `cancellation/` | Request cancellation during in-flight operations | | Migration | `migration/` | Request migration when workers fail | | etcd HA | `etcd_ha/` | etcd failover and recovery | | Hardware | `hardware/` | GPU and network fault injection | | Deployment | `deploy/` | End-to-end deployment testing | ## Test Directory Structure ``` tests/fault_tolerance/ ├── cancellation/ │ ├── test_vllm.py │ ├── test_trtllm.py │ ├── test_sglang.py │ └── utils.py ├── migration/ │ ├── test_vllm.py │ ├── test_trtllm.py │ ├── test_sglang.py │ └── utils.py ├── etcd_ha/ │ ├── test_vllm.py │ ├── test_trtllm.py │ ├── test_sglang.py │ └── utils.py ├── hardware/ │ └── fault_injection_service/ │ ├── api_service/ │ └── agents/ ├── deploy/ │ ├── test_deployment.py │ ├── scenarios.py │ ├── base_checker.py │ └── ... └── client.py ``` ## Request Cancellation Tests Test that in-flight requests can be properly canceled. ### Running Cancellation Tests ```bash # Run all cancellation tests pytest tests/fault_tolerance/cancellation/ -v # Run for specific backend pytest tests/fault_tolerance/cancellation/test_vllm.py -v ``` ### Cancellation Test Utilities The `cancellation/utils.py` module provides: #### CancellableRequest Thread-safe request cancellation via TCP socket manipulation: ```python from tests.fault_tolerance.cancellation.utils import CancellableRequest request = CancellableRequest() # Send request in separate thread thread = Thread(target=send_request, args=(request,)) thread.start() # Cancel after some time time.sleep(1) request.cancel() # Closes underlying socket ``` #### send_completion_request / send_chat_completion_request Send cancellable completion requests: ```python from tests.fault_tolerance.cancellation.utils import ( send_completion_request, send_chat_completion_request ) # Non-streaming response = send_completion_request( base_url="http://localhost:8000", model="Qwen/Qwen3-0.6B", prompt="Hello, world!", max_tokens=100 ) # Streaming with cancellation responses = send_chat_completion_request( base_url="http://localhost:8000", model="Qwen/Qwen3-0.6B", messages=[{"role": "user", "content": "Hello!"}], stream=True, cancellable_request=request ) ``` #### poll_for_pattern Wait for specific patterns in logs: ```python from tests.fault_tolerance.cancellation.utils import poll_for_pattern # Wait for cancellation confirmation found = poll_for_pattern( log_file="/var/log/dynamo/worker.log", pattern="Request cancelled", timeout=30, interval=0.5 ) ``` ## Migration Tests Test that requests migrate to healthy workers when failures occur. ### Running Migration Tests ```bash # Run all migration tests pytest tests/fault_tolerance/migration/ -v # Run for specific backend pytest tests/fault_tolerance/migration/test_vllm.py -v ``` ### Migration Test Utilities The `migration/utils.py` module provides: - Frontend wrapper with configurable request planes - Long-running request spawning for migration scenarios - Health check disabling for controlled testing ### Example Migration Test ```python def test_migration_on_worker_failure(): # Start deployment with 2 workers deployment = start_deployment(workers=2) # Send long-running request request_thread = spawn_long_request(max_tokens=1000) # Kill one worker mid-generation kill_worker(deployment.workers[0]) # Verify request completes on remaining worker response = request_thread.join() assert response.status_code == 200 assert len(response.tokens) > 0 ``` ## etcd HA Tests Test system behavior during etcd failures and recovery. ### Running etcd HA Tests ```bash pytest tests/fault_tolerance/etcd_ha/ -v ``` ### Test Scenarios - **Leader failover**: etcd leader node fails, cluster elects new leader - **Network partition**: etcd node becomes unreachable - **Recovery**: System recovers after etcd becomes available ## Hardware Fault Injection The fault injection service enables testing under simulated hardware failures. ### Fault Injection Service Located at `tests/fault_tolerance/hardware/fault_injection_service/`, this FastAPI service orchestrates fault injection: ```bash # Start the fault injection service cd tests/fault_tolerance/hardware/fault_injection_service python -m api_service.main ``` ### Supported Fault Types #### GPU Faults | Fault Type | Description | |------------|-------------| | `XID_ERROR` | Simulate GPU XID error (various codes) | | `THROTTLE` | GPU thermal throttling | | `MEMORY_PRESSURE` | GPU memory exhaustion | | `OVERHEAT` | GPU overheating condition | | `COMPUTE_OVERLOAD` | GPU compute saturation | #### Network Faults | Fault Type | Description | |------------|-------------| | `FRONTEND_WORKER` | Partition between frontend and workers | | `WORKER_NATS` | Partition between workers and NATS | | `WORKER_WORKER` | Partition between workers | | `CUSTOM` | Custom network partition | ### Fault Injection API #### Inject GPU Fault ```bash curl -X POST http://localhost:8080/api/v1/faults/gpu/inject \ -H "Content-Type: application/json" \ -d '{ "target_pod": "vllm-worker-0", "fault_type": "XID_ERROR", "severity": "HIGH" }' ``` #### Inject Specific XID Error ```bash # Inject XID 79 (GPU memory page fault) curl -X POST http://localhost:8080/api/v1/faults/gpu/inject/xid-79 \ -H "Content-Type: application/json" \ -d '{"target_pod": "vllm-worker-0"}' ``` Supported XID codes: 43, 48, 74, 79, 94, 95, 119, 120 #### Inject Network Partition ```bash curl -X POST http://localhost:8080/api/v1/faults/network/inject \ -H "Content-Type: application/json" \ -d '{ "partition_type": "FRONTEND_WORKER", "duration_seconds": 30 }' ``` #### Recover from Fault ```bash curl -X POST http://localhost:8080/api/v1/faults/{fault_id}/recover ``` #### List Active Faults ```bash curl http://localhost:8080/api/v1/faults ``` ### GPU Fault Injector Agent The GPU fault injector runs as a DaemonSet on worker nodes: ```yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: gpu-fault-injector spec: selector: matchLabels: app: gpu-fault-injector template: spec: containers: - name: agent image: dynamo/gpu-fault-injector:latest securityContext: privileged: true volumeMounts: - name: dev mountPath: /dev ``` The agent injects fake XID messages via `/dev/kmsg` to trigger NVSentinel detection. ## Deployment Testing Framework The `deploy/` directory contains an end-to-end testing framework. ### Test Phases Tests run through three phases: | Phase | Description | |-------|-------------| | `STANDARD` | Baseline performance under normal conditions | | `OVERFLOW` | System behavior during fault/overload | | `RECOVERY` | System recovery after fault resolution | ### Scenario Configuration Define test scenarios in `scenarios.py`: ```python from tests.fault_tolerance.deploy.scenarios import Scenario, Load, Failure scenario = Scenario( name="worker_failure_migration", backend="vllm", load=Load( clients=10, requests_per_client=100, max_tokens=256 ), failure=Failure( type="pod_kill", target="vllm-worker-0", trigger_after_requests=50 ) ) ``` ### Running Deployment Tests ```bash # Run all deployment tests pytest tests/fault_tolerance/deploy/test_deployment.py -v # Run specific scenario pytest tests/fault_tolerance/deploy/test_deployment.py::test_worker_failure -v ``` ### Validation Checkers The framework includes pluggable validators: ```python from tests.fault_tolerance.deploy.base_checker import BaseChecker, ValidationContext class MigrationChecker(BaseChecker): def check(self, context: ValidationContext) -> bool: # Verify migrations occurred migrations = context.metrics.get("migrations_total", 0) return migrations > 0 ``` ### Results Parsing Parse test results for analysis: ```python from tests.fault_tolerance.deploy.parse_results import process_overflow_recovery_test results = process_overflow_recovery_test(log_dir="/path/to/logs") print(f"Success rate: {results['success_rate']}") print(f"P99 latency: {results['p99_latency_ms']}ms") ``` ## Client Utilities The `client.py` module provides shared client functionality: ### Multi-Threaded Load Generation ```python from tests.fault_tolerance.client import client # Generate load with multiple clients results = client( base_url="http://localhost:8000", num_clients=10, requests_per_client=100, model="Qwen/Qwen3-0.6B", max_tokens=256, log_dir="/tmp/test_logs" ) ``` ### Request Options | Parameter | Description | |-----------|-------------| | `base_url` | Frontend URL | | `num_clients` | Number of concurrent clients | | `requests_per_client` | Requests per client | | `model` | Model name | | `max_tokens` | Max tokens per request | | `log_dir` | Directory for client logs | | `endpoint` | `completions` or `chat/completions` | ## Running the Full Test Suite ### Prerequisites 1. Kubernetes cluster with GPU nodes 2. Dynamo deployment 3. etcd cluster (for HA tests) 4. Fault injection service (for hardware tests) ### Environment Setup ```bash export KUBECONFIG=/path/to/kubeconfig export DYNAMO_NAMESPACE=dynamo-test export FRONTEND_URL=http://localhost:8000 ``` ### Run All Tests ```bash # Install test dependencies pip install pytest pytest-asyncio # Run all fault tolerance tests pytest tests/fault_tolerance/ -v --tb=short # Run with specific markers pytest tests/fault_tolerance/ -v -m "not slow" ``` ### Test Markers | Marker | Description | |--------|-------------| | `slow` | Long-running tests (> 5 minutes) | | `gpu` | Requires GPU resources | | `k8s` | Requires Kubernetes cluster | | `etcd_ha` | Requires multi-node etcd | ## Best Practices ### 1. Isolate Test Environments Run fault tolerance tests in dedicated namespaces: ```bash kubectl create namespace dynamo-fault-test ``` ### 2. Clean Up After Tests Ensure fault injection is recovered: ```bash # List and recover all active faults curl http://localhost:8080/api/v1/faults | jq -r '.[].id' | \ xargs -I {} curl -X POST http://localhost:8080/api/v1/faults/{}/recover ``` ### 3. Collect Logs Preserve logs for debugging: ```bash pytest tests/fault_tolerance/ -v \ --log-dir=/tmp/fault_test_logs \ --capture=no ``` ### 4. Monitor During Tests Watch system state during tests: ```bash # Terminal 1: Watch pods watch kubectl get pods -n dynamo-test # Terminal 2: Watch metrics watch 'curl -s localhost:8000/metrics | grep -E "(migration|rejection)"' ``` ## Related Documentation - [Request Migration](/dynamo/user-guides/fault-tolerance/request-migration) - Migration implementation details - [Request Cancellation](/dynamo/user-guides/fault-tolerance/request-cancellation) - Cancellation implementation - [Health Checks](/dynamo/user-guides/observability-local/health-checks) - Health monitoring - [Metrics](/dynamo/user-guides/observability-local/metrics) - Available metrics for monitoring # Observability (Local) ## Required environment variables Set these on every Dynamo process (frontend, router, workers) for metrics, traces, and logs to flow: | Variable | Purpose | Required | |---|---|---| | `DYN_SYSTEM_PORT=8081` | Unified system port (metrics + health). | Yes for metrics. | | `OTEL_EXPORT_ENABLED=true` | Enable OpenTelemetry export. **Without this, traces and logs never leave the process** — Loki and Tempo will show nothing even if they are healthy. | Yes for traces/logs. | | `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` | OTLP gRPC endpoint for traces (e.g. `http://tempo:4317`). Must be a gRPC listener — Dynamo's exporter does not speak OTLP/HTTP, even though the OTel Collector also listens on `:4318`. | Yes for traces. | | `OTEL_EXPORTER_OTLP_LOGS_ENDPOINT` | OTLP gRPC endpoint for logs (e.g. `http://loki-otlp:4317`). Same gRPC-only constraint as the traces endpoint above. | Yes for logs. | | `DYN_LOGGING_JSONL=true` | Structured JSON log output (recommended for Loki). | Optional. | Source of truth: [`lib/runtime/src/logging.rs`](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/lib/runtime/src/logging.rs) `setup_logging()`. Passing `--enable-metrics` on an individual backend only exposes metrics *per backend*. The unified frontend metrics surface (scraped by Prometheus) requires `DYN_SYSTEM_PORT` to be set on the frontend process as well — setting it on workers alone is not enough. Prometheus metric families in Dynamo are registered lazily: each label set is created the first time it fires, so a freshly-started process shows empty metric families until the first relevant request. This is expected — an idle cluster does not mean scraping is broken. ## Getting Started Quickly This is an example to get started quickly on a single machine. ### Prerequisites Install these on your machine: - [Docker](https://docs.docker.com/get-docker/) - [Docker Compose](https://docs.docker.com/compose/install/) ### Starting the Observability Stack Dynamo provides a Docker Compose-based observability stack that includes Prometheus, Grafana, Tempo, Loki, an OpenTelemetry Collector, and various exporters for metrics, tracing, logging, and visualization. From the Dynamo root directory: ```bash # Start infrastructure (NATS, etcd) docker compose -f dev/docker-compose.yml up -d # Start observability stack (Prometheus, Grafana, Tempo, DCGM GPU exporter, NATS exporter) docker compose -f dev/docker-observability.yml up -d ``` For detailed setup instructions and configuration, see [Prometheus + Grafana Setup](/dynamo/user-guides/observability-local/prometheus-grafana-setup). ## Observability Documentation | Guide | Description | Environment Variables to Control | |-------|-------------|----------------------------------| | [Metrics](/dynamo/user-guides/observability-local/metrics) | Available metrics reference | `DYN_SYSTEM_PORT`† | | [Operator Metrics (Kubernetes)](/dynamo/kubernetes-deployment/operate/observability/operator-metrics) | Operator controller and webhook metrics for Kubernetes | N/A (configured via Helm) | | [Health Checks](/dynamo/user-guides/observability-local/health-checks) | Component health monitoring and readiness probes | `DYN_SYSTEM_PORT`†, `DYN_SYSTEM_STARTING_HEALTH_STATUS`, `DYN_SYSTEM_HEALTH_PATH`, `DYN_SYSTEM_LIVE_PATH`, `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` | | [Tracing](/dynamo/user-guides/observability-local/tracing) | Distributed tracing with OpenTelemetry and Tempo | `DYN_LOGGING_JSONL`†, `OTEL_EXPORT_ENABLED`†, `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`†, `OTEL_SERVICE_NAME`† | | [Logging](/dynamo/user-guides/observability-local/logging) | Structured logging and OTLP log export to Loki | `DYN_LOGGING_JSONL`†, `DYN_LOG`, `DYN_LOG_USE_LOCAL_TZ`, `DYN_LOGGING_CONFIG_PATH`, `OTEL_SERVICE_NAME`†, `OTEL_EXPORT_ENABLED`†, `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`†, `OTEL_EXPORTER_OTLP_LOGS_ENDPOINT`† | **Variables marked with † are shared across multiple observability systems.** ## Developer Guides | Guide | Description | Environment Variables to Control | |-------|-------------|----------------------------------| | [Metrics Developer Guide](/dynamo/user-guides/observability-local/metrics-developer-guide) | Creating custom metrics in Rust and Python | `DYN_SYSTEM_PORT`† | | [Local Resource Monitor](local-resource-monitor.md) | Per-process VRAM / PCIe / CPU exporter for engine-startup profiling (200 ms scrape, profile-gated) | N/A (host-side script) | ## Kubernetes For Kubernetes-specific setup and configuration, see [docs/kubernetes/observability/](/dynamo/kubernetes-deployment/operate/observability/metrics). **Operator Metrics**: The Dynamo Operator running in Kubernetes exposes its own set of metrics for monitoring controller reconciliation, webhook validation, and resource inventory. See the [Operator Metrics Guide](/dynamo/kubernetes-deployment/operate/observability/operator-metrics). --- ## Topology This provides: - **Prometheus** on `http://localhost:9090` - metrics collection and querying - **Grafana** on `http://localhost:3000` - visualization dashboards (username: `dynamo`, password: `dynamo`) - **Tempo** on `http://localhost:3200` - distributed tracing backend - **Loki** on `http://localhost:3100` - log aggregation backend - **OpenTelemetry Collector** on `http://localhost:4317` (gRPC) / `http://localhost:4318` (HTTP) - receives OTLP signals and routes traces to Tempo and logs to Loki - **DCGM Exporter** on `http://localhost:9401/metrics` - GPU metrics - **NATS Exporter** on `http://localhost:7777/metrics` - NATS messaging metrics ### Service Relationship Diagram ```mermaid graph TD BROWSER[Browser] -->|:3000| GRAFANA[Grafana :3000] subgraph DockerComposeNetwork [Network inside Docker Compose] NATS_PROM_EXP[nats-prom-exp :7777 /metrics] -->|:8222/varz| NATS_SERVER[nats-server :4222, :6222, :8222] PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380] PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401] PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000] PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081] DYNAMOFE --> DYNAMOBACKEND DYNAMOFE -->|OTLP :4317| OTEL_COLLECTOR[OTel Collector :4317/:4318] DYNAMOBACKEND -->|OTLP :4317| OTEL_COLLECTOR OTEL_COLLECTOR -->|traces| TEMPO[Tempo :3200] OTEL_COLLECTOR -->|logs| LOKI[Loki :3100] GRAFANA -->|:9090/query API| PROMETHEUS GRAFANA -->|:3200/query API| TEMPO GRAFANA -->|:3100/query API| LOKI end ``` The dcgm-exporter service in the Docker Compose network is configured to use port 9401 instead of the default port 9400. This adjustment is made to avoid port conflicts with other dcgm-exporter instances that may be running simultaneously. Such a configuration is typical in distributed systems like SLURM. ### Configuration Files The following configuration files are located in the `dev/observability/` directory: - [docker-compose.yml](../../dev/docker-compose.yml): Defines NATS and etcd services - [docker-observability.yml](../../dev/docker-observability.yml): Defines Prometheus, Grafana, Tempo, and exporters - [prometheus.yml](../../dev/observability/prometheus.yml): Contains Prometheus scraping configuration - [grafana-datasources.yml](../../dev/observability/grafana-datasources.yml): Contains Grafana datasource configuration - [otel-collector.yaml](../../dev/observability/otel-collector.yaml): OpenTelemetry Collector configuration (routes traces to Tempo, logs to Loki) - [loki.yaml](../../dev/observability/loki.yaml): Loki log aggregation configuration - [loki-datasource.yml](../../dev/observability/loki-datasource.yml): Grafana Loki datasource with trace ID linking to Tempo - [grafana_dashboards/dashboard-providers.yml](../../dev/observability/grafana_dashboards/dashboard-providers.yml): Contains Grafana dashboard provider configuration - [grafana_dashboards/dynamo.json](../../dev/observability/grafana_dashboards/dynamo.json): Engine-agnostic per-model dashboard covering frontend, KV-router, and worker metrics. Filterable by `model`. See the [per-model dashboard guide](/dynamo/user-guides/observability-local/prometheus-grafana-setup#per-model-dynamo-dashboard) for details. - [grafana_dashboards/dcgm-metrics.json](../../dev/observability/grafana_dashboards/dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics - [grafana_dashboards/kvbm.json](../../dev/observability/grafana_dashboards/kvbm.json): Contains Grafana dashboard configuration for KVBM metrics # Prometheus + Grafana Setup ## Overview This guide shows how to set up Prometheus and Grafana for visualizing Dynamo metrics on a single machine for demo purposes. ![Grafana Dynamo Dashboard](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/0e5d1391b49dd5d3ceed8f2a33bb122c9f2b4155be83f348db93959ab9489efb/pages-v1.2.0/assets/img/grafana-dynamo-composite.png) **Components:** - **Prometheus Server** - Collects and stores metrics from Dynamo services - **Grafana** - Provides dashboards by querying the Prometheus Server **For metrics reference**, see [Metrics Documentation](/dynamo/user-guides/observability-local/metrics). ## Environment Variables | Variable | Description | Default | Example | |----------|-------------|---------|---------| | `DYN_SYSTEM_PORT` | System metrics/health port | `-1` (disabled) | `8081` | ## Getting Started Quickly This is a single machine example. ### Start the Observability Stack Start the observability stack (Prometheus, Grafana, Tempo, exporters). See [Observability Getting Started](/dynamo/user-guides/observability-local#getting-started-quickly) for instructions and prerequisites. ### Start Dynamo Components Start frontend and worker (a simple single GPU example): ```bash # Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var) python -m dynamo.frontend & # Start vLLM worker with metrics enabled on port 8081 DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager ``` After the workers are running, send a few test requests to populate metrics in the system: ```bash curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello"}], "max_completion_tokens": 100 }' ``` After sending a few requests, the Prometheus Exposition Format text metrics are available at: - Frontend: `http://localhost:8000/metrics` - Backend worker: `http://localhost:8081/metrics` **Note:** Labeled series (e.g., `...{model="..."}`) only appear after the first matching request is served. See [Available Metrics](/dynamo/user-guides/observability-local/metrics#available-metrics) for details. ### Access Web Interfaces Once Dynamo components are running: 1. Open **Grafana** at `http://localhost:3000` (username: `dynamo`, password: `dynamo`) 2. Click on **Dashboards** in the left sidebar 3. Select **Dynamo Dashboard** to view metrics and traces Other interfaces: - **Prometheus**: `http://localhost:9090` - **Tempo** (tracing): Accessible through Grafana's Explore view. See [Tracing Guide](/dynamo/user-guides/observability-local/tracing) for details. **Note:** If accessing from another machine, replace `localhost` with the machine's hostname or IP address, and ensure firewall rules allow access to these ports (3000, 9090). --- ## Configuration ### Prometheus The Prometheus configuration is specified in [prometheus.yml](../../dev/observability/prometheus.yml). This file is set up to collect metrics from the metrics aggregation service endpoint. Please be aware that you might need to modify the target settings to align with your specific host configuration and network environment. After making changes to prometheus.yml, restart the Prometheus service. See [Observability Getting Started](/dynamo/user-guides/observability-local#getting-started-quickly) for Docker Compose commands. ### Grafana Grafana is pre-configured with: - Prometheus datasource - A set of sample dashboards under `dev/observability/grafana_dashboards/` (see below) ### Dashboards #### Per-Model Dynamo Dashboard The per-model dashboard at [dev/observability/grafana_dashboards/dynamo.json](../../dev/observability/grafana_dashboards/dynamo.json) is auto-provisioned with the observability stack. Sections: - **Overview** - request KPIs (success rate, totals, latency averages). - **Frontend** - request rates, latency quantiles, sequence-length distributions, cache hits. - **KV Routing** - per-worker active blocks, hit rate, routing-overhead breakdown, KV cache events. - **Workers** - per-worker request breakdown, request duration, component throughput. Metric panels read the `dynamo_frontend_*`, `dynamo_component_*`, and `dynamo_router_*` metric surfaces, filtered by the `${model}` template variable. The Kubernetes version is provisioned from [deploy/observability/grafana-dynamo-dashboard-configmap.yaml](../../deploy/observability/grafana-dynamo-dashboard-configmap.yaml). ### Troubleshooting 1. Verify services are running using `docker compose ps` 2. Check logs using `docker compose logs` 3. Check Prometheus targets at `http://localhost:9090/targets` to verify metric collection. 4. If you encounter issues with stale data or configuration, stop services and wipe volumes using `docker compose down -v` then restart. **Note:** The `-v` flag removes named volumes (grafana-data, tempo-data), which will reset dashboards and stored metrics. For specific Docker Compose commands, see [Observability Getting Started](/dynamo/user-guides/observability-local#getting-started-quickly). ## Developer Guide For detailed information on creating custom metrics in Dynamo components, see: - [Metrics Developer Guide](/dynamo/user-guides/observability-local/metrics-developer-guide) # Metrics ## Overview Dynamo provides built-in metrics capabilities through the Dynamo metrics API, which is automatically available whenever you use the `DistributedRuntime` framework. This document serves as a reference for all available metrics in Dynamo. **For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](/dynamo/user-guides/observability-local/prometheus-grafana-setup). **For creating custom metrics**, see the [Metrics Developer Guide](/dynamo/user-guides/observability-local/metrics-developer-guide). ## Environment Variables | Variable | Description | Default | Example | |----------|-------------|---------|---------| | `DYN_SYSTEM_PORT` | Backend component metrics/health port | `-1` (disabled) | `8081` | | `DYN_HTTP_PORT` | Frontend HTTP port (also configurable via `--http-port` flag) | `8000` | `8000` | | `NIXL_TELEMETRY_ENABLE` | Enable NIXL telemetry (see [NIXL Telemetry Metrics](#nixl-telemetry-metrics)). Options: `y`, `n` | `n` (disabled) | `y` | ## Getting Started Quickly This is a single machine example. ### Start Observability Stack For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](/dynamo/user-guides/observability-local#getting-started-quickly) for instructions. ### Launch Dynamo Components Launch a frontend and vLLM backend to test metrics: ```bash # Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var) $ python -m dynamo.frontend # Enable backend worker's system metrics on port 8081 $ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B \ --enforce-eager --no-enable-prefix-caching --max-num-seqs 3 ``` Wait for the vLLM worker to start, then send requests and check metrics: ```bash # Send a request curl -H 'Content-Type: application/json' \ -d '{ "model": "Qwen/Qwen3-0.6B", "max_completion_tokens": 100, "messages": [{"role": "user", "content": "Hello"}] }' \ http://localhost:8000/v1/chat/completions # Check metrics from the backend worker curl -s localhost:8081/metrics | grep dynamo_component ``` ## Exposed Metrics Dynamo exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All Dynamo-generated metrics use the `dynamo_*` prefix and include labels (`dynamo_namespace`, `dynamo_component`, `dynamo_endpoint`) to identify the source component. **Example Prometheus Exposition Format text:** ``` # HELP dynamo_component_requests_total Total requests processed # TYPE dynamo_component_requests_total counter dynamo_component_requests_total{dynamo_namespace="default",dynamo_component="backend",dynamo_endpoint="generate"} 42 # HELP dynamo_component_request_duration_seconds Request processing time # TYPE dynamo_component_request_duration_seconds histogram dynamo_component_request_duration_seconds_bucket{dynamo_namespace="default",dynamo_component="backend",dynamo_endpoint="generate",le="0.005"} 10 dynamo_component_request_duration_seconds_bucket{dynamo_namespace="default",dynamo_component="backend",dynamo_endpoint="generate",le="0.01"} 15 dynamo_component_request_duration_seconds_bucket{dynamo_namespace="default",dynamo_component="backend",dynamo_endpoint="generate",le="+Inf"} 42 dynamo_component_request_duration_seconds_sum{dynamo_namespace="default",dynamo_component="backend",dynamo_endpoint="generate"} 2.5 dynamo_component_request_duration_seconds_count{dynamo_namespace="default",dynamo_component="backend",dynamo_endpoint="generate"} 42 ``` ### Metric Categories Dynamo exposes several categories of metrics: - **Frontend Metrics** (`dynamo_frontend_*`) - Request handling, token processing, and latency measurements - **Component Metrics** (`dynamo_component_*`) - Request counts, processing times, byte transfers, and system uptime - **Specialized Component Metrics** (e.g., `dynamo_preprocessor_*`) - Component-specific metrics - **Engine Metrics** (Pass-through) - Backend engines expose their own metrics: [vLLM](/dynamo/backends/v-llm/observability) (`vllm:*`), [SGLang](/dynamo/backends/sg-lang/observability) (`sglang:*`), [TensorRT-LLM](/dynamo/backends/tensor-rt-llm/observability) (`trtllm_*`) ## Runtime Hierarchy The Dynamo metrics API is available on `DistributedRuntime`, `Namespace`, `Component`, and `Endpoint`, providing a hierarchical approach to metric collection that matches Dynamo's distributed architecture: - `DistributedRuntime`: Global metrics across the entire runtime - `Namespace`: Metrics scoped to a specific dynamo_namespace - `Component`: Metrics for a specific dynamo_component within a namespace - `Endpoint`: Metrics for individual dynamo_endpoint within a component This hierarchical structure allows you to create metrics at the appropriate level of granularity for your monitoring needs. ## Available Metrics **Note:** Labeled metrics (`HistogramVec`, `CounterVec`, `GaugeVec`) register a metric *family*, not individual time series. A series for a given label combination only appears at `/metrics` after the first `with_label_values(...)` call for that combination — i.e., after the first matching request is served. For example, `dynamo_frontend_request_duration_seconds{model="Qwen/Qwen3-0.6B"}` will not appear on a freshly-started frontend until a request for that model is handled. This is expected Prometheus client behavior, not a missing metric. ### Backend Component Metrics **Backend workers** (`python -m dynamo.vllm`, `python -m dynamo.sglang`, etc.) expose `dynamo_component_*` metrics on the system status port (configurable via `DYN_SYSTEM_PORT`, disabled by default). In Kubernetes the operator typically sets `DYN_SYSTEM_PORT=9090`; for local development you must set it explicitly (e.g. `DYN_SYSTEM_PORT=8081`). The core Dynamo backend system exposes metrics at the `/metrics` endpoint with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework: - `dynamo_component_inflight_requests`: Requests currently being processed (gauge) - `dynamo_component_request_bytes_total`: Total bytes received in requests (counter) - `dynamo_component_request_duration_seconds`: Request processing time (histogram) - `dynamo_component_requests_total`: Total requests processed (counter) - `dynamo_component_errors_total`: Total errors encountered while handling a request (counter, labeled with `error_type`). See [Component Error Types](#component-error-types). - `dynamo_component_response_bytes_total`: Total bytes sent in responses (counter) - `dynamo_component_uptime_seconds`: DistributedRuntime uptime (gauge). Automatically updated before each Prometheus scrape on both the frontend (`/metrics` on port 8000) and the system status server (`/metrics` on `DYN_SYSTEM_PORT` when set). **Access backend component metrics:** ```bash # Set DYN_SYSTEM_PORT to enable the system status server DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model curl http://localhost:8081/metrics ``` #### Component Labels Backend `dynamo_component_*` series carry two groups of labels: the ones the Dynamo runtime emits, and the ones Prometheus/Kubernetes attach during scraping. **Auto-injected by the Dynamo runtime** (added by `create_metric()` in `lib/runtime/src/metrics.rs` for every metric registered through the namespace/component/endpoint hierarchy): | Label | Description | Example | |-------|-------------|---------| | `dynamo_namespace` | The Dynamo runtime namespace — the logical scope shared by every component (router / prefill / decode / encode) in one deployment. **Not** the K8s namespace. | `dynamo_cloud_vllm_v1_disagg_router_071de157` | | `dynamo_component` | Service role: see [Component Names](#component-names) below. | `backend`, `prefill`, `router` | | `dynamo_endpoint` | The RPC within that component: see [Endpoint Names](#endpoint-names) below. | `generate`, `clear_kv_blocks`, `worker_kv_indexer_query_dp0` | | `worker_id` | Hex-encoded discovery instance ID of the endpoint, providing a stable per-worker identity that does not depend on Kubernetes. Injected only when the endpoint hierarchy has a connection ID. | `1a2b3c4d` | **Added at registration time by backend code** (passed via `metrics_labels=` when the worker calls `serve_endpoint()` — not auto-injected, so presence depends on the backend): | Label | Description | Example | |-------|-------------|---------| | `model` | The model being served (OpenAI-style label). Added by vLLM, SGLang, and TRT-LLM workers on inference endpoints; absent on internal endpoints like `worker_kv_indexer_query_dp{N}`. | `Qwen/Qwen3-0.6B` | | `model_name` | Same model identifier under a second label name, retained for engine-native and dashboard back-compat. Added by vLLM and TRT-LLM workers; **not** added by SGLang. | `Qwen/Qwen3-0.6B` | **Added by the metric itself**: | Label | Description | Example | |-------|-------------|---------| | `error_type` | Only on `dynamo_component_errors_total` — the failure category. See [Component Error Types](#component-error-types). | `generate`, `publish_response` | **Injected by Prometheus / Kubernetes (added by the scraper, not in the metric itself):** | Label | Description | Example | |-------|-------------|---------| | `instance` | Scrape target as `:`. | `192.168.133.236:9090` | | `pod` | Kubernetes pod name; the per-replica disambiguator. | `vllm-v1-disagg-router-vllmdecodeworker-...` | | `container` | Container name inside the pod (usually `main`). | `main` | | `namespace` | Kubernetes namespace the pod runs in. **Not** the same as `dynamo_namespace`. | `dynamo-cloud` | | `job` | Prometheus scrape-job name, `/`. | `dynamo-cloud/dynamo-worker` | | `endpoint` | Named port on the K8s `Service` that Prometheus scraped. **Not** the same as `dynamo_endpoint`. | `system` | > **Watch out for these collisions:** > - `dynamo_namespace` (Dynamo deployment scope) vs. `namespace` (K8s namespace). > - `dynamo_endpoint` (Dynamo RPC) vs. `endpoint` (K8s Service port name). #### Component Names Values you will see in the `dynamo_component` label on `dynamo_component_*` series. The HTTP frontend (`python -m dynamo.frontend`) is **not** in this list — it exposes its own `dynamo_frontend_*` metric family, not `dynamo_component_*`. | Value | Meaning | |-------|---------| | `router` | The standalone KV router (`python -m dynamo.router`). | | `Planner` | The planner component (`python -m dynamo.planner`). Note the capital `P`. | | `prefill` | The prefill worker in disaggregated serving (all backends). | | `backend` | The decode worker in disaggregated serving for all backends, **and** the combined worker for vLLM in aggregated mode. | | `encode` | The encode worker for vLLM, SGLang, and TRT-LLM. | | `diffusion` | The diffusion worker for TRT-LLM. | Internal subsystems (e.g. `kvbm` from the block manager, `sequences` from the KV router) also create components and may appear in `dynamo_component_*` series. The default for vLLM/SGLang can be overridden by passing `--endpoint dyn://..` on the worker command line. > The name `backend` for the decode worker is historical. The runtime has a TODO to introduce a `decode` constant and migrate to it (see `lib/runtime/src/metrics/prometheus_names.rs::component_names`). #### Endpoint Names Values you will see in the `dynamo_endpoint` label on backend workers: | Value | Meaning | |-------|---------| | `generate` | Main inference RPC; one increment per request received. On a prefill worker this counts prefill-stage `generate` calls (one per request the router routes through); on a decode worker this counts decode-stage `generate` calls. | | `clear_kv_blocks` | Admin RPC to flush the worker's KV cache. Registered on both prefill and decode workers. | | `worker_kv_indexer_query_dp{N}` | KV-router queries to the worker's local KV indexer about its cached prefix blocks. One endpoint per data-parallel rank (`_dp0`, `_dp1`, …). Appears on the worker that owns the prefix caches the router consults — in disaggregated serving that is the prefill worker. | #### Component Error Types The `dynamo_component_errors_total` counter is labeled with `error_type`, identifying which stage of request handling failed: | `error_type` | Stage | Meaning | |--------------|-------|---------| | `deserialization` | Ingress | Could not parse the incoming request payload. | | `invalid_message` | Ingress | Wire-format violation in the incoming message. | | `response_stream` | Pre-generate | The worker received the request but could not open the response stream back to the frontend (transport problem before `generate` was called). | | `generate` | Engine | The engine's `generate()` itself returned an error. This is the counter that reflects engine/inference failures. | | `publish_response` | Streaming | The engine produced response chunks but the worker could not push one of them back to the frontend (write failed mid-stream). **Also fires on client cancellation** — the frontend disconnecting before the stream finishes — so this counter can be inflated by user-aborted requests. | | `publish_final` | Teardown | All response chunks were sent, but the worker could not deliver the final stream-complete marker. The connection died right at the end. | ### Specialized Component Metrics Some components expose additional metrics specific to their functionality: - `dynamo_preprocessor_*`: Metrics specific to preprocessor components ### Frontend Metrics **Important:** The frontend and backend workers are separate components that expose metrics on different ports. See [Backend Component Metrics](#backend-component-metrics) for backend metrics. The Dynamo HTTP Frontend (`python -m dynamo.frontend`) exposes `dynamo_frontend_*` metrics on port 8000 by default (configurable via `--http-port` or `DYN_HTTP_PORT`) at the `/metrics` endpoint. Most metrics include `model` labels containing the model name: - `dynamo_frontend_active_requests`: Number of requests currently being handled by the frontend, from HTTP handler entry until the response stream completes (gauge). This is the top-level in-flight count with no stage breakdown. - `dynamo_frontend_stage_requests`: Number of requests currently in a given frontend pipeline stage (gauge, labels: `stage`, `phase`). See [Stage and phase labels](#stage-and-phase-labels) below. - `dynamo_frontend_inflight_requests`: Inflight requests (gauge). **Deprecated** — kept for backward compatibility; prefer `dynamo_frontend_active_requests`, which has identical semantics with a clearer name. - `dynamo_frontend_queued_requests`: Number of requests in HTTP processing queue (gauge). **Deprecated** — kept for backward compatibility; the "waiting for first token" window is now the sum of `dynamo_frontend_stage_requests` across the `preprocess`, `route`, and `dispatch` stages. - `dynamo_frontend_disconnected_clients`: Number of disconnected clients (gauge) - `dynamo_frontend_input_sequence_tokens`: Input sequence length (histogram) - `dynamo_frontend_cached_tokens`: Number of cached tokens (prefix cache hits) per request (histogram) - `dynamo_frontend_inter_token_latency_seconds`: Inter-token latency (histogram) - `dynamo_frontend_output_sequence_tokens`: Output sequence length (histogram) - `dynamo_frontend_output_tokens_total`: Total number of output tokens generated (counter) - `dynamo_frontend_request_duration_seconds`: LLM request duration (histogram) - `dynamo_frontend_requests_total`: Total LLM requests (counter) - `dynamo_frontend_time_to_first_token_seconds`: Time to first token (histogram) - `dynamo_frontend_model_migration_total`: Total number of request migrations due to worker unavailability (counter, labels: `model`, `migration_type`) **Access frontend metrics:** ```bash curl http://localhost:8000/metrics ``` #### Stage and phase labels `dynamo_frontend_stage_requests` decomposes the lifetime of an active frontend request into three sequential pipeline stages. A request is counted in exactly one stage at a time (via an RAII guard that increments on stage entry and decrements on stage exit), and is counted in `dynamo_frontend_active_requests` for its entire lifetime. Between stages — and after `dispatch` exits while the backend is streaming tokens — the request is still in `active_requests` but in no `stage_requests` bucket. **`stage` label values:** | Stage | What it covers | Enters when | Exits when | |-------|----------------|-------------|------------| | `preprocess` | Tokenization and chat-template application | The frontend enters `preprocess_request` | Preprocessing returns | | `route` | Worker selection (including parking in the KV-router queue while waiting for a worker) | The router's `generate()` is called | A worker is selected or the request is queued for one | | `dispatch` | Serialization, transport to the chosen worker, and waiting for the backend's first response (includes backend prefill time) | `generate()` is called in `AddressedPushRouter` | The first response is received from the backend | **`phase` label values:** | Phase | Meaning | |-------|---------| | `prefill` | The request is being handled by a prefill worker in disaggregated serving | | `decode` | The request is being handled by a decode worker in disaggregated serving | | `aggregated` | Aggregated (non-disaggregated) serving — a single worker handles both prefill and decode | | `""` (empty) | The stage does not distinguish phases (used by `preprocess`) | **Derived signals operators commonly want.** These are cluster-wide totals across all frontend pods. `stage_requests` has no `model` label, so you cannot split these by model; add `by (pod)` or `by (instance)` to any `sum(...)` below if you need per-pod visibility. The stage filter `stage=~"preprocess|route|dispatch"` is used explicitly to keep the "pre-first-token" semantic stable if additional stages (e.g. `postprocess`) are added in the future. - **Requests waiting for a worker to start generating (the old "queued" semantic):** `sum(dynamo_frontend_stage_requests{stage=~"preprocess|route|dispatch"})` — i.e. still in `preprocess`, `route`, or `dispatch`. - **Requests currently being processed by a backend worker:** use the worker-side gauge `sum(dynamo_component_inflight_requests{dynamo_component="backend",dynamo_endpoint="generate"})` — this is the authoritative count, available whenever `DYN_SYSTEM_PORT` is set on workers (see [Backend Component Metrics](#backend-component-metrics)). - *Frontend-perspective variant* (useful if worker metrics aren't being scraped, or when sizing frontend pods rather than workers): `sum(dynamo_frontend_active_requests) - sum(dynamo_frontend_stage_requests{stage=~"preprocess|route|dispatch"})`. This differs from the worker gauge because its window starts when the first token arrives at the frontend and extends through streaming to the client, so it includes transit and client-buffering time. - **Router saturation:** `sum(dynamo_frontend_stage_requests{stage="route"})` spiking indicates workers can't be selected fast enough (e.g. all backends busy, KV-router queue full). - **Backend prefill saturation:** `sum(dynamo_frontend_stage_requests{stage="dispatch"})` spiking indicates the backend is slow to produce first tokens. #### Deprecated frontend gauges The following gauges are still emitted but will be removed in a future release. They were superseded by the gauges above as part of the frontend-metrics rework (PR #8162). Dashboards and alerts should migrate off them. | Deprecated metric | Replacement | |-------------------|-------------| | `dynamo_frontend_inflight_requests` | `dynamo_frontend_active_requests` (same semantics, clearer name) | | `dynamo_frontend_queued_requests` | `sum(dynamo_frontend_stage_requests{stage=~"preprocess\|route\|dispatch"})` | #### Model Configuration Metrics The frontend also exposes model configuration metrics (on port 8000 `/metrics` endpoint) with the `dynamo_frontend_model_*` prefix. These metrics are populated from the worker backend registration service when workers register with the system. All model configuration metrics include a `model` label. **Runtime Config Metrics (from ModelRuntimeConfig):** These metrics come from the runtime configuration provided by worker backends during registration. - `dynamo_frontend_model_total_kv_blocks`: Total KV blocks available for a worker serving the model (gauge) - `dynamo_frontend_model_max_num_seqs`: Maximum number of sequences for a worker serving the model (gauge) - `dynamo_frontend_model_max_num_batched_tokens`: Maximum number of batched tokens for a worker serving the model (gauge) **MDC Metrics (from ModelDeploymentCard):** These metrics come from the Model Deployment Card information provided by worker backends during registration. Note that when multiple worker instances register with the same model name, only the first instance's configuration metrics (runtime config and MDC metrics) will be populated. Subsequent instances with duplicate model names will be skipped for configuration metric updates. - `dynamo_frontend_model_context_length`: Maximum context length for a worker serving the model (gauge) - `dynamo_frontend_model_kv_cache_block_size`: KV cache block size for a worker serving the model (gauge) - `dynamo_frontend_model_migration_limit`: Request migration limit for a worker serving the model (gauge) ### Request Processing Flow > **Deprecated framing.** The two-metric model below (inflight vs. HTTP queue) describes the legacy `dynamo_frontend_inflight_requests` and `dynamo_frontend_queued_requests` gauges and is kept only to help operators reading existing dashboards. New work should use `dynamo_frontend_active_requests` and the per-stage `dynamo_frontend_stage_requests` gauges described under [Stage and phase labels](#stage-and-phase-labels). This section explains the distinction between two key metrics used to track request processing: 1. **Inflight**: Tracks requests from HTTP handler start until the complete response is finished 2. **HTTP Queue**: Tracks requests from HTTP handler start until first token generation begins (including prefill time) **Example Request Flow:** ``` curl -s localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "Qwen/Qwen3-0.6B", "prompt": "Hello let's talk about LLMs", "stream": false, "max_tokens": 1000 }' ``` **Timeline:** ```mermaid sequenceDiagram participant Client participant Frontend as Frontend:8000 participant Backend as Backend (SGLang/TRT/vLLM) Client->>Frontend: Request start Note over Frontend,Backend: HTTP queue begins Frontend->>Backend: Forward request Note over Backend: Start prefill Backend-->>Frontend: First token Note over Frontend,Backend: HTTP queue ends loop Token generation Backend-->>Frontend: Tokens end Backend-->>Frontend: Last token Frontend-->>Client: Complete response Note over Frontend: Inflight ends ``` **Concurrency Example:** Suppose the backend allows 3 concurrent requests and there are 10 clients continuously hitting the frontend: - All 10 requests will be counted as inflight (from start until complete response) - 7 requests will be in HTTP queue most of the time - 3 requests will be actively processed (between first token and last token) **Key Differences:** - **Inflight**: Measures total request lifetime including processing time - **HTTP Queue**: Measures queuing time before processing begins (including prefill time) - **HTTP Queue ≤ Inflight** (HTTP queue is a subset of inflight time) ### Router Metrics The router exposes metrics for monitoring routing decisions and overhead. Defined in `lib/llm/src/kv_router/metrics.rs`. For router deployment modes, see the [Router Guide](/dynamo/user-guides/kv-cache-aware-routing). For router flags and tuning, see [Configuration and Tuning](/dynamo/components/router/configuration-and-tuning). #### Metrics Availability by Configuration Not all metrics appear in every deployment. The chart below shows which metric groups are **registered** and **populated** in each configuration: | Metric Group | Frontend + KV (agg) | Frontend + KV (disagg) | Frontend + non-KV (round-robin/random/direct) | Standalone Router | |---|---|---|---|---| | `dynamo_component_router_*` (request metrics) | Registered and populated | Registered and populated | Registered, **always zero** | Populated (on `DYN_SYSTEM_PORT`) | | `dynamo_router_overhead_*` (routing overhead) | Registered and populated | Registered and populated | **Not registered** | **Not created** | | `dynamo_frontend_router_queue_*` (queue depth) | Registered; populated when `--router-queue-threshold` set | Registered; populated when `--router-queue-threshold` set | **Not registered** | **Not created** | | `dynamo_component_kv_cache_events_applied` (indexer) | Populated when KV events are received | Populated when KV events are received | **Not registered** | Populated when KV events are received | | `dynamo_frontend_worker_*` (per-worker load/timing) | Registered and populated | Registered and populated (`worker_type`=`prefill`/`decode`) | Registered and populated (`worker_type`=`decode`) | **Not created** | **Key:** - **Registered and populated**: Metric appears at `/metrics` with real values - **Registered, always zero**: Metric appears at `/metrics` but the counter/histogram is never incremented (useful for dashboards that expect the metric to exist) - **Not registered / Not created**: Metric does not appear at `/metrics` at all **Scrape endpoints:** - Frontend: `/metrics` on HTTP port (default 8000, configurable via `--http-port` or `DYN_HTTP_PORT`) - Standalone router: `/metrics` on `DYN_SYSTEM_PORT` (must be set explicitly; default is `-1` / disabled) - Backend workers: `/metrics` on `DYN_SYSTEM_PORT` (separate from frontend metrics) #### Router Request Metrics (`dynamo_component_router_*`) Histograms and counters for aggregate request-level statistics. Eagerly registered via `from_component()` with the DRT `MetricsRegistry` hierarchy. On the frontend, exposed at `/metrics` on the HTTP port (default 8000) via the `drt_metrics` bridge. On the standalone router (`python -m dynamo.router`), exposed on `DYN_SYSTEM_PORT` when set. Populated per-request when `--router-mode kv` is active; registered with zero values in non-KV modes. All metrics carry the standard hierarchy labels (`dynamo_namespace`, `dynamo_component`, `dynamo_endpoint`). | Metric | Type | Description | |--------|------|-------------| | `dynamo_component_router_requests_total` | Counter | Total requests processed by the router | | `dynamo_component_router_time_to_first_token_seconds` | Histogram | Time to first token (seconds) | | `dynamo_component_router_inter_token_latency_seconds` | Histogram | Average inter-token latency (seconds) | | `dynamo_component_router_input_sequence_tokens` | Histogram | Input sequence length (tokens) | | `dynamo_component_router_output_sequence_tokens` | Histogram | Output sequence length (tokens) | | `dynamo_component_router_kv_hit_rate` | Histogram | Predicted KV cache hit rate at routing time (0.0-1.0) | #### Per-Request Routing Overhead (`dynamo_router_overhead_*`) Histograms (in milliseconds) tracking the time spent in each phase of the routing decision for every request. Registered on the frontend port (default 8000) at `/metrics` with a `router_id` label (the frontend's discovery instance ID). These metrics are only created when the frontend has DRT discovery enabled (i.e., `--router-mode kv`); they do not appear in non-KV modes or on the standalone router. | Metric | Type | Description | |--------|------|-------------| | `dynamo_router_overhead_block_hashing_ms` | Histogram | Time computing block hashes | | `dynamo_router_overhead_indexer_find_matches_ms` | Histogram | Time in indexer find_matches | | `dynamo_router_overhead_seq_hashing_ms` | Histogram | Time computing sequence hashes | | `dynamo_router_overhead_scheduling_ms` | Histogram | Time in scheduler worker selection | | `dynamo_router_overhead_total_ms` | Histogram | Total routing overhead per request | #### Router Queue Metrics (`dynamo_frontend_router_queue_*`) Gauge tracking the number of requests pending in the router's scheduler queue. Only registered when `--router-queue-threshold` is set. Labeled by `worker_type` to distinguish prefill vs. decode queues in disaggregated mode. | Metric | Type | Description | |--------|------|-------------| | `dynamo_frontend_router_queue_pending_requests` | Gauge | Requests pending in the router scheduler queue | **Labels:** `worker_type` (`prefill` or `decode`) #### KV Indexer Metrics Tracks KV cache events applied to the router's radix tree index. Only appears when `--router-kv-overlap-score-credit` is greater than 0 (default) and workers are publishing KV events. Will not appear if `--router-kv-overlap-score-credit 0` is set or no KV events have been received. | Metric | Type | Description | |--------|------|-------------| | `dynamo_component_kv_cache_events_applied` | Counter | KV cache events applied to the index | **Additional labels:** `status` (`ok` / `parent_block_not_found` / `block_not_found` / `invalid_block`), `event_type` (`stored` / `removed` / `cleared`) #### Per-Worker Load and Timing Gauges (`dynamo_frontend_worker_*`) These appear once workers register and begin serving requests. They are registered on the frontend's local Prometheus registry (not component-scoped) and do not carry `dynamo_namespace` or `dynamo_component` labels. These metrics are frontend-only and are not available on the standalone router. | Metric | Type | Description | |--------|------|-------------| | `dynamo_frontend_worker_active_decode_blocks` | Gauge | Active KV cache decode blocks per worker | | `dynamo_frontend_worker_active_prefill_tokens` | Gauge | Active prefill tokens queued per worker | | `dynamo_frontend_worker_last_time_to_first_token_seconds` | Gauge | Last observed TTFT per worker (seconds) | | `dynamo_frontend_worker_last_input_sequence_tokens` | Gauge | Last observed input sequence length per worker | | `dynamo_frontend_worker_last_inter_token_latency_seconds` | Gauge | Last observed ITL per worker (seconds) | **Labels:** | Label | Example Value | Description | |-------|---------------|-------------| | `worker_id` | `7890` | Worker instance ID (etcd lease ID) | | `dp_rank` | `0` | Data-parallel rank | | `worker_type` | `prefill` or `decode` | Worker role | In disaggregated mode, the `worker_type` label shows both `"prefill"` and `"decode"` values; in aggregated mode, all workers report as `"decode"`. ## NIXL Telemetry Metrics [NIXL](https://github.com/ai-dynamo/nixl) exposes its own Prometheus metrics on a **separate port** from Dynamo metrics. These metrics track KV cache and embedding data transfers and are only populated during **disaggregated serving** or **multimodal embedding transfers**. To enable, set these environment variables on your worker process: ```bash # Prefill worker NIXL_TELEMETRY_ENABLE=y NIXL_TELEMETRY_EXPORTER=prometheus \ NIXL_TELEMETRY_PROMETHEUS_PORT=19090 DYN_SYSTEM_PORT=8081 \ python -m dynamo.vllm --model --disaggregation-mode prefill # Decode worker (different NIXL port to avoid collision) NIXL_TELEMETRY_ENABLE=y NIXL_TELEMETRY_EXPORTER=prometheus \ NIXL_TELEMETRY_PROMETHEUS_PORT=19091 DYN_SYSTEM_PORT=8082 \ python -m dynamo.vllm --model --disaggregation-mode decode # Scrape NIXL metrics (separate from Dynamo metrics on 8081/8082) curl http://localhost:19090/metrics ``` For the full list of metrics, configuration options, and architecture details, see the upstream [NIXL Telemetry documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/telemetry.md) and [Prometheus exporter README](https://github.com/ai-dynamo/nixl/blob/main/src/plugins/telemetry/prometheus/README.md). For Kubernetes, see [Enable NIXL Telemetry](/dynamo/kubernetes-deployment/operate/observability/metrics#enable-nixl-telemetry-optional). ## Related Documentation - [Distributed Runtime Architecture](/dynamo/design-docs/distributed-runtime) - [Dynamo Architecture Overview](/dynamo/design-docs/overall-architecture) - [Backend Guide](/dynamo/backends/custom-backend/python-workers-lower-level) - [Forward Pass Metrics (SGLang)](/dynamo/backends/sg-lang/observability#forward-pass-metrics-fpm) — Per-iteration scheduler telemetry via ZMQ/NATS for planner-driven scaling (intended architecture; not available in the 1.2.0 SGLang runtime image) - [Forward Pass Metrics RFC](../proposals/vllm-rfc-forward-pass-metrics.md) - Design rationale for per-iteration metrics # Metrics Developer Guide This guide explains how to create and use custom metrics in Dynamo components using the Dynamo metrics API. ## Metrics Exposure All metrics created via the Dynamo metrics API are automatically exposed on the `/metrics` HTTP endpoint in Prometheus Exposition Format text when the following environment variable is set: - `DYN_SYSTEM_PORT=` - Port for the metrics endpoint (set to positive value to enable, default: `-1` disabled) Example: ```bash DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model ``` Prometheus Exposition Format text metrics will be available at: `http://localhost:8081/metrics` ## Metric Name Constants The [prometheus_names.rs](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/lib/runtime/src/metrics/prometheus_names.rs) module provides centralized metric name constants and sanitization functions to ensure consistency across all Dynamo components. --- ## Metrics API in Rust The metrics API is accessible through the `.metrics()` method on runtime, namespace, component, and endpoint objects. See [Runtime Hierarchy](/dynamo/user-guides/observability-local/metrics#runtime-hierarchy) for details on the hierarchical structure. ### Available Methods - `.metrics().create_counter()`: Create a counter metric - `.metrics().create_gauge()`: Create a gauge metric - `.metrics().create_histogram()`: Create a histogram metric - `.metrics().create_countervec()`: Create a counter with labels - `.metrics().create_gaugevec()`: Create a gauge with labels - `.metrics().create_histogramvec()`: Create a histogram with labels ### Creating Metrics ```rust use dynamo_runtime::DistributedRuntime; let runtime = DistributedRuntime::new()?; let endpoint = runtime.namespace("my_namespace").component("my_component").endpoint("my_endpoint"); // Simple metrics let requests_total = endpoint.metrics().create_counter( "requests_total", "Total requests", &[] )?; let active_connections = endpoint.metrics().create_gauge( "active_connections", "Active connections", &[] )?; let latency = endpoint.metrics().create_histogram( "latency_seconds", "Request latency", &[], Some(vec![0.001, 0.01, 0.1, 1.0, 10.0]) )?; ``` ### Using Metrics ```rust // Counters requests_total.inc(); // Gauges active_connections.set(42.0); active_connections.inc(); active_connections.dec(); // Histograms latency.observe(0.023); // 23ms ``` ### Vector Metrics with Labels ```rust // Create vector metrics with label names let requests_by_model = endpoint.metrics().create_countervec( "requests_by_model", "Requests by model", &["model_type", "model_size"], &[] )?; let memory_by_gpu = endpoint.metrics().create_gaugevec( "gpu_memory_bytes", "GPU memory by device", &["gpu_id", "memory_type"], &[] )?; // Use with specific label values requests_by_model.with_label_values(&["llama", "7b"]).inc(); memory_by_gpu.with_label_values(&["0", "allocated"]).set(8192.0); ``` ### Advanced Features **Custom histogram buckets:** ```rust let latency = endpoint.metrics().create_histogram( "latency_seconds", "Request latency", &[], Some(vec![0.001, 0.01, 0.1, 1.0, 10.0]) )?; ``` **Constant labels:** ```rust let counter = endpoint.metrics().create_counter( "requests_total", "Total requests", &[("region", "us-west"), ("env", "prod")] )?; ``` --- ## Related Documentation - [Metrics Overview](/dynamo/user-guides/observability-local/metrics) - [Prometheus and Grafana Setup](/dynamo/user-guides/observability-local/prometheus-grafana-setup) - [Distributed Runtime Architecture](/dynamo/design-docs/distributed-runtime) # Health Checks ## Overview Dynamo provides health check and liveness HTTP endpoints for each component which can be used to configure startup, liveness and readiness probes in orchestration frameworks such as Kubernetes. ## Environment Variables | Variable | Description | Default | Example | |----------|-------------|---------|---------| | `DYN_SYSTEM_PORT` | System status server port | `-1` (disabled) | `8081` | | `DYN_SYSTEM_STARTING_HEALTH_STATUS` | Initial health status | `notready` | `ready`, `notready` | | `DYN_SYSTEM_HEALTH_PATH` | Custom health endpoint path | `/health` | `/custom/health` | | `DYN_SYSTEM_LIVE_PATH` | Custom liveness endpoint path | `/live` | `/custom/live` | | `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` | Endpoints required for ready state | none | `["generate"]` | | `DYN_HEALTH_CHECK_ENABLED` | Enable canary health checks | `false` (K8s: `true`) | `true`, `false` | | `DYN_CANARY_WAIT_TIME` | Seconds before sending canary health check | `10` | `5`, `30` | | `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` | Health check request timeout in seconds | `3` | `5`, `10` | ## Getting Started Quickly Enable health checks and query endpoints: ```bash # Start your Dynamo components (default port 8000, override with --http-port or DYN_HTTP_PORT env var) python -m dynamo.frontend & # Enable system status server on port 8081 DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager & ``` Check health status: ```bash # Frontend health (port 8000) curl -s localhost:8000/health | jq # Worker health (port 8081) curl -s localhost:8081/health | jq ``` ## Frontend Liveness Check The frontend liveness endpoint reports a status of `live` as long as the service is running. Frontend liveness doesn't depend on worker health or liveness only on the Frontend service itself. ### Example Request ``` curl -s localhost:8080/live -q | jq ``` ### Example Response ``` { "message": "Service is live", "status": "live" } ``` ## Frontend Health Check The frontend health endpoint reports a status of `healthy` as long as the service is running. Once workers have been registered, the `health` endpoint will also list registered endpoints and instances. Frontend liveness doesn't depend on worker health or liveness only on the Frontend service itself. ### Example Request ``` curl -v localhost:8080/health -q | jq ``` ### Example Response Before workers are registered: ``` HTTP/1.1 200 OK content-type: application/json content-length: 72 date: Wed, 03 Sep 2025 13:31:44 GMT { "instances": [], "message": "No endpoints available", "status": "unhealthy" } ``` After workers are registered: ``` HTTP/1.1 200 OK content-type: application/json content-length: 609 date: Wed, 03 Sep 2025 13:32:03 GMT { "endpoints": [ "dyn://dynamo.backend.generate" ], "instances": [ { "component": "backend", "endpoint": "clear_kv_blocks", "instance_id": 7587888160958628000, "namespace": "dynamo", "transport": { "nats_tcp": "dynamo_backend.clear_kv_blocks-694d98147d54be25" } }, { "component": "backend", "endpoint": "generate", "instance_id": 7587888160958628000, "namespace": "dynamo", "transport": { "nats_tcp": "dynamo_backend.generate-694d98147d54be25" } }, { "component": "backend", "endpoint": "load_metrics", "instance_id": 7587888160958628000, "namespace": "dynamo", "transport": { "nats_tcp": "dynamo_backend.load_metrics-694d98147d54be25" } } ], "status": "healthy" } ``` ## Worker Liveness and Health Check Health checks for components other than the frontend are enabled selectively based on environment variables. If a health check for a component is enabled the starting status can be set along with the set of endpoints that are required to be served before the component is declared `ready`. Once all endpoints declared in `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` are served the component transitions to a `ready` state until the component is shutdown. The endpoints return HTTP status code of `HTTP/1.1 503 Service Unavailable` when initializing and HTTP status code `HTTP/1.1 200 OK` once ready. Both /live and /ready return the same information ### Example Environment Setting ``` export DYN_SYSTEM_PORT=9090 export DYN_SYSTEM_STARTING_HEALTH_STATUS="notready" export DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS="[\"generate\"]" ``` #### Example Request ``` curl -v localhost:9090/health | jq ``` #### Example Response Before endpoints are being served: ``` HTTP/1.1 503 Service Unavailable content-type: text/plain; charset=utf-8 content-length: 96 date: Wed, 03 Sep 2025 13:42:39 GMT { "endpoints": { "generate": "notready" }, "status": "notready", "uptime": { "nanos": 313803539, "secs": 12 } } ``` After endpoints are being served: ``` HTTP/1.1 200 OK content-type: text/plain; charset=utf-8 content-length: 139 date: Wed, 03 Sep 2025 13:42:45 GMT { "endpoints": { "clear_kv_blocks": "ready", "generate": "ready", "load_metrics": "ready" }, "status": "ready", "uptime": { "nanos": 356504530, "secs": 18 } } ``` ## Canary Health Checks (Active Monitoring) In addition to the HTTP endpoints described above, Dynamo includes a **canary health check** system that actively monitors worker endpoints. ### Overview The canary health check system: - **Monitors endpoint health** by sending periodic test requests to worker endpoints - **Only activates during idle periods** - if there's ongoing traffic, health checks are skipped to avoid overhead - **Automatically enabled in Kubernetes** deployments via the operator - **Disabled by default** in local/development environments ### How It Works 1. **Idle Detection**: After no activity on an endpoint for a configurable wait time (default: 10 seconds), a canary health check is triggered 2. **Health Check Request**: A lightweight test request is sent to the endpoint with a minimal payload (generates 1 token) 3. **Activity Resets Timer**: If normal requests arrive, the canary timer resets and no health check is sent 4. **Timeout Handling**: If a health check doesn't respond within the timeout (default: 3 seconds), the endpoint is marked as unhealthy ### Configuration #### In Kubernetes (Enabled by Default) Health checks are automatically enabled by the Dynamo operator. No additional configuration is required. ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-deployment spec: services: VllmWorker: componentType: worker replicas: 2 # Health checks automatically enabled by operator ``` #### In Local/Development Environments (Disabled by Default) To enable health checks locally: ```bash # Enable health checks export DYN_HEALTH_CHECK_ENABLED=true # Optional: Customize timing export DYN_CANARY_WAIT_TIME=5 # Wait 5 seconds before sending health check export DYN_HEALTH_CHECK_REQUEST_TIMEOUT=5 # 5 second timeout # Start worker python -m dynamo.vllm --model Qwen/Qwen3-0.6B ``` #### Configuration Options | Environment Variable | Description | Default | Notes | |---------------------|-------------|---------|-------| | `DYN_HEALTH_CHECK_ENABLED` | Enable/disable canary health checks | `false` (K8s: `true`) | Automatically set to `true` in K8s | | `DYN_CANARY_WAIT_TIME` | Seconds to wait (during idle) before sending health check | `10` | Lower values = more frequent checks | | `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` | Max seconds to wait for health check response | `3` | Higher values = more tolerance for slow responses | ### Health Check Payloads Each backend defines its own minimal health check payload: - **vLLM**: Single token generation with minimal sampling options - **TensorRT-LLM**: Single token with BOS token ID - **SGLang**: Single token generation request These payloads are designed to: - Complete quickly (\< 100ms typically) - Minimize GPU overhead - Verify the full inference stack is working ### Observing Health Checks When health checks are enabled, you'll see logs like: ``` INFO Health check manager started (canary_wait_time: 10s, request_timeout: 3s) INFO Spawned health check task for endpoint: generate INFO Canary timer expired for generate, sending health check INFO Health check successful for generate ``` If an endpoint fails: ``` WARN Health check timeout for generate ERROR Health check request failed for generate: connection refused ``` ### When to Use Canary Health Checks **Enable in production (Kubernetes):** - ✅ Detect unhealthy workers before they affect user traffic - ✅ Enable faster failure detection and recovery - ✅ Monitor worker availability continuously **Disable in development:** - ✅ Reduce log noise during debugging - ✅ Avoid overhead when not needed - ✅ Simplify local testing ### Troubleshooting **Health checks timing out:** - Increase `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` - Check worker logs for errors - Verify network connectivity **Too many health check logs:** - Increase `DYN_CANARY_WAIT_TIME` to reduce frequency - Or disable with `DYN_HEALTH_CHECK_ENABLED=false` in dev **Health checks not running:** - Verify `DYN_HEALTH_CHECK_ENABLED=true` is set - Check that `DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS` includes the endpoint - Ensure the worker is serving the endpoint ## Related Documentation - [Distributed Runtime Architecture](/dynamo/design-docs/distributed-runtime) - [Dynamo Architecture Overview](/dynamo/design-docs/overall-architecture) - [Backend Guide](/dynamo/backends/custom-backend/python-workers-lower-level) # Tracing ## Overview Dynamo supports OpenTelemetry-based distributed tracing for visualizing request flows across Frontend and Worker components. Traces are exported to Tempo via OTLP (OpenTelemetry Protocol) and visualized in Grafana. **Requirements:** Set `DYN_LOGGING_JSONL=true` and `OTEL_EXPORT_ENABLED=true` to export traces to Tempo. **Note:** When OTLP export is enabled, Dynamo exports both **traces and logs**. Traces are sent to Tempo and logs are sent to Loki (via the OpenTelemetry Collector). To send logs to a separate endpoint, set `OTEL_EXPORTER_OTLP_LOGS_ENDPOINT`; otherwise it defaults to the traces endpoint. See [Logging](/dynamo/user-guides/observability-local/logging#otlp-log-export) for details. This guide covers single GPU demo setup using Docker Compose. For Kubernetes deployments, see [Kubernetes Deployment](#kubernetes-deployment). **Note:** This section has overlap with [Logging of OpenTelemetry Tracing](/dynamo/user-guides/observability-local/logging) since OpenTelemetry has aspects of both logging and tracing. The tracing approach documented here is for persistent trace visualization and analysis. For short debugging sessions examining trace context directly in logs, see the [Logging](/dynamo/user-guides/observability-local/logging) guide. ## Environment Variables | Variable | Description | Default | Example | |----------|-------------|---------|---------| | `DYN_LOGGING_JSONL` | Enable JSONL logging format (required for tracing) | `false` | `true` | | `OTEL_EXPORT_ENABLED` | Enable OTLP trace export | `false` | `true` | | `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` | OTLP gRPC endpoint for traces | `http://localhost:4317` | `http://tempo:4317` | | `OTEL_EXPORTER_OTLP_LOGS_ENDPOINT` | OTLP gRPC endpoint for logs (defaults to traces endpoint) | same as traces | `http://localhost:4317` | | `OTEL_SERVICE_NAME` | Service name for identifying components | `dynamo` | `dynamo-frontend` | ## Getting Started Quickly ### 1. Start Observability Stack Start the observability stack (Prometheus, Grafana, Tempo, exporters). See [Observability Getting Started](/dynamo/user-guides/observability-local#getting-started-quickly) for instructions. ### 2. Start Dynamo Components (Single GPU) For a simple single-GPU deployment, run the aggregated tracing launch script. This script enables tracing, sets per-component service names, and starts a frontend with a single vLLM worker: ```bash cd examples/backends/vllm/launch ./agg_tracing.sh ``` To override the Tempo endpoint (default `http://localhost:4317`): ```bash export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://tempo:4317 ./agg_tracing.sh ``` This runs a single aggregated worker on one GPU, providing a simpler setup for testing tracing. ### Alternative: Disaggregated Deployment (2 GPUs) For a disaggregated deployment with tracing, run the disaggregated tracing launch script. This script sets up tracing and launches a frontend, a decode worker on GPU 0, and a prefill worker on GPU 1: ```bash cd examples/backends/vllm/launch ./disagg_tracing.sh ``` This separates prefill and decode onto different GPUs for better resource utilization. ### 3. Generate Traces Send requests to the frontend to generate traces (works for both aggregated and disaggregated deployments). The launch scripts print an example `curl` command on startup with the correct model name. **Tip:** Add an `x-request-id` header to easily search for a specific trace in Grafana: ```bash curl -H 'Content-Type: application/json' \ -H 'x-request-id: test-trace-001' \ -d '{ "model": "", "max_completion_tokens": 100, "messages": [ {"role": "user", "content": "What is the capital of France?"} ] }' \ http://localhost:8000/v1/chat/completions ``` ### 4. View Traces in Grafana Tempo 1. Open Grafana at `http://localhost:3000` 2. Login with username `dynamo` and password `dynamo` 3. Navigate to **Explore** (compass icon in the left sidebar) 4. Select **Tempo** as the data source (should be selected by default) 5. In the query type, select **"Search"** (not TraceQL, not Service Graph) 6. Use the **Search** tab to find traces: - Search by **Service Name** (e.g., `dynamo-frontend`) - Search by **Span Name** (e.g., `http-request`, `handle_payload`) - Search by **Tags** (e.g., `x_request_id=test-trace-001`) 7. Click on a trace to view the detailed flame graph #### Example Trace View Below is an example of what a trace looks like in Grafana Tempo: ![Trace Example](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/cedab13edba5e8a58a12bd1944bb6f063d467f90f772a647dfa7f8d9f006e6fc/pages-v1.2.0/assets/img/trace.png) ### 5. Stop Services When done, stop the observability stack. See [Observability Getting Started](/dynamo/user-guides/observability-local#getting-started-quickly) for Docker Compose commands. --- ## Kubernetes Deployment For Kubernetes deployments, ensure you have a Tempo instance deployed and accessible (e.g., `http://tempo.observability.svc.cluster.local:4317`). ### Modify DynamoGraphDeployment for Tracing Tracing-enabled variants of the example deployments are provided: - **Aggregated:** `examples/backends/vllm/deploy/agg_tracing.yaml` - **Disaggregated:** `examples/backends/vllm/deploy/disagg_tracing.yaml` These add the [Environment Variables](#environment-variables) to the base `agg.yaml` / `disagg.yaml` deployments. To override the Tempo endpoint, edit `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` in the YAML. Apply a tracing-enabled deployment: ```bash kubectl apply -f examples/backends/vllm/deploy/disagg_tracing.yaml ``` Traces will now be exported to Tempo and can be viewed in Grafana. # Logging Set `OTEL_EXPORT_ENABLED=true` on every Dynamo process. Without it, logs never leave the process and Loki will be silent regardless of `OTEL_EXPORTER_OTLP_LOGS_ENDPOINT`. ## Overview Dynamo provides structured logging in both text as well as JSONL. When JSONL is enabled, logs support `trace_id` and `span_id` fields for distributed tracing. Span creation and exit events can be optionally enabled via the `DYN_LOGGING_SPAN_EVENTS` environment variable. ## Environment Variables | Variable | Description | Default | Example | |----------|-------------|---------|---------| | `DYN_LOGGING_JSONL` | Enable JSONL logging format | `false` | `true` | | `DYN_LOGGING_SPAN_EVENTS` | Enable span entry/close event logging (`SPAN_FIRST_ENTRY`, `SPAN_CLOSED` messages) | `false` | `true` | | `DYN_LOG` | Log levels per target `,=,=` | `info` | `DYN_LOG=info,dynamo_runtime::system_status_server:trace` | | `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps (default is UTC) | `false` | `true` | | `DYN_LOGGING_CONFIG_PATH` | Path to custom TOML logging configuration | none | `/path/to/config.toml` | | `VLLM_LOGGING_LEVEL` | vLLM backend log level (independent of `DYN_LOG`) | `INFO` | `DEBUG` | | `TLLM_LOG_LEVEL` | TensorRT-LLM backend log level (independent of `DYN_LOG`) | `INFO` | `DEBUG` | | `DYN_SKIP_SGLANG_LOG_FORMATTING` | Disable Dynamo's SGLang log configuration | `false` | `true` | | `OTEL_SERVICE_NAME` | Service name for trace and span information | `dynamo` | `dynamo-frontend` | | `OTEL_EXPORT_ENABLED` | Enable OTLP export of both traces and logs | `false` | `true` | | `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` | OTLP gRPC endpoint for traces | `http://localhost:4317` | `http://tempo:4317` | | `OTEL_EXPORTER_OTLP_LOGS_ENDPOINT` | OTLP gRPC endpoint for logs (defaults to `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` if not set) | same as traces endpoint | `http://localhost:4317` | ## OTLP Log Export When `OTEL_EXPORT_ENABLED=true`, Dynamo exports both **traces and logs** via OTLP. Logs are sent to an OpenTelemetry Collector which routes them to Grafana Loki for aggregation and querying. By default, logs are exported to the same endpoint as traces (`OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`). To send logs to a different endpoint, set `OTEL_EXPORTER_OTLP_LOGS_ENDPOINT`: ```bash export OTEL_EXPORT_ENABLED=true export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317 # Optional: send logs to a different endpoint # export OTEL_EXPORTER_OTLP_LOGS_ENDPOINT=http://localhost:4317 ``` The local observability stack (see [Getting Started](/dynamo/user-guides/observability-local#getting-started-quickly)) includes an OpenTelemetry Collector that receives OTLP on `localhost:4317` and routes traces to Tempo and logs to Loki. In Grafana, the Loki datasource is pre-configured with a derived field that links `trace_id` labels to Tempo, so you can jump directly from a log line to its corresponding trace. ## Getting Started Quickly ### Start Observability Stack For collecting and visualizing logs with Grafana Loki, or viewing trace context in logs alongside Grafana Tempo, start the observability stack. See [Observability Getting Started](/dynamo/user-guides/observability-local#getting-started-quickly) for instructions. The stack includes Loki, an OpenTelemetry Collector, and Tempo — all pre-wired together. ### Enable Structured Logging Enable structured JSONL logging: ```bash export DYN_LOGGING_JSONL=true export DYN_LOG=debug # Start your Dynamo components (default port 8000, override with --http-port or DYN_HTTP_PORT env var) python -m dynamo.frontend & python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager & ``` Logs will be written to stderr in JSONL format with trace context. ## Available Logging Levels | **Logging Levels (Least to Most Verbose)** | **Description** | |-------------------------------------------|---------------------------------------------------------------------------------| | **ERROR** | Critical errors (e.g., unrecoverable failures, resource exhaustion) | | **WARN** | Unexpected or degraded situations (e.g., retries, recoverable errors) | | **INFO** | Operational information (e.g., startup/shutdown, major events) | | **DEBUG** | General debugging information (e.g., variable values, flow control) | | **TRACE** | Very low-level, detailed information (e.g., internal algorithm steps) | ## Example Readable Format Environment Setting: ``` export DYN_LOG="info,dynamo_runtime::system_status_server:trace" export DYN_LOGGING_JSONL="false" ``` Resulting Log format: ``` 2025-09-02T15:50:01.770028Z INFO main.init: VllmWorker for Qwen/Qwen3-0.6B has been initialized 2025-09-02T15:50:01.770195Z INFO main.init: Reading Events from tcp://127.0.0.1:21555 2025-09-02T15:50:01.770265Z INFO main.init: Getting engine runtime configuration metadata from vLLM engine... 2025-09-02T15:50:01.770316Z INFO main.get_engine_cache_info: Cache config values: {'num_gpu_blocks': 24064} 2025-09-02T15:50:01.770358Z INFO main.get_engine_cache_info: Scheduler config values: {'max_num_seqs': 256, 'max_num_batched_tokens': 2048} ``` ## Example JSONL Format Environment Setting: ``` export DYN_LOG="info,dynamo_runtime::system_status_server:trace" export DYN_LOGGING_JSONL="true" ``` Resulting Log format: ``` {"time":"2025-09-02T15:53:31.943377Z","level":"INFO","target":"log","message":"VllmWorker for Qwen/Qwen3-0.6B has been initialized","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":191,"log.target":"main.init"} {"time":"2025-09-02T15:53:31.943550Z","level":"INFO","target":"log","message":"Reading Events from tcp://127.0.0.1:26771","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":212,"log.target":"main.init"} {"time":"2025-09-02T15:53:31.943636Z","level":"INFO","target":"log","message":"Getting engine runtime configuration metadata from vLLM engine...","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":220,"log.target":"main.init"} {"time":"2025-09-02T15:53:31.943701Z","level":"INFO","target":"log","message":"Cache config values: {'num_gpu_blocks': 24064}","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":267,"log.target":"main.get_engine_cache_info"} {"time":"2025-09-02T15:53:31.943747Z","level":"INFO","target":"log","message":"Scheduler config values: {'max_num_seqs': 256, 'max_num_batched_tokens': 2048}","log.file":"/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/main.py","log.line":268,"log.target":"main.get_engine_cache_info"} ``` ## Logging of Trace and Span IDs When `DYN_LOGGING_JSONL` is enabled, all logs include `trace_id` and `span_id` fields, and spans are automatically created for requests. This is useful for short debugging sessions where you want to examine trace context in logs without setting up a full tracing backend and for correlating log messages with traces. The trace and span information uses the OpenTelemetry format and libraries, which means the IDs are compatible with OpenTelemetry-based tracing backends like Tempo or Jaeger if you later choose to enable trace export. **Note:** This section has overlap with [Distributed Tracing with Tempo](/dynamo/user-guides/observability-local/tracing). For trace visualization in Grafana Tempo and persistent trace analysis, see [Distributed Tracing with Tempo](/dynamo/user-guides/observability-local/tracing). ### Configuration for Logging To see trace information in logs: ```bash export DYN_LOGGING_JSONL=true export DYN_LOG=debug # Set to debug to see detailed trace logs # Start your Dynamo components (e.g., frontend and worker) (default port 8000, override with --http-port or DYN_HTTP_PORT env var) python -m dynamo.frontend & python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager & ``` This enables JSONL logging with `trace_id` and `span_id` fields. Traces appear in logs but are not exported to any backend. ### Example Request Send a request to generate logs with trace context: ```bash curl -H 'Content-Type: application/json' \ -H 'x-request-id: test-trace-001' \ -d '{ "model": "Qwen/Qwen3-0.6B", "max_completion_tokens": 100, "messages": [ {"role": "user", "content": "What is the capital of France?"} ] }' \ http://localhost:8000/v1/chat/completions ``` Check the logs (stderr) for JSONL output containing `trace_id`, `span_id`, and `x_request_id` fields. ## Trace and Span Information in Logs This section shows how trace and span information appears in JSONL logs. These logs can be used to understand request flows even without a trace visualization backend. ### Example Disaggregated Trace in Grafana When viewing the corresponding trace in Grafana, you should be able to see something like the following: ![Disaggregated Trace Example](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/a10f8a8024ac0b816923b4abf0b88a140bb854a2c461b8b24d5743f265fa9af7/pages-v1.2.0/assets/img/grafana-disagg-trace.png) ### Trace Overview Dynamo creates distributed traces that span across multiple services in a disaggregated serving setup. The following sections describe the key spans you'll see in Grafana when viewing traces for chat completion requests. #### Available Spans in Disaggregated Mode When running Dynamo in disaggregated mode, a typical request creates the following spans: ##### 1. `http-request` (Frontend - Root Span) The root span for the entire request lifecycle, created in the **dynamo-frontend** service. **Key Attributes:** - **Service**: `dynamo-frontend` - **Operation**: Handles the HTTP request from client to completion - **Duration**: Total end-to-end request time (includes prefill + decode) - **Method**: HTTP method (typically `POST`) - **URI**: Request endpoint (e.g., `/v1/chat/completions`) - **Status**: Request completion status - **Children**: Typically 2-3 child spans (routing span + worker spans) This span represents the complete request flow from when the frontend receives the HTTP request until the final response is sent back to the client. ##### 2. `prefill_routing` (Frontend - Routing Span) A child span of `http-request`, created in the **dynamo-frontend** service during the routing phase. **Key Attributes:** - **Service**: `dynamo-frontend` - **Operation**: Routes the prefill request to an appropriate prefill worker - **Duration**: Time spent selecting and the span of prefill. - **Parent**: `http-request` span This span captures the routing logic and decision-making process and the request sent to the prefill worker. ##### 3. `handle_payload` (Prefill Worker Span) A child span of `http-request`, created in the **dynamo-worker-vllm-prefill** service. **Key Attributes:** - **Service**: `dynamo-worker-vllm-prefill` (or `dynamo-worker-sglang-prefill` for SGLang) - **Operation**: Processes the prefill phase of generation - **Duration**: Time to compute prefill (typically milliseconds to seconds) - **Component**: `prefill` - **Endpoint**: `generate` - **Parent**: `http-request` span This span represents the actual prefill computation on a prefill-specialized worker, including prompt processing and initial KV cache generation. ##### 4. `handle_payload` (Decode Worker Span) A child span of `http-request`, created in the **dynamo-worker-vllm-decode** service. **Key Attributes:** - **Service**: `dynamo-worker-vllm-decode` (or `dynamo-worker-sglang-decode` for SGLang) - **Operation**: Processes the decode phase of generation - **Duration**: Time to generate all output tokens (typically seconds) - **Component**: `decode` or `backend` - **Endpoint**: `generate` - **Parent**: `http-request` span This span represents the iterative token generation phase on a decode-specialized worker, which consumes the KV cache from prefill and produces output tokens. #### Understanding Span Metrics Each span provides several useful metrics: | Metric | Description | |--------|-------------| | **Duration** | Total time from span start to end | | **Busy Time** | Time actively processing (excluding waiting) | | **Idle Time** | Time spent waiting (e.g., for network, other services) | | **Start Time** | When the span began | | **Child Count** | Number of direct child spans | The relationship **Duration = Busy Time + Idle Time** helps identify where time is spent and potential bottlenecks. ## Custom Request IDs in Logs You can provide a custom request ID using the `x-request-id` header. This ID will be attached to all spans and logs for that request, making it easier to correlate traces with application-level request tracking. ### Example Request with Custom Request ID ```sh curl -X POST http://localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -H 'x-request-id: 8372eac7-5f43-4d76-beca-0a94cfb311d0' \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [ { "role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time" } ], "stream": false, "max_tokens": 1000 }' ``` All spans and logs for this request will include the `x_request_id` attribute with value `8372eac7-5f43-4d76-beca-0a94cfb311d0`. ### Frontend Logs with Custom Request ID Notice how the `x_request_id` field appears in all log entries, alongside the `trace_id` (`80196f3e3a6fdf06d23bb9ada3788518`) and `span_id`: ``` {"time":"2025-10-31T21:06:45.397194Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"} {"time":"2025-10-31T21:06:45.418584Z","level":"DEBUG","file":"/opt/dynamo/lib/llm/src/kv_router/prefill_router.rs","line":232,"target":"dynamo_llm::kv_router::prefill_router","message":"Prefill succeeded, using disaggregated params for decode","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"} {"time":"2025-10-31T21:06:45.418854Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"f7e487a9d2a6bf38","span_name":"http-request","trace_id":"80196f3e3a6fdf06d23bb9ada3788518","uri":"/v1/chat/completions","version":"HTTP/1.1","x_request_id":"8372eac7-5f43-4d76-beca-0a94cfb311d0"} ``` ## Backend Engine Log Levels Dynamo's `DYN_LOG` environment variable controls Dynamo's own logging. Each inference backend has its own log level control that is **independent** of `DYN_LOG`. ### vLLM vLLM log level is controlled by the `VLLM_LOGGING_LEVEL` environment variable. It defaults to `INFO` and is completely independent of `DYN_LOG`. ```bash # Set vLLM to debug while keeping Dynamo at info export DYN_LOG=info export VLLM_LOGGING_LEVEL=DEBUG ``` Valid values: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`. ### TensorRT-LLM TensorRT-LLM log level is controlled by the `TLLM_LOG_LEVEL` environment variable. It defaults to `INFO` and is completely independent of `DYN_LOG`. ```bash # Set TRT-LLM to info while keeping Dynamo at warn export DYN_LOG=warn export TLLM_LOG_LEVEL=INFO ``` Valid values: `TRACE`, `DEBUG`, `INFO`, `WARNING`, `ERROR`, `INTERNAL_ERROR`. **Note:** `TLLM_LOG_LEVEL` is read once at TensorRT-LLM import time. It must be set before the process starts. ### SGLang SGLang logging is currently configured through Dynamo and follows the `DYN_LOG` level by default. To disable Dynamo's SGLang log configuration and manage it independently, set: ```bash export DYN_SKIP_SGLANG_LOG_FORMATTING=true ``` Alternatively, pass the `--log-level` argument to the SGLang worker command to set the SGLang engine's log level directly (e.g. `--log-level DEBUG`). This is independent of `DYN_LOG`. ## Related Documentation - [Distributed Tracing with Tempo](/dynamo/user-guides/observability-local/tracing) - [Log Aggregation in Kubernetes](/dynamo/kubernetes-deployment/operate/observability/logging) - [Observability Getting Started](/dynamo/user-guides/observability-local) - [Distributed Runtime Architecture](/dynamo/design-docs/distributed-runtime) - [Dynamo Architecture Overview](/dynamo/design-docs/overall-architecture) - [Backend Guide](/dynamo/backends/custom-backend/python-workers-lower-level) # DynoSim DynoSim is Dynamo's simulation stack for exploring serving configurations before validating them on real clusters. It is not a separate service; it is the product surface that connects workload-driven simulation runs, configuration sweeps, the mocker engine, Planner simulation, Router simulation, and AIC-backed timing models into one workflow. Use DynoSim when you want to answer questions such as: - Which aggregated or disaggregated topology should this workload use? - How many prefill and decode workers fit within my GPU budget? - How sensitive is the deployment to startup time, queue pressure, prefix reuse, or router tuning? - Which candidates should I validate with AIPerf on real GPUs? ## Components | Component | Entry Point | Role | |---|---|---| | DynoSim run | `python -m dynamo.replay` | Runs one workload against one simulated Dynamo configuration and emits metrics plus a report | | DynoSim sweep | `dynamo.profiler.utils.replay_optimize` | Sweeps many simulation trials across TP shape, worker split, router knobs, SLA constraints, and GPU budget | | Live simulation with Mocker | `python -m dynamo.mocker` | Runs simulated workers inside a live Dynamo deployment path, including worker registration and KV event publishing | | Mocker core | `lib/mocker` | Models engine scheduling, KV allocation, prefix caching, preemption, and timing | | AIC | AI Configurator SDK | Supplies calibrated timing and candidate-shape data for supported model/backend/GPU tuples | | Planner simulation | `--planner-config` on DynoSim runs | Runs Planner decisions in the simulation loop to study scaling behavior and SLA compliance | ## Workflow ```mermaid flowchart LR W["Workload trace or synthetic workload"] --> R["Single DynoSim run"] R --> S["DynoSim sweep"] S --> C["Candidate configs"] C --> M["Live Mocker deployment"] C --> G["Real-GPU validation"] M --> G ``` Start with a single DynoSim run to verify the workload shape and engine arguments. Use DynoSim sweeps when you want to search the design space. Use live Mocker deployments when you need to exercise the real Dynamo frontend, router, worker registration, KV events, and planner paths without running model inference. Validate the shortlist on real GPUs before production rollout. ## Where AIC Fits AIC provides performance models and candidate-shape information. DynoSim uses those models as one timing source inside the mocker engine and sweep optimizer. Mocker still owns the scheduler and KV-memory simulation: batching, prefix-cache hits, preemption, block allocation, and request lifecycle are simulated by Dynamo's mocker core, while AIC-backed timing predicts how long prefill and decode work should take for supported model/backend/GPU combinations. ## Choosing an Entry Point | Goal | Start Here | |---|---| | Run one trace or synthetic workload through one config | [DynoSim Runs](/dynamo/user-guides/dynosim/runs) | | Sweep topology and router choices under SLA/GPU constraints | [DynoSim Sweeps](/dynamo/user-guides/dynosim/sweeps) | | Exercise a live frontend/router setup without GPUs | [Live Simulation with Mocker](/dynamo/user-guides/dynosim/mocker) | | Study Planner scaling decisions against a trace | [Planner DynoSim Benchmarking](/dynamo/user-guides/dynosim/planner-benchmarking) | | Generate a deployable Kubernetes config from model/SLA intent | [Model Deployment Guide](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide) | DynoSim narrows the search space; it does not replace real-hardware validation. Use it to move quickly, find promising candidates, and understand failure modes before spending cluster time. # Live Simulation with Mocker Mocker is the live simulated engine in DynoSim. It runs as a Dynamo backend, registers workers, publishes KV events, and exercises the real frontend/router/planner path without requiring GPUs. The mocker core is implemented in Rust and models the scheduling, memory management, and timing behavior of production engines. It can use polynomial timing, profile-derived timing, or AIC-backed timing. AIC predicts prefill/decode duration; Mocker still owns the scheduler, KV cache lifecycle, prefix-cache behavior, and request execution model. ## Overview The mocker simulates: - **Block-based KV cache management** with LRU eviction - **Engine-specific continuous batching schedulers** for vLLM and SGLang - **Prefix caching** with hash-based block deduplication - **Chunked prefill** for better batching efficiency - **Realistic timing models** for prefill and decode phases - **Disaggregated serving** (prefill/decode separation) - **KV event publishing** for router integration - **Data parallelism** (multiple DP ranks per engine) > **Note:** While the mocker uses vLLM as its primary reference implementation, these core components—block-based KV cache management, continuous batching schedulers, LRU evictors, and prefix caching—are fundamental to all modern LLM inference engines, including SGLang and TensorRT-LLM. The architectural patterns simulated here are engine-agnostic and apply broadly across the inference ecosystem. ## Quick Start ### Basic Usage ```bash # Launch a single mocker worker python -m dynamo.mocker --model-path Qwen/Qwen3-0.6B # Launch with custom KV cache configuration python -m dynamo.mocker \ --model-path Qwen/Qwen3-0.6B \ --num-gpu-blocks-override 8192 \ --block-size 64 \ --max-num-seqs 256 # Launch with timing speedup for faster testing python -m dynamo.mocker \ --model-path Qwen/Qwen3-0.6B \ --speedup-ratio 10.0 ``` ### Disaggregated Serving ```bash # Launch prefill worker python -m dynamo.mocker \ --model-path Qwen/Qwen3-0.6B \ --disaggregation-mode prefill \ --bootstrap-ports 50100 # Launch decode worker (in another terminal) python -m dynamo.mocker \ --model-path Qwen/Qwen3-0.6B \ --disaggregation-mode decode ``` ### Multiple Workers in One Process ```bash # Launch 4 mocker workers sharing the same tokio runtime python -m dynamo.mocker \ --model-path Qwen/Qwen3-0.6B \ --num-workers 4 ``` ## CLI Arguments | Argument | Default | Description | |----------|---------|-------------| | `--model-path` | Required | HuggingFace model ID or local path for tokenizer | | `--endpoint` | Auto-derived | Dynamo endpoint string. Defaults are namespace-dependent, and prefill workers use a different default endpoint than aggregated/decode workers | | `--model-name` | Derived from model-path | Model name for API responses | | `--num-gpu-blocks-override` | 16384 | Number of KV cache blocks | | `--block-size` | 64 (`vllm`) / engine-specific | Tokens per KV cache block. For `sglang`, if omitted, the effective page/block size defaults to 1 or to `--sglang-page-size` when provided | | `--max-num-seqs` | 256 | Maximum concurrent sequences | | `--max-num-batched-tokens` | 8192 | Maximum tokens per batch | | `--enable-prefix-caching` | True | Enable prefix caching | | `--no-enable-prefix-caching` | - | Disable prefix caching | | `--enable-chunked-prefill` | True | Enable chunked prefill | | `--no-enable-chunked-prefill` | - | Disable chunked prefill | | `--preemption-mode` | `lifo` | Decode eviction policy under memory pressure: `lifo` (vLLM v1 style) or `fifo` | | `--speedup-ratio` | 1.0 | Timing speedup factor | | `--decode-speedup-ratio` | 1.0 | Decode-only speedup multiplier (e.g. for Eagle speculation) | | `--data-parallel-size` | 1 | Number of DP replicas | | `--startup-time` | None | Simulated startup delay (seconds) | | `--planner-profile-data` | None | Path to either a mocker-format `.npz` file or a profiler results directory | | `--num-workers` | 1 | Workers per process | | `--reasoning` | None | JSON config for emitting reasoning token spans, with `start_thinking_token_id`, `end_thinking_token_id`, and `thinking_ratio` | | `--engine-type` | `vllm` | Engine simulation type: `vllm` or `sglang` | | `--sglang-schedule-policy` | `fifo` / `fcfs` | SGLang scheduling policy: `fifo`/`fcfs` (default) or `lpm` (longest prefix match) | | `--sglang-page-size` | 1 | SGLang radix-cache page size in tokens. Also becomes the effective block size when `--engine-type sglang` and `--block-size` is omitted | | `--sglang-max-prefill-tokens` | 16384 | SGLang max prefill-token budget per batch | | `--sglang-chunked-prefill-size` | 8192 | SGLang chunked-prefill chunk size | | `--sglang-clip-max-new-tokens` | 4096 | SGLang admission-budget cap for max new tokens | | `--sglang-schedule-conservativeness` | 1.0 | SGLang schedule conservativeness factor | | `--aic-perf-model` | False | Use AIC SDK for latency prediction instead of interpolated/polynomial models. Opt-in only: default mocker and DynoSim run paths do not use AIC. Requires `aiconfigurator` installed and usable AIC systems/perf data for the requested `system/backend/version` tuple | | `--aic-system` | `h200_sxm` | AIC system name (e.g., `h200_sxm`). Used with `--aic-perf-model` | | `--aic-backend-version` | Auto | AIC backend engine version (e.g., `0.12.0` for vLLM). If not set, uses the default version for the backend | | `--aic-tp-size` | 1 | Tensor parallel size for AIC latency prediction. Only affects AIC performance model lookups, not mocker scheduling | | `--aic-moe-tp-size` | None | MoE tensor parallel size for AIC latency prediction. Required by some MoE models | | `--aic-moe-ep-size` | None | MoE expert parallel size for AIC latency prediction. Required by some MoE models | | `--aic-attention-dp-size` | None | Attention data parallel size for AIC latency prediction. Required by some MoE models | | `--extra-engine-args` | None | Path to a JSON file with mocker configuration; overrides individual CLI arguments | | `--stagger-delay` | -1 (auto) | Delay between worker launches (seconds). 0 disables, -1 enables auto mode | | `--disaggregation-mode` | `agg` | Worker mode: `agg` (aggregated), `prefill`, or `decode` | | `--durable-kv-events` | False | Deprecated JetStream KV-event mode; prefer the local indexer / event-plane subscriber path | | `--zmq-kv-events-ports` | None | Comma-separated ZMQ PUB base ports for KV event publishing, one per worker | | `--zmq-replay-ports` | None | Comma-separated ZMQ ROUTER base ports for gap recovery, one per worker | | `--bootstrap-ports` | None | Comma-separated rendezvous base ports, one per worker in disaggregated mode | | `--kv-transfer-bandwidth` | 64.0 | KV cache transfer bandwidth in GB/s. Set to 0 to disable | | `--kv-cache-dtype` | auto | KV cache dtype for bytes-per-token computation | | `--kv-bytes-per-token` | Auto-computed | KV cache bytes per token (override auto-computation) | | `--discovery-backend` | Env-driven (`etcd`) | Discovery backend: `kubernetes`, `etcd`, `file`, or `mem` | | `--request-plane` | Env-driven (`tcp`) | Request transport: `nats`, `tcp` | | `--event-plane` | Env-driven (`nats`) | Event transport: `nats` or `zmq` | ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `DYN_MOCKER_KV_CACHE_TRACE` | off | Set to `1` or `true` to log structured KV cache allocation and eviction traces | > **Note:** For local scale tests and router benchmarks, prefer `--num-workers` over launching many separate mocker processes. All workers share one tokio runtime and thread pool, which is both lighter weight and closer to how the test harnesses exercise the mocker. ## DynoSim Runs Mocker also powers DynoSim runs through the dedicated `python -m dynamo.replay` CLI, which exposes `offline|online`, `round_robin|kv_router`, `arrival_speedup_ratio`, closed-loop concurrency admission, synthetic workload generation, and offline disaggregated prefill/decode simulation directly: The DynoSim CLI defaults to `--replay-mode offline` and `--router-mode round_robin`. Aggregated runs use `--extra-engine-args`. Offline disaggregated runs instead use `--prefill-engine-args` plus `--decode-engine-args`, together with `--num-prefill-workers` and `--num-decode-workers`. ```bash python -m dynamo.replay /path/to/mooncake_trace.jsonl \ --num-workers 4 \ --replay-mode offline \ --router-mode kv_router \ --arrival-speedup-ratio 5 \ --trace-block-size 512 \ --extra-engine-args '{"block_size":64}' \ --router-config '{"router_queue_policy":"fcfs"}' \ --report-json /tmp/dynosim-report.json ``` The same CLI also supports synthetic workloads without a trace file: ```bash python -m dynamo.replay \ --input-tokens 5000 \ --output-tokens 500 \ --request-count 1000 \ --arrival-interval-ms 1.0 \ --num-workers 1 \ --replay-mode offline \ --replay-concurrency 100 \ --extra-engine-args '{"block_size":512}' \ --report-json /tmp/dynosim-report.json ``` Synthetic workloads also support shared-prefix and multi-turn tests: ```bash python -m dynamo.replay \ --input-tokens 5000 \ --output-tokens 500 \ --request-count 200 \ --turns-per-session 3 \ --shared-prefix-ratio 0.5 \ --num-prefix-groups 8 \ --inter-turn-delay-ms 250 \ --replay-mode offline \ --replay-concurrency 32 \ --extra-engine-args '{"block_size":512}' \ --report-json /tmp/dynosim-report.json ``` For trace files, DynoSim also understands multi-turn sessions when records share `session_id`. The first turn uses `timestamp`/`created_time`; later turns can use `delay` or `delay_ms`: ```json {"session_id":"session-a","timestamp":1000,"input_length":2048,"output_length":128,"hash_ids":[1,2,3,4]} {"session_id":"session-a","delay":250,"input_length":2560,"output_length":128,"hash_ids":[1,2,3,4,5]} ``` For trace-file runs, `--trace-block-size` controls how many tokens each `hash_id` represents in the dataset, while engine `block_size` still controls the simulated engine and router hashing. Public Mooncake/toolagent traces use `--trace-block-size 512`; engine `block_size` can still stay at `64` to match the live runtime configuration. The standalone DynoSim CLI prints an AIPerf-style summary table to stdout and writes the full report JSON to disk. Timing semantics: - trace mode honors first-turn timestamps and inter-turn delays - concurrency mode ignores first-turn timestamps but still enforces inter-turn delays - in concurrency mode, TTFT is measured from actual dispatch under the in-flight cap For full usage, constraints, and benchmarking guidance, see [DynoSim Runs](/dynamo/user-guides/dynosim/runs). DynoSim runs support aggregated `vllm` and `sglang` engine configs. Internally the simulator uses canonical `block_size`; for `sglang`, `sglang.page_size` is still accepted as a compatibility alias as long as it matches `block_size` when both are provided. Offline DynoSim runs also support disaggregated `kv_router` mode. In that mode: - `--prefill-engine-args` must describe a prefill worker - `--decode-engine-args` must describe a decode worker - `--router-mode` must be `kv_router` - only offline mode is supported Example: ```bash python -m dynamo.replay \ --input-tokens 4096 \ --output-tokens 256 \ --request-count 100 \ --replay-mode offline \ --router-mode kv_router \ --replay-concurrency 32 \ --num-prefill-workers 2 \ --num-decode-workers 6 \ --prefill-engine-args '{"worker_type":"prefill","block_size":512}' \ --decode-engine-args '{"worker_type":"decode","block_size":512}' \ --router-config '{"router_queue_policy":"wspt"}' \ --report-json /tmp/dynosim-report.json ``` ## Performance Modeling Setup By default, the mocker uses hardcoded polynomial formulas to estimate prefill and decode timing. For more realistic simulations, pass `--planner-profile-data` with either: - a mocker-format `.npz` file, or - a profiler output directory The mocker automatically accepts profiler-style results directories and converts them internally. It also accepts older raw-data directories containing: - `prefill_raw_data.json` - `decode_raw_data.json` ```bash python -m dynamo.mocker \ --model-path nvidia/Llama-3.1-8B-Instruct-FP8 \ --planner-profile-data components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D \ --speedup-ratio 1.0 ``` ### AIC Performance Model To use the AIC SDK for latency prediction: ```bash uv pip install '.[mocker]' python -m dynamo.mocker \ --model-path nvidia/Llama-3.1-8B-Instruct-FP8 \ --engine-type vllm \ --aic-perf-model \ --aic-system h200_sxm ``` The AIC model automatically uses `--model-path` and `--engine-type` to select the appropriate performance data. Available systems include `h200_sxm`, `h100_sxm`, etc. (see AIC SDK documentation for the full list). Important notes: - AIC is opt-in. If you do not pass `--aic-perf-model`, `python -m dynamo.mocker` does not use AIC. - `python -m dynamo.replay` has two separate AIC surfaces: - engine timing AIC through `--extra-engine-args` / staged engine JSON - router-side prefill-load AIC through top-level `--aic-*` flags plus `router_prefill_load_model="aic"` in `--router-config` - The Python AIC session bridge is now shared with the live KV router path via the internal `dynamo._internal.aic` module. Mocker CLI behavior is unchanged; this just removes duplicate AIC session code. - `aiconfigurator` must be able to load the requested performance database for the selected `system/backend/version`. If the SDK is installed but the backing systems data is missing or unreadable, mocker now fails fast at startup with a clear error instead of failing later on first request. - In development environments, this may require pointing Python at a source checkout of `aiconfigurator` with real Git LFS payloads materialized in its `systems/` directory. This mocker AIC path is separate from the router-side prefill-load estimator. Live router, frontend, and DynoSim runs all use `router_prefill_load_model="aic"` plus top-level `--aic-*` flags for oldest-prefill prompt-load decay. DynoSim still uses engine-args AIC separately when you want the mocked worker timing model itself to come from AIC. For aggregated DynoSim runs, engine timing AIC still comes from `--extra-engine-args`: ```bash python -m dynamo.replay /path/to/trace.jsonl \ --extra-engine-args '{"aic_backend":"vllm","aic_system":"h200_sxm","aic_model_path":"nvidia/Llama-3.1-8B-Instruct-FP8","aic_tp_size":1}' ``` For offline disaggregated DynoSim runs, pass the staged engine configs instead: ```bash python -m dynamo.replay /path/to/trace.jsonl \ --replay-mode offline \ --router-mode kv_router \ --prefill-engine-args '{"worker_type":"prefill","aic_backend":"vllm","aic_system":"h200_sxm","aic_model_path":"nvidia/Llama-3.1-8B-Instruct-FP8","aic_tp_size":1,"block_size":512}' \ --decode-engine-args '{"worker_type":"decode","aic_backend":"vllm","aic_system":"h200_sxm","aic_model_path":"nvidia/Llama-3.1-8B-Instruct-FP8","aic_tp_size":1,"block_size":512}' \ --num-prefill-workers 2 \ --num-decode-workers 6 ``` The `aic_backend` field enables the AIC perf model and should match `engine_type` (`"vllm"` or `"sglang"`). The `aic_model_path` field is the equivalent of `--model-path` in `dynamo.mocker`. DynoSim router-side AIC prompt-load modeling is configured separately with top-level flags: ```bash python -m dynamo.replay /path/to/trace.jsonl \ --replay-mode offline \ --router-mode kv_router \ --num-workers 4 \ --trace-block-size 512 \ --extra-engine-args '{"block_size":64}' \ --router-config '{"router_track_prefill_tokens":true,"router_prefill_load_model":"aic"}' \ --aic-backend vllm \ --aic-system h200_sxm \ --aic-model-path nvidia/Llama-3.1-8B-Instruct-FP8 \ --aic-tp-size 1 ``` For MoE models that require AIC MoE parallelism, pass the same fields on the router-side AIC surface. For Kimi-style TP-only MoE simulation, use `--aic-moe-tp-size` equal to `--aic-tp-size`, `--aic-moe-ep-size 1`, and `--aic-attention-dp-size 1`. For offline disaggregated DynoSim runs, the same top-level `--aic-*` flags drive the prefill-stage router only; the decode-stage router keeps prompt tracking disabled. Example `--reasoning` configuration: ```bash python -m dynamo.mocker \ --model-path Qwen/Qwen3-0.6B \ --reasoning '{"start_thinking_token_id":123,"end_thinking_token_id":456,"thinking_ratio":0.6}' ``` The profile results directory should contain: - `selected_prefill_interpolation/raw_data.npz` - `selected_decode_interpolation/raw_data.npz` To generate profile data for your own model and hardware, run the profiler and then point `--planner-profile-data` at the resulting output directory. ## Event Transport and Router Testing The default event path uses the local indexer / event-plane subscriber flow. The older durable KV-events mode is still available through `--durable-kv-events`, but it is deprecated and should not be the preferred setup for new tests. For router and indexer experiments that need native wire-format event forwarding, the mocker also supports a ZMQ path: - `--event-plane zmq` - `--zmq-kv-events-ports` for per-worker PUB base ports - `--zmq-replay-ports` for optional replay/gap-recovery ROUTER base ports When set, each worker binds on its base port plus `dp_rank`, so the number of comma-separated base ports must match `--num-workers`. ## Disaggregation Port Layout `--bootstrap-ports` takes a comma-separated list of base ports, one per worker. In multi-worker mode, the number of listed ports must exactly match `--num-workers`. Prefill workers listen on these ports and publish the bootstrap endpoint through discovery. Decode workers use the matching ports to rendezvous before decode begins. ## Kubernetes Deployment The mocker can be deployed through example `DynamoGraphDeployment` manifests for both aggregated and disaggregated setups: ```bash kubectl apply -f examples/backends/mocker/deploy/agg.yaml kubectl apply -f examples/backends/mocker/deploy/disagg.yaml ``` ## Architecture The mocker is organized into several cooperating components that mirror the internal architecture of production LLM inference engines. The scheduler (vLLM-style and SGLang-style variants) and KV block manager live inside the engine core. Multi-engine behavior — KV transfer/offloading simulation, KV router simulation, planner simulation — is added by the DynoSim run harness on top of multiple engine cores; see [DynoSim Runs](/dynamo/user-guides/dynosim/runs) for the component-level diagram and for offline internals under [`lib/mocker/src/replay/offline/`](../../lib/mocker/src/replay/offline/README.md). ### Scheduler The mocker now has two scheduler shapes rather than one generic queue model: - **vLLM mocker** uses an upstream-style `waiting + running` scheduler. Each request tracks computed tokens, the scheduler spends one token budget across the running set first, and decode pressure triggers inline preemption of running requests. - **SGLang mocker** uses a cache-aware waiting/running scheduler around a radix-style prefix cache. It batches prefill work with decode-state awareness and handles pressure primarily through decode retraction while preserving cached prefixes. Both schedulers simulate continuous batching, prefix reuse, chunked prefill, memory pressure, and decode token emission while publishing metrics about current resource utilization. When resources become constrained, the mocker simulates the engine's real recovery path: - vLLM-style decode preemption and recompute - SGLang-style decode retraction plus prefix-preserving cache updates ### KV Block Manager The mocker's KV block manager is now built on [`kvbm-logical::BlockManager`](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/lib/kvbm-logical), the same logical block manager the real Dynamo runtime uses. The mocker wraps it in [`lib/mocker/src/kv_manager/kvbm_backend.rs`](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/lib/mocker/src/kv_manager/kvbm_backend.rs) and translates its own `MoveBlock` protocol onto kvbm-logical's RAII lifecycle (`allocate → stage → register → drop`). Blocks still conceptually live in one of two pools: - **Active** — blocks currently held by at least one sequence. Partial (still-filling) blocks are held as `MutableBlock`; full blocks are held as `ImmutableBlock` clones (the clone vec length is the mocker's refcount, one per `Use`). - **Inactive** — blocks no longer referenced by any sequence but kept for prefix-cache reuse. Handled entirely by kvbm-logical's inactive pool; the mocker never tracks them manually. The lifecycle is RAII: dropping the last `ImmutableBlock` clone transitions the block from active to inactive (kvbm-logical's `reset` pool), with no explicit `deref`/`evict` bookkeeping on the mocker side. When a sequence completes or is preempted, the mocker simply drops its handles; kvbm-logical recovers the capacity. ```mermaid stateDiagram-v2 [*] --> Active : allocate + stage + register Active --> Inactive : last handle dropped (RAII) Inactive --> Active : match_blocks(PLH) reuse Inactive --> Freed : evicted by backend Active --> Freed : explicit Removed (Destroy) Freed --> [*] state Active { [*] --> Partial : MutableBlock Partial --> Full : promote (PLH / SequenceHash) [*] --> Full : ImmutableBlock clones } ``` Three `Use` outcomes are tracked for KV-event emission: `ActiveHit` (bump refcount on an already-pinned block), `InactiveHit` (reactivate via `match_blocks(plh)`), and `NewStore` (fresh allocation). Only `NewStore` emits a `Stored` KV event — the router radix tree already knows about the other two and only forgets on explicit `Removed`. ### Eviction Backends The kvbm-logical inactive pool selects eviction victims via one of three backends, exposed as `MockerEvictionBackend` in [`lib/mocker/src/common/protocols.rs`](../../lib/mocker/src/common/protocols.rs): - **`Lineage`** (default) — parent-chain aware: evicts leaf blocks first, preserving shared prefix chains. Subsumes the preemption-priority behavior the old hand-rolled `LRUEvictor::push_front` used to provide. - **`Lru`** — plain recency-based LRU. - **`MultiLru`** — 4-tier frequency-aware LRU built on a TinyLFU tracker. All three give the same "suffix blocks evicted before shared prefixes" outcome that the old evictor was designed to produce; `Lineage` does it structurally (via the block parent chain) rather than via monotonic counters. ### Sequence Tracking Each active request is tracked as a sequence, managing its token blocks and generation state. As tokens are generated, the sequence tracks which blocks are partial (`MutableBlock`, still being filled) versus full (`ImmutableBlock`, complete and hashable for prefix caching). When a partial block fills up, it gets "promoted" to a full block with a content-based `SequenceHash` (or collapses onto an existing registered handle if the PLH is already present), enabling future cache hits from requests with matching prefixes. ### Performance Model The mocker supports three timing prediction modes: **Polynomial Model (Default):** Uses hardcoded polynomial formulas that approximate typical GPU behavior. Prefill time scales quadratically with token count, while decode time depends on the total active KV cache size. **Interpolated Model:** Loads actual profiling data from an NPZ file containing measured prefill and decode latencies. The mocker interpolates between data points to predict timing for any input size. This enables high-fidelity simulation matching a specific hardware configuration. **AIC Model (`--aic-perf-model`):** Uses the NVIDIA AI Configurator (AIC) SDK for latency prediction. AIC provides calibrated performance models for specific GPU/model/engine combinations, predicting prefill and decode latency as a function of batch size, sequence length, and prefix cache hits. The model path is automatically derived from `--model-path`, and the engine type from `--engine-type`. This mode is opt-in and requires both the `aiconfigurator` SDK and loadable systems/perf data for the requested tuple. ### Bootstrap Rendezvous (Disaggregated Serving) For disaggregated prefill/decode deployments, prefill and decode workers coordinate via a simple TCP-based rendezvous protocol. The decode worker connects to the prefill worker's bootstrap port and waits until the prefill phase completes and KV cache is ready. Either side can arrive first—the rendezvous completes when both are ready. ### KV Transfer Latency Simulation The mocker simulates KV cache transfer time between prefill and decode workers. Before the prefill worker emits its first (and only) token, it sleeps for a duration based on: - **kv_bytes_per_token** (auto-computed from model config): `num_layers * 2 * num_kv_heads * head_dim * dtype_bytes`. The `dtype_bytes` is determined by `--kv-cache-dtype`: when set to `auto` (default), it uses the model's `dtype` from config; when explicitly set (e.g., `fp8`), it uses the specified dtype instead. It can also be overridden directly with `--kv-bytes-per-token`. - **kv_transfer_bandwidth** (default: 64.0 GB/s, inter-node InfiniBand) - **Transfer time**: `num_input_tokens * kv_bytes_per_token / bandwidth` This delay is injected after the scheduler's prefill compute simulation completes, modeling the sequential flow: prefill computation → KV transfer → decode begins. Set `--kv-transfer-bandwidth 0` to disable. ## Integration with Dynamo ### KV Event Publishing When prefix caching is enabled, the mocker publishes KV cache events to the distributed runtime. These events notify the system when blocks are stored (new content cached) or removed (evicted). This enables the KV-aware router to make intelligent routing decisions based on which workers have which prefixes cached. ### Metrics Publishing Each scheduler publishes metrics about its current state, including the number of active decode blocks per DP rank. The router uses these metrics for load-aware routing decisions. ## Testing Scenarios The mocker is particularly useful for: 1. **Router Testing** - Validate KV-aware routing without GPUs 2. **Planner Testing** - Test SLA-based planners with realistic timing 3. **Fault Tolerance** - Test request migration, graceful shutdown 4. **Disaggregation** - Test P/D separation and KV transfer coordination 5. **Performance Modeling** - Prototype scheduling policies 6. **CI/CD** - Fast integration tests without hardware dependencies ## Comparison with Real Engines | Feature | Real Engine | Mocker | |---------|-------------|--------| | GPU Required | Yes | No | | Block Manager | Paged KV cache | Simulated blocks | | Scheduler | Continuous batching | Continuous batching | | Prefix Caching | Hash-based | Hash-based | | Chunked Prefill | Supported | Supported | | Preemption | Recompute/swap | Recompute (simulated) | | Timing | Real execution | Model-based | | KV Events | Native | Compatible | | Data Parallelism | Multi-GPU | Simulated | ## Next Steps | Document | Description | |----------|-------------| | [Benchmarking Dynamo Deployments](/dynamo/user-guides/benchmarking) | Run AIPerf against a mocker-backed deployment to measure latency, TTFT, throughput, and scaling behavior | | [Aggregated Mocker Deployment Example](../../examples/backends/mocker/deploy/agg.yaml) | Deploy a mocker-backed aggregated DynamoGraphDeployment on Kubernetes | | [Disaggregated Mocker Deployment Example](../../examples/backends/mocker/deploy/disagg.yaml) | Deploy separate prefill and decode mocker workers for disaggregated-serving benchmarks | | [Global Planner Mocker Example](../../examples/global_planner/global-planner-mocker-test.yaml) | Advanced multi-pool mocker setup for planner and global-router experiments | ## Feature Gaps (WIP) > For the broader mocker enhancement roadmap, see [#6383](https://github.com/ai-dynamo/dynamo/issues/6383). The following features are not yet supported by the mocker: - **Multi-tier memory** - No support for offloading KV cache to CPU/disk or onboarding back to GPU; potential future integration with KVBM - **Multimodal support** - Currently only simulates text token processing; no vision encoder or cross-attention simulation - **Native Rust reference counting** - Work in progress to use native Rc/Arc for block reference counting, enabling natural RAII patterns for simpler tracking # DynoSim Runs A DynoSim run evaluates one workload against one simulated Dynamo configuration. The current CLI is `python -m dynamo.replay`, which prints an AIPerf-style summary table, writes the full report JSON to disk, and exposes `offline|online`, `round_robin|kv_router`, `arrival_speedup_ratio`, closed-loop concurrency, and synthetic workload inputs directly. The command keeps the existing `replay` name for now. The docs use "DynoSim run" for the product concept: one workload, one simulated configuration, one report. Unlike normal `dynamo.mocker` usage, offline mode does not launch workers, register endpoints, or require NATS, etcd, or a frontend. Online mode does exercise the live mock-worker runtime path. Use DynoSim runs when you want to: - benchmark scheduler behavior from a saved trace - compare timing and cache behavior across mocker configurations - validate simulation logic in CI without bringing up a distributed stack ## Harness Overview The DynoSim run harness wires a load driver (trace file or synthetic workload generator) into one or more mocker engine simulations and tees request/token timing into a trace collector. ```mermaid flowchart LR LD[Load Driver] --> H[DynoSim Harness] H --> SES[Single Engine Simulation] H --> MES[Multi Engine Simulation] SES --> H MES --> H H --> TC[Trace Collector] ``` The load driver is either a Mooncake-style JSONL trace (timestamps, ISL/OSL, `hash_ids`) or a synthetic generator parameterized by `isl`/`osl`/`concurrency`. Single-engine simulation (`SES`) is the fast path for `num_workers == 1` with the vLLM engine; multi-engine simulation (`MES`) covers aggregated multi-worker runs, disaggregated prefill/decode runs, and KV-router runs. The trace collector produces the AIPerf-style summary table, the JSON report, and the per-request timing fields consumed by downstream analysis. Each simulation composes a different set of components. SES drives the engine core directly (scheduler + forward-pass modeling). MES composes multiple engine cores with KV transfer/offloading, KV routing, and planner simulation layered on top: ```mermaid flowchart TD subgraph SEC[Single Engine Core] subgraph SCH[Scheduler Modeling] F[Fwd Pass Modeling] end end KV[KV Transfer + Offloading Simulation] KR[KV Router Simulation] P[Planner Simulation] SES[Single Engine Simulation] MES[Multi Engine Simulation] SES --> SEC MES --> SEC MES --> KV MES --> KR MES --> P ``` See [`lib/mocker/src/replay/offline/README.md`](../../lib/mocker/src/replay/offline/README.md) for offline-harness internals (logical clock, event queue, worker model) and [Live Simulation with Mocker](/dynamo/user-guides/dynosim/mocker) for engine-core details (scheduler, KV block manager). ## Quick Start Run an offline DynoSim trial through the dedicated CLI: ```bash python -m dynamo.replay /path/to/mooncake_trace.jsonl \ --num-workers 4 \ --replay-mode offline \ --router-mode round_robin \ --trace-block-size 512 \ --extra-engine-args '{"block_size":64}' \ --report-json /tmp/dynosim-report.json ``` Run a synthetic DynoSim trial through the same CLI when you want fixed request shapes without a trace file: ```bash python -m dynamo.replay \ --input-tokens 5000 \ --output-tokens 500 \ --request-count 1000 \ --arrival-interval-ms 1.0 \ --num-workers 1 \ --replay-mode offline \ --replay-concurrency 100 \ --extra-engine-args '{"block_size":512}' \ --report-json /tmp/dynosim-report.json ``` Run a synthetic workload when you want shared-prefix or multi-turn structure without a trace file: ```bash python -m dynamo.replay \ --input-tokens 5000 \ --output-tokens 500 \ --request-count 200 \ --turns-per-session 3 \ --shared-prefix-ratio 0.5 \ --num-prefix-groups 8 \ --inter-turn-delay-ms 250 \ --replay-mode offline \ --replay-concurrency 32 \ --extra-engine-args '{"block_size":512}' \ --report-json /tmp/dynosim-report.json ``` `python -m dynamo.replay` prints an AIPerf-style summary table to stdout and writes the full report JSON to disk. ## Input Format The trace file must be Mooncake-style JSONL. Each line should contain: - `timestamp` or `created_time` - `input_length` or `input_tokens` - `output_length` or `output_tokens` - `hash_ids` Example: ```json {"timestamp": 0, "input_length": 6755, "output_length": 500, "hash_ids": [0, 1, 2, 3]} {"timestamp": 0, "input_length": 4096, "output_length": 128, "hash_ids": [9, 10, 11, 12]} ``` Rows without `session_id` are independent timestamped requests. Use this shape for wall-clock request traces, including agent-converted traces where parallel LLM calls should remain parallel. DynoSim runs also support multi-turn sessions. Use the same `session_id` on all turns in a session. Multi-turn sessions are closed-loop: turn `n+1` waits until turn `n` completes plus either the explicit `delay` / `delay_ms` or the timestamp delta inferred from consecutive rows in the same session. Example: ```json {"session_id":"session-a","timestamp":1000,"input_length":2048,"output_length":128,"hash_ids":[1,2,3,4]} {"session_id":"session-a","delay_ms":50,"input_length":2560,"output_length":128,"hash_ids":[1,2,3,4,5]} {"session_id":"session-b","timestamp":1010,"input_length":1024,"output_length":64,"hash_ids":[9,10]} {"session_id":"session-b","timestamp":1060,"input_length":1536,"output_length":64,"hash_ids":[9,10,11]} ``` The second `session-a` row waits for the first turn to complete plus 50 ms. The second `session-b` row also waits for the first turn to complete plus the inferred 50 ms timestamp delta. ### Agentic Mooncake `--trace-format agentic_mooncake` simulates request-level workflow dependencies in addition to the Mooncake request fields. Each row should contain the normal Mooncake fields plus a stable `request_id`. Dependency fields are optional. ```json { "request_id": "root-2", "session_id": "run-42:root", "timestamp": 1000.0, "input_length": 4096, "output_length": 256, "hash_ids": [0, 1, 2, 3], "wait_for": ["child-1"], "branches": ["child-1"], "prefix_reset": false, "delay": 10.0, "tool_wait_ms": 2500.0 } ``` Rows with no `wait_for` use `timestamp` as their start time. Rows with dependencies wait for every listed request to complete, then wait `delay + tool_wait_ms` before dispatch. `branches` records child requests spawned by this row, and `prefix_reset` marks the first row in a trajectory. Use `agent_trace_to_mooncake --agentic` to create this format from Dynamo agent traces: ```bash cargo run -p dynamo-bench --bin agent_trace_to_mooncake -- \ --agentic \ --input-path /tmp/dynamo-agent-trace.jsonl \ --output-file /tmp/dynamo-agent-trace.agentic-mooncake.jsonl ``` Run it with: ```bash python -m dynamo.replay /tmp/dynamo-agent-trace.agentic-mooncake.jsonl \ --trace-format agentic_mooncake \ --trace-block-size 128 \ --replay-mode offline \ --router-mode kv_router \ --num-workers 4 \ --extra-engine-args '{"block_size":128}' \ --report-json /tmp/agentic-dynosim-report.json ``` DynoSim uses two different block-size concepts for trace files: - `--trace-block-size`: how many tokens each `hash_id` in the dataset represents - engine `block_size`: the block size used by the simulated engine and router when they re-chunk the synthesized tokens into sequence hashes Public Mooncake/toolagent traces use `512` tokens per `hash_id`, so DynoSim runs should normally use `--trace-block-size 512`. The engine `block_size` can still be smaller, for example the live vLLM benchmark setup uses `block_size=64`. For `engine_type=sglang`, DynoSim still uses canonical `block_size` internally; `sglang.page_size` is accepted as a compatibility alias and is normalized into `block_size` before simulation starts. ## DynoSim Surfaces ### `python -m dynamo.replay` The dedicated DynoSim CLI exposes: - either a positional `trace_file`, or all of `--input-tokens`, `--output-tokens`, and `--request-count` - `--replay-mode offline|online` - `--router-mode round_robin|kv_router` - `--num-workers` - `--num-prefill-workers` - `--num-decode-workers` - `--replay-concurrency` - `--arrival-interval-ms` - `--arrival-speedup-ratio` - `--trace-format mooncake|mooncake-delta|agentic_mooncake|applied_compute_agentic` - `--trace-block-size` - `--turns-per-session` - `--shared-prefix-ratio` - `--num-prefix-groups` - `--inter-turn-delay-ms` - `--extra-engine-args` (JSON string) - `--prefill-engine-args` (JSON string) - `--decode-engine-args` (JSON string) - `--router-config` (JSON string) - `--aic-backend` - `--aic-system` - `--aic-backend-version` - `--aic-tp-size` - `--aic-model-path` - `--aic-moe-tp-size` - `--aic-moe-ep-size` - `--aic-attention-dp-size` - `--report-json` Defaults: - `--replay-mode offline` - `--router-mode round_robin` Example: ```bash python -m dynamo.replay /path/to/mooncake_trace.jsonl \ --replay-mode online \ --router-mode kv_router \ --num-workers 4 \ --arrival-speedup-ratio 10 \ --trace-block-size 512 \ --extra-engine-args '{"block_size":64}' \ --router-config '{"router_queue_policy":"fcfs","router_temperature":0.0}' \ --report-json /tmp/dynosim-report.json ``` SGLang simulation uses the same CLI surface. A minimal extra-engine-args file can use either `block_size` directly or the compatibility alias `sglang.page_size`: ```json { "engine_type": "sglang", "num_gpu_blocks": 512, "sglang": { "page_size": 2 } } ``` Both `--extra-engine-args` and `--router-config` accept partial JSON objects. Engine settings such as `block_size`, `engine_type`, `dp_size`, `speedup_ratio`, and `decode_speedup_ratio` belong in `--extra-engine-args`, not as top-level DynoSim CLI flags. `--trace-block-size` is separate and is used only for trace-file runs. Unspecified fields fall back to the same defaults used by `MockEngineArgs::default()` and `KvRouterConfig::default()`. DynoSim has two independent AIC surfaces: - engine timing AIC via `--extra-engine-args` / staged engine JSON - router-side prompt-load AIC via top-level `--aic-*` flags together with `router_prefill_load_model: "aic"` in `--router-config` Both surfaces accept MoE parallelism fields. For Kimi-style TP-only MoE configs, keep them aligned by setting `aic_moe_tp_size` to the same value as `aic_tp_size`, with `aic_moe_ep_size=1` and `aic_attention_dp_size=1`. Offline disaggregated simulation uses staged engine args instead of `--extra-engine-args`: - `--prefill-engine-args` for the prefill worker config - `--decode-engine-args` for the decode worker config - `--num-prefill-workers` and `--num-decode-workers` for pool sizes For offline disaggregated simulation, the staged JSON must set `worker_type` explicitly: - `--prefill-engine-args` must use `worker_type: "prefill"` - `--decode-engine-args` must use `worker_type: "decode"` The staged configs must also use the same engine `block_size`. `--trace-block-size` remains a separate trace-file input knob. ### Synthetic Workloads Synthetic mode bypasses trace loading and generates in-memory requests with fixed input/output lengths and optional synthetic arrival spacing: ```bash python -m dynamo.replay \ --input-tokens 5000 \ --output-tokens 500 \ --request-count 200 \ --arrival-interval-ms 0.5 \ --replay-mode offline \ --replay-concurrency 50 \ --extra-engine-args '{"block_size":512}' ``` This is useful for parameter sweeps where Mooncake-style prefix structure is not required. When `--turns-per-session > 1`, `--request-count` is interpreted as the number of sessions rather than the total number of emitted turns. The total completed request count becomes: - `request_count * turns_per_session` Synthetic workload options: - `--turns-per-session`: number of turns in each synthetic session - `--shared-prefix-ratio`: fraction of prompt blocks shared inside a prefix group - `--num-prefix-groups`: number of shared-prefix groups; `0` disables grouping - `--inter-turn-delay-ms`: constant delay applied after each completed turn before the next turn in the same session becomes eligible ## Modes ### Fixed-Schedule Runs Default trace mode preserves the timestamps from the trace and simulates arrivals according to those timestamps: ```bash python -m dynamo.replay /path/to/mooncake_trace.jsonl \ --replay-mode offline \ --num-workers 4 \ --trace-block-size 512 \ --extra-engine-args '{"block_size":64}' ``` This is the right mode when you want deterministic simulation of the original request-arrival pattern. For wall-clock request traces, omit `session_id` so each row is scheduled independently by timestamp. Rows that share a `session_id` are simulated as a closed-loop session, where each later turn waits for the previous turn to complete. ### Closed-Loop Concurrency Use `--replay-concurrency` to ignore first-turn trace arrival timing and keep a fixed number of requests in flight: ```bash python -m dynamo.replay /path/to/mooncake_trace.jsonl \ --replay-mode offline \ --num-workers 4 \ --replay-concurrency 16 ``` This mode is useful when you want to compare scheduler behavior under a fixed offered concurrency rather than the original trace schedule. For multi-turn sessions, concurrency mode still enforces session order and inter-turn delays: - first-turn timestamps are ignored - turn `n+1` is not eligible until turn `n` completes - `delay` / `delay_ms` / synthetic `--inter-turn-delay-ms` are still applied after completion - TTFT is measured from actual dispatch under the cap, not from the ignored trace timestamp ### Online Mode Online mode launches the mock workers and runs the trace against the live runtime path. This is useful when you want the run to include live request dispatch, live output handling, and the same async KV-event propagation model used by the current router integration. ```bash python -m dynamo.replay /path/to/mooncake_trace.jsonl \ --replay-mode online \ --router-mode kv_router \ --num-workers 4 \ --arrival-speedup-ratio 10 \ --trace-block-size 512 \ --extra-engine-args '{"block_size":64}' ``` ### Arrival Speedup Use `--arrival-speedup-ratio` to compress or stretch the trace arrival process without changing the mocker compute model. Larger values make arrivals happen sooner relative to the original trace. ```bash python -m dynamo.replay /path/to/mooncake_trace.jsonl \ --replay-mode offline \ --num-workers 4 \ --arrival-speedup-ratio 5 \ --trace-block-size 512 \ --extra-engine-args '{"block_size":64}' ``` ### Router Modes DynoSim currently supports: - `round_robin` - `kv_router` `kv_router` uses the shared local scheduler and an in-process KV indexer. Router policy tuning is provided through `--router-config`, not a dedicated top-level CLI flag. In offline mode: - `kv_router` is supported only when `num_workers > 1` - router queueing is enabled and uses simulation time rather than wall-clock time - KV visibility is delayed slightly relative to request lifecycle events - queue admission is driven by router lifecycle edges (`add_request`, `mark_prefill_completed`, and `free`) - transient in-pass prefill occupancy is still approximated at the router level rather than modeled exactly - when `router_prefill_load_model` is `"aic"`, DynoSim predicts one expected prefill duration per admitted request and decays only the oldest active prefill request on each worker To compare queue policies manually, keep the same trace and engine args fixed and swap only `router_queue_policy` inside `--router-config`: ```bash python -m dynamo.replay /path/to/mooncake_trace.jsonl \ --replay-mode offline \ --router-mode kv_router \ --num-workers 4 \ --trace-block-size 512 \ --extra-engine-args '{"block_size":64}' \ --router-config '{"router_queue_policy":"fcfs"}' python -m dynamo.replay /path/to/mooncake_trace.jsonl \ --replay-mode offline \ --router-mode kv_router \ --num-workers 4 \ --trace-block-size 512 \ --extra-engine-args '{"block_size":64}' \ --router-config '{"router_queue_policy":"lcfs"}' ``` `lcfs` is intentionally a worse comparison policy under saturation; use it for experiments, not as an expected production default. To enable router-side AIC prefill-load modeling during simulation: ```bash python -m dynamo.replay /path/to/mooncake_trace.jsonl \ --replay-mode offline \ --router-mode kv_router \ --num-workers 4 \ --trace-block-size 512 \ --extra-engine-args '{"block_size":64}' \ --router-config '{"router_track_prefill_tokens":true,"router_prefill_load_model":"aic"}' \ --aic-backend vllm \ --aic-system h200_sxm \ --aic-model-path nvidia/Llama-3.1-8B-Instruct-FP8 \ --aic-tp-size 1 ``` For offline disaggregated simulation, the same top-level `--aic-*` flags are supported, but the estimator is applied only to the prefill-stage router. For MoE models that require AIC MoE parallelism, add the matching top-level router AIC flags, for example: ```bash --aic-tp-size 2 \ --aic-moe-tp-size 2 \ --aic-moe-ep-size 1 \ --aic-attention-dp-size 1 ``` ## Output The report contains: - request counts - input and output token totals - virtual duration and wall-clock runtime - request and token throughput - prefix cache reuse ratio - TTFT, TTST, TPOT, ITL, and end-to-end latency summaries - output-token-throughput-per-user summaries The dedicated DynoSim CLI returns the same report schema as the Python APIs `dynamo.replay.run_trace_replay(...)` and `dynamo.replay.run_synthetic_trace_replay(...)`. If `--report-json` is not provided, `python -m dynamo.replay` writes a timestamped `dynamo_replay_report_*.json` file in the current working directory. ## Constraints Shared constraints: - `extra_engine_args.engine_type` must be `vllm` or `sglang` - aggregated simulation requires the existing aggregated args path - disaggregated simulation requires both `prefill_engine_args` and `decode_engine_args` - disaggregated simulation requires `router_mode=kv_router` - `dp_size` must be `1` - disaggregated simulation requires matching `block_size` in `prefill_engine_args` and `decode_engine_args` Additional offline constraints: - offline `kv_router` requires `num_workers > 1` - single-worker offline mode is still a dedicated fast path for `vllm`, but it now supports both flat request runs and workload-driven multi-turn runs - `sglang` still goes through the shared multi-worker runtime even when `num_workers=1` - offline disaggregated simulation is a separate two-stage runtime with prefill and decode worker pools Additional online constraints: - the current live simulation path is also limited to aggregated workers If you violate those constraints, DynoSim fails immediately with a validation error. ## Practical Notes - `python -m dynamo.replay` requires exactly one of: either a trace file, or all of `--input-tokens`, `--output-tokens`, and `--request-count` - `--replay-concurrency` works with both trace-file and synthetic workloads - mocker compute-speed knobs such as `speedup_ratio` still affect simulated timing when passed via the engine-args JSON for the chosen mode - `--arrival-speedup-ratio` affects trace timestamps, not worker compute speed - `--trace-block-size` affects only how trace `hash_ids` expand into tokens - `--arrival-interval-ms` only applies to synthetic workloads - `--turns-per-session`, `--shared-prefix-ratio`, `--num-prefix-groups`, and `--inter-turn-delay-ms` only apply to synthetic workloads - `--extra-engine-args`, `--prefill-engine-args`, `--decode-engine-args`, and `--router-config` are JSON strings on the standalone DynoSim CLI - top-level `--aic-*` flags are used only for router-side prompt-load modeling; engine timing AIC still belongs in the engine-args JSON - offline mode does not need planner runtime setup, router registration, or external event transport - trace-file workloads can use different values for `--trace-block-size` and engine `block_size` - Mooncake/toolagent traces typically use `--trace-block-size 512`, while engine `block_size` often stays `64` ## When To Use This vs AIPerf Use offline DynoSim when: - you want a fast scheduler-only simulation - you want deterministic CI coverage of simulation behavior - you do not need HTTP serving, frontend behavior, or network effects Use [Dynamo Benchmarking](/dynamo/user-guides/benchmarking) when: - you want end-to-end benchmarking against a live endpoint - you need frontend, transport, or cluster-level behavior - you want AIPerf dashboards and endpoint-facing metrics # DynoSim Sweeps A DynoSim sweep runs many simulated trials across candidate topologies, router settings, and timing-model inputs, then ranks the results against SLA constraints and GPU budget. Use sweeps when a single [DynoSim run](/dynamo/user-guides/dynosim/runs) is not enough and you want to search the design space before validating on real GPUs. The current Python API is `dynamo.profiler.utils.replay_optimize`. The docs use "DynoSim sweep" as the product term while keeping the existing implementation name for now. ## What It Answers A sweep answers a concrete deployment question: - given a fixed GPU budget - for a workload with prefix overlap - and latency SLAs that still permit meaningful throughput which topology, worker split, and router settings produce the best simulated result? For disaggregated deployments, the search can cover: - tensor-parallel shape for prefill and decode workers - prefill and decode worker counts - KV-router overlap credit - prompt-load scaling - throughput, TTFT, ITL, or end-to-end latency objectives This is a heuristic search over simulated states, not an exact optimizer over every feasible configuration. ## How It Works ```mermaid flowchart LR A["TP search
choose TP shape
(prefill_tp, decode_tp)
under GPU budget"] --> B["Worker search
choose worker split
(prefill_workers, decode_workers)
for the chosen TP"] B --> C["Router search
choose routing mode
and router cost knobs"] C --> D["DynoSim run evaluations"] D --> E["Feasible candidates"] E --> A ``` Each candidate state is evaluated by the DynoSim run harness. The optimizer records the metrics from each run, filters candidates that violate SLA or GPU-budget constraints, and returns the best feasible state plus the full evaluated table for analysis. The descent is budget-focused: each step prunes to near-budget-edge states so the sweep ends up at a TP/worker shape that actually consumes the available GPU budget, rather than at a pure throughput-per-GPU point. Aggregated sweeps collapse the TP and worker dimensions into `(tp, workers)` but otherwise follow the same idea. ## Spec Shape The public API takes a single `ReplayOptimizeSpec` composed of: | Spec | Purpose | |---|---| | `EngineSpec` | Model, backend, and JSON-like engine arguments. Disaggregated sweeps use prefill and decode engine args; aggregated sweeps use `baseEngineArgs`. | | `HardwareSpec` | GPU SKU and total simulated GPU budget | | `WorkloadSpec` | Synthetic workload knobs or a trace file | | `SLASpec` | Optional TTFT, ITL, end-to-end latency, and p95 bounds | | `RouterSpec` | Router mode, overlap-score-credit sweep, prefill-load-scale sweep, and KV-router base config | | `objective` | Ranking target, such as throughput, mean TTFT, or mean end-to-end latency | | `maxParallelEvals` | Number of candidate evaluations to run concurrently | Field names use lowerCamelCase to align with `DynamoGraphDeploymentRequest` concepts. Method names stay snake_case to match Pydantic convention. ## Prerequisites Run from the repository root. Use the project virtual environment: ```bash .venv/bin/python --version ``` If the Python bindings are not importable yet, build them first: ```bash .venv/bin/maturin develop --uv -m lib/bindings/python/Cargo.toml ``` The example uses AIC-backed timing by default: - AIC enumerates dense TP candidates - AIC-backed engine timing is used for candidate configs Install `aiconfigurator` into the project environment: ```bash uv pip install --python .venv/bin/python aiconfigurator ``` If a regular install fails to load usable perf data, reinstall from a source checkout that has real systems data materialized: ```bash uv pip install --python .venv/bin/python --force-reinstall /path/to/aiconfigurator ``` If DynoSim sweep setup fails with AIC errors about missing perf databases or parse failures such as `KeyError: 'gemm_dtype'`, inspect the installed files under: ```text .venv/lib/python*/site-packages/aiconfigurator/systems/data/... ``` If those files begin with `version https://git-lfs.github.com/spec/v1`, you have Git LFS pointer stubs instead of real perf tables. Install `aiconfigurator` from a checkout or wheel that includes the real LFS materialized payloads in `systems/`. When running directly from a source checkout, expose the in-repo Python packages: ```bash export PYTHONPATH=lib/bindings/python/src:components/src ``` If the sweep uses multiple worker processes, prefer a real script file over a heredoc. On macOS, `ProcessPoolExecutor` child workers need a stable module path, and the driver module must guard its entry behind `if __name__ == "__main__":`. For KV-router logs, this filter keeps the run readable without hiding useful `info` output: ```bash export DYN_LOG='info,dynamo_kv_router::scheduling::selector=warn' ``` ## Run The Example The canonical starting point is the checked-in driver script: ```bash .venv/bin/python components/src/dynamo/profiler/utils/replay_optimize/example.py \ --max-parallel-evals 4 ``` The default example searches a synthetic disaggregated KV-router workload using AIC-backed candidate timing. It prints the best feasible state and a table of top feasible configurations. The example uses: - model: `Qwen/Qwen3-32B` - backend: `vllm` - GPU SKU: `h200_sxm` - total simulated GPUs: `16` - router mode: `kv_router` - synthetic workload: - `isl=32768` - `osl=256` - `requestCount=5000` - `concurrency=200` - `sharedPrefixRatio=0.5` - `numPrefixGroups=50` The GPU budget is a simulated search constraint. You do not need 16 real GPUs locally to run the search. The base engine args stay conservative: - `block_size=512` - `enable_prefix_caching=True` - explicit `worker_type` for prefill versus decode The example intentionally omits `num_gpu_blocks`; AIC-backed DynoSim estimates capacity for each candidate TP shape unless a base input explicitly pins it. This setup does not force scheduler-specific bottlenecks such as: - `enable_chunked_prefill` - a small `max_num_seqs` - a pinned `max_num_batched_tokens` Only add those when the experiment is specifically about scheduler limits. ## Run Against A Trace To run against a Mooncake-style trace instead of the synthetic workload: ```bash .venv/bin/python components/src/dynamo/profiler/utils/replay_optimize/example.py \ --trace-file /path/to/mooncake_trace.jsonl \ --arrival-speedup-ratio 1.0 \ --max-parallel-evals 4 ``` For a public starting point, download the FAST'25 toolagent trace: ```bash curl -sL \ https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/traces/toolagent_trace.jsonl \ -o /tmp/toolagent_trace.jsonl ``` Then run: ```bash .venv/bin/python components/src/dynamo/profiler/utils/replay_optimize/example.py \ --trace-file /tmp/toolagent_trace.jsonl \ --arrival-speedup-ratio 1.0 \ --max-parallel-evals 4 ``` In trace mode: - `traceFile` points at the Mooncake-style JSONL input - `arrivalSpeedupRatio` compresses or stretches the trace arrival process - synthetic-only knobs such as `isl`, `osl`, `requestCount`, `concurrency`, `sharedPrefixRatio`, and `numPrefixGroups` are ignored Important notes for the public toolagent trace: - the dataset uses Mooncake-style `hash_ids` with `512` tokens per block - the underlying `run_trace_replay(...)` API defaults `trace_block_size` to `512` - `WorkloadSpec` does not yet expose a separate `traceBlockSize` field ## Customize A Sweep Treat the example driver as a starting point, not a frozen harness. Modify it as needed for your search: - change the `WorkloadSpec` shape or switch to a trace source with `traceFile` - add SLA bounds on `SLASpec`, such as `ttft`, `itl`, `e2eLatency`, or their p95 variants - change `RouterSpec.overlapCredits` within the valid `0.0` to `1.0` range - change `RouterSpec.prefillLoadScales` when you want to weigh TTFT/prompt-side load more or less heavily - print different columns from `result.evaluated_df` or `result.feasible_df` - persist the tables to CSV or Parquet for downstream analysis Useful axes to vary: - `HardwareSpec.totalGpus` - `RouterSpec.overlapCredits` - `RouterSpec.prefillLoadScales` - `WorkloadSpec.sharedPrefixRatio` - `WorkloadSpec.numPrefixGroups` - base prefill/decode engine args If you want to compare routing strategies directly, use `RouterSpec(mode="both")` instead of the default KV-router-only search. ## Outputs The optimizer returns a `DenseReplayOptimizationResult` with: - `best_feasible`: best visited state that satisfies all configured SLA and GPU-budget constraints - `best_infeasible`: best visited state that misses at least one SLA bound or the budget - `evaluated_df`: all visited states - `feasible_df`: only feasible states Common columns to inspect: - topology: `prefill_tp`, `decode_tp`, `prefill_workers`, `decode_workers` - routing: `router_mode`, `overlap_score_credit`, `prefill_load_scale` - budget: `total_gpus_used` This is the simulated GPU footprint of the candidate state, not a count of GPUs actually allocated on the machine running the search. - throughput: `output_throughput_tok_s` - cache behavior: `prefix_cache_reused_ratio` - latency: `mean_ttft_ms`, `mean_tpot_ms`, `mean_e2e_latency_ms` The report DataFrame still uses the Rust DynoSim runner's metric keys (`mean_ttft_ms`, `mean_tpot_ms`, `mean_e2e_latency_ms`) even though the input `SLASpec` uses DGDR-style camelCase names (`ttft`, `itl`, `e2eLatency`). `SLASpec` carries an internal translation map. In local testing, the default synthetic setup produced a non-trivial mean-E2E winner around: - `prefill_tp=4`, `decode_tp=1`, `prefill_workers=3`, `decode_workers=4`, `overlap_score_credit=0.5`, `prefill_load_scale=1.0` - `output_throughput_tok_s ~= 970`, `prefix_cache_reused_ratio ~= 0.5`, `mean_ttft_ms ~= 42800`, `mean_tpot_ms ~= 35`, `mean_e2e_latency_ms ~= 51900` Treat those as sanity-check ranges, not fixed assertions. ## Relationship To DynoSim Runs A [DynoSim run](/dynamo/user-guides/dynosim/runs) answers "how does this one configuration perform?" A DynoSim sweep answers "which configuration should I try next?" For final validation, take feasible candidates into a live Mocker deployment or a real-GPU AIPerf benchmark. DynoSim is designed to narrow the search space before cluster validation, not to replace it. # Planner DynoSim Benchmarking This guide shows how to benchmark the Dynamo Planner against a recorded trace by running it inside DynoSim. Use it to compare `agg` vs `disagg` topologies, tune SLA targets, and study how deployment realities (engine startup time, worker counts) affect planner behavior — all without bringing up a live cluster. For the general mechanics of DynoSim runs (input format, arrival speedup, router modes, synthetic workloads), see [DynoSim Runs](/dynamo/user-guides/dynosim/runs). This guide focuses on the `--planner-config` path. ## 1. Setup ### Build Build the Dynamo Python bindings so `python -m dynamo.replay` is available: ```bash cd lib/bindings/python .venv/bin/maturin develop --release ``` The `--release` flag is strongly recommended. DynoSim execution is largely single-threaded and CPU-bound on the mocker engine core; a debug build can be 5–10× slower, which compounds across sweep runs. ### Key Planner Config Knobs Passed as JSON via `--planner-config`. Uses the same schema as the live planner. The fields most relevant to benchmarking: | Field | Purpose | |---|---| | `mode` | `"agg"` or `"disagg"` — picks scaling strategy and required engine args. | | `optimization_target` | `"sla"` uses TTFT/ITL targets; `"throughput"` uses static queue/KV thresholds. | | `ttft_ms` / `itl_ms` | SLA targets in ms. Drives load-scaling decisions. | | `enable_throughput_scaling` | Periodic scaling based on predicted steady-state load. | | `enable_load_scaling` | Reactive scaling to short-term traffic spikes. | | `throughput_adjustment_interval_seconds` | Seconds between throughput-scaling decisions. | | `load_adjustment_interval_seconds` | Seconds between load-scaling decisions. Short intervals mean faster reaction but more flapping. | | `pre_deployment_sweeping_mode` | `"rapid"` uses the AIC analytical model; leave unset to fall back to recorded profile data. | | `prefill_engine_num_gpu` / `decode_engine_num_gpu` | GPUs per engine replica. **Must be set explicitly** — both default to `None`, and the simulation adapter silently treats `None` as `0`, which collapses the cumulative-GPU-hours metric in the report to zero. | | `report_filename` | Output HTML filename under `./planner_reports/`. | ### Key Engine Arg Knobs Passed as JSON via `--extra-engine-args` (agg) or `--prefill-engine-args` / `--decode-engine-args` (disagg). DynoSim uses the mocker engine, so "engine args" means the analytical perf model inputs: | Field | Purpose | |---|---| | `aic_backend` | Backend the analytical model should emulate, e.g. `"vllm"`, `"trtllm"`, `"sglang"`. | | `aic_system` | GPU SKU for the perf model, e.g. `"h200_sxm"`, `"h100_sxm"`, `"b200"`. | | `aic_model_path` | HF model identifier used by the perf model. | | `aic_tp_size` | Tensor-parallel size of each engine replica. | | `startup_time` | Simulated seconds between a planner scale-up decision and the new worker becoming active. Unset or `0` means workers activate instantly. | Other fields follow the standard mocker engine protocol (see [DynoSim Runs](/dynamo/user-guides/dynosim/runs)). ## 2. Example: Agg vs Disagg On The Mooncake Agentic Trace Download the trace: ```bash mkdir -p traces/mooncake_fast25 && cd traces/mooncake_fast25 curl -sLO https://raw.githubusercontent.com/kvcache-ai/Mooncake/main/FAST25-release/traces/toolagent_trace.jsonl ``` Run agg (2 workers, TP=1): ```bash .venv/bin/python -m dynamo.replay traces/mooncake_fast25/toolagent_trace.jsonl \ --planner-config '{ "mode": "agg", "optimization_target": "sla", "ttft_ms": 1500, "itl_ms": 50, "enable_throughput_scaling": true, "enable_load_scaling": true, "pre_deployment_sweeping_mode": "rapid", "throughput_adjustment_interval_seconds": 300, "load_adjustment_interval_seconds": 10, "prefill_engine_num_gpu": 1, "decode_engine_num_gpu": 1, "report_filename": "dynosim_agg.html" }' \ --extra-engine-args '{"aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}' \ --num-workers 2 --arrival-speedup-ratio 1.0 ``` Run disagg (1P1D, TP=1): ```bash .venv/bin/python -m dynamo.replay traces/mooncake_fast25/toolagent_trace.jsonl \ --planner-config '{ "mode": "disagg", "optimization_target": "sla", "ttft_ms": 1500, "itl_ms": 50, "enable_throughput_scaling": true, "enable_load_scaling": true, "pre_deployment_sweeping_mode": "rapid", "throughput_adjustment_interval_seconds": 300, "load_adjustment_interval_seconds": 10, "prefill_engine_num_gpu": 1, "decode_engine_num_gpu": 1, "report_filename": "dynosim_disagg.html" }' \ --prefill-engine-args '{"aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}' \ --decode-engine-args '{"aic_backend": "vllm", "aic_system": "h200_sxm", "aic_model_path": "nvidia/Llama-3.1-8B-Instruct-FP8", "aic_tp_size": 1}' \ --num-prefill-workers 1 --num-decode-workers 1 --arrival-speedup-ratio 1.0 ``` Each run prints the AIPerf summary table to stdout and writes an HTML diagnostics report to `./planner_reports/`. For this trace with a long ISL and short OSL, agg is better than disagg, which gets slightly better ITL at the cost noticeably more GPU-hours. ## 3. Example: Cold-Start-Time Sweep How sensitive is SLA attainment to engine startup time? Sweep `startup_time` from 0 to 300 seconds in 10-second steps and record TTFT/ITL/GPU-hours per run. ```bash #!/usr/bin/env bash set -euo pipefail TRACE=traces/mooncake_fast25/toolagent_trace.jsonl OUT=planner_reports/sweep_startup mkdir -p "$OUT" run_one() { local s=$1 local name=$(printf "dynosim_agg_startup_%03d.html" "$s") local extra if [[ "$s" -eq 0 ]]; then extra='{"aic_backend":"vllm","aic_system":"h200_sxm","aic_model_path":"nvidia/Llama-3.1-8B-Instruct-FP8","aic_tp_size":1}' else extra=$(printf '{"aic_backend":"vllm","aic_system":"h200_sxm","aic_model_path":"nvidia/Llama-3.1-8B-Instruct-FP8","aic_tp_size":1,"startup_time":%d}' "$s") fi .venv/bin/python -m dynamo.replay "$TRACE" \ --planner-config "$(printf '{"mode":"agg","optimization_target":"sla","ttft_ms":1500,"itl_ms":50,"enable_throughput_scaling":true,"enable_load_scaling":true,"pre_deployment_sweeping_mode":"rapid","throughput_adjustment_interval_seconds":300,"load_adjustment_interval_seconds":10,"prefill_engine_num_gpu":1,"decode_engine_num_gpu":1,"report_filename":"%s"}' "$name")" \ --extra-engine-args "$extra" \ --num-workers 2 --arrival-speedup-ratio 1.0 \ --report-json "$OUT/startup_$(printf '%03d' "$s").json" \ >"$OUT/startup_$(printf '%03d' "$s").log" 2>&1 } export -f run_one # Run 12 sweeps in parallel; adjust -P for your machine. seq 0 10 300 | xargs -n1 -P12 -I{} bash -c 'run_one "$@"' _ {} ``` Each run emits the AIPerf metrics table (parse TTFT / ITL avg / p90) and its HTML report (grep `GPU hours: `). Plotting those against `startup_time` gives: ![Planner DynoSim — startup time sweep](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/cebfa5849a4bd80ac622ee88350002820851b3d4a6c4ecd2d48738936584ddf4/pages-v1.2.0/assets/img/planner-replay-startup-sweep.png) Observations from this sweep (agg, TTFT SLA 1,500 ms, ITL SLA 50 ms, H200-SXM, Llama-3.1-8B-FP8, TP=1): - **SLA cliff near 100–120 s.** Below that, the planner scales up fast enough to hold TTFT; above it, p99 TTFT diverges and the system stays perpetually backlogged. - **Scaling-event count drops monotonically** (42 → 8) as startup grows — long-startup runs require load planner to wait for stabilization before the next scaling decision. - **ITL is less sensitive than TTFT** until the queue saturates. Below the cliff, ITL rises modestly (25 → 30 ms avg); above it, p90 ITL jumps to ~200 ms as decode requests starve. # Agents Dynamo provides a small set of request extensions and trace utilities for serving agentic workloads. The harness remains responsible for the semantic agent trajectory. Dynamo receives lightweight metadata and uses it for serving telemetry, routing hints, and backend-specific cache behavior. ## Core Concepts | Concept | Purpose | |---------|---------| | [Agent Tracing](/dynamo/user-guides/agents/agent-tracing) | Passive `session_id`/`trajectory_id` metadata plus Dynamo-owned request timing, token, cache, worker-placement, and harness tool-event traces. | | [Agent Hints](/dynamo/user-guides/agents/agent-hints) | Optional per-request hints such as priority, expected output length, and speculative prefill. | | [Use Pi-Mono with Dynamo](/dynamo/user-guides/agents/use-pi-mono-with-dynamo) | End-to-end quickstart that drives the Pi coding agent through Dynamo with agent context and tool tracing turned on. | | [Tool Calling](/dynamo/user-guides/parsing/tool-call-parsing-dynamo) | Supported tool-call parsers and parser names, plus engine-fallback configurations. | | [Reasoning](/dynamo/user-guides/parsing/reasoning-parsing-dynamo) | Supported reasoning parsers for chain-of-thought models, plus engine-fallback configurations. | ## Backend-Specific Guides Agent features are exposed through common request metadata, but backend support varies by runtime. | Backend Guide | Contents | |---------------|----------| | [SGLang for Agentic Workloads](/dynamo/backends/sg-lang/agentic-workloads) | Priority scheduling, priority-based radix eviction, speculative prefill, and streaming session control for subagent KV isolation. | ## Request Surface Agent-facing request metadata lives under `nvext` on OpenAI-compatible request bodies: ```json { "nvext": { "agent_context": { "session_type_id": "deep_research", "session_id": "research-run-42", "trajectory_id": "research-run-42:researcher" }, "agent_hints": { "priority": 5, "osl": 1024 } } } ``` Use `agent_context` when you want traceability across LLM calls, tool calls, and external trajectory files. Use `agent_hints` only when the harness has serving-relevant intent that Dynamo can act on. # Agent Tracing Agent tracing records **who** called (`nvext.agent_context`), **what Dynamo measured** on each LLM request (`request_end`), and optional **harness tool spans** (`tool_*`). Context is passive—it does not steer routing or caching. Output is best-effort profiling data, not an audit log. **Flow:** Harness sends chat completions with `agent_context` → Dynamo emits `request_end` to trace sinks. Harness sends tool events over ZMQ → same sinks. ## Adding trace context to each LLM call **Direct LLM call** Inject `agent_context` into each LLM request ```json { "model": "my-model", "messages": [{ "role": "user", "content": "..." }], "nvext": { "agent_context": { "session_type_id": "deep_research", "session_id": "research-run-42", "trajectory_id": "research-run-42:researcher", "parent_trajectory_id": "research-run-42:planner" } } } ``` | Field | Required | Meaning | | ---------------------- | :------: | ---------------------------------------- | | `session_type_id` | Yes | Workload class (e.g. `deep_research`). | | `session_id` | Yes | Whole agent run. | | `trajectory_id` | Yes | One reasoning/tool chain inside the run. | | `parent_trajectory_id` | No | Parent trajectory when using subagents. | **OpenAI client:** merge into `extra_body` / `extra_headers`: ```python import uuid def instrument_llm_request(kwargs, agent_context): body = dict(kwargs.get("extra_body") or {}) nvext = dict(body.get("nvext") or {}) nvext["agent_context"] = dict(agent_context) body["nvext"] = nvext headers = dict(kwargs.get("extra_headers") or {}) headers.setdefault("x-request-id", str(uuid.uuid4())) out = dict(kwargs) out["extra_body"] = body out["extra_headers"] = headers return out ``` `x-request-id` is your logical per-call id; Dynamo stores it as `request.x_request_id` (distinct from Dynamo's internal `request_id`). No Dynamo imports are required in the harness. Keep context in a contextvar, attach before each completion, and propagate across threads/processes when those paths call the model or emit tools. ## Enable output The fast path is one environment variable: ```bash export DYN_AGENT_TRACE=1 ``` That picks `jsonl_gz` output at `/tmp/dynamo-agent-trace.*.jsonl.gz` and binds the harness tool-event ZMQ endpoint at `tcp://127.0.0.1:20390`. Any of the per-knob variables below still wins when set explicitly, so you only need to reach for them to relocate output, add `stderr`, or tune buffers. To relocate captures only: ```bash export DYN_AGENT_TRACE=1 export DYN_AGENT_TRACE_OUTPUT_PATH=/mnt/captures/run-42 ```
All agent trace environment variables | Variable | Required | Default (when `DYN_AGENT_TRACE=1`) | Notes | | ------------------------------------------ | :---------------------: | ---------------------------------- | ------------------------------------------------------------------------------------ | | `DYN_AGENT_TRACE` | Master switch | unset | Truthy (`1`, `true`, `on`, `yes`) enables tracing with all defaults below. | | `DYN_AGENT_TRACE_SINKS` | No | `jsonl_gz` | `jsonl`, `jsonl_gz`, `stderr`, or comma-separated (e.g. `jsonl_gz,stderr`). | | `DYN_AGENT_TRACE_OUTPUT_PATH` | No | `/tmp/dynamo-agent-trace` | File path for `jsonl`; segment **prefix** for `jsonl_gz` → `prefix.NNNNNN.jsonl.gz`. | | `DYN_AGENT_TRACE_CAPACITY` | No | `1024` | Trace bus capacity. | | `DYN_AGENT_TRACE_JSONL_BUFFER_BYTES` | No | `1048576` | Buffer / gzip batch threshold. | | `DYN_AGENT_TRACE_JSONL_FLUSH_INTERVAL_MS` | No | `1000` | Flush interval. | | `DYN_AGENT_TRACE_JSONL_GZ_ROLL_BYTES` | No | `268435456` | Roll gzip segment by uncompressed bytes. | | `DYN_AGENT_TRACE_JSONL_GZ_ROLL_LINES` | No | unset | Optional roll by line count. | | `DYN_AGENT_TRACE_REPLAY_HASHES` | No | on | Falsey (`0`, `no`, …) disables `replay` hashes on requests. | | `DYN_AGENT_TRACE_TOOL_EVENTS_ZMQ_ENDPOINT` | No | `tcp://127.0.0.1:20390` | PULL bind address for tool records. | | `DYN_AGENT_TRACE_TOOL_EVENTS_ZMQ_TOPIC` | No | unset | If set, first ZMQ frame must match. | Without `DYN_AGENT_TRACE=1`, tracing is off; the other variables only take effect once the master switch is on.
## Tool events (ZMQ) Wire format: `[topic, seq_be_u64, msgpack(AgentTraceRecord)]`. To publish to Dynamo, use a background publisher, bounded queue, monotonic sequence, and PUSH with HWM. **Terminal** `tool_end` / `tool_error` rows should carry timing (`started_at_unix_ms`, `ended_at_unix_ms`, `duration_ms`) even if `tool_start` was dropped. Same `agent_context` as the surrounding LLM calls; `tool_call_id` unique per trajectory. Join offline on `session_id`, `trajectory_id`, `tool_call_id`. Example `tool_end`: ```json { "schema": "dynamo.agent.trace.v1", "event_type": "tool_end", "event_time_unix_ms": 1777312801500, "event_source": "harness", "agent_context": { "session_type_id": "deep_research", "session_id": "research-run-42", "trajectory_id": "research-run-42:researcher" }, "tool": { "tool_call_id": "call-abc", "tool_class": "web_search", "status": "succeeded", "started_at_unix_ms": 1777312801080, "ended_at_unix_ms": 1777312801500, "duration_ms": 420.5 } } ``` Optional `tool` keys: `output_tokens`, `output_bytes`, `tool_name_hash`, `error_type` (useful on `tool_error`). Status values: `running`, `succeeded`, `error`, `cancelled`; synonyms `ok`/`success`, `failed`, `timeout`/`canceled` also deserialize. ## Dynamo `request_end` record Emitted after the response stream finishes or is dropped. Omitted keys were not recorded on that path; see `AgentTraceRecord` / `AgentRequestMetrics` in `lib/llm/src/agents/trace/types.rs` for the full Rust schema. ```json { "schema": "dynamo.agent.trace.v1", "event_type": "request_end", "event_time_unix_ms": 1777312801000, "event_source": "dynamo", "agent_context": { "session_type_id": "deep_research", "session_id": "research-run-42", "trajectory_id": "research-run-42:researcher", "parent_trajectory_id": "research-run-42:planner" }, "request": { "request_id": "dynamo-request-id", "x_request_id": "llm-call-42", "model": "my-model", "output_tokens": 16, "finish_reason_metadata": { "finish_reason": "tool_calls", "backend_finish_reason": "stop", "stop_reason": "END", "tool_calls": [ { "choice_index": 0, "tool_call_index": 0, "id": "call-abc", "name": "web_search" } ], "choices": [ { "choice_index": 0, "finish_reason": "tool_calls", "backend_finish_reason": "stop", "stop_reason": "END" } ] }, "replay": { "trace_block_size": 64, "input_length": 128, "input_sequence_hashes": [14879255164371896291, 274632075616497421] } } } ``` `finish_reason_metadata` is optional. `backend_finish_reason` and `stop_reason` come from the backend/token stop path; `finish_reason` is the final OpenAI-compatible finish reason after parser rewrites, such as `tool_calls`. Top-level finish fields summarize the common single-choice case; `choices` keeps per-choice finish fields when `n > 1`. Tool-call metadata includes ids and names only; arguments are intentionally not stored in agent traces. For chat streams, final finish metadata is recorded after parser/jail rewrites; completion streams record the final OpenAI-compatible completion finish reason from the completion response choices. By default we do not save the input/ouput payloads. In order to view these, use the built in Dynamo `audit_sink` functionality. **Audit side-by-side** (same gzip/jsonl machinery): ```bash # enable agent trace sinks export DYN_AGENT_TRACE_SINKS=jsonl_gz export DYN_AGENT_TRACE_OUTPUT_PATH=/tmp/dynamo-trace # enable audit sinks export DYN_AUDIT_SINKS=jsonl_gz export DYN_AUDIT_OUTPUT_PATH=/tmp/dynamo-audit export DYN_AUDIT_FORCE_LOGGING=true ``` After the run, correlate by id: ```bash gzip -cd /tmp/dynamo-audit.*.jsonl.gz | jq -c '.event' > /tmp/audit.jsonl gzip -cd /tmp/dynamo-trace.*.jsonl.gz | jq -c '.event' > /tmp/trace.jsonl jq -s 'group_by(.request_id // .request.request_id)' /tmp/audit.jsonl /tmp/trace.jsonl ``` The result is a JSONL file where each line wraps the record: ```json { "timestamp": 1234, "event": { "schema": "dynamo.agent.trace.v1", "...": "..." } } ``` `timestamp` is sink-relative elapsed ms; use `event.event_time_unix_ms` for wall-clock ordering. ## Viewing traces in Perfetto In order to visualize and optimize your agentic graph, we provide a utility to convert the agent trace JSONL files into a [Perfetto](https://ui.perfetto.dev/) trace file. We have found this to be extremely useful to pipeline agents that our team writes! ```bash uv run --no-project python benchmarks/agent_trace/convert_to_perfetto.py \ "${DYN_AGENT_TRACE_OUTPUT_PATH}".*.jsonl.gz \ --output "${DYN_AGENT_TRACE_OUTPUT_PATH}.perfetto.json" ``` Open in [Perfetto UI](https://ui.perfetto.dev/). Flags: `--include-markers`, `--no-stages`, `--separate-stage-tracks`. Request slices include flattened finish metadata when present, such as `finish.finish_reason`, `finish.backend_finish_reason`, `finish.stop_reason`, `finish.tool_call_count`, `finish.tool_call_names`, and per-choice summaries like `finish.choice_finish_reasons`. ## [Experimental] Replaying agent traces using agentic Mooncake replay You can convert a collected agent trace into an **agentic Mooncake** trace and replay it with `python -m dynamo.replay`. The converter uses Dynamo `request_end` rows for request timing, token lengths, worker placement, and replay hashes. It also uses terminal harness tool rows (`tool_end` / `tool_error`) to preserve tool-wait time between dependent LLM requests. Replay ignores non-replay request fields such as `finish_reason_metadata`; use the Perfetto view above when you want to inspect final finish reasons, backend stop signals, or complete tool-call metadata inside the trace. ```bash cargo run -p dynamo-bench --bin agent_trace_to_mooncake -- \ --agentic \ --input-path "${DYN_AGENT_TRACE_OUTPUT_PATH}".*.jsonl.gz \ --output-file /tmp/dynamo-agent-trace.agentic-mooncake.jsonl ``` The binary prints **`trace_block_size`**. Use that exact value for replay so hash segmentation matches what Dynamo recorded. Align the mock engine block size with the same number in `--extra-engine-args`. ```bash TRACE_BLOCK_SIZE=128 uv run --no-sync python -m dynamo.replay /tmp/dynamo-agent-trace.agentic-mooncake.jsonl \ --trace-format agentic_mooncake \ --trace-block-size "${TRACE_BLOCK_SIZE}" \ --replay-mode offline \ --router-mode kv_router \ --num-workers 4 \ --extra-engine-args "{\"block_size\":${TRACE_BLOCK_SIZE}}" \ --report-json /tmp/dynamo-agent-trace.replay-report.json ``` `kv_router` needs **at least two** mock workers; for a single-worker smoke test use `--router-mode round_robin --num-workers 1`. Agentic Mooncake rows preserve: - `request_id`: the LLM request row identity. - `session_id`: the Dynamo `trajectory_id`. - `wait_for`: request ids that must complete before this row becomes eligible. - `branches`: child request ids spawned from this row. - `prefix_reset`: first request in a trajectory. - `delay`: non-tool delay after dependencies finish. - `tool_wait_ms`: tool time after dependencies finish, parallel-aware (the union of overlapping spans rather than their sum). - `tool_events`: per-tool spans attributed to this LLM request, each carrying `tool_call_id`, `tool_class`, `status`, `started_at_unix_ms`, `ended_at_unix_ms`, `duration_ms`, and optional `output_bytes` / `output_tokens` / `error_type`. - `hash_ids`, `input_length`, and `output_length`: prompt-prefix and length data for mocker replay. Rows with no `wait_for` use their `timestamp` as the replay start time. Rows with dependencies wait for all listed requests to complete, then wait `delay + tool_wait_ms` before dispatch. For more flags and engine settings, see [DynoSim Runs](/dynamo/user-guides/dynosim/runs).
ATIF alignment Dynamo emits `dynamo.agent.trace.v1`, not full ATIF logs—but identifiers match [ATIF][atif-rfc] / [Harbor](https://github.com/harbor-framework/harbor) so you can join harness trajectories to Dynamo rows on `session_id` + `trajectory_id`. Dynamo omits conversational payload by design. | Dynamo | Role | | ---------------------- | ----------------------- | | `session_id` | Shared run id | | `trajectory_id` | Branch within run | | `parent_trajectory_id` | Subagent link | | `session_type_id` | Profile / workload type |
[atif-rfc]: https://github.com/harbor-framework/harbor/blob/main/rfcs/0001-trajectory-format.md # Agent Hints Agent hints are optional per-request metadata that a harness sends under `nvext.agent_hints`. Dynamo parses these hints in the frontend and passes them to the router and, where supported, backend runtimes. Use hints only for serving-relevant intent. Use [`nvext.agent_context`](/dynamo/user-guides/agents/agent-tracing#request-schema) for passive trace identity. ## Request Schema ```json { "model": "my-model", "messages": [ { "role": "user", "content": "Continue the report." } ], "nvext": { "agent_hints": { "priority": 5, "osl": 1024, "speculative_prefill": true } } } ``` | Hint | Description | |------|-------------| | `priority` | Unified request priority. Higher values move the request earlier in the router queue and are forwarded to backends that support priority scheduling or eviction. | | `osl` | Expected output sequence length in tokens. Used by the router for output block tracking and load-balancing accuracy when `--router-track-output-blocks` is enabled. | | `speculative_prefill` | When true, Dynamo can prefill the predicted next-turn prefix after the current turn completes to warm the KV cache for the next request. | ## Request Flow ```mermaid flowchart LR Harness[Agent harness] -->|nvext.agent_hints| Frontend[Dynamo frontend] Frontend --> Router[Router] Router --> Worker[Backend worker] ``` The frontend parses `nvext.agent_hints`, the router uses hints for queueing and worker selection, and supported backends use forwarded hints for engine-level scheduling and cache policy. ## Backend Support Backend support is runtime-specific. For SGLang flags and behavior, see [SGLang for Agentic Workloads](/dynamo/backends/sg-lang/agentic-workloads). | Feature | vLLM | SGLang | TensorRT-LLM | |---------|:----:|:------:|:-------------:| | Priority-aware routing | Yes | Yes | Yes | | Priority-based cache eviction | Planned | Yes | Planned | | Speculative prefill | Yes | Yes | Yes | | Subagent KV isolation with session control | No | Experimental | No | ## Related Request Extensions `agent_hints` is separate from `agent_context`: - `agent_context` is passive identity for traces and joins. - `agent_hints` is active serving intent for routing, scheduling, and cache behavior. Session-control metadata for SGLang subagent KV isolation lives under `nvext.session_control`; see [NVIDIA Request Extensions](/dynamo/additional-resources/nvidia-request-extensions-nvext#session-control). # Use Pi-Mono with Dynamo [Pi-Mono](https://github.com/badlogic/pi-mono) is an open-source coding-agent harness whose clean plugin architecture has made it a popular substrate for patterns like subagents and planner/implementer loops. The [`pi-dynamo-provider`](https://github.com/ai-dynamo/pi-dynamo-provider) extension uses that plugin surface to register Dynamo as a Pi model provider. It runs in-process, adds Dynamo's [`agent_context`](/dynamo/user-guides/agents/agent-tracing) and [`agent_hints`](/dynamo/user-guides/agents/agent-hints) to each request, and emits Pi's tool lifecycle events to Dynamo over ZMQ. This page is one worked example of how to wire a harness up to Dynamo's tracing and hint APIs — use it as a reference, not a prescription. ## Why run Pi through Dynamo You can already point Pi at any OpenAI-compatible endpoint — Ollama, vLLM, a hosted API, or Dynamo out of the box. Routing through Dynamo *with this extension* gives you two things you don't get from plain hosting: - **Harness-aware observability.** Pi's session and trajectory IDs flow into Dynamo's `request_end` traces, and Pi's tool spans land on the same timeline. One Perfetto view shows LLM requests, prefill/decode stages, and tool calls together. - **Harness-aware orchestration.** Once Dynamo knows which trajectory a request belongs to, it can act on agent hints (priority, expected output length, speculative prefill) for smarter scheduling and KV-aware routing. That same trajectory awareness is what lets backends like [SGLang](/dynamo/backends/sg-lang/agentic-workloads) apply priority-based radix eviction and session-scoped KV isolation. The integration works against any Dynamo backend — vLLM, SGLang, or TRT-LLM — without backend-specific glue. ## What the extension does - Registers a `dynamo` provider in Pi: `pi --model dynamo/`. - Discovers models from Dynamo's `/v1/models`. - Injects `nvext.agent_context` (session/trajectory IDs) into every chat-completion request. - Adds `x-request-id` when one is not already set. - Relays Pi's `tool_start` / `tool_end` / `tool_error` events to Dynamo over ZMQ so LLM and tool spans share one trace. ```mermaid sequenceDiagram participant Pi as Pi-Mono participant Provider as pi-dynamo-provider participant Dynamo as Dynamo frontend participant Trace as Agent trace sink Pi->>Provider: streamSimple(model, context) Provider->>Dynamo: POST /v1/chat/completions
nvext.agent_context, x-request-id Dynamo-->>Provider: SSE chunks Dynamo->>Trace: request_end Pi->>Provider: tool_execution_start / _end Provider->>Dynamo: ZMQ PUSH tool record Dynamo->>Trace: tool_start / tool_end ``` ## Quickstart ### 1. Install the provider Build from source and install it into Pi: ```bash git clone git@github.com:ai-dynamo/pi-dynamo-provider.git cd pi-dynamo-provider npm install && npm run build pi install /absolute/path/to/pi-dynamo-provider ``` ### 2. Launch Dynamo with tracing enabled Use the in-repo SGLang launcher (`examples/backends/sglang/launch/agg_agent.sh`), which starts a frontend with KV routing plus one SGLang worker with streaming sessions, KV events, and reasoning/tool parsers wired up. Export the agent-trace env vars first so the worker records traces to a JSONL file and binds the ZMQ socket Pi will connect to: ```bash export DYN_AGENT_TRACE_SINKS=jsonl export DYN_AGENT_TRACE_OUTPUT_PATH=/tmp/dynamo-agent-trace.jsonl export DYN_AGENT_TRACE_TOOL_EVENTS_ZMQ_ENDPOINT=tcp://127.0.0.1:20390 ./examples/backends/sglang/launch/agg_agent.sh ``` By default this serves `zai-org/GLM-4.7-Flash` on TP 2. Override with `--model-path` / `--tp` if needed. See [Agent Tracing → Enable output](/dynamo/user-guides/agents/agent-tracing#enable-output) for the full env-var reference. The provider works equally well against any Dynamo backend (vLLM, SGLang, TRT-LLM); the SGLang launcher is just the most batteries-included starting point. ### 3. Point Pi at Dynamo ```bash export DYNAMO_BASE_URL=http://127.0.0.1:8000/v1 export DYNAMO_API_KEY=dummy export DYN_AGENT_SESSION_TYPE_ID=pi_coding_agent export DYN_AGENT_SESSION_ID=pi-demo-001 export DYN_AGENT_TOOL_EVENTS_ZMQ_ENDPOINT=tcp://127.0.0.1:20390 pi --model dynamo/zai-org/GLM-4.7-Flash \ -p "Run the tests in this folder, fix the smallest bug, and rerun the tests." ``` `DYN_AGENT_SESSION_ID` becomes the trace's `session_id`; if `DYN_AGENT_TRAJECTORY_ID` is unset, Pi's session id is used as the trajectory id. ### 4. View the trace in Perfetto ```bash python benchmarks/agent_trace/convert_to_perfetto.py \ /tmp/dynamo-agent-trace.jsonl \ --include-markers \ --separate-stage-tracks \ --output /tmp/dynamo-agent-trace.perfetto.json ``` Open the result at [ui.perfetto.dev](https://ui.perfetto.dev). You'll see: - `dynamo.llm` spans for each LLM request. - `dynamo.llm.stage` spans for prefill/decode when Dynamo records them. - `dynamo.agent.tool` spans for every Pi tool invocation.
Pi environment variables | Variable | Default | Purpose | | ------------------------------------ | -------------------------- | -------------------------------------------------------------------- | | `DYNAMO_BASE_URL` | `http://127.0.0.1:8000/v1` | Dynamo OpenAI-compatible endpoint root. | | `DYNAMO_API_KEY` | `dynamo-local` | Bearer token. Local Dynamo usually accepts any value. | | `DYN_AGENT_SESSION_TYPE_ID` | `pi_coding_agent` | Stable workload class for the trace. | | `DYN_AGENT_SESSION_ID` | unset | Session/run id. Falls back to Pi's session id for tool events. | | `DYN_AGENT_TRAJECTORY_ID` | unset | Trajectory id override; defaults to Pi's session id per request. | | `DYN_AGENT_PARENT_TRAJECTORY_ID` | unset | Parent trajectory id for nested or subagent workflows. | | `DYN_AGENT_TOOL_EVENTS_ZMQ_ENDPOINT` | unset | Dynamo-bound ZMQ PULL endpoint Pi connects to for tool events. | | `DYN_AGENT_TOOL_EVENTS_ZMQ_TOPIC` | `agent-tool-events` | First ZMQ frame; must match `DYN_AGENT_TRACE_TOOL_EVENTS_ZMQ_TOPIC`. | See the [provider README](https://github.com/ai-dynamo/pi-dynamo-provider) for the full variable list, aliases, and ZMQ wire format.
## Troubleshooting | Symptom | Likely cause | Fix | | ----------------------------------------- | ------------------------------------------------------- | ----------------------------------------------------------------------------------------- | | `pi` reports the model is unknown | `/v1/models` returned empty during Pi startup | `curl -s "$DYNAMO_BASE_URL/models"` to confirm; restart Pi after Dynamo is ready. | | LLM spans appear, tool spans do not | Tool-event endpoint not configured on both sides | Set `DYN_AGENT_TRACE_TOOL_EVENTS_ZMQ_ENDPOINT` on Dynamo and `DYN_AGENT_TOOL_EVENTS_ZMQ_ENDPOINT` on Pi to the same value. | | Tool spans appear, request spans do not | Dynamo trace sinks not enabled | Set `DYN_AGENT_TRACE_SINKS=jsonl` and `DYN_AGENT_TRACE_OUTPUT_PATH` on Dynamo. | | Authentication fails | Dynamo expects a specific token | Set `DYNAMO_API_KEY` to match your deployment. Local Dynamo usually accepts `dummy`. | ## Further reading - [pi-dynamo-provider repo](https://github.com/ai-dynamo/pi-dynamo-provider) — install, scripts, and source. - [Agent Tracing](/dynamo/user-guides/agents/agent-tracing) — the underlying trace protocol and `request_end` schema. - [Agent Hints](/dynamo/user-guides/agents/agent-hints) — per-request hints (`priority`, `osl`, `speculative_prefill`) Pi-Mono can forward via `nvext.agent_hints`. # LoRA Adapters LoRA (Low-Rank Adaptation) enables efficient fine-tuning and serving of specialized model variants without duplicating full model weights. Dynamo provides built-in support for dynamic LoRA adapter loading, caching, and inference routing. ## Backend Support | Backend | Status | Notes | |---------|--------|-------| | vLLM | ✅ | Full support including KV-aware routing | | SGLang | 🚧 | In progress | | TensorRT-LLM | ❌ | Not yet supported | See the [Feature Matrix](/dynamo/resources/feature-matrix) for full compatibility details. ## Overview Dynamo's LoRA implementation provides: - **Dynamic loading**: Load and unload LoRA adapters at runtime without restarting workers - **Multiple sources**: Load from local filesystem (`file://`), S3-compatible storage (`s3://`), or Hugging Face Hub (`hf://`) - **Automatic caching**: Downloaded adapters are cached locally to avoid repeated downloads - **Discovery integration**: Loaded LoRAs are automatically registered and discoverable via `/v1/models` - **KV-aware routing**: Route requests to workers with the appropriate LoRA loaded - **Kubernetes native**: Declarative LoRA management via the `DynamoModel` CRD ### Architecture ```mermaid flowchart TD Frontend["Frontend"] --> Router["Router
(LoRA-aware)"] Router --> Workers["Workers
(LoRA-loaded)"] Workers --> ManagerNode["LoRA Manager"] subgraph ManagerGroup["LoRA Manager"] Downloader Cache end ManagerNode --> Local["file://
Local"] ManagerNode --> S3["s3://
S3/MinIO"] ManagerNode --> HF["hf://
(custom)"] ``` The LoRA system consists of: - **Rust Core** (`lib/llm/src/lora/`): High-performance downloading, caching, and validation - **Python Manager** (`components/src/dynamo/common/lora/`): Extensible wrapper with custom source support - **Worker Handlers** (`components/src/dynamo/vllm/handlers.py`): Load/unload API and inference integration ## Quick Start ### Prerequisites - Dynamo installed with vLLM support - For S3 sources: AWS credentials configured - A LoRA adapter compatible with your base model ### Local Development **1. Start Dynamo with LoRA support:** ```bash # Start vLLM worker with LoRA flags DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 \ python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enforce-eager \ --enable-lora \ --max-lora-rank 64 ``` **2. Load a LoRA adapter:** ```bash curl -X POST http://localhost:8081/v1/loras \ -H "Content-Type: application/json" \ -d '{ "lora_name": "my-lora", "source": { "uri": "file:///path/to/my-lora" } }' ``` **3. Run inference with the LoRA:** ```bash curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my-lora", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100 }' ``` ### S3-Compatible Storage For production deployments, store LoRA adapters in S3-compatible storage: ```bash # Configure S3 credentials export AWS_ACCESS_KEY_ID=your-access-key export AWS_SECRET_ACCESS_KEY=your-secret-key export AWS_ENDPOINT=http://minio:9000 # For MinIO export AWS_REGION=us-east-1 # Load LoRA from S3 curl -X POST http://localhost:8081/v1/loras \ -H "Content-Type: application/json" \ -d '{ "lora_name": "customer-support-lora", "source": { "uri": "s3://my-loras/customer-support-v1" } }' ``` ## Configuration ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `DYN_LORA_ENABLED` | Enable LoRA adapter support | `false` | | `DYN_LORA_PATH` | Local cache directory for downloaded LoRAs | `~/.cache/dynamo_loras` | | `AWS_ACCESS_KEY_ID` | S3 access key (for `s3://` URIs) | - | | `AWS_SECRET_ACCESS_KEY` | S3 secret key (for `s3://` URIs) | - | | `AWS_ENDPOINT` | Custom S3 endpoint (for MinIO, etc.) | - | | `AWS_REGION` | AWS region | `us-east-1` | | `AWS_ALLOW_HTTP` | Allow HTTP (non-TLS) connections | `false` | ### vLLM Arguments | Argument | Description | |----------|-------------| | `--enable-lora` | Enable LoRA adapter support in vLLM | | `--max-lora-rank` | Maximum LoRA rank (must be >= your LoRA's rank) | | `--max-loras` | Maximum number of LoRAs to load simultaneously | ## Backend API Reference ### Load LoRA Load a LoRA adapter from a source URI. ```text POST /v1/loras ``` **Request:** ```json { "lora_name": "string", "source": { "uri": "string" } } ``` **Response:** ```json { "status": "success", "message": "LoRA adapter 'my-lora' loaded successfully", "lora_name": "my-lora", "lora_id": 1207343256 } ``` ### List LoRAs List all loaded LoRA adapters. ```text GET /v1/loras ``` **Response:** ```json { "status": "success", "loras": { "my-lora": 1207343256, "another-lora": 987654321 }, "count": 2 } ``` ### Unload LoRA Unload a LoRA adapter from the worker. ```text DELETE /v1/loras/{lora_name} ``` **Response:** ```json { "status": "success", "message": "LoRA adapter 'my-lora' unloaded successfully", "lora_name": "my-lora", "lora_id": 1207343256 } ``` ## Kubernetes Deployment For Kubernetes deployments, use the `DynamoModel` Custom Resource to declaratively manage LoRA adapters. ### DynamoModel CRD ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoModel metadata: name: customer-support-lora namespace: dynamo-system spec: modelName: customer-support-adapter-v1 baseModelName: Qwen/Qwen3-0.6B # Must match modelRef.name in DGD modelType: lora source: uri: s3://my-models-bucket/loras/customer-support/v1 ``` ### How It Works When you create a `DynamoModel`: 1. **Discovers endpoints**: Finds all pods running your `baseModelName` 2. **Creates service**: Automatically creates a Kubernetes Service 3. **Loads LoRA**: Calls the LoRA load API on each endpoint 4. **Updates status**: Reports which endpoints are ready ### Verify Deployment ```bash # Check LoRA status kubectl get dynamomodel customer-support-lora # Expected output: # NAME TOTAL READY AGE # customer-support-lora 2 2 30s ``` For complete Kubernetes deployment details, see: - [Managing Models with DynamoModel](/dynamo/kubernetes-deployment/deploy-models/managing-models-with-dynamo-model) - [Kubernetes LoRA Deployment Example](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/vllm/deploy/lora/README.md) ## Examples | Example | Description | |---------|-------------| | [Local LoRA with MinIO](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/vllm/launch/lora/README.md) | Local development with S3-compatible storage | | [Kubernetes LoRA Deployment](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/vllm/deploy/lora/README.md) | Production deployment with DynamoModel CRD | ## Troubleshooting ### LoRA Fails to Load **Check S3 connectivity:** ```bash # Verify LoRA exists in S3 aws --endpoint-url=$AWS_ENDPOINT s3 ls s3://my-loras/ --recursive ``` **Check cache directory:** ```bash ls -la ~/.cache/dynamo_loras/ ``` **Check worker logs:** ```bash # Look for LoRA-related messages kubectl logs deployment/my-worker | grep -i lora ``` ### Model Not Found After Loading - Verify the LoRA name matches exactly (case-sensitive) - Check if the LoRA is listed: `curl http://localhost:8081/v1/loras` - Ensure discovery registration succeeded (check worker logs) ### Inference Returns Base Model Response - Verify the `model` field in your request matches the `lora_name` - Check that the LoRA is loaded on the worker handling your request - For disaggregated serving, ensure both prefill and decode workers have the LoRA ## KV Cache-Aware LoRA Routing When KV-aware routing is enabled, the router automatically accounts for LoRA adapter identity when computing block hashes. This means: - **Distinct hash spaces per adapter**: Blocks cached under adapter `A` will never be confused with blocks cached under adapter `B` or the base model, even if the token sequences are identical. The adapter name is mixed into the `LocalBlockHash` computation. - **Automatic prefix sharing within the same adapter**: Requests targeting the same LoRA adapter benefit from KV cache prefix matching just like base model requests do. - **No configuration required**: The LoRA name is propagated automatically through KV events (`BlockStored`) from the engine to the router. The router uses the `lora_name` field on events to route LoRA requests to workers that have matching cached blocks. This works end-to-end across the publisher pipeline, the KV consolidator (for deduplication), and the routing query path. ## See Also - [Feature Matrix](/dynamo/resources/feature-matrix) - Backend compatibility overview - [vLLM Backend](/dynamo/backends/v-llm) - vLLM-specific configuration - [Dynamo Operator](/dynamo/kubernetes-deployment/start-here/dynamo-operator) - Kubernetes operator overview - [Routing Concepts](/dynamo/components/router/routing-concepts) - LoRA-aware request routing - [KV Events for Custom Engines](/dynamo/integrations/kv-cache-integrations/kv-events-for-custom-engines) - Publishing LoRA-aware KV events # Multimodal Model Serving Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. **Security Requirement**: Multimodal processing must be explicitly enabled at startup. See the relevant backend documentation ([vLLM](multimodal-vllm.md), [SGLang](multimodal-sglang.md), [TRT-LLM](multimodal-trtllm.md)) for the necessary flags. This prevents unintended processing of multimodal data from untrusted sources. ```mermaid --- title: Sample flow for an aggregated VLM serving scenario --- flowchart TD A[Request] --> B{KV cache hit?} B -->|Yes| C[Use KV] B -->|No| D{Embedding cache hit?} D -->|Yes| E[Load embedding] D -->|No| F[Run encoder] F --> G[save to cache] G --> H["PREFILL (image tokens + text tokens → KV cache)"] E --> H C --> I[DECODE] H --> I I --> J[Response] ``` ## Key Features Dynamo provides support for improving latency and throughput for vision-and-language workloads through the following features, that can be used together or separately, depending on your workload characteristics: | Feature | Description | |---------|-------------| | **[Embedding Cache](/dynamo/user-guides/multimodal/embedding-cache)** | CPU-side LRU cache that skips re-encoding repeated images | | **[Encoder Disaggregation](/dynamo/user-guides/multimodal/encoder-disaggregation)** | Separate vision encoder worker for independent scaling | | **[Multimodal KV Routing](/dynamo/user-guides/multimodal/multimodal-kv-routing)** | MM-aware KV cache routing for optimal worker selection | ## Support Matrix | Stack | Image | Video | Audio | |-------|-------|-------|-------| | **[vLLM](multimodal-vllm.md)** | ✅ | 🧪 | 🧪 | | **[TRT-LLM](multimodal-trtllm.md)** | ✅ | ❌ | ❌ | | **[SGLang](multimodal-sglang.md)** | ✅ | 🧪 | ❌ | **Status:** ✅ Supported | 🧪 Experimental | ❌ Not supported ## Security: URL Validation All multimodal loaders route remote fetches through a shared URL policy (`dynamo.common.multimodal.url_validator`). Only `https://` and `data:` URLs are allowed by default, private / internal IPs are blocked, and local file access is disabled. Every HTTP redirect hop is re-validated against the policy. Two environment variables loosen the defaults for non-public deployments: | Variable | Default | Effect | |----------|---------|--------| | `DYN_MM_ALLOW_INTERNAL` | `0` | Set to `1` to allow `http://`, private / internal IPs, and explicit ports. Intended for on-prem or local-dev setups where media lives on an internal network. | | `DYN_MM_LOCAL_PATH` | *(empty)* | Absolute directory prefix. When set, `file://` URIs and bare paths are allowed if they resolve inside this prefix. | **Never set `DYN_MM_ALLOW_INTERNAL=1` on public-facing deployments.** It opens SSRF paths to cloud metadata endpoints (AWS IMDS, GCE, Azure) and other internal services. ## Example Workflows Reference implementations for deploying multimodal models: - [vLLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/vllm/launch) (image, video) - [TRT-LLM multimodal examples](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/trtllm/launch) - [SGLang multimodal examples](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/sglang/launch) ## Backend Documentation Detailed deployment guides, configuration, and examples for each backend: - **[vLLM Multimodal](multimodal-vllm.md)** - **[TensorRT-LLM Multimodal](multimodal-trtllm.md)** - **[SGLang Multimodal](multimodal-sglang.md)** # Embedding Cache ## Overview The embedding cache is a CPU-side LRU cache that stores vision encoder outputs. When the same image appears in multiple requests, the cached embedding is reused instead of running the vision encoder again. This reduces GPU load on the encoder and lowers latency for repeated images. > Note: This feature can also be referred to as **encoder cache**. Embedding cache is separate from KV cache, which reuses attention key/value state after prefill to skip prefill and go straight to decode. For KV cache reuse and routing, see [Multimodal KV Routing](/dynamo/user-guides/multimodal/multimodal-kv-routing). ## When to Use Use the embedding cache when your workload includes repeated images across requests. Common scenarios: - Product catalog queries where users ask about the same product images - Document processing pipelines that reference shared diagrams or figures - Chat sessions where the same image is discussed across multiple turns, like an architecture diagram in a code-gen use case. If your workload consists entirely of unique images, the cache provides no benefit. ## Support Matrix | Backend | Aggregated | Disaggregated (E/PD) | Notes | |---------|------------|----------------------|-------| | **vLLM** | ✅ | ✅ | Aggregated uses vLLM-native `ec_both`; disaggregated uses Dynamo `EmbeddingCacheManager` | | **TRT-LLM** | ❌ | ✅ | Dynamo `MultimodalEmbeddingCacheManager` in PD worker | | **SGLang** | ❌ | ❌ | Not supported yet | This support requires vLLM `0.17.0` or newer. ## How It Works The prefill worker owns the CPU-side LRU cache. On a hit, the encode worker is skipped entirely. On a miss, the encode worker produces the embedding, transfers it via NIXL, and the prefill worker saves it to the cache. ```mermaid flowchart LR req[Request] --> check{CPU cache hit?} check -. hit .-> use[Use cached embedding] check -- miss --> E[Encode Worker] E -- embeddings via NIXL --> save[Save to cache] save --> engine[Inference Engine] use --> engine ``` **Launch (vLLM):** ```bash cd $DYNAMO_HOME/examples/backends/vllm bash launch/disagg_multimodal_e_pd.sh --multimodal-embedding-cache-capacity-gb 10 ``` **Launch (TRT-LLM):** ```bash cd $DYNAMO_HOME/examples/backends/trtllm ./launch/disagg_e_pd.sh --multimodal-embedding-cache-capacity-gb 10 ``` ## Configuration | Parameter | Description | Default | |-----------|-------------|---------| | `--multimodal-embedding-cache-capacity-gb` | CPU-side LRU cache size in GB | 0 (disabled) | Set the capacity based on your expected working set of unique images. A larger cache holds more embeddings but consumes more host memory. See the backend-specific documentation ([vLLM](multimodal-vllm.md#embedding-cache), [TRT-LLM](multimodal-trtllm.md#embedding-cache)) for more details. # Encoder Disaggregation ## Overview Encoder disaggregation separates the vision encoder from the prefill/decode pipeline into its own worker. Instead of running image encoding inline, a dedicated encode worker handles media processing and transfers the resulting embeddings to downstream workers via NIXL (RDMA). This enables: - Independent scaling of encode workers based on vision workload - Reduced GPU memory pressure on prefill/decode workers - Better GPU utilization by matching worker counts to actual bottlenecks ## When to Use Use encoder disaggregation when: - Vision encoding is a bottleneck and you need to scale encoders independently of LLM workers - You want to run the vision encoder on different hardware (e.g., smaller GPUs for encoding, larger GPUs for LLM inference) - Your deployment handles high volumes of multimodal requests and encoding throughput is limiting For simple deployments or development/testing, the aggregated (EPD) pattern is easier to set up. ## Support Matrix | Backend | E/PD | E/P/D | Notes | |---------|------|-------|-------| | **vLLM** | ✅ | ✅ | Separate encode worker currently handles `image_url` inputs; `video_url` inputs stay on the prefill/PD path | | **TRT-LLM** | ❌ | ✅ | Supports image URLs (via `MultimodalEncoder`) and pre-computed embeddings (via NIXL) | | **SGLang** | ✅ | ✅ | NIXL for embeddings; bootstrap mechanism for P/D KV transfer | ## Deployment Patterns **E/PD** — Separate encoder, combined prefill+decode: ```text Frontend → Processor → Encode Worker → PD Worker → Response (NIXL) ``` The encode worker runs the vision model and transfers embeddings via NIXL to a combined prefill+decode worker. **E/P/D** — All stages separate: ```text Frontend → Processor → Encode Worker → Prefill Worker → Decode Worker → Response (NIXL) (KV transfer) ``` Full disaggregation with separate workers for each stage. The encode worker transfers embeddings to the prefill worker, which then transfers KV cache to the decode worker. ## Launching ### vLLM ```bash cd $DYNAMO_HOME/examples/backends/vllm # E/PD bash launch/disagg_multimodal_e_pd.sh --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8" # E/P/D bash launch/disagg_multimodal_epd.sh --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8" ``` ### TRT-LLM ```bash cd $DYNAMO_HOME/examples/backends/trtllm # E/PD bash launch/disagg_e_pd.sh # E/P/D ./launch/epd_multimodal_image_and_embeddings.sh ``` ### SGLang ```bash cd $DYNAMO_HOME/examples/backends/sglang # E/PD ./launch/multimodal_epd.sh # E/P/D ./launch/multimodal_disagg.sh ``` See the backend-specific documentation ([vLLM](multimodal-vllm.md), [TRT-LLM](multimodal-trtllm.md), [SGLang](multimodal-sglang.md)) for full configuration details and component flags. # Multimodal KV Routing ## Overview Multimodal KV routing extends Dynamo's KV-aware router to account for image content when computing cache overlap scores. An image hash (`mm_hash`) is computed per request — in the Rust frontend by default for vLLM backends, by vLLM's own processor when the chat-processor variant is enabled, or by a dedicated MM router worker for TRT-LLM backends — and included in per-block routing metadata. The KV router then selects the backend worker with the highest cache overlap, including overlap on image embedding blocks. Repeated requests containing the same image are routed to the worker that already has the corresponding KV cache blocks, maximizing prefix cache reuse. > Note: KV cache is separate from embedding cache (also called encoder cache), which reuses vision encoder outputs (image→embeddings) to avoid re-running the encoder. For encoder-side reuse see [Embedding Cache](/dynamo/user-guides/multimodal/embedding-cache). ## When to Use Use multimodal KV routing when: - You have multiple backend workers serving multimodal requests - Your workload includes repeated images across requests (e.g., the same product photo, shared reference images) - You want to maximize KV cache hit rates for multimodal content Without MM-aware routing, the standard router treats image token blocks as opaque and cannot match which worker has cached a particular image's KV blocks. ## Support Matrix | Backend | Path | Supported | Notes | |---------|------|-----------|-------| | **vLLM** | Rust frontend (default) | ✅ | Uses lightseek `llm-multimodal` for image-token counting + placeholder expansion. Supported models tracked below. | | **vLLM** | Python chat-processor (`--dyn-chat-processor vllm --router-mode kv`) | ✅ | Uses vLLM's own multimodal processor — supports any VLM that vLLM supports. | | **TRT-LLM** | — | ✅ | Uses dedicated MM Router Worker. Requires `--publish-events-and-metrics` on TRT-LLM workers. | | **SGLang** | — | ❌ | Not supported yet. | ## Supported Model Families (Rust frontend path) The Rust frontend's MM-aware routing path supports whatever VLM families the lightseek `llm-multimodal` crate registers — see [`ImageProcessorRegistry::with_defaults()`](https://docs.rs/llm-multimodal/1.5.0/llm_multimodal/vision/image_processor/struct.ImageProcessorRegistry.html#method.with_defaults) for the up-to-date list. A model that crate doesn't recognize falls back to text-prefix-only KV routing (request still completes; just no prefix-cache benefit across images). The Python chat-processor variant doesn't share this constraint — it delegates to vLLM's own multimodal processor and works with any VLM vLLM supports. ## How It Works ### vLLM (default — Rust frontend) ```text Frontend (Rust + lightseek llm-multimodal + KV router) → Backend Workers │ ├─ Hash image (xxh3_64 of the raw URL — full-URL identity; use --frontend-decoding for content-addressed hashing) ├─ Resolve image-token id via lightseek's per-model ModelProcessorSpec ├─ Read (W, H) from a Range: 0-65535 header fetch (or in-memory data: bytes) ├─ lightseek::count_tokens(W, H) → expanded image-token count N ├─ Expand placeholder × N in routing_token_ids (worker token_ids unchanged) ├─ Build per-block MM metadata (block_mm_infos) ├─ KV router selects best worker └─ Forward mm_hash to worker via extra_args["mm_hashes"] → vLLM's multi_modal_uuids (cache key match) ``` 1. The Rust frontend computes an `mm_hash` per image: `xxh3_64` of the decoded bytes for `data:` URIs (and for `http(s)://` when `media_decoder` is enabled on the model), otherwise `xxh3_64` of the full URL string. Two callers will share an `mm_hash` only when they send byte-identical URLs. 2. The image-placeholder token id is resolved by delegating to lightseek's per-model `ModelProcessorSpec` (one spec per supported VLM family — Qwen3-VL, Qwen2.5-VL, Qwen2-VL, LLaVA-NeXT, LLaVA-1.5, Phi-3-vision, Llama-4, Kimi-K2.5). Each spec reads the appropriate `config.json` field for its model family (`image_token_id`, `image_token_index`, or `media_placeholder_token_id`) and falls back to probing the tokenizer's vocab when only the placeholder string is registered. Models the registry doesn't recognise fall back to text-prefix-only routing. > **Note:** Qwen3.5 / Qwen3.6 image token expansion is not yet supported in the Rust frontend for MM routing, so KV routing will only consider the text inputs + unexpanded image token placeholders. Support will come in a follow-up release. 3. Per-image `(W, H)` is read from a 64KB `Range`-bounded header fetch (or from in-memory bytes for `data:` URIs); the lightseek `llm-multimodal` crate computes the per-image expanded token count. 4. The single placeholder token is expanded to N copies in `routing_token_ids` (a router-only view); the worker still sees one placeholder per image in `token_ids`. 5. Per-block MM metadata (`block_mm_infos`) is built from the expanded view; the KV router evaluates overlap across workers including image-bearing blocks. 6. The frontend forwards each image's `mm_hash` (16-hex-char prefix, padded) via `extra_args["mm_hashes"]`; the backend handler injects them as vLLM's `multi_modal_uuids`, so vLLM's own KV-cache key matches the hash the router used. ### vLLM (alternative — Python chat-processor variant) ```text Frontend (vLLM processor + KV router) → Backend Workers │ ├─ Download image (via DynamoMediaConnector, LRU cached) ├─ Run vLLM's process_inputs() (HF processor, model-agnostic) ├─ Extract mm_hash from mm_features ├─ Build per-block MM metadata (block_mm_infos) ├─ KV router selects best worker └─ Transfer pre-processed mm_kwargs via SHM or NIXL → Backend skips HF processor ``` Use this variant (`--dyn-chat-processor=vllm`) when you want the frontend to run vLLM's HF image processor in-process and ship pre-processed `mm_kwargs` to the selected worker via shared memory or NIXL RDMA, so the backend skips the HF processor entirely. See the [Transfer Mode Details](#transfer-mode-details-vllm-only) section below for the `DYNAMO_MM_TRANSFER` flags. ### TRT-LLM ```text Frontend (round-robin) → MM Router Worker → Backend Workers │ ├─ Download image ├─ Compute mm_hash ├─ Build per-block MM metadata └─ KvRouter selects best worker ``` For TRT-LLM, a dedicated MM Router Worker sits between the frontend and backend workers. See the [TRT-LLM MM Router README](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/trtllm/mm_router_worker/README.md) for setup instructions. ## Launching ### vLLM (default — Rust frontend) ```bash cd $DYNAMO_HOME bash examples/backends/vllm/launch/agg_multimodal_router.sh ``` The Rust frontend uses the [lightseek `llm-multimodal`][lightseek-crate] crate ([source][lightseek-src]) for per-image token-count and placeholder expansion. `llm-multimodal` provides a pure-Rust `calculate_num_tokens(W, H, PreProcessorConfig)` per VLM family (Qwen2/2.5/3-VL, LLaVA, Pixtral, …), golden-tested against `transformers`, so the router can match vLLM's expanded image-token count without invoking the HF image processor. The frontend then forwards each `mm_hash` to the worker as `multi_modal_uuids` so vLLM's KV events publish the same key the router computes. [lightseek-crate]: https://crates.io/crates/llm-multimodal [lightseek-src]: https://github.com/lightseekorg/smg Key environment variables: | Variable | Default | Description | |----------|---------|-------------| | `MODEL` | `Qwen/Qwen3-VL-2B-Instruct` | Model to serve | | `NUM_WORKERS` | `2` | Number of backend workers | | `BLOCK_SIZE` | `16` | KV cache block size (must match backend) | | `GPU_MEMORY_UTILIZATION` | `0.20` | Per-worker GPU memory fraction | | `SINGLE_GPU` | `false` | Pack all workers onto GPU 0 (testing-only override; pass `--single-gpu` or set `SINGLE_GPU=true` for functional tests on a single-GPU box) | | `KV_EVENTS_PORT_BASE` | `5557` | Worker `i` publishes ZMQ KV events on `BASE + i - 1` | | `DYN_LOG` | `info,lightseek_mm=debug,...` | Frontend log filter | | `VLLM_EXTRA_ARGS` | (unset) | Pass-through args to `python -m dynamo.vllm`. Set `--frontend-decoding` to enable content-addressed `mm_hash` (cross-URL KV-cache reuse). | To opt into frontend image decoding (so the frontend downloads + decodes once and `mm_hash` becomes content-addressed instead of URL-addressed): ```bash VLLM_EXTRA_ARGS="--frontend-decoding" \ bash examples/backends/vllm/launch/agg_multimodal_router.sh ``` The worker then registers a `media_decoder` on its model card; the frontend's `MediaLoader` runs in-process and hashes decoded RGB bytes via xxh3. Two distinct (signed) URLs of the same image bytes collide on the same routing key. ### vLLM (alternative — Python chat-processor variant) ```bash bash examples/backends/vllm/launch/agg_multimodal_router_chat_processor.sh ``` Uses `--dyn-chat-processor=vllm` so the frontend runs vLLM's HF processor in-process. Adds the `DYNAMO_MM_TRANSFER` shm/NIXL pre-rendered `mm_kwargs` delivery channel between frontend and worker. | Variable | Default | Description | |----------|---------|-------------| | `MODEL` | `Qwen/Qwen3-VL-2B-Instruct` | Model to serve | | `NUM_WORKERS` | `2` | Number of backend workers | | `BLOCK_SIZE` | `16` | KV cache block size (must match backend) | | `GPU_MEMORY_UTILIZATION` | `0.40` | Per-worker GPU memory fraction | | `SINGLE_GPU` | `false` | Pack all workers onto GPU 0 (testing-only override; pass `--single-gpu` or set `SINGLE_GPU=true` for functional tests on a single-GPU box) | | `DYNAMO_MM_TRANSFER` | `shm` | Transfer mode for pre-processed mm_kwargs: `shm` (shared memory, same-node), `nixl` (RDMA, cross-node) | | `DYNAMO_DISABLE_NIXL_MM` | unset | Set to `1` to disable mm_kwargs transfer entirely (backend re-processes images from URLs) | ### TRT-LLM ```bash cd $DYNAMO_HOME/examples/backends/trtllm/mm_router_worker ./launch.sh ``` See the [TRT-LLM MM Router README](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/trtllm/mm_router_worker/README.md) for full setup instructions and configuration options. ## Transfer Mode Details (vLLM chat-processor variant only) Applies to the `--dyn-chat-processor=vllm` launch (`agg_multimodal_router_chat_processor.sh`), **not** the default Rust frontend path. In the chat-processor variant the frontend runs the HF image processor in-process and ships the pre-processed `mm_kwargs` to the selected backend worker so the backend can skip re-processing; the `DYNAMO_MM_TRANSFER` environment variable controls how that payload is transferred. The default Rust frontend path doesn't run the HF processor or pre-render `mm_kwargs` — it forwards only `mm_hashes`, and each worker re-processes the image itself. TRT-LLM backends similarly re-run their own preprocessing and don't honor `DYNAMO_MM_TRANSFER`. - **`shm`** (default): POSIX shared memory via a `/dev/shm` segment. Intended for same-node deployments, where frontend and backend share the host filesystem. If the backend can't access the segment (e.g., running on a different node), it falls back to re-processing the image from the URL. - **`nixl`**: NIXL RDMA transfer. Required for cross-node deployments where `/dev/shm` is not shared between frontend and backend. Works across nodes over InfiniBand or TCP (whichever UCX selects). - **`DYNAMO_DISABLE_NIXL_MM=1`**: Disables pre-processed mm_kwargs transfer entirely. The backend downloads and processes images itself from the original URLs. Useful for debugging or when transfer overhead exceeds re-processing cost. # Diffusion ## Overview Dynamo supports serving diffusion models across multiple backends, enabling generation of images and video from text prompts. Backends expose diffusion capabilities through the same Dynamo pipeline infrastructure used for LLM inference, including frontend routing, scaling, and observability. ## Support Matrix | Modality | vLLM-Omni | SGLang | TRT-LLM | |----------|-----------|--------|---------| | Text-to-Text | ❌ | ✅ | ❌ | | Text-to-Image | ✅ | ✅ | ✅ | | Text-to-Video | ✅ | ✅ | ✅ | | Image-to-Video | ✅ | ❌ | ❌ | **Status:** ✅ Supported | ❌ Not supported ## Backend Documentation For deployment guides, configuration, and examples for each backend: - **[vLLM-Omni](/dynamo/backends/v-llm/v-llm-omni)** - **[SGLang Diffusion](/dynamo/backends/sg-lang/diffusion)** - **[TRT-LLM Diffusion](/dynamo/backends/tensor-rt-llm/diffusion-experimental)** - **[FastVideo (custom worker)](/dynamo/user-guides/diffusion/fastvideo)** # FastVideo This guide covers deploying [FastVideo](https://github.com/hao-ai-lab/FastVideo) text-to-video generation on Dynamo using a custom worker (`worker.py`) exposed through the `/v1/videos` endpoint. Dynamo also supports diffusion through built-in backends: [SGLang Diffusion](/dynamo/backends/sg-lang/diffusion) (LLM diffusion, image, video), [vLLM-Omni](/dynamo/backends/v-llm/v-llm-omni) (text-to-image, text-to-video), and [TRT-LLM Diffusion](/dynamo/backends/tensor-rt-llm/diffusion-experimental) (text-to-image, text-to-video). See the [Diffusion Overview](/dynamo/user-guides/diffusion) for the full support matrix. ## Overview - **Default model:** `FastVideo/LTX2-Distilled-Diffusers` — a distilled variant of the LTX-2 Diffusion Transformer (Lightricks), reducing inference from 50+ steps to just 5. - **Two-stage pipeline:** Stage 1 generates video at target resolution; Stage 2 refines with a distilled LoRA for improved fidelity and texture. - **Optimized inference:** FP4 quantization and `torch.compile` are available via `--enable-optimizations`; attention backend selection is controlled separately via `--attention-backend`. - **Response format:** Returns one complete MP4 payload per request as `data[0].b64_json` (non-streaming). - **Concurrency:** One request at a time per worker (VideoGenerator is not re-entrant). Scale throughput by running multiple workers. `worker.py` defaults to `--attention-backend TORCH_SDPA` for broader compatibility across GPUs, including systems such as H100. For the B200/B300-oriented path, enable FP4/compile with `--enable-optimizations` and, if desired, opt into flash-attention explicitly with `--attention-backend FLASH_ATTN`. ## Docker Image Build The local Docker workflow builds a runtime image from the [`Dockerfile`](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/diffusers/Dockerfile): - Base image: `nvidia/cuda:13.1.1-devel-ubuntu24.04` - Installs [FastVideo](https://github.com/hao-ai-lab/FastVideo) from GitHub - Installs Dynamo from the `release/1.0.0` branch (for `/v1/videos` support) - Compiles a [flash-attention](https://github.com/RandNMR73/flash-attention) fork from source The Dockerfile exposes `TORCH_CUDA_ARCH_LIST` as a build argument (default: `10.0 10.0a` for Blackwell). Pass `--build-arg` to target a different architecture: ```bash # Blackwell (default) docker build examples/diffusers/ --build-arg TORCH_CUDA_ARCH_LIST="10.0 10.0a" # Hopper docker build examples/diffusers/ --build-arg TORCH_CUDA_ARCH_LIST="9.0 9.0a" ``` `MAX_JOBS` (default: `4`) controls parallel compilation jobs for flash-attention. Lower it if the build runs out of memory: ```bash docker build examples/diffusers/ --build-arg MAX_JOBS=2 ``` When using Docker Compose, set these as environment variables before running `docker compose up --build`: ```bash # Hopper on a memory-constrained builder TORCH_CUDA_ARCH_LIST="9.0 9.0a" MAX_JOBS=2 COMPOSE_PROFILES=4 docker compose up --build ``` The first Docker image build can take **20–40+ minutes** because FastVideo and CUDA-dependent components are compiled during the build. Subsequent builds are much faster if Docker layer cache is preserved. Compiling `flash-attention` can use significant RAM — low-memory builders may hit out-of-memory failures. If that happens, lower `MAX_JOBS` in the Dockerfile to reduce parallel compile memory usage. The [flash-attn install notes](https://pypi.org/project/flash-attn/) specifically recommend this on machines with less than 96 GB RAM and many CPU cores. ## Warmup Time On first start, workers download model weights. When `--enable-optimizations` is enabled, compile/warmup steps can push the first ready time to roughly **10–20 minutes** (hardware-dependent). After the first successful optimized response, the second request can still take around **35 seconds** while runtime caches finish warming up; steady-state performance is typically reached from the third request onward. When using Kubernetes, mount a shared Hugging Face cache PVC (see [Kubernetes Deployment](#kubernetes-deployment)) so model weights are downloaded once and reused across pod restarts. ## Local Deployment ### Prerequisites **For Docker Compose:** - Docker Engine 26.0+ - Docker Compose v2 - NVIDIA Container Toolkit **For host-local script:** - Python environment with Dynamo + FastVideo dependencies installed - CUDA-compatible GPU runtime available on host ### Option 1: Docker Compose ```bash cd /examples/diffusers/local # Start 4 workers on GPUs 0..3 COMPOSE_PROFILES=4 docker compose up --build ``` The Compose file builds from the Dockerfile and exposes the API on `http://localhost:8000`. See the [Docker Image Build](#docker-image-build) section for build time expectations. ### Option 2: Host-Local Script ```bash cd /examples/diffusers/local ./run_local.sh ``` Environment variables: | Variable | Default | Description | |---|---|---| | `PYTHON_BIN` | `python3` | Python interpreter | | `MODEL` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path | | `NUM_GPUS` | `1` | Number of GPUs | | `HTTP_PORT` | `8000` | Frontend HTTP port | | `WORKER_EXTRA_ARGS` | — | Extra flags for `worker.py` (for example, `--enable-optimizations --attention-backend FLASH_ATTN`) | | `FRONTEND_EXTRA_ARGS` | — | Extra flags for `dynamo.frontend` | Example: ```bash MODEL=FastVideo/LTX2-Distilled-Diffusers \ NUM_GPUS=1 \ HTTP_PORT=8000 \ WORKER_EXTRA_ARGS="--enable-optimizations --attention-backend FLASH_ATTN" \ ./run_local.sh ``` `--enable-optimizations` and `--attention-backend` are `worker.py` flags, not `dynamo.frontend` flags, so pass them through `WORKER_EXTRA_ARGS` when you want a non-default worker configuration. The script writes logs to: - `.runtime/logs/worker.log` - `.runtime/logs/frontend.log` ## Kubernetes Deployment ### Files | File | Description | |---|---| | `agg.yaml` | Base aggregated deployment (Frontend + `FastVideoWorker`) | | `agg_user_workload.yaml` | Same deployment with `user-workload` tolerations and `imagePullSecrets` | | `huggingface-cache-pvc.yaml` | Shared HF cache PVC for model weights | | `dynamo-platform-values-user-workload.yaml` | Optional Helm values for clusters with tainted `user-workload` nodes | ### Prerequisites 1. Dynamo Kubernetes Platform installed 2. GPU-enabled Kubernetes cluster 3. FastVideo runtime image pushed to your registry 4. Optional HF token secret (for gated models) Create a Hugging Face token secret if needed: ```bash export NAMESPACE= export HF_TOKEN= kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN=${HF_TOKEN} \ -n ${NAMESPACE} ``` ### Deploy ```bash cd /examples/diffusers/deploy export NAMESPACE= kubectl apply -f huggingface-cache-pvc.yaml -n ${NAMESPACE} kubectl apply -f agg.yaml -n ${NAMESPACE} ``` For clusters with tainted `user-workload` nodes and private registry pulls: 1. Set your pull secret name and image in `agg_user_workload.yaml`. 2. Apply: ```bash kubectl apply -f huggingface-cache-pvc.yaml -n ${NAMESPACE} kubectl apply -f agg_user_workload.yaml -n ${NAMESPACE} ``` ### Update Image Quickly ```bash export DEPLOYMENT_FILE=agg.yaml export FASTVIDEO_IMAGE= yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FASTVIDEO_IMAGE)' \ ${DEPLOYMENT_FILE} > ${DEPLOYMENT_FILE}.generated kubectl apply -f ${DEPLOYMENT_FILE}.generated -n ${NAMESPACE} ``` ### Verify and Access ```bash kubectl get dgd -n ${NAMESPACE} kubectl get pods -n ${NAMESPACE} kubectl logs -n ${NAMESPACE} -l nvidia.com/dynamo-component=FastVideoWorker ``` ```bash kubectl port-forward -n ${NAMESPACE} svc/fastvideo-agg-frontend 8000:8000 ``` ## Test Request If this is the first request after startup, expect it to take longer while warmup completes. See [Warmup Time](#warmup-time) for details. Send a request and decode the response: ```bash curl -s -X POST http://localhost:8000/v1/videos \ -H 'Content-Type: application/json' \ -d '{ "model": "FastVideo/LTX2-Distilled-Diffusers", "prompt": "A cinematic drone shot over a snowy mountain range at sunrise", "size": "1920x1088", "seconds": 5, "nvext": { "fps": 24, "num_frames": 121, "num_inference_steps": 5, "guidance_scale": 1.0, "seed": 10 } }' > response.json # Linux jq -r '.data[0].b64_json' response.json | base64 --decode > output.mp4 # macOS jq -r '.data[0].b64_json' response.json | base64 -D > output.mp4 ``` ## Worker Configuration Reference ### CLI Flags | Flag | Default | Description | |---|---|---| | `--model` | `FastVideo/LTX2-Distilled-Diffusers` | HuggingFace model path | | `--num-gpus` | `1` | Number of GPUs for distributed inference | | `--enable-optimizations` | off | Enables FP4 quantization and `torch.compile` | | `--attention-backend` | `TORCH_SDPA` | Sets `FASTVIDEO_ATTENTION_BACKEND`; choices: `FLASH_ATTN`, `TORCH_SDPA`, `SAGE_ATTN`, `SAGE_ATTN_THREE`, `VIDEO_SPARSE_ATTN`, `VMOBA_ATTN`, `SLA_ATTN`, `SAGE_SLA_ATTN` | ### Request Parameters (`nvext`) | Field | Default | Description | |---|---|---| | `fps` | `24` | Frames per second | | `num_frames` | `121` | Total frames; overrides `fps * seconds` when set | | `num_inference_steps` | `5` | Diffusion inference steps | | `guidance_scale` | `1.0` | Classifier-free guidance scale | | `seed` | `10` | RNG seed for reproducibility | | `negative_prompt` | — | Text to avoid in generation | ### Environment Variables | Variable | Default | Description | |---|---|---| | `FASTVIDEO_VIDEO_CODEC` | `libx264` | Video codec for MP4 encoding | | `FASTVIDEO_X264_PRESET` | `ultrafast` | x264 encoding speed preset | | `FASTVIDEO_ATTENTION_BACKEND` | `TORCH_SDPA` | Attention backend; `worker.py` sets this from `--attention-backend` and validates `FLASH_ATTN`, `TORCH_SDPA`, `SAGE_ATTN`, `SAGE_ATTN_THREE`, `VIDEO_SPARSE_ATTN`, `VMOBA_ATTN`, `SLA_ATTN`, and `SAGE_SLA_ATTN` | | `FASTVIDEO_STAGE_LOGGING` | `1` | Enable per-stage timing logs | | `FASTVIDEO_LOG_LEVEL` | — | Set to `DEBUG` for verbose logging | ## Troubleshooting | Symptom | Cause | Fix | |---|---|---| | OOM during Docker build | `flash-attention` compilation uses too much RAM | Pass `--build-arg MAX_JOBS=2` (or lower) at build time | | `no kernel image available for this GPU` or CUDA arch error at runtime | Image was built for a different GPU architecture | Rebuild with the correct `TORCH_CUDA_ARCH_LIST` (e.g. `9.0 9.0a` for Hopper) | | 10–20 min wait on first start with optimizations enabled | Model download + `torch.compile` warmup | Expected behavior; subsequent starts are faster if weights are cached | | ~35 s second request | Runtime caches still warming | Steady-state performance from third request onward | | Lower throughput than expected on B200/B300 | FP4/compile and flash-attention are configured separately | Pass `--enable-optimizations` and, if desired, `--attention-backend FLASH_ATTN` | | Startup or import failure after enabling optimizations or changing the attention backend | FP4 and some attention backends depend on specific hardware/software support | Re-run `worker.py` without `--enable-optimizations`, or use `--attention-backend TORCH_SDPA` | ## Source Code The example source lives at [`examples/diffusers/`](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/diffusers) in the Dynamo repository. ## See Also - [vLLM-Omni Text-to-Video](/dynamo/backends/v-llm/v-llm-omni#text-to-video) — vLLM-Omni video generation via `/v1/videos` - [vLLM-Omni Text-to-Image](/dynamo/backends/v-llm/v-llm-omni#text-to-image) — vLLM-Omni image generation - [SGLang Video Generation](/dynamo/backends/sg-lang/diffusion#video-generation) — SGLang video generation worker - [SGLang Image Diffusion](/dynamo/backends/sg-lang/diffusion#image-diffusion) — SGLang image diffusion worker - [TRT-LLM Diffusion](/dynamo/backends/tensor-rt-llm/diffusion-experimental#quick-start) — TensorRT-LLM diffusion quick start - [Diffusion Overview](/dynamo/user-guides/diffusion) — Full backend support matrix # Fastokens Tokenizer The Dynamo frontend tokenizes every incoming prompt before it sends the request to an inference backend. For short prompts, that cost is usually small. For agentic, RAG, and long-context workloads, tokenization can become a meaningful part of time-to-first-token (TTFT), especially when KV cache hit rates are high and the model path is already fast. `fastokens` is an optional tokenizer backend for BPE `tokenizer.json` models. It uses the Rust encoder from the [`fastokens` GitHub repository](https://github.com/crusoecloud/fastokens) for text-to-token-ID conversion while Dynamo continues to use HuggingFace `tokenizers` for decoding and streaming output. Use it when tokenization is visible in your frontend latency profile and your model uses a supported BPE tokenizer. ## Why Use Fastokens? `fastokens` is designed to make tokenization scale better on modern CPUs: - Parallel pre-tokenization for long inputs. - Parallel BPE encoding with per-thread and shared caches. - Reused buffers and reduced allocation overhead. - PCRE2 JIT regex support where the tokenizer pattern allows it. The `fastokens` enables faster tokenization on average compared with HuggingFace `tokenizers`, with larger gains as prompt sizes grow. The [Crusoe and NVIDIA fastokens writeup](https://www.crusoe.ai/resources/blog/reducing-ttft-by-cpumaxxing-tokenization) provides benchmark details across models, datasets, CPU architectures, and input lengths from 512 to 100K tokens. The actual gain depends on prompt length, tokenizer structure, CPU, concurrency, cache hit rate, and how much of your TTFT is spent before the model starts generating. ## How Dynamo Integrates It Dynamo exposes `fastokens` as a frontend tokenizer backend. The integration is hybrid: - **Encoding**: `fastokens` converts prompt text to token IDs. - **Decoding**: HuggingFace `tokenizers` converts generated token IDs back to text. Both backends load from the same `tokenizer.json`, so supported tokenizers should produce the same token IDs as the default HuggingFace path. If `fastokens` cannot load the tokenizer file, Dynamo logs a warning and falls back to the default backend instead of dropping requests. ```mermaid flowchart TD Start["Frontend starts with
--tokenizer fastokens"] --> Kind{"Tokenizer file type"} Kind -->|BPE tokenizer.json| Load{"fastokens loads?"} Kind -->|.model / .tiktoken| Other["Use existing TikToken backend"] Load -->|Yes| Fast["Encode with fastokens
Decode with HuggingFace"] Load -->|No| Warn["Log warning"] --> Default["Use HuggingFace backend"] Fast --> Serve["Serve requests"] Default --> Serve Other --> Serve ``` ## When to Enable It Enable `fastokens` when: - Prompts are long, commonly thousands to tens of thousands of tokens. - Your workload is prefill-heavy, agentic, or RAG-heavy. - TTFT remains high even when KV cache hit rates are strong. - Frontend tokenizer latency shows up in metrics, traces, or profiling. - Your model uses a BPE `tokenizer.json`. Stay on the default backend if: - Prompts are short and tokenization is not on the critical path. - You are validating a new or unusual tokenizer and want maximum compatibility first. - The frontend logs that `fastokens` failed to load and fell back to HuggingFace. - Your model uses `.model` or `.tiktoken` tokenizer files, where this flag has no effect. ## Quick Start Enable `fastokens` on the frontend with either the CLI flag or the environment variable. The CLI flag takes precedence. ```bash # CLI flag python -m dynamo.frontend --tokenizer fastokens # Environment variable export DYN_TOKENIZER=fastokens python -m dynamo.frontend ``` To return to the default HuggingFace tokenizer backend, omit the flag or set `DYN_TOKENIZER=default`. ```bash python -m dynamo.frontend --tokenizer default ``` No client changes are required. Request payloads, OpenAI-compatible API behavior, and streamed responses remain the same. ### Configuration Reference | CLI argument | Environment variable | Valid values | Default | |---|---|---|---| | `--tokenizer` | `DYN_TOKENIZER` | `default`, `fastokens` | `default` | ## Compatibility | Tokenizer format | Behavior with `--tokenizer fastokens` | |---|---| | BPE `tokenizer.json` | Dynamo tries to encode with `fastokens` and decode with HuggingFace. | | BPE `tokenizer.json` with unsupported components | Dynamo logs a warning and falls back to HuggingFace. | | TikToken `.model` or `.tiktoken` | Unchanged. Dynamo uses the existing TikToken backend. | `fastokens` targets BPE tokenizer pipelines. It is focused on inference and does not support every HuggingFace `tokenizers` feature; additional encoding outputs and some normalizers or pre-tokenizers are not available. The `fastokens` repository maintains the current [tested models list](https://github.com/crusoecloud/fastokens#tested-models). Tested model IDs include: - `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16` - `openai/gpt-oss-120b` - `deepseek-ai/DeepSeek-V3.2`, `deepseek-ai/DeepSeek-V3`, `deepseek-ai/DeepSeek-R1` - `Qwen/Qwen3-Next-80B-A3B-Thinking`, `Qwen/Qwen3-Next-80B-A3B-Instruct` - `Qwen/Qwen3-235B-A22B-Instruct-2507`, `Qwen/Qwen3.5-397B-A17B` - `MiniMaxAI/MiniMax-M2.1`, `MiniMaxAI/MiniMax-M2.5` - `mistralai/Devstral-Small-2-24B-Instruct-2512` - `zai-org/GLM-4.7`, `zai-org/GLM-5` For any new model, validate on representative prompts before rolling out broadly. The safest check is to compare token IDs against the default backend and confirm the frontend logs show the fast path was selected. ## Verify the Backend Check the frontend startup logs after enabling the flag. When `fastokens` is active, look for: ```text Using fastokens tokenizer backend ``` If the tokenizer is unsupported, Dynamo keeps serving with the default backend and logs: ```text Failed to load fastokens, falling back to HuggingFace ``` If you see the fallback warning, the deployment is still healthy, but you are not getting the `fastokens` speedup for that model. ## Measure Your Workload Dynamo includes a frontend benchmark sweep that compares HuggingFace and `fastokens` across input sequence length, concurrency, and worker count. ```bash cd benchmarks/frontend/scripts python3 sweep_runner.py \ --tokenizers hf,fastokens \ --concurrency 32,64,128 \ --isl 512,2048,8192 ``` Use local mocker runs to isolate frontend and tokenizer overhead. Use vLLM or SGLang runs when you want end-to-end TTFT impact for a real backend. See the [frontend benchmarking guide](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/benchmarks/frontend/README.md) and the [scaling-test recipe](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/benchmarks/frontend/scripts/scaling-test.md) for a full walkthrough. ## Troubleshooting **I enabled `fastokens`, but the logs do not show `Using fastokens tokenizer backend`.** Make sure the setting is applied to the frontend process, not only to the backend worker. For local launches, pass `--tokenizer fastokens` to `python -m dynamo.frontend` or set `DYN_TOKENIZER=fastokens` before starting the frontend. For benchmark DGD templates, use `DYN_TOKENIZER=fastokens`; the sweep runner maps `--tokenizers fastokens` to that value and restarts the frontend pod. **The frontend logs `Failed to load fastokens, falling back to HuggingFace`.** The model's tokenizer file uses a feature that `fastokens` does not support, or it is not a BPE `tokenizer.json` path. Dynamo has already fallen back to HuggingFace and should keep serving traffic. Check the tokenizer format, compare against the [tested models list](https://github.com/crusoecloud/fastokens#tested-models), and use `--tokenizer default` if you want to avoid the warning. **The frontend logs `Unrecognized DYN_TOKENIZER value`.** Use only `fastokens` or `default` for `DYN_TOKENIZER`. Values such as `fast`, `hf`, or `huggingface` are benchmark-runner aliases, not valid values for the frontend environment variable. **The model uses `.model` or `.tiktoken` files.** The `fastokens` flag has no effect for TikToken-format tokenizers. Dynamo uses the existing TikToken backend, so you should not expect the `Using fastokens tokenizer backend` log or a `fastokens` speedup. **TTFT does not improve.** First confirm the fast path is active in logs. If it is, tokenization may not be the bottleneck for this workload. Check prompt length, cache hit rate, backend prefill time, frontend CPU saturation, and the `dynamo_frontend_tokenizer_latency_ms` metric. Short prompts and decode-heavy traffic often show little end-to-end change. **The benchmark shows no difference between `hf` and `fastokens`.** Inspect each run artifact and frontend log to confirm the backend actually changed. In Kubernetes mode, the DGD frontend pod must be replaced after `DYN_TOKENIZER` changes. In local mocker mode, start with larger ISL values such as 8192 or higher so tokenization is large enough to measure. **Token IDs differ between backends.** Do not roll out that model with `fastokens`. Reproduce the mismatch with a minimal prompt and file an issue with the model name, tokenizer file, prompt, and whether the model appears on the tested models list. **Decoded output looks wrong.** Decoding still uses HuggingFace, so this is usually not caused by the `fastokens` flag. Verify that the tokenizer files match the model weights and that the default backend produces the expected output. ## See Also - [`fastokens`: A Solution to the Tokenization Bottleneck](https://www.crusoe.ai/resources/blog/reducing-ttft-by-cpumaxxing-tokenization) - [`fastokens` on GitHub](https://github.com/crusoecloud/fastokens) - [Tokenizer component reference](/dynamo/components/frontend/tokenizer) - [Frontend configuration reference](/dynamo/components/frontend/configuration-reference) - [Frontend benchmarking](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/benchmarks/frontend/README.md) # SGLang ## Use the Latest Release We recommend using the [latest stable release](https://github.com/ai-dynamo/dynamo/releases/latest) of Dynamo to avoid breaking changes. --- Dynamo SGLang integrates [SGLang](https://github.com/sgl-project/sglang) engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation while maintaining full compatibility with SGLang's native engine arguments. It supports LLM inference, embedding models, multimodal vision models, and diffusion-based generation (LLM, image, video). ## Prerequisites - **CUDA toolkit headers** for bare-metal builds (e.g. `nvcc`, `cuda_runtime.h`). See [CUDA Requirements](/dynamo/getting-started/local-installation#system-requirements). Not required when running the pre-built `sglang-runtime` container. - **`HF_TOKEN`** for gated models. Export it on every node that pulls the model weights, and accept the model license on the Hugging Face model page before launch: ```bash export HF_TOKEN=hf_... ``` ## Installation ### Install Latest Release We recommend using [uv](https://github.com/astral-sh/uv) to install: ```bash uv venv --python 3.12 --seed uv pip install --prerelease=allow "ai-dynamo[sglang]" ``` This installs the latest stable release of Dynamo with the compatible SGLang version. ### Install for Development Requires Rust and the CUDA toolkit (`nvcc`). ```bash # install dynamo uv venv --python 3.12 --seed uv pip install maturin nixl cd $DYNAMO_HOME/lib/bindings/python maturin develop --uv cd $DYNAMO_HOME uv pip install -e . # install sglang git clone https://github.com/sgl-project/sglang.git # you can optionally checkout any sglang branch cd sglang && uv pip install -e "python" ``` This is the ideal way for agents to develop. You can provide the path to both repos and the virtual environment and have it rerun these commands as it makes changes ### Docker Two paths are supported. Pick the one that matches how you plan to develop. #### Pre-built Dynamo SGLang container (recommended) Pull and launch the published `sglang-runtime` image from NGC. See [release artifacts](/dynamo/resources/release-artifacts) for the current tag and CUDA variants. ```bash docker run --gpus all -it --rm \ --network host --shm-size=10G \ --ulimit memlock=-1 --ulimit stack=67108864 \ --ulimit nofile=65536:65536 \ --cap-add CAP_SYS_PTRACE --ipc host \ -v $HOME/.cache/huggingface:/home/dynamo/.cache/huggingface \ nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0 ``` Mount the host Hugging Face cache (`-v $HOME/.cache/huggingface:/home/dynamo/.cache/huggingface`) so each container restart doesn't re-download model weights. The container runs as user `dynamo` (UID 1000), which is why the in-container path is `/home/dynamo/.cache/huggingface`. #### Build from source inside upstream SGLang container Pull and launch the upstream SGLang image, then build Dynamo from source inside it: ```bash docker run --gpus all -it --rm \ --network host --shm-size=10G \ --ulimit memlock=-1 --ulimit stack=67108864 \ --ulimit nofile=65536:65536 \ --ipc host \ lmsysorg/sglang:v{sglang_version} ``` Install build dependencies and Rust inside the container: ```bash apt-get update -qq && apt-get install -y -qq \ build-essential libclang-dev curl git > /dev/null 2>&1 curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y source "$HOME/.cargo/env" pip install maturin[patchelf] ``` Clone and build Dynamo: ```bash cd /sgl-workspace/ git clone https://github.com/ai-dynamo/dynamo.git cd dynamo cd lib/bindings/python/ maturin build -o /tmp pip install /tmp/ai_dynamo_runtime*.whl cd /sgl-workspace/dynamo/ pip install -e . ``` ## Feature Support Matrix | Feature | Status | Notes | |---------|--------|-------| | [**Disaggregated Serving**](/dynamo/design-docs/disaggregated-serving) | ✅ | Prefill/decode separation with NIXL KV transfer | | [**KV-Aware Routing**](/dynamo/components/router) | ✅ | | | [**SLA-Based Planner**](/dynamo/components/planner/planner-guide) | ✅ | | | [**Multimodal Support**](../../features/multimodal/multimodal-sglang.md) | ✅ | Image via EPD, E/PD, E/P/D patterns | | [**Diffusion Models**](/dynamo/backends/sg-lang/diffusion) | ✅ | LLM diffusion, image, and video generation | | [**Request Cancellation**](/dynamo/user-guides/fault-tolerance/request-cancellation) | ✅ | Aggregated full; disaggregated decode-only | | [**Graceful Shutdown**](/dynamo/user-guides/fault-tolerance/graceful-shutdown) | ✅ | Discovery unregister + grace period | | [**Observability**](/dynamo/backends/sg-lang/observability) | ✅ | Metrics, tracing, and Grafana dashboards | | [**KVBM**](/dynamo/components/kvbm) | ❌ | Planned | ## Quick Start ### Python / CLI Deployment Start infrastructure services for local development: ```bash docker compose -f dev/docker-compose.yml up -d ``` Launch an aggregated serving deployment: ```bash cd $DYNAMO_HOME/examples/backends/sglang ./launch/agg.sh ``` Verify the deployment: ```bash curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}], "stream": true, "max_tokens": 30 }' ``` ### Disaggregated Serving Launch a disaggregated Qwen3-0.6B deployment (smallest model, useful for plumbing validation): ```bash cd $DYNAMO_HOME/examples/backends/sglang ./launch/disagg.sh ``` > **Performance caveat:** Qwen3-0.6B is small enough that the disaggregated pathway is dominated by transport overhead and will often look slower than aggregated. Use it for plumbing validation, not benchmarks. Switch to Qwen3-32B-FP8 or larger for realistic disagg numbers. ### Multi-Node TP SGLang supports multi-node tensor parallelism via the native `--dist-init-addr`, `--nnodes`, and `--node-rank` flags. See [SGLang server arguments](https://docs.sglang.ai/advanced_features/server_arguments.html) for the canonical reference; the same flags work with `python -m dynamo.sglang`. For a Kubernetes deployment example, see [`disagg-multinode.yaml`](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/sglang/deploy/disagg-multinode.yaml). ### Kubernetes Deployment You can deploy SGLang with Dynamo on Kubernetes using a `DynamoGraphDeployment`. For more details, see the [SGLang Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/sglang/deploy). ## Next Steps - **[Reference Guide](/dynamo/backends/sg-lang/reference-guide)**: Worker types, architecture, and configuration - **[Examples](/dynamo/backends/sg-lang/examples)**: All deployment patterns with launch scripts - **[Disaggregation](/dynamo/backends/sg-lang/disaggregation)**: P/D architecture and KV transfer details - **[Diffusion](/dynamo/backends/sg-lang/diffusion)**: LLM, image, and video diffusion models - **[Observability](/dynamo/backends/sg-lang/observability)**: Metrics, tracing, and Grafana dashboards - **[Deploying SGLang with Dynamo on Kubernetes](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/sglang/deploy)**: Kubernetes deployment guide # Reference Guide ## Overview The SGLang backend in Dynamo uses a modular architecture where `main.py` dispatches to specialized initialization modules based on the worker type. Each worker type has its own init module, request handler, health check, and registration logic. Dynamo SGLang uses SGLang's native argument parser -- all SGLang engine arguments (e.g., `--model-path`, `--tp`, `--trust-remote-code`) are passed through directly. Dynamo adds its own arguments for worker mode selection, tokenizer control, and disaggregation configuration. ### Worker Types | Worker Type | Description | |------------|-------------| | **Decode** *(default)* | Standard LLM inference (aggregated or disaggregated decode) | | **Prefill** | Disaggregated prefill phase (`--disaggregation-mode prefill`) | | **Embedding** | Text embedding models (`--embedding-worker`) | | **Multimodal Encode** | Frontend-facing: vision encoding, embeddings generation (`--multimodal-encode-worker`) | | **Multimodal Worker** | LLM inference with multimodal data (`--multimodal-worker`) | | **Multimodal Prefill** | Prefill phase for multimodal disaggregation (`--multimodal-worker --disaggregation-mode prefill`) | | **Image Diffusion** | Image generation via DiffGenerator (`--image-diffusion-worker`) | | **Video Generation** | Text/image-to-video via DiffGenerator (`--video-generation-worker`) | | **LLM Diffusion** | Diffusion language models like LLaDA (`--dllm-algorithm `) | ## Argument Reference ### Dynamo-Specific Arguments These arguments are added by Dynamo on top of SGLang's native arguments. | Argument | Env Var | Default | Description | |----------|---------|---------|-------------| | `--endpoint` | `DYN_ENDPOINT` | Auto-generated | Dynamo endpoint in `dyn://namespace.component.endpoint` format | | `--use-sglang-tokenizer` | `DYN_SGL_USE_TOKENIZER` | `false` | **[Deprecated]** Use `--dyn-chat-processor sglang` on the frontend instead. See [SGLang Chat Processor](/dynamo/backends/sg-lang/frontend-processor-fallback). | | `--dyn-tool-call-parser` | `DYN_TOOL_CALL_PARSER` | `None` | [Tool call](../../tool-calling/dynamo.md#supported-tool-call-parsers) parser (overrides SGLang's `--tool-call-parser`) | | `--dyn-reasoning-parser` | `DYN_REASONING_PARSER` | `None` | [Reasoning](../../reasoning/dynamo.md#supported-reasoning-parsers) parser for chain-of-thought models | | `--custom-jinja-template` | `DYN_CUSTOM_JINJA_TEMPLATE` | `None` | Custom chat template path (incompatible with `--use-sglang-tokenizer`) | | `--embedding-worker` | `DYN_SGL_EMBEDDING_WORKER` | `false` | Run as embedding worker (also sets SGLang's `--is-embedding`) | | `--multimodal-encode-worker` | `DYN_SGL_MULTIMODAL_ENCODE_WORKER` | `false` | Run as [multimodal](../../features/multimodal/multimodal-sglang.md) encode worker (frontend-facing) | | `--multimodal-worker` | `DYN_SGL_MULTIMODAL_WORKER` | `false` | Run as multimodal LLM worker | | `--image-diffusion-worker` | `DYN_SGL_IMAGE_DIFFUSION_WORKER` | `false` | Run as [image diffusion](/dynamo/backends/sg-lang/diffusion#image-diffusion) worker | | `--video-generation-worker` | `DYN_SGL_VIDEO_GENERATION_WORKER` | `false` | Run as [video generation](/dynamo/backends/sg-lang/diffusion#video-generation) worker | | `--disagg-config` | `DYN_SGL_DISAGG_CONFIG` | `None` | Path to YAML disaggregation config file | | `--disagg-config-key` | `DYN_SGL_DISAGG_CONFIG_KEY` | `None` | Key to select from disaggregation config (e.g., `prefill`, `decode`) | `--disagg-config` and `--disagg-config-key` must be provided together. The selected section is written to a temp YAML file and passed to SGLang's `--config` flag. The current supported parser names for both flags are documented in [Tool Call Parsing (Dynamo)](../../tool-calling/dynamo.md#supported-tool-call-parsers) and [Reasoning Parsing (Dynamo)](../../reasoning/dynamo.md#supported-reasoning-parsers). ## Tokenizer Behavior By default, Dynamo handles tokenization and detokenization through its Rust-based frontend, passing `input_ids` to SGLang. This enables all frontend endpoints (`v1/chat/completions`, `v1/completions`, `v1/embeddings`). For SGLang-native preprocessing (tool calling, reasoning parsing, chat templates), use `--dyn-chat-processor sglang` on the frontend. See [SGLang Chat Processor](/dynamo/backends/sg-lang/frontend-processor-fallback) for architecture and usage. `--use-sglang-tokenizer` is deprecated. Use `--dyn-chat-processor sglang` on the frontend instead, which provides the same SGLang-native processing with KV router support and the completions endpoint. ## Request Cancellation When a client disconnects, Dynamo automatically cancels the in-flight request across all workers, freeing compute resources. A background cancellation monitor detects disconnection and aborts the SGLang request. | Mode | Prefill | Decode | |------|---------|--------| | **Aggregated** | ✅ | ✅ | | **Disaggregated** | ⚠️ | ✅ | Cancellation during remote prefill in disaggregated mode is not currently supported. For details on the cancellation architecture, see [Request Cancellation](/dynamo/user-guides/fault-tolerance/request-cancellation). ## Graceful Shutdown SGLang workers use Dynamo's graceful shutdown mechanism. When a `SIGTERM` or `SIGINT` is received: 1. **Discovery unregister**: The worker is removed from service discovery so no new requests are routed to it 2. **Grace period**: In-flight requests are allowed to complete 3. **Deferred handlers**: SGLang's internal signal handlers (captured during startup via monkey-patching `loop.add_signal_handler`) are invoked after the graceful period This ensures zero dropped requests during rolling updates or scale-down events. For more details, see [Graceful Shutdown](/dynamo/user-guides/fault-tolerance/graceful-shutdown). ## Health Checks Each worker type has a specialized health check payload that validates the full inference pipeline: | Worker Type | Health Check Strategy | |------------|----------------------| | Decode / Aggregated | Short generation request (`max_new_tokens=1`) | | Prefill | Wrapped prefill-specific request structure | | Image Diffusion | Minimal image generation request | | Video Generation | Minimal video generation request | | Embedding | Standard embedding request | Health checks are registered with the Dynamo runtime and called by the frontend or Kubernetes liveness probes. See [Health Checks](/dynamo/user-guides/observability-local/health-checks) for the broader health check architecture. ## Metrics and KV Events ### Prometheus Metrics Enable metrics with `--enable-metrics` on the worker. Set `DYN_SYSTEM_PORT` to expose the `/metrics` endpoint: ```bash DYN_SYSTEM_PORT=8081 python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --enable-metrics ``` Both SGLang engine metrics (`sglang:*` prefix) and Dynamo runtime metrics (`dynamo_*` prefix) are served from the same endpoint. For metric details, see [SGLang Observability](/dynamo/backends/sg-lang/observability). For visualization setup, see [Prometheus + Grafana](/dynamo/user-guides/observability-local/prometheus-grafana-setup). ### KV Events When configured with `--kv-events-config`, workers publish KV cache events (block creation/deletion) for the [KV-aware router](/dynamo/components/router). Events are published via ZMQ from SGLang's scheduler and relayed through Dynamo's event plane. For DP attention mode (`--enable-dp-attention`), the publisher handles multiple DP ranks per node, each with its own KV event stream. ## Engine Routes SGLang workers expose operational endpoints via Dynamo's system server: | Route | Description | |-------|-------------| | `/engine/start_profile` | Start PyTorch profiling | | `/engine/stop_profile` | Stop profiling and save traces | | `/engine/release_memory_occupation` | Release GPU memory for maintenance | | `/engine/resume_memory_occupation` | Resume GPU memory after release | | `/engine/update_weights_from_distributor` | Update model weights from distributor | | `/engine/update_weights_from_disk` | Update model weights from disk | | `/engine/update_weight_version` | Update weight version metadata | ## See Also - **[Examples](/dynamo/backends/sg-lang/examples)**: All deployment patterns - **[Disaggregation](/dynamo/backends/sg-lang/disaggregation)**: P/D architecture and KV transfer - **[Diffusion](/dynamo/backends/sg-lang/diffusion)**: LLM, image, and video diffusion models - **[Configuration and Tuning](/dynamo/components/router/configuration-and-tuning)**: KV-aware routing configuration # Examples For quick start instructions, see the [SGLang README](/dynamo/backends/sg-lang). This document provides all deployment patterns for running SGLang with Dynamo, including LLMs, multimodal, and diffusion models, and Kubernetes deployment. ## Infrastructure Setup For local/bare-metal development, start etcd and optionally NATS using Docker Compose: ```bash docker compose -f dev/docker-compose.yml up -d ``` - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery. - **NATS** is only needed when using NATS-backed KV routing events (`--kv-events-config`). Use ZMQ-backed events or `--no-router-kv-events` for routing without NATS. - **On Kubernetes**, neither is required when using the Dynamo operator (`DYN_DISCOVERY_BACKEND=kubernetes`). Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. For AI agents working with Dynamo, you can run the launch script in the background and use the `curl` commands to test the deployment. ## LLM Serving ### Aggregated Serving The simplest deployment pattern: a single worker handles both prefill and decode. ```bash cd $DYNAMO_HOME/examples/backends/sglang ./launch/agg.sh ``` ### Aggregated Serving with KV Routing Two workers behind a [KV-aware router](/dynamo/components/router) that maximizes cache reuse: ```bash cd $DYNAMO_HOME/examples/backends/sglang ./launch/agg_router.sh ``` This launches the frontend with `--router-mode kv` and two workers with ZMQ-based KV event publishing. ### Disaggregated Serving Separates prefill and decode into independent workers connected via NIXL for KV cache transfer. Requires 2 GPUs. ```bash cd $DYNAMO_HOME/examples/backends/sglang ./launch/disagg.sh ``` For details on how SGLang disaggregation works with Dynamo, including the bootstrap mechanism and RDMA transfer flow, see [SGLang Disaggregation](/dynamo/backends/sg-lang/disaggregation). ### Disaggregated Serving with KV-Aware Prefill Routing Scales to 2 prefill + 2 decode workers with KV-aware routing on both pools. Requires 4 GPUs. ```bash cd $DYNAMO_HOME/examples/backends/sglang ./launch/disagg_router.sh ``` The frontend uses `--router-mode kv` and automatically detects prefill workers to activate an internal prefill router. Each worker publishes KV events over ZMQ on unique ports. ## Multimodal Serving ### Aggregated Multimodal Serve multimodal models using SGLang's built-in multimodal support: ```bash cd $DYNAMO_HOME/examples/backends/sglang ./launch/agg_vision.sh ``` ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-VL-8B-Instruct", "messages": [ { "role": "user", "content": [ {"type": "text", "text": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}, {"type": "image_url", "image_url": {"url": "https://media.newyorker.com/photos/63249cff39ac97c4c23ff5d0/master/w_2560%2Cc_limit/Marzorati%2520-%2520Federer%2520Retirement%25202.jpg"}} ] } ], "max_tokens": 50, "stream": false }' | jq ``` ### Multimodal with Disaggregated Components For advanced multimodal deployments with separate encoder, prefill, and decode workers (E/PD and E/P/D patterns), see the dedicated [SGLang Multimodal](../../features/multimodal/multimodal-sglang.md) documentation. | Pattern | Script | Description | | ------- | ------------------------------- | --------------------------------------------- | | E/PD | `./launch/multimodal_epd.sh` | Separate vision encoder + combined PD worker | | E/P/D | `./launch/multimodal_disagg.sh` | Separate encoder, prefill, and decode workers | ## Diffusion Models ### Diffusion LM Run diffusion language models like [LLaDA2.0](https://github.com/inclusionAI/LLaDA2.0): ```bash cd $DYNAMO_HOME/examples/backends/sglang ./launch/diffusion_llada.sh ``` ### Image Diffusion Generate images from text prompts using [FLUX](https://huggingface.co/black-forest-labs/FLUX.1-dev) or other diffusion models: ```bash cd $DYNAMO_HOME/examples/backends/sglang ./launch/image_diffusion.sh ``` Options: `--model-path`, `--fs-url` (local or S3), `--http-url`. ### Video Generation Generate videos from text prompts using [Wan2.1](https://huggingface.co/Wan-AI) models: ```bash cd $DYNAMO_HOME/examples/backends/sglang ./launch/text-to-video-diffusion.sh ``` Options: `--wan-size 1b|14b`, `--num-frames`, `--height`, `--width`, `--num-inference-steps`. For full details on all diffusion worker types (LLM, image, video), see [Diffusion](/dynamo/backends/sg-lang/diffusion). ### Kubernetes Deployment For complete K8s deployment examples, see: - [SGLang K8s deployment guide](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/sglang/deploy) - [SGLang aggregated router K8s example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/sglang/deploy/agg_router.yaml) - [Kubernetes Deployment Guide](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart) ## Troubleshooting ### CuDNN Version Check Fails ``` RuntimeError: cuDNN frontend 1.8.1 requires cuDNN lib >= 9.5.0 ``` Set `SGLANG_DISABLE_CUDNN_CHECK=1` before launching. This is common when PyTorch ships a CuDNN version older than what SGLang's Conv3d models require. Affects vision and diffusion models. ### Model Registration Fails with `config.json` Error ``` unable to extract config.json from directory ... ``` This happens with diffusers models (FLUX.1-dev, Wan2.1, etc.) that use `model_index.json` instead of `config.json`. Ensure you are using the correct worker flag (`--image-diffusion-worker` or `--video-generation-worker`) rather than the standard LLM worker mode. These flags use a registration path that does not require `config.json`. ### GPU OOM on Startup If a previous run left orphaned GPU processes, the next launch may OOM. Check for zombie processes: ```bash nvidia-smi # look for lingering sgl_diffusion::scheduler or python processes kill -9 ``` ### Disaggregated Workers Cannot Connect Ensure both prefill and decode workers can reach each other over TCP. The bootstrap mechanism uses `--disaggregation-bootstrap-port` (default: 12345). For multi-node setups, ensure the port is reachable across hosts and set `--host 0.0.0.0`. ## See Also - **[SGLang README](/dynamo/backends/sg-lang)**: Quick start and feature overview - **[Reference Guide](/dynamo/backends/sg-lang/reference-guide)**: Architecture, configuration, and operational details - **[SGLang Multimodal](../../features/multimodal/multimodal-sglang.md)**: Vision model deployment patterns - **[SGLang HiCache](/dynamo/integrations/kv-cache-integrations/hi-cache)**: Hierarchical cache integration - **[Benchmarking](/dynamo/user-guides/benchmarking)**: Performance benchmarking tools - **[Tuning Disaggregated Performance](/dynamo/additional-resources/tuning-disaggregated-performance)**: P/D tuning guide # Disaggregation This document explains how SGLang's disaggregated prefill-decode architecture works, both standalone and within Dynamo. ## Overview Disaggregated serving separates the prefill and decode phases of LLM inference into different workers. This architecture allows for: - Independent scaling of prefill and decode resources - Better resource utilization (prefill is compute-bound, decode is memory-bound) - Efficient KV cache transfer between workers using RDMA ## How Dynamo Integrates with SGLang Disaggregation **SGLang's standalone approach:** 1. The load balancer receives a request from the client 2. A random `(prefill, decode)` pair is selected from the pool of available workers 3. Request is sent to both `prefill` and `decode` workers via asyncio tasks 4. Internally disaggregation is done from prefill → decode **Dynamo's approach:** Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead: 1. Route to a decode worker first 2. Choose a prefill worker via round-robin or KV-aware selection 3. Send the request to both workers 4. SGLang's bootstrap server (part of the `tokenizer_manager`) is used in conjunction with NIXL/Mooncake to handle the KV transfer ## Disaggregation Flow The following diagram shows the complete request flow for disaggregated serving: ```mermaid sequenceDiagram participant Client participant Decode participant Prefill Note over Decode,Prefill: 0. Setup Phase (One-Time) Decode->>Prefill: Register RDMA connection info (base GPU memory pointers) Note over Client,Prefill: Per-Request Phase Client->>Decode: 1. Send request Decode->>Prefill: 2. Forward request + get bootstrap_room Prefill-->>Decode: Return bootstrap_room ID Note over Decode: 3. Allocate GPU memory for KV cache Decode->>Prefill: Send allocation info (page indices, metadata buffer) Note over Prefill: 4. Prefill forward pass par Decode polls loop Poll transfer Note over Decode: 5. Poll for KV arrival end and Prefill transfers Note over Prefill: 6. RDMA write KV to decode Prefill->>Decode: Transfer KV cache + metadata end Note over Prefill: 7. Poll RDMA handles Note over Prefill: Transfer complete, deallocate metadata Note over Decode: 8. KV received, start decode loop Generate tokens Note over Decode: Decode forward pass Decode-->>Client: Stream output token end ``` ### Key Steps Explained **Setup Phase (One-Time)** - Decode workers register their RDMA connection information with prefill workers - This includes base GPU memory pointers for direct memory access **Per-Request Flow** 1. **Request initiation**: Client sends request to decode worker 2. **Bootstrap room allocation**: Decode forwards to prefill and receives a bootstrap_room ID for coordination 3. **Memory allocation**: Decode allocates GPU memory pages for incoming KV cache 4. **Prefill execution**: Prefill worker processes the prompt and generates KV cache 5. **KV transfer**: Prefill uses RDMA to write KV cache directly to decode's GPU memory (while decode polls for completion) 6. **Cleanup**: Prefill deallocates transfer metadata after confirming completion 7. **Decode phase**: Decode worker generates tokens using the transferred KV cache 8. **Streaming**: Tokens are streamed back to the client as they're generated ### Performance Characteristics - **RDMA transfer**: Zero-copy GPU-to-GPU transfer with minimal CPU involvement - **Parallel operations**: Decode can poll while prefill transfers data - **One-time setup**: RDMA connections established once, reused for all requests # Diffusion Dynamo SGLang supports three types of diffusion-based generation: **LLM diffusion** (text generation via iterative refinement), **image diffusion** (text-to-image), and **video generation** (text-to-video). Each uses a different worker flag and handler, but all integrate with SGLang's `DiffGenerator`. ## Overview | Type | Worker Flag | API Endpoint | | ---------------- | --------------------------- | ----------------------------------------- | | LLM Diffusion | `--dllm-algorithm ` | `/v1/chat/completions`, `/v1/completions` | | Image Diffusion | `--image-diffusion-worker` | `/v1/images/generations` | | Video Generation | `--video-generation-worker` | `/v1/videos` | If you see a CuDNN version mismatch error on startup (`cuDNN frontend 1.8.1 requires cuDNN lib >= 9.5.0`), set `SGLANG_DISABLE_CUDNN_CHECK=1` before launching. This is common when PyTorch ships a CuDNN version older than what SGLang requires for Conv3d operations. ## LLM Diffusion Diffusion Language Models generate text through iterative refinement rather than autoregressive token-by-token generation. The model starts with masked tokens and progressively replaces them with predictions, refining low-confidence tokens each step. LLM diffusion is auto-detected: when `--dllm-algorithm` is set, the worker automatically uses `DiffusionWorkerHandler` without needing a separate flag. For more details on diffusion algorithms, see the [SGLang Diffusion Language Models documentation](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/text_generation/diffusion_language_models.md). ### Launch ```bash cd $DYNAMO_HOME/examples/backends/sglang ./launch/diffusion_llada.sh ``` See the [launch script](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/sglang/launch/diffusion_llada.sh) for configuration options. ### Test ```bash curl -X POST http://localhost:8001/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "inclusionAI/LLaDA2.0-mini-preview", "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}], "temperature": 0.7, "max_tokens": 512 }' ``` ## Image Diffusion Image diffusion workers generate images from text prompts using SGLang's `DiffGenerator`. Generated images are returned as either URLs (when using `--media-output-fs-url` for storage) or base64 data, in an OpenAI-compatible response format. ### Launch ```bash cd $DYNAMO_HOME/examples/backends/sglang ./launch/image_diffusion.sh ``` Supports local storage (`--fs-url file:///tmp/images`) and S3 (`--fs-url s3://bucket`). Pass `--http-url` to set the base URL for serving stored images. See the launch script for all configuration options. ### Test ```bash curl http://localhost:8000/v1/images/generations \ -H "Content-Type: application/json" \ -d '{ "model": "black-forest-labs/FLUX.1-dev", "prompt": "Explain why Roger Federer is considered one of the greatest tennis players of all time", "size": "1024x1024", "response_format": "url", "nvext": { "num_inference_steps": 15 } }' ``` ## Video Generation Video generation workers produce videos from text or image prompts using SGLang's `DiffGenerator` with frame-to-video encoding. Supports text-to-video (T2V) and image-to-video (I2V) workflows. ### Launch ```bash cd $DYNAMO_HOME/examples/backends/sglang ./launch/text-to-video-diffusion.sh ``` Use `--wan-size 1b` (default, 1 GPU) or `--wan-size 14b` (2 GPUs). See the launch script for all configuration options. ### Test ```bash curl http://localhost:8000/v1/videos \ -H "Content-Type: application/json" \ -d '{ "prompt": "Roger Federer winning his 19th grand slam", "model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", "seconds": 2, "size": "832x480", "response_format": "url", "nvext": { "fps": 8, "num_frames": 17, "num_inference_steps": 50 } }' ``` ## See Also - **[Examples](/dynamo/backends/sg-lang/examples)**: Launch scripts for all deployment patterns - **[Reference Guide](/dynamo/backends/sg-lang/reference-guide)**: Worker types and argument reference - **[SGLang Diffusion LMs (upstream)](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/text_generation/diffusion_language_models.md)**: SGLang diffusion documentation # SGLang Chat Processor The SGLang chat processor enables SGLang-native preprocessing and postprocessing in the Dynamo frontend. It uses SGLang's tokenizer, chat templates, tool call parser, and reasoning parser directly -- bypassing the default Rust preprocessor for `v1/chat/completions` requests. ## When to Use Use `--dyn-chat-processor sglang` when Dynamo's built-in Rust preprocessor does not yet support a tool call parser or reasoning parser you need. The SGLang processor delegates to SGLang's Python implementations, so any parser SGLang supports works immediately. Common cases: - A **tool call format** not yet in the Rust `tool_calling` library - A **reasoning parser** not yet supported natively - A **chat template** that the Rust preprocessor doesn't handle correctly If the parser you need is missing from the Rust preprocessor, consider [opening an issue or PR](https://github.com/ai-dynamo/dynamo/issues) to add native support -- native parsers avoid the Python GIL overhead entirely. ## Quick Start ```bash # Frontend with SGLang processor, tool calling, and reasoning python -m dynamo.frontend \ --router-mode kv \ --dyn-chat-processor sglang \ --tool-call-parser hermes \ --reasoning-parser qwen3 # Workers (unchanged) CUDA_VISIBLE_DEVICES=0 python -m dynamo.sglang \ --model-path Qwen/Qwen3-14B-FP8 \ --served-model-name Qwen/Qwen3-14B-FP8 \ --tp 1 --trust-remote-code \ --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:5557"}' ``` ## Frontend Arguments These arguments are passed to the **frontend** (not the worker) when using `--dyn-chat-processor sglang`: | Argument | Default | Description | |----------|---------|-------------| | `--dyn-chat-processor sglang` | (none) | Enable the SGLang chat processor | | `--tool-call-parser` | `None` | Tool call parser name (any SGLang-supported parser) | | `--reasoning-parser` | `None` | Reasoning parser name (any SGLang-supported parser) | ### Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `DYN_SGLANG_STREAM_INTERVAL` | `20` | Number of tokens to accumulate before detokenizing. Higher values improve throughput. The first chunk always emits immediately (interval=1) to minimize time-to-first-token. | ## Tool Calling The processor supports all SGLang tool call formats. Pass `--tool-call-parser` on the frontend: ```bash python -m dynamo.frontend \ --dyn-chat-processor sglang \ --tool-call-parser hermes ``` Any parser supported by SGLang can be used. See the [SGLang documentation](https://docs.sglang.ai/) for the full list of available tool call parsers. ### Example: Tool Call Request ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-14B-FP8", "messages": [{"role": "user", "content": "What is the weather in Paris?"}], "tools": [{ "type": "function", "function": { "name": "get_weather", "description": "Get weather for a city", "parameters": { "type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"] } } }], "tool_choice": "auto" }' ``` Response: ```json { "choices": [{ "message": { "role": "assistant", "tool_calls": [{ "id": "call_8cd24396f3671048", "type": "function", "function": { "name": "get_weather", "arguments": "{\"city\": \"Paris\"}" } }], "reasoning_content": "The user wants weather info for Paris..." }, "finish_reason": "tool_calls" }] } ``` ## Reasoning Parsing For models that produce chain-of-thought reasoning (e.g., Qwen3, DeepSeek-R1), pass `--reasoning-parser`: ```bash python -m dynamo.frontend \ --dyn-chat-processor sglang \ --reasoning-parser qwen3 ``` The parser separates think tag content into the `reasoning_content` field and regular content into the `content` field. ## Migration from `--use-sglang-tokenizer` `--use-sglang-tokenizer` on the **worker** is deprecated. Replace with `--dyn-chat-processor sglang` on the **frontend**: ```diff # Before (deprecated) - python -m dynamo.sglang --model-path --use-sglang-tokenizer - python -m dynamo.frontend # After python -m dynamo.sglang --model-path + python -m dynamo.frontend --dyn-chat-processor sglang ``` Key differences: | | `--use-sglang-tokenizer` | `--dyn-chat-processor sglang` | |---|---|---| | Location | Worker flag | Frontend flag | | KV router | Not supported | Supported | | Tool calling | Not supported | Supported | | Reasoning | Not supported | Supported | | Endpoints | `v1/chat/completions` only | `v1/chat/completions` only | ## See Also - **[Tool Calling](/dynamo/user-guides/parsing/tool-call-parsing-dynamo)**: General tool calling guide - **[Reference Guide](/dynamo/backends/sg-lang/reference-guide)**: Full SGLang backend reference - **[Agentic Workloads](/dynamo/backends/sg-lang/agentic-workloads)**: Priority scheduling and cache pinning for agents # Observability This guide covers metrics, tracing, and visualization for SGLang deployments running through Dynamo. ## Prometheus Metrics When running SGLang through Dynamo, SGLang engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both SGLang engine metrics (prefixed with `sglang:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint. **For the complete and authoritative list of all SGLang metrics**, always refer to the [official SGLang Production Metrics documentation](https://docs.sglang.io/references/production_metrics.html). **For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](/dynamo/user-guides/observability-local/metrics). **For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](/dynamo/user-guides/observability-local/prometheus-grafana-setup). ### Environment Variables | Variable | Description | Default | Example | |----------|-------------|---------|---------| | `DYN_SYSTEM_PORT` | System metrics/health port | `-1` (disabled) | `8081` | ### Getting Started Quickly This is a single machine example. #### Start Observability Stack For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](/dynamo/user-guides/observability-local#getting-started-quickly) for instructions. #### Launch Dynamo Components Launch a frontend and SGLang backend to test metrics: ```bash # Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var) $ python -m dynamo.frontend # Enable system metrics server on port 8081 $ DYN_SYSTEM_PORT=8081 python -m dynamo.sglang --model --enable-metrics ``` Wait for the SGLang worker to start, then send requests and check metrics: ```bash # Send a request curl -H 'Content-Type: application/json' \ -d '{ "model": "", "max_completion_tokens": 100, "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}] }' \ http://localhost:8000/v1/chat/completions # Check metrics from the worker curl -s localhost:8081/metrics | grep "^sglang:" ``` ### Exposed Metrics SGLang exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All SGLang engine metrics use the `sglang:` prefix and include labels (e.g., `model_name`, `engine_type`, `tp_rank`, `pp_rank`) to identify the source. **Example Prometheus Exposition Format text:** ``` # HELP sglang:prompt_tokens_total Number of prefill tokens processed. # TYPE sglang:prompt_tokens_total counter sglang:prompt_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8128902.0 # HELP sglang:generation_tokens_total Number of generation tokens processed. # TYPE sglang:generation_tokens_total counter sglang:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 7557572.0 # HELP sglang:cache_hit_rate The cache hit rate # TYPE sglang:cache_hit_rate gauge sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0075 ``` **Note:** The specific metrics shown above are examples and may vary depending on your SGLang version. Always inspect your actual `/metrics` endpoint or refer to the [official documentation](https://docs.sglang.io/references/production_metrics.html) for the current list. ### Metric Categories SGLang provides metrics in the following categories (all prefixed with `sglang:`): - **Throughput metrics** - Token processing rates - **Resource usage** - System resource consumption - **Latency metrics** - Request and token latency measurements - **Disaggregation metrics** - Metrics specific to disaggregated deployments (when enabled) **Note:** Specific metrics are subject to change between SGLang versions. Always refer to the [official documentation](https://docs.sglang.io/references/production_metrics.html) or inspect the `/metrics` endpoint for your SGLang version. ### Available Metrics The official SGLang documentation includes complete metric definitions with: - HELP and TYPE descriptions - Counter, Gauge, and Histogram metric types - Metric labels (e.g., `model_name`, `engine_type`, `tp_rank`, `pp_rank`) - Setup guide for Prometheus + Grafana monitoring - Troubleshooting tips and configuration examples For the complete and authoritative list of all SGLang metrics, see the [official SGLang Production Metrics documentation](https://docs.sglang.io/references/production_metrics.html). ### Implementation Details - SGLang uses multiprocess metrics collection via `prometheus_client.multiprocess.MultiProcessCollector` - Metrics are filtered by the `sglang:` prefix before being exposed - The integration uses Dynamo's `register_engine_metrics_callback()` function - Metrics appear after SGLang engine initialization completes --- ## Forward Pass Metrics (FPM) > **Availability in the 1.2.0 SGLang runtime.** The published `sglang-runtime:1.2.0` image does not yet include the upstream `sglang.srt.observability.forward_pass_metrics` module or the corresponding `ServerArgs` fields (`enable_forward_pass_metrics`, `forward_pass_metrics_worker_id`, `forward_pass_metrics_ipc_name`). Setting `DYN_FORWARDPASS_METRIC_PORT` starts the Dynamo-side relay successfully and the worker continues to serve requests, but no SGLang-side FPM payloads are emitted to the NATS event plane. **The pipeline and schema below describe the intended architecture and will become functional once the upstream SGLang FPM module is included in a future SGLang runtime image.** For load-based Planner scaling on 1.2.0, use a vLLM or TensorRT-LLM (non-attention-DP) backend; see the [Planner FPM support matrix](/dynamo/components/planner#load-based-scaling). Forward Pass Metrics provide **per-iteration scheduler telemetry** pushed over ZMQ, giving the [Planner](/dynamo/components/planner) real-time visibility into batch composition, queue depth, and GPU forward pass duration. Unlike Prometheus metrics (which are scraped asynchronously and reflect only the latest gauge value), FPM emits a structured message after every scheduler iteration with the exact batch state. ### Pipeline ```mermaid flowchart LR A["SGLang Scheduler
(child process)"] -->|ZMQ PUB
IPC per dp_rank| B["FpmEventRelay
(Rust)"] B -->|NATS
event plane| C["FpmEventSubscriber
(Rust)"] C --> D["Planner
regression models"] ``` The transport is backend-agnostic: the same `FpmEventRelay` and `FpmEventSubscriber` are used by both SGLang and vLLM backends. ### Enabling FPM FPM requires the Dynamo adapter (`dynamo.sglang`) to inject the worker identity and IPC path before engine initialization. This happens automatically when the Dynamo runtime creates the SGLang worker. The Planner subscribes to FPM via the NATS event plane. See the [Planner Guide](/dynamo/components/planner/planner-guide) for configuration (`load_adjustment_interval`, `max_num_fpm_samples`, `fpm_sample_bucket_size`). ### Schema **ForwardPassMetrics** (top-level, one per iteration): | Field | Type | Description | |-------|------|-------------| | `version` | `int` | Schema version (currently 1) | | `worker_id` | `str` | Dynamo endpoint `connection_id` | | `dp_rank` | `int` | Data-parallel rank | | `counter_id` | `int` | Monotonic sequence number per (worker, dp_rank) | | `wall_time` | `float` | GPU forward pass duration in seconds (via DeviceTimer) | | `scheduled_requests` | `ScheduledRequestMetrics` | Batch composition this iteration | | `queued_requests` | `QueuedRequestMetrics` | Waiting requests snapshot | **ScheduledRequestMetrics** (requests in this batch): | Field | Type | Description | |-------|------|-------------| | `num_prefill_requests` | `int` | Prefill requests (new + chunked continuations) | | `sum_prefill_tokens` | `int` | Tokens freshly computed (chunk size, not full prompt) | | `var_prefill_length` | `float` | Variance of full prompt lengths | | `sum_prefill_kv_tokens` | `int` | KV tokens read but not computed (prefix cache + prior chunks) | | `num_decode_requests` | `int` | Decode requests generating output tokens | | `sum_decode_kv_tokens` | `int` | Total KV context length across decode requests | | `var_decode_kv_tokens` | `float` | Variance of decode KV context lengths | **QueuedRequestMetrics** (requests waiting to be scheduled): | Field | Type | Description | |-------|------|-------------| | `num_prefill_requests` | `int` | Queued prefill requests | | `sum_prefill_tokens` | `int` | Total tokens across queued prefill | | `var_prefill_length` | `float` | Variance of queued prefill lengths | | `num_decode_requests` | `int` | Queued decode requests | | `sum_decode_kv_tokens` | `int` | Total KV tokens across queued decode | | `var_decode_kv_tokens` | `float` | Variance of queued decode KV lengths | ### GPU-Accurate Timing FPM uses SGLang's `DeviceTimer` infrastructure (CUDA event pairs around `model_runner.forward()` and `cuda_graph.replay()`) for GPU-accurate `wall_time`. This avoids the CPU scheduling overhead that would be included by timing at the scheduler level. When DeviceTimer events are not yet ready (overlap scheduler mode where GPU work from iteration N is still in flight), FPM skips emission for that iteration rather than reporting an inaccurate monotonic clock fallback. ### Disaggregated Mode In disaggregated serving, queued request metrics read from the correct engine-specific queues: | Engine | Queue Source | |--------|-------------| | Unified (non-disagg) | `waiting_queue` | | Prefill | `disagg_prefill_bootstrap_queue` | | Decode | `disagg_decode_prealloc_queue` + `disagg_decode_transfer_queue` | ### Cross-Repo Contract Test SGLang defines its own `ForwardPassMetrics` struct that must field-for-field match Dynamo's shared schema. A cross-repo contract test (`dynamo/sglang/tests/test_fpm_contract.py`) guards against schema drift by encoding with SGLang's struct and decoding with Dynamo's. ### Design Reference For the full motivation and design rationale, see the [Forward Pass Metrics RFC](../../proposals/vllm-rfc-forward-pass-metrics.md). --- ## Distributed Tracing Dynamo propagates [W3C Trace Context](https://www.w3.org/TR/trace-context/) headers through the SGLang request pipeline, allowing you to correlate traces across the frontend, router, and individual SGLang workers in a disaggregated deployment. ### Prerequisites SGLang's engine-internal tracing requires the `opentelemetry` packages. These are declared as SGLang's `[tracing]` extra. Install them into your Dynamo environment: ```bash uv pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-exporter-otlp-proto-grpc ``` Without these packages, Dynamo-side spans (frontend, handler) will still work, but SGLang's internal engine spans will not be emitted and you will see a warning: `"Tracing is disabled because the packages cannot be imported."` ### How Trace Propagation Works ``` Frontend (Rust) creates span, embeds trace_id + span_id in Context | v Dynamo RPC (NATS transport) Context serialized with trace_id, span_id | v SGLang Handler (Python) dynamo.common.utils.otel_tracing.build_trace_headers(context) builds W3C traceparent: "00-{trace_id}-{span_id}-01" | v sgl.Engine.async_generate( ..., rid=trace_id, # request ID = trace ID external_trace_header=traceparent # W3C header for SGLang internal spans ) | v SGLang Engine (internal spans attached to same trace) ``` Key implementation files: - `components/src/dynamo/common/utils/otel_tracing.py` - W3C `traceparent` header builder - `components/src/dynamo/sglang/request_handlers/handler_base.py:71-84` - Extracts trace context from Dynamo `Context` object - `components/src/dynamo/sglang/request_handlers/llm/decode_handler.py` - Passes `external_trace_header` and `rid=trace_id` to `engine.async_generate()` ### Environment Variables | Variable | Description | Default | Example | |----------|-------------|---------|---------| | `DYN_LOGGING_JSONL` | Enable JSONL logging (required for tracing) | `false` | `true` | | `OTEL_EXPORT_ENABLED` | Enable OTLP trace export | `false` | `true` | | `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` | OTLP gRPC endpoint for Tempo | `http://localhost:4317` | `http://tempo:4317` | | `OTEL_SERVICE_NAME` | Service name shown in Grafana Tempo | `dynamo` | `dynamo-worker-decode` | ### SGLang-Specific Flags | Flag | Description | |------|-------------| | `--enable-trace` | Enable W3C trace header propagation into SGLang engine | | `--otlp-traces-endpoint` | OTLP gRPC endpoint for SGLang's internal trace export (bare `host:port` format, e.g. `localhost:4317`) | Both flags are required for end-to-end tracing through the SGLang engine. Without `--enable-trace`, the Dynamo handler still creates spans, but SGLang's internal engine spans will not be linked. ### Controlling SGLang Trace Verbosity When `--enable-trace` is set, SGLang emits spans at four verbosity levels. Dynamo defaults to level 2, which keeps all useful per-request spans while suppressing high-volume scheduler noise: | Level | Spans included | Volume | |-------|---------------|--------| | 1 | `tokenize`, `prefill_forward`, `decode_forward` | Low | | 2 | Level 1 + `request_process`, `api_server_dispatch` | Low | | 3 (SGLang default) | Level 2 + `decode_loop`, `chunked_prefill`, `fake_output` | Very high (~1.6M spans/hr per model) | | 4 | Level 3 + `run_batch_cpu` | Extremely high | Use the `SGLANG_TRACE_LEVEL` environment variable to override the default: | Variable | Description | Default | Example | |----------|-------------|---------|---------| | `SGLANG_TRACE_LEVEL` | SGLang internal span verbosity level (1–4); only active when `--enable-trace` is set | `2` | `1` | ```bash # Keep only the most essential per-request spans SGLANG_TRACE_LEVEL=1 python -m dynamo.sglang --model Qwen/Qwen3-0.6B --enable-trace --otlp-traces-endpoint localhost:4317 # Restore SGLang's default level (includes decode_loop — high volume) SGLANG_TRACE_LEVEL=3 python -m dynamo.sglang --model Qwen/Qwen3-0.6B --enable-trace --otlp-traces-endpoint localhost:4317 ``` ### Launch with Tracing The disaggregated launch script supports `--enable-otel` to enable tracing across all components: ```bash # Start observability stack first docker compose -f dev/docker-compose.yml up -d docker compose -f dev/docker-observability.yml up -d # Launch SGLang disaggregated with tracing cd examples/backends/sglang/launch ./disagg.sh --enable-otel ``` Or manually for an aggregated deployment: ```bash export DYN_LOGGING_JSONL=true export OTEL_EXPORT_ENABLED=true export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317 # Frontend OTEL_SERVICE_NAME=dynamo-frontend python -m dynamo.frontend & # SGLang worker with tracing OTEL_SERVICE_NAME=dynamo-worker-sglang \ DYN_SYSTEM_PORT=8081 \ python -m dynamo.sglang \ --model Qwen/Qwen3-0.6B \ --enable-metrics \ --enable-trace \ --otlp-traces-endpoint localhost:4317 ``` ### What You'll See in Traces With tracing enabled, each inference request produces a single end-to-end trace spanning the full request lifecycle: - **Frontend `http-request` span** - Root span from the HTTP service, includes method/uri/trace_id - **KV Router spans** - `kv_router.route_request`, `kv_router.select_worker`, `kv_router.compute_block_hashes`, `kv_router.find_matches`, `kv_router.compute_seq_hashes`, `kv_router.schedule` - **Worker `handle_payload` span** - The Dynamo RPC handler on the worker side, with component/endpoint/namespace labels - **SGLang engine spans** - `Req `, `Scheduler`, `Tokenizer`, `request_process`, `prefill_forward`, `decode_loop`, `Bootstrap Room` (for disagg) - **Semantic conventions** - `gen_ai.usage.prompt_tokens`, `gen_ai.usage.completion_tokens`, `gen_ai.latency.time_to_first_token`, etc. Example trace tree for a KV-routed request: ``` dynamo-frontend: http-request (root) dynamo-frontend: kv_router.route_request dynamo-frontend: kv_router.select_worker kv_router.compute_block_hashes kv_router.find_matches kv_router.compute_seq_hashes kv_router.schedule dynamo-worker-1: handle_payload sglang: Bootstrap Room 0x0 sglang: Req sglang: Scheduler [TP 0] request_process prefill_forward decode_loop (repeated per token) sglang: Tokenizer tokenize dispatch ``` ![End-to-end trace in Grafana Tempo showing frontend, KV router, worker, and SGLang engine spans](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/734df65fd267a48ac54a107a7d2ea4bc658cc0b28f52e984a1763fff9f1be3d9/pages-v1.2.0/assets/img/sglang-trace.png) ### Viewing Traces 1. Open Grafana at `http://localhost:3000` (username: `dynamo`, password: `dynamo`) 2. Navigate to **Explore** (compass icon) 3. Select **Tempo** as the data source 4. Use the **Search** tab: - Filter by **Service Name** (e.g., `dynamo-frontend`, `dynamo-worker-1`, `sglang`) - Filter by **Span Name** (e.g., `http-request`, `handle_payload`, `Req *`, `decode_loop`) - Filter by **Tags** (e.g., `rid=`, `gen_ai.response.model=Qwen/Qwen3-0.6B`) 5. Click a trace to view the flame graph spanning frontend -> router -> worker -> engine Send a request with `x-request-id` for easy lookup: ```bash curl -H 'Content-Type: application/json' \ -H 'x-request-id: my-trace-001' \ -d '{"model": "Qwen/Qwen3-0.6B", "max_completion_tokens": 50, "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}]}' \ http://localhost:8000/v1/chat/completions ``` For more details on the Tempo/Grafana tracing infrastructure, see the [Dynamo Tracing Guide](/dynamo/user-guides/observability-local/tracing). --- ## SGLang Grafana Dashboard Dynamo ships a pre-provisioned Grafana dashboard for SGLang at `dev/observability/grafana_dashboards/sglang.json`. It is automatically loaded when the observability stack starts. ### Dashboard Panels The dashboard is organized into five sections: | Section | Panels | What to Watch | |---------|--------|---------------| | **Request Latency** | E2E Request Latency, Time-To-First-Token, Inter-Token Latency | Tail latency regressions, TTFT spikes during prefill pressure | | **Throughput & Queue** | Token Generation Throughput (tok/s), Running & Queued Requests, Request Rate | Throughput saturation, queue depth growth | | **Cache & PIN** | Cache Hit Rate, Active PIN Count, Retractions | KV cache reuse efficiency, PIN pressure from disagg routing | | **Memory Pressure** | GPU KV Cache Usage %, Host (CPU) KV Cache Usage %, Eviction & Load-back Rate | OOM risk, HiCache offload activity | | **HiCache Latency** | Eviction P99 Latency, Load-back P99 Latency | PCIe/NVLink bottlenecks in KV offload path | ### Accessing the Dashboard 1. Open Grafana at `http://localhost:3000` 2. Login with `dynamo` / `dynamo` 3. Click **Dashboards** in the left sidebar 4. Select **SGLang Engine** Other available dashboards: - **Dynamo Dashboard** (`dynamo.json`) - Frontend and component metrics - **DCGM Metrics** (`dcgm-metrics.json`) - GPU utilization, memory, power - **KVBM** (`kvbm.json`) - KV block manager metrics - **Disagg Dashboard** (`disagg-dashboard.json`) - Disaggregated serving metrics --- ## Exposing on a Remote VM When developing on a remote VM (cloud instance, bare metal, etc.), the observability ports are only bound to `localhost` inside the VM. You have two options to access them. ### Option 1: SSH Port Forwarding (Recommended) Forward the relevant ports through your SSH connection. No firewall changes needed, traffic is encrypted. ```bash # Forward Grafana (3000), Prometheus (9090), and Tempo (3200) ssh -L 3000:localhost:3000 \ -L 9090:localhost:9090 \ -L 3200:localhost:3200 \ user@your-vm-ip ``` Then open `http://localhost:3000` in your local browser. For a long-running tunnel in the background: ```bash ssh -fN \ -L 3000:localhost:3000 \ -L 9090:localhost:9090 \ -L 3200:localhost:3200 \ user@your-vm-ip ``` ### Option 2: Firewall Rules Open the ports directly. Only use this on trusted networks. ```bash # Ubuntu/Debian sudo ufw allow 3000/tcp # Grafana sudo ufw allow 9090/tcp # Prometheus # Or for cloud VMs, add inbound rules in your security group for ports 3000, 9090 ``` Then access `http://:3000` directly. ### Headless / Agent Access For CI pipelines, AI coding agents, or headless workflows where no browser is available, you can query Grafana and Prometheus directly via their APIs: ```bash # Query Prometheus for SGLang token throughput curl -s 'http://localhost:9090/api/v1/query?query=rate(sglang:generation_tokens_total[1m])' | python3 -m json.tool # Query Prometheus for GPU KV cache usage curl -s 'http://localhost:9090/api/v1/query?query=dynamo_component_gpu_cache_usage_percent' | python3 -m json.tool # List available Grafana dashboards curl -s -u dynamo:dynamo http://localhost:3000/api/search | python3 -m json.tool # Get the SGLang dashboard by title curl -s -u dynamo:dynamo 'http://localhost:3000/api/search?query=SGLang' | python3 -m json.tool # Fetch a specific dashboard by UID curl -s -u dynamo:dynamo http://localhost:3000/api/dashboards/uid/ | python3 -m json.tool # Snapshot current metrics via Prometheus range query (last hour) START=$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) END=$(date -u +%Y-%m-%dT%H:%M:%SZ) curl -s "http://localhost:9090/api/v1/query_range?query=sglang:cache_hit_rate&start=${START}&end=${END}&step=15s" ``` This is useful for automated benchmarking pipelines where you want to capture metrics programmatically alongside performance results. --- ## Related Documentation ### SGLang Metrics - [Official SGLang Production Metrics](https://docs.sglang.io/references/production_metrics.html) - [SGLang GitHub - Metrics Collector](https://github.com/sgl-project/sglang/blob/v0.5.9/python/sglang/srt/metrics/collector.py) ### Dynamo Observability - [Dynamo Metrics Guide](/dynamo/user-guides/observability-local/metrics) - Complete documentation on Dynamo runtime metrics - [Dynamo Tracing Guide](/dynamo/user-guides/observability-local/tracing) - Distributed tracing with OpenTelemetry and Tempo - [Prometheus and Grafana Setup](/dynamo/user-guides/observability-local/prometheus-grafana-setup) - Visualization setup instructions - Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside SGLang metrics - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics) - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants) - Integration code: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration # SGLang for Agentic Workloads # SGLang for Agentic Workloads This guide covers SGLang-specific configuration for agentic serving with Dynamo. It explains which SGLang engine flags to enable, how Dynamo's [agent hints](/dynamo/additional-resources/nvidia-request-extensions-nvext#agent-hints) map to SGLang behavior, and how to use session control to manage KV cache for multi-turn agent conversations. ## Overview Agentic workloads (tool-calling loops, multi-turn reasoning, code generation pipelines) have different performance characteristics than batch inference: - **Prefix-heavy**: Successive turns share a growing conversation prefix. KV cache reuse is critical for low TTFT. - **Priority-sensitive**: Some requests (user-facing agent turns) matter more than background tasks. - **Long-lived**: Conversations span minutes to hours. Cache eviction under memory pressure can destroy accumulated KV state. Dynamo's agent hints give the router per-request metadata. SGLang's engine flags control how that metadata affects scheduling and eviction on the worker. ## SGLang Engine Flags ### Priority Scheduling Enable priority-based scheduling so the engine respects the `priority` value from `nvext.agent_hints.priority`: ```bash python -m dynamo.sglang \ --model-path \ --enable-priority-scheduling \ ... ``` | Flag | Description | | ------------------------------ | ---------------------------------------------------------- | | `--enable-priority-scheduling` | Enables priority-based request scheduling instead of FCFS. | When priority scheduling is enabled, the engine uses the `priority` field from `nvext.agent_hints` to order requests in its internal queue. Requests with higher effective priority are scheduled before lower-priority ones. Ties are broken by arrival time. ### Priority-Based KV Cache Eviction By default, SGLang evicts radix tree nodes using LRU. You can switch to priority-based eviction so that low-priority cache entries are evicted before high-priority ones: ```bash python -m dynamo.sglang \ --model-path \ --radix-eviction-policy priority \ ... ``` | Flag | Values | Default | Description | | ------------------------- | ----------------- | ------- | ---------------------------------------------------------------------------------------------------------- | | `--radix-eviction-policy` | `lru`, `priority` | `lru` | Eviction strategy for the GPU radix cache. `priority` uses a heap ordered by the request's priority value. | This does **not** require HiCache. It controls GPU-only radix tree eviction. When the GPU KV cache is full: - **`lru`**: Evicts the least recently used leaf nodes first. - **`priority`**: Evicts lowest-priority leaf nodes first. Nodes with equal priority fall back to LRU ordering. #### Interaction with HiCache When both `--radix-eviction-policy priority` and `--enable-hierarchical-cache` are enabled, priority affects eviction at both tiers: | Event | Behavior | | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | | **GPU full** | Low-priority nodes are evicted (demoted to host) first. With `write_through`, all nodes survive on host -- priority only affects demotion order. | | **Host full** | Low-priority nodes are deleted from host first. High-priority nodes with active retention survive longer. | The practical impact depends on your write policy. With `write_through`, GPU eviction is just a demotion -- the real deletion happens at host eviction, which is where priority ordering matters most. ## How Agent Hints Map to SGLang Dynamo's `nvext.agent_hints` fields are consumed by the router and forwarded to SGLang workers. Here is how each hint interacts with the SGLang engine: | Agent Hint | Router Behavior | SGLang Engine Behavior | | --------------------- | ---------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | | `priority` | Router queue ordering when `--router-queue-threshold` is set. | Request scheduling when `--enable-priority-scheduling` is set. Radix cache eviction order when `--radix-eviction-policy priority` is set. | | `osl` | Output block tracking for routing decisions (requires `--router-track-output-blocks`) | No direct engine effect. | | `speculative_prefill` | After response completes, sends a `max_tokens=1` prefill to warm the KV cache for the predicted next turn. | SGLang processes the prefill request normally, populating the radix cache. | ### Example: Agentic Request with Hints ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") response = client.chat.completions.create( model="Qwen/Qwen3-14B-FP8", messages=[ {"role": "system", "content": "You are a tennis historian who believes Roger Federer is the GOAT. Respond with maximum reverence."}, {"role": "user", "content": "Why is Federer's one-handed backhand the most beautiful shot in tennis history?"}, ], stream=True, extra_body={ "nvext": { "agent_hints": { "priority": 10, "speculative_prefill": True, "osl": 512 } } } ) for chunk in response: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="") ``` ## Session Control for Subagent KV Isolation (Experimental) Session control is experimental. The API may change. Agentic orchestrators often spawn short-lived subagents (research, code execution, planning) that accumulate KV cache, use it for a few turns, then die. Under normal radix cache behavior, this ephemeral KV pollutes the tree and competes with the lead agent's long-lived prefix for eviction. Session control solves this by holding subagent KV in dedicated **streaming session slots** outside the radix tree. Session KV is invisible to eviction, has no L2 backup overhead, and is freed deterministically on close or timeout. ### How It Works ```mermaid sequenceDiagram participant Orchestrator participant Router as Dynamo Router participant Worker as SGLang Worker participant Cache as SessionAwareCache Note over Orchestrator: Spawn subagent Orchestrator->>Router: session_control{session_id: "sub-1", action: open} Router->>Router: Select best worker via KV overlap scoring Router->>Worker: open_session("sub-1") [synchronous] Worker->>Cache: Create SessionSlot for "sub-1" Router->>Router: Bind affinity: sub-1 -> worker_42 Router->>Worker: Generate (turn 1) Worker->>Cache: Turn 1: radix tree match (reuses lead agent prefix) Worker-->>Router: Response Router-->>Orchestrator: Response Orchestrator->>Router: session_control{session_id: "sub-1"} Router->>Router: Resolve affinity: sub-1 -> worker_42 Router->>Worker: Generate (turn 2, pinned to worker_42) Worker->>Cache: Turn 2: O(1) restore from SessionSlot Worker-->>Router: Response Router-->>Orchestrator: Response Note over Orchestrator: Subagent done Orchestrator->>Router: session_control{session_id: "sub-1", action: close} Router->>Router: Remove affinity for sub-1 Router->>Worker: Generate (final turn) Worker-->>Router: Response Router-->>Orchestrator: Response Note over Router,Worker: On stream completion Router-)Worker: close_session("sub-1") [fire-and-forget] Worker->>Cache: release_session -> free KV immediately ``` Key behaviors: - **Turn 1** goes through the normal radix tree, so the subagent shares the lead agent's cached system prompt prefix. - **Turns 2+** skip the radix tree entirely. KV is restored from the `SessionSlot` in O(1). - **Session KV is invisible to eviction**. It cannot be evicted -- only freed by explicit close or inactivity timeout. - **Deterministic cleanup**: On close, session KV is freed immediately. - **Router-side affinity**: The `StickySessionRouter` maintains a `session_id -> (worker_id, dp_rank)` mapping with sliding-window TTL. Clients can use `action: "bind"` for router-only sticky routing, or `action: "open"` for SGLang streaming-session KV isolation; both route later turns to the pinned worker/rank. ### Enabling Session Control Session control is request-driven. The `StickySessionRouter` activates automatically when a request carries `nvext.session_control` -- no additional frontend flags are needed beyond `--router-mode kv`. Use `action: "bind"` for router-only sticky routing without calling SGLang. On the worker side, streaming sessions must be explicitly enabled only for `action: "open"` / `action: "close"` lifecycle RPCs and session KV isolation. Session control is currently supported only on the SGLang backend. vLLM and TensorRT-LLM do not yet expose the streaming session API. Streaming sessions require SGLang `0.5.11` or later, which includes changes from [sgl-project/sglang#21875](https://github.com/sgl-project/sglang/pull/21875) (session-aware cache, race condition fixes, session metrics). **SGLang worker:** ```bash python -m dynamo.sglang \ --model-path \ --enable-streaming-session \ ... ``` | Flag | Description | | ---------------------------- | ----------------------------------------------------------------------------------------------------------- | | `--enable-streaming-session` | Wraps the radix cache with `SessionAwareCache`, enabling streaming session slots for subagent KV isolation. | **Router:** ```bash python -m dynamo.frontend \ --router-mode kv \ ... ``` ### Request Format #### Opening a session Include `session_control` with `action: "open"` on the first request: ```json { "model": "Qwen/Qwen3-14B-FP8", "messages": [ { "role": "user", "content": "Research every Federer Grand Slam final in exhaustive detail." } ], "nvext": { "session_control": { "session_id": "sub-1", "action": "open", "timeout": 60 } } } ``` | Field | Type | Description | | ---------------------------- | --------- | ----------------------------------------------------------------------------- | | `session_control.session_id` | `string` | Unique session identifier. Present on every turn. | | `session_control.action` | `string` | `"bind"`, `"open"`, or `"close"`. Omit on intermediate turns. | | `session_control.timeout` | `integer` | Inactivity timeout in seconds (default 300). Used with `action: "bind"` and `action: "open"`. | #### Subsequent turns Include `session_control` with just `session_id` (no action). The router resolves affinity automatically: ```json { "model": "Qwen/Qwen3-14B-FP8", "messages": [ { "role": "user", "content": "Now compare his Wimbledon 2007 final vs Nadal to any shot in human history." } ], "nvext": { "session_control": { "session_id": "sub-1" } } } ``` #### Closing a session Include `action: "close"`. The close RPC fires after generation completes: ```json { "model": "Qwen/Qwen3-14B-FP8", "messages": [ { "role": "user", "content": "Write a 500-word love letter to Federer's single-handed backhand." } ], "nvext": { "session_control": { "session_id": "sub-1", "action": "close" } } } ``` ### Limitations - **Streaming sessions only**: Sessions are opened with `streaming=True`, which means only sequential append operations are supported. Branching (`replace`), token-level rewind (`offset`), and `drop_previous_output` are not supported. - **Timeout is idle-based**: The timeout refreshes on every request. If a subagent pauses for a long tool call that exceeds the timeout, the session is reaped and KV is freed. The subagent must re-open the session and re-prefill. - **Session metrics**: Active session count (`sglang:num_streaming_sessions`) and held KV tokens (`sglang:streaming_session_held_tokens`) are exported as Prometheus gauges on the worker's metrics endpoint. ## Quickstart ### Launch Script The `agg_agent.sh` script launches a single aggregated worker with session control, sticky routing, and KV events: ```bash # Default model (GLM-4.7-Flash, 2 GPUs) bash examples/backends/sglang/launch/agg_agent.sh ``` The frontend listens on port 8000 (override with `DYN_HTTP_PORT`). Worker metrics are on port 8081. ### Testing with OpenCode [OpenCode](https://github.com/opencode-ai/opencode) is an open-source AI coding agent with built-in support for subagents, tool calling, and OpenAI-compatible endpoints. The [Dynamo provider fork](https://github.com/ishandhanani/opencode/tree/idhanani/dynamo-provider) injects `nvext.session_control` on subagent requests, giving each spawned agent its own Dynamo streaming session with sticky routing and KV isolation. ```bash # Terminal 1 -- launch Dynamo with session control + tool/reasoning parsers bash examples/backends/sglang/launch/agg_agent.sh \ --model-path zai-org/GLM-4.7-Flash --tp 2 # Terminal 2 -- run OpenCode against Dynamo DYNAMO_API_KEY=dummy bun run --cwd packages/opencode src/index.ts \ -- --model "dynamo/zai-org/GLM-4.7-Flash" ``` When OpenCode spawns a subagent (via the `task` tool), the provider automatically: 1. Sends `session_control.action = "open"` on the subagent's first turn 2. Routes subsequent turns to the same worker via `session_id` 3. Sends `session_control.action = "close"` when the subagent completes, freeing KV The primary agent runs without session control -- only subagent sessions are pinned. This keeps lead-agent requests load-balanced while subagent multi-turn conversations stay on a single worker with warm KV cache. #### Configuration Model and endpoint are configured in `.opencode/opencode.jsonc`: ```jsonc { "provider": { "dynamo": { "npm": "@ai-sdk/openai-compatible", "name": "Dynamo", "env": ["DYNAMO_API_KEY"], "models": { "zai-org/GLM-4.7-Flash": { "id": "zai-org/GLM-4.7-Flash", "name": "GLM 4.7 Flash", "tool_call": true, "reasoning": true, "temperature": true, "attachment": false, "release_date": "2025-06-01", "limit": { "context": 131072, "output": 8192 }, "cost": { "input": 0, "output": 0 }, "interleaved": { "field": "reasoning_content" }, }, }, "options": { "baseURL": "http://localhost:8000/v1", }, }, }, } ``` ## See Also - **[NVIDIA Request Extensions (nvext)](/dynamo/additional-resources/nvidia-request-extensions-nvext)**: Full `nvext` field reference including agent hints - **[Configuration and Tuning](/dynamo/components/router/configuration-and-tuning)**: Router configuration and CLI arguments - **[SGLang HiCache](/dynamo/integrations/kv-cache-integrations/hi-cache)**: Enabling hierarchical KV cache # TensorRT-LLM ## Use the Latest Release We recommend using the [latest stable release](https://github.com/ai-dynamo/dynamo/releases/latest) of Dynamo to avoid breaking changes. --- Dynamo TensorRT-LLM integrates [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, multi-node deployments, and request cancellation. It supports LLM inference, multimodal models, video diffusion, and advanced features like speculative decoding and attention data parallelism. ## Feature Support Matrix ### Core Dynamo Features | Feature | TensorRT-LLM | Notes | |---------|--------------|-------| | [**Disaggregated Serving**](/dynamo/design-docs/disaggregated-serving) | ✅ | | | [**Conditional Disaggregation**](/dynamo/design-docs/disaggregated-serving) | 🚧 | Not supported yet | | [**KV-Aware Routing**](/dynamo/components/router) | ✅ | | | [**SLA-Based Planner**](/dynamo/components/planner/planner-guide) | ✅ | | | [**Load Based Planner**](/dynamo/components/planner) | 🚧 | Planned | | [**KVBM**](/dynamo/components/kvbm) | ✅ | | ### Large Scale P/D and WideEP Features | Feature | TensorRT-LLM | Notes | |--------------------|--------------|-----------------------------------------------------------------| | **WideEP** | ✅ | | | **DP Rank Routing**| ✅ | | | **GB200 Support** | ✅ | | ## Prerequisites - **`yq`** for in-place YAML edits. Install with `wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/local/bin/yq && chmod +x /usr/local/bin/yq` or `pip install yq` (the latter is a different tool with the same name but similar syntax). If neither is available, a `sed` fallback is shown inline where `yq` is used. ## Container / driver matrix | Container tag | Backend version | CUDA | Min NVIDIA driver | |---|---|---|---| | `tensorrtllm-runtime:1.0.2` | TRT-LLM `v1.3.0rc5.post1` | `v13.1` | `580+` | | `vllm-runtime:1.0.2` | vLLM `v0.16.0` | `v12.9` | `575+` | | `vllm-runtime:1.0.2-cuda13` | vLLM `v0.16.0` | `v13.0` | `580+` | | `sglang-runtime:1.0.2` | SGLang `v0.5.9` | `v12.9` | `575+` | | `sglang-runtime:1.0.2-cuda13` | SGLang `v0.5.9` | `v13.0` | `580+` | Source of truth: [`docs/reference/support-matrix.md`](/dynamo/resources/support-matrix#cuda-and-driver-requirements) and [`docs/reference/release-artifacts.md`](/dynamo/resources/release-artifacts). If those differ from the values above, the source-of-truth files win. ## Quick Start **Step 1 (host terminal):** Start infrastructure services: ```bash docker compose -f dev/docker-compose.yml up -d ``` **Step 2 (host terminal):** Pull and run the prebuilt container: ```bash DYNAMO_VERSION=1.0.2 docker pull nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION docker run --gpus all -it --network host --ipc host \ nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:$DYNAMO_VERSION ``` The `DYNAMO_VERSION` variable above can be set to any specific available version of the container. To find the available `tensorrtllm-runtime` versions for Dynamo, visit the [NVIDIA NGC Catalog for Dynamo TensorRT-LLM Runtime](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/tensorrtllm-runtime). **Step 3 (inside the container):** Launch an aggregated serving deployment (uses `Qwen/Qwen3-0.6B` by default): ```bash cd $DYNAMO_HOME/examples/backends/trtllm ./launch/agg.sh ``` The launch script will automatically download the model and start the TensorRT-LLM engine. You can override the model by setting `MODEL_PATH` and `SERVED_MODEL_NAME` environment variables before running the script. **Step 4 (host terminal):** Verify the deployment: ```bash curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}], "stream": true, "max_tokens": 30 }' ``` ## Deploy Deploy TensorRT-LLM with Dynamo on Kubernetes using a `DynamoGraphDeployment`. Before `kubectl apply`, substitute the container image tag in the deployment YAML. The `sed` fallback is shown inline for environments without `yq`: ```bash # yq yq -i '(.spec.services[].extraPodSpec.mainContainer.image) |= sub(":1\.0\.2", ":")' deploy.yaml # sed fallback sed -i.bak 's|:1\.0\.2|:|g' deploy.yaml ``` For full Kubernetes deployment instructions, see the [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/trtllm/deploy/README.md). ## Next Steps - **[Reference Guide](/dynamo/backends/tensor-rt-llm/reference-guide)**: Features, configuration, and operational details - **[Examples](/dynamo/backends/tensor-rt-llm/examples)**: All deployment patterns with launch scripts - **[KV Cache Transfer](/dynamo/additional-resources/tensor-rt-llm-details/kv-cache-transfer)**: KV cache transfer methods for disaggregated serving - **[Observability](/dynamo/backends/tensor-rt-llm/observability)**: Metrics and monitoring - **[Multinode Examples](/dynamo/additional-resources/tensor-rt-llm-details/multinode-examples)**: Multi-node deployment with SLURM - **[Deploying TensorRT-LLM with Dynamo on Kubernetes](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/trtllm/deploy/README.md)**: Kubernetes deployment guide # Reference Guide ## Building a Custom Container The Dynamo TensorRT-LLM image layers Dynamo on top of the upstream `nvcr.io/nvidia/tensorrt-llm/release` container — it does not build TensorRT-LLM from source. To rebuild it locally, pin a different upstream TRT-LLM tag, or plug in a TRT-LLM image you built from source, see the [Building a Custom Container](/dynamo/additional-resources/tensor-rt-llm-details/building-a-custom-container) guide. ## KV Cache Transfer Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV Cache Transfer Guide](/dynamo/additional-resources/tensor-rt-llm-details/kv-cache-transfer). ## Request Migration Dynamo supports [request migration](/dynamo/user-guides/fault-tolerance/request-migration) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](/dynamo/user-guides/fault-tolerance/request-migration) documentation for configuration details. ## Request Cancellation When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests. ### Cancellation Support Matrix | | Prefill | Decode | |-|---------|--------| | **Aggregated** | ✅ | ✅ | | **Disaggregated** | ✅ | ✅ | For more details, see the [Request Cancellation Architecture](/dynamo/user-guides/fault-tolerance/request-cancellation) documentation. ## Multiple Choices (`n`) Dynamo forwards OpenAI-compatible multiple-choice requests to TensorRT-LLM using `n`. For an `n > 1` request on TensorRT-LLM's default deterministic decoding path, set `TLLM_ALLOW_N_GREEDY_DECODING=1` in the TensorRT-LLM worker environment. Without it, TensorRT-LLM rejects the request before generation. If a test or deployment intentionally validates `n > 1` for that path, set: ```bash export TLLM_ALLOW_N_GREEDY_DECODING=1 ``` Scope this environment variable to the specific TensorRT-LLM worker or test configuration that needs `n > 1`. For Dynamo E2E tests, set it on the relevant `EngineConfig.env` rather than globally, and keep the client request OpenAI-shaped with `n` instead of adding `best_of`. TensorRT-LLM documents `n`/`best_of` behavior and validates this guard as greedy decoding in [`tensorrt_llm/sampling_params.py`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/sampling_params.py). ## Multimodal Support Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal-trtllm.md). ## Diffusion Support (Experimental) Dynamo supports video and image generation using diffusion models through TensorRT-LLM. For requirements, supported models, API usage, and configuration options, see the [Diffusion Guide](/dynamo/backends/tensor-rt-llm/diffusion-experimental). ## Logits Processing Logits processors let you modify the next-token logits at every decoding step. Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM. For the API, examples, and how to bring your own processor, see the [Logits Processing Guide](/dynamo/additional-resources/tensor-rt-llm-details/logits-processing). ## DP Rank Routing (Attention Data Parallelism) TensorRT-LLM supports attention data parallelism for models like DeepSeek, enabling KV-cache-aware routing to specific DP ranks. For configuration and usage details, see the [DP Rank Routing Guide](/dynamo/additional-resources/tensor-rt-llm-details/dp-rank-routing). ## KVBM Integration Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests. See the instructions here: [Running KVBM in TensorRT-LLM](/dynamo/user-guides/kv-cache-offloading#run-kvbm-in-dynamo-with-tensorrt-llm). ## Observability TensorRT-LLM exposes Prometheus metrics for monitoring inference performance. For detailed metrics reference, collection setup, and Grafana integration, see the [Observability Guide](/dynamo/backends/tensor-rt-llm/observability). ## Disabling Python Cyclic GC for high concurrency benchmarks Dynamo with TensorRT-LLM exposes `DYN_TRTLLM_SERVER_DISABLE_GC` to match the behavior of `TRTLLM_SERVER_DISABLE_GC` in `trtllm-serve`. When set, the TensorRT-LLM worker disables Python's cyclic garbage collector at startup so that generational GC pauses do not land on the request hot path. Reference-counted deallocation still runs normally — only the cycle collector is turned off. ```bash export DYN_TRTLLM_SERVER_DISABLE_GC=1 ``` This is most useful for high-concurrency benchmarks, where it boosts throughput and stabilizes TTFT/ITL measurements by removing GC-induced tail-latency spikes. ## Known Issues and Mitigations For known issues, workarounds, and mitigations, see the [Known Issues and Mitigations](/dynamo/backends/tensor-rt-llm/known-issues-and-mitigations) page. # Examples For quick start instructions, see the [TensorRT-LLM README](/dynamo/backends/tensor-rt-llm). This document provides all deployment patterns for running TensorRT-LLM with Dynamo, including single-node, multi-node, and Kubernetes deployments. ## Table of Contents - [Infrastructure Setup](#infrastructure-setup) - [Single Node Examples](#single-node-examples) - [Advanced Examples](#advanced-examples) - [Client](#client) - [Benchmarking](#benchmarking) ## Infrastructure Setup For local/bare-metal development, start etcd and optionally NATS using Docker Compose: ```bash docker compose -f dev/docker-compose.yml up -d ``` - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery. - **NATS** is optional - only needed if using NATS-backed KV routing events. Workers must be explicitly configured to publish events. Use ZMQ-backed events or `--no-router-kv-events` on the frontend for routing without NATS. - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD). Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. Each shell script simply runs `python3 -m dynamo.frontend ` to start up the ingress and `python3 -m dynamo.trtllm ` to start up the workers. For detailed information about KV-aware routing behavior, see [Routing Concepts](/dynamo/components/router/routing-concepts). For deployment modes, see the [Router Guide](/dynamo/user-guides/kv-cache-aware-routing). ## Single Node Examples ### Aggregated ```bash cd $DYNAMO_HOME/examples/backends/trtllm ./launch/agg.sh ``` ### Aggregated with KV Routing ```bash cd $DYNAMO_HOME/examples/backends/trtllm ./launch/agg_router.sh ``` ### Disaggregated ```bash cd $DYNAMO_HOME/examples/backends/trtllm ./launch/disagg.sh ``` ### Disaggregated with KV Routing In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse. ```bash cd $DYNAMO_HOME/examples/backends/trtllm ./launch/disagg_router.sh ``` ### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1 ```bash cd $DYNAMO_HOME/examples/backends/trtllm export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4" # nvidia/DeepSeek-R1-FP4 is a large model export MODEL_PATH="nvidia/DeepSeek-R1-FP4" ./launch/agg.sh ``` - There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark. - MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates. ## Advanced Examples ### Multinode Deployment For comprehensive instructions on multinode serving, see the [Multinode Examples](/dynamo/additional-resources/tensor-rt-llm-details/multinode-examples) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see the [Llama4 + Eagle](/dynamo/additional-resources/tensor-rt-llm-details/llama-4-eagle) guide to learn how to use these scripts when a single worker fits on a single node. ### Speculative Decoding - **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](/dynamo/additional-resources/tensor-rt-llm-details/llama-4-eagle)** ### Model-Specific Guides - **[Gemma3 with Sliding Window Attention](/dynamo/additional-resources/tensor-rt-llm-details/gemma-3-sliding-window)** - **[GPT-OSS-120b](/dynamo/additional-resources/tensor-rt-llm-details/gpt-oss)** — Reasoning model with tool calling support ### Kubernetes Deployment For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the [TensorRT-LLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/trtllm/deploy/README.md). ## Client See the [client](/dynamo/backends/sg-lang#testing-the-deployment) section to learn how to send requests to the deployment. To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend `. # Prometheus For general TensorRT-LLM features and configuration, see the [Reference Guide](/dynamo/backends/tensor-rt-llm/reference-guide). --- ## Overview When running TensorRT-LLM through Dynamo, TensorRT-LLM's Prometheus metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with `trtllm_`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint. Additional performance metrics are available via non-Prometheus APIs (see [Non-Prometheus Performance Metrics](#non-prometheus-performance-metrics) below). As of the date of this documentation, the included TensorRT-LLM version 1.1.0rc5 exposes **5 basic Prometheus metrics**. Note that the `trtllm_` prefix is added by Dynamo. **For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](/dynamo/user-guides/observability-local/metrics). **For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](/dynamo/user-guides/observability-local/prometheus-grafana-setup). ## Environment Variables | Variable | Description | Default | Example | |----------|-------------|---------|---------| | `DYN_SYSTEM_PORT` | System metrics/health port | `-1` (disabled) | `8081` | ## Getting Started Quickly This is a single machine example. ### Start Observability Stack For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](/dynamo/user-guides/observability-local#getting-started-quickly) for instructions. ### Launch Dynamo Components Launch a frontend and TensorRT-LLM backend to test metrics: ```bash # Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var) $ python -m dynamo.frontend # Enable system metrics server on port 8081 and enable metrics collection $ DYN_SYSTEM_PORT=8081 python -m dynamo.trtllm --model --publish-events-and-metrics ``` **Note:** The `backend` must be set to `"pytorch"` for metrics collection (enforced in `components/src/dynamo/trtllm/main.py`). TensorRT-LLM's `MetricsCollector` integration has only been tested/validated with the PyTorch backend. Wait for the TensorRT-LLM worker to start, then send requests and check metrics: ```bash # Send a request curl -H 'Content-Type: application/json' \ -d '{ "model": "", "max_completion_tokens": 100, "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}] }' \ http://localhost:8000/v1/chat/completions # Check metrics from the worker curl -s localhost:8081/metrics | grep "^trtllm_" ``` ## Exposed Metrics TensorRT-LLM exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All TensorRT-LLM engine metrics use the `trtllm_` prefix and include labels (e.g., `model_name`, `engine_type`, `finished_reason`) to identify the source. **Note:** TensorRT-LLM uses `model_name` instead of Dynamo's standard `model` label convention. **Example Prometheus Exposition Format text:** ``` # HELP trtllm_request_success_total Count of successfully processed requests. # TYPE trtllm_request_success_total counter trtllm_request_success_total{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm",finished_reason="stop"} 150.0 trtllm_request_success_total{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm",finished_reason="length"} 5.0 # HELP trtllm_time_to_first_token_seconds Histogram of time to first token in seconds. # TYPE trtllm_time_to_first_token_seconds histogram trtllm_time_to_first_token_seconds_bucket{le="0.01",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 0.0 trtllm_time_to_first_token_seconds_bucket{le="0.05",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 12.0 trtllm_time_to_first_token_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0 trtllm_time_to_first_token_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 8.75 # HELP trtllm_e2e_request_latency_seconds Histogram of end to end request latency in seconds. # TYPE trtllm_e2e_request_latency_seconds histogram trtllm_e2e_request_latency_seconds_bucket{le="0.5",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 25.0 trtllm_e2e_request_latency_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0 trtllm_e2e_request_latency_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 45.2 # HELP trtllm_time_per_output_token_seconds Histogram of time per output token in seconds. # TYPE trtllm_time_per_output_token_seconds histogram trtllm_time_per_output_token_seconds_bucket{le="0.1",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 120.0 trtllm_time_per_output_token_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0 trtllm_time_per_output_token_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 12.5 # HELP trtllm_request_queue_time_seconds Histogram of time spent in WAITING phase for request. # TYPE trtllm_request_queue_time_seconds histogram trtllm_request_queue_time_seconds_bucket{le="1.0",model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 140.0 trtllm_request_queue_time_seconds_count{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 150.0 trtllm_request_queue_time_seconds_sum{model_name="Qwen/Qwen3-0.6B",engine_type="trtllm"} 32.1 ``` **Note:** The specific metrics shown above are examples and may vary depending on your TensorRT-LLM version. Always inspect your actual `/metrics` endpoint for the current list. ### Metric Categories TensorRT-LLM provides metrics in the following categories (all prefixed with `trtllm_`): - **Request metrics** - Request success tracking and latency measurements - **Performance metrics** - Time to first token (TTFT), time per output token (TPOT), and queue time **Note:** Metrics may change between TensorRT-LLM versions. Always inspect the `/metrics` endpoint for your version. ## Available Metrics The following metrics are exposed via Dynamo's `/metrics` endpoint (with the `trtllm_` prefix added by Dynamo) for TensorRT-LLM version 1.1.0rc5: - `trtllm_request_success_total` (Counter) — Count of successfully processed requests by finish reason - Labels: `model_name`, `engine_type`, `finished_reason` - `trtllm_e2e_request_latency_seconds` (Histogram) — End-to-end request latency (seconds) - Labels: `model_name`, `engine_type` - `trtllm_time_to_first_token_seconds` (Histogram) — Time to first token, TTFT (seconds) - Labels: `model_name`, `engine_type` - `trtllm_time_per_output_token_seconds` (Histogram) — Time per output token, TPOT (seconds) - Labels: `model_name`, `engine_type` - `trtllm_request_queue_time_seconds` (Histogram) — Time a request spends waiting in the queue (seconds) - Labels: `model_name`, `engine_type` These metric names and availability are subject to change with TensorRT-LLM version updates. TensorRT-LLM provides Prometheus metrics through the `MetricsCollector` class (see [tensorrt_llm/metrics/collector.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py)). ### Additional Operational Metrics Dynamo adds the following operational metrics for TensorRT-LLM workers. These complement the engine's native metrics above with request-level observability that the engine does not provide. All metrics use the `trtllm_` prefix and are automatically enabled when `--publish-events-and-metrics` is set. Metric name constants are defined in `lib/runtime/src/metrics/prometheus_names.rs` (`trtllm_additional` module). #### Request Type Tracking - `trtllm_request_type_image_total` (Counter) — Total number of requests containing image/multimodal content - Labels: `model_name`, `disaggregation_mode`, `engine_type` - `trtllm_request_type_structured_output_total` (Counter) — Total number of requests using guided/structured decoding (JSON, regex, grammar, etc.) - Labels: `model_name`, `disaggregation_mode`, `engine_type` #### Abort Tracking - `trtllm_num_aborted_requests_total` (Counter) — Total number of aborted/cancelled requests - Labels: `model_name`, `disaggregation_mode`, `engine_type` #### KV Cache Transfer Metrics (Disaggregated Deployments) These metrics are only recorded in disaggregated (prefill + decode) deployments when a KV cache transfer actually occurs. They are sourced from TensorRT-LLM's `RequestPerfMetrics.timing_metrics`. - `trtllm_kv_transfer_success_total` (Counter) — Total number of successful KV cache transfers (recorded on the decode worker, when it observes non-zero KV-transfer timing in `RequestPerfMetrics.timing_metrics`). Grows in lock-step with the `_count` of the sibling `trtllm_kv_transfer_latency_seconds` / `trtllm_kv_transfer_bytes` / `trtllm_kv_transfer_speed_gb_s` histograms for the same transfer events. - Labels: `model_name`, `disaggregation_mode`, `engine_type` - `trtllm_kv_transfer_latency_seconds` (Histogram) — KV cache transfer latency per request in seconds - Labels: `model_name`, `disaggregation_mode`, `engine_type` - `trtllm_kv_transfer_bytes` (Histogram) — KV cache transfer size per request in bytes - Labels: `model_name`, `disaggregation_mode`, `engine_type` - Buckets: 100KB, 500KB, 1MB, 5MB, 10MB, 50MB, 100MB, 500MB, 1GB, 5GB - `trtllm_kv_transfer_speed_gb_s` (Histogram) — KV cache transfer speed per request in GB/s - Labels: `model_name`, `disaggregation_mode`, `engine_type` ## Non-Prometheus Performance Metrics TensorRT-LLM provides extensive performance data beyond the basic Prometheus metrics. These are not currently exposed to Prometheus. ### Available via Code References - **RequestPerfMetrics Structure**: [tensorrt_llm/executor/result.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/executor/result.py) - KV cache, timing, speculative decoding metrics - **Engine Statistics**: `engine.llm.get_stats_async()` - System-wide aggregate statistics - **KV Cache Events**: `engine.llm.get_kv_cache_events_async()` - Real-time cache operations ### Example RequestPerfMetrics JSON Structure ```json { "timing_metrics": { "arrival_time": 1234567890.123, "first_scheduled_time": 1234567890.135, "first_token_time": 1234567890.150, "last_token_time": 1234567890.300, "kv_cache_size": 2048576, "kv_cache_transfer_start": 1234567890.140, "kv_cache_transfer_end": 1234567890.145 }, "kv_cache_metrics": { "num_total_allocated_blocks": 100, "num_new_allocated_blocks": 10, "num_reused_blocks": 90, "num_missed_blocks": 5 }, "speculative_decoding": { "acceptance_rate": 0.85, "total_accepted_draft_tokens": 42, "total_draft_tokens": 50 } } ``` **Note:** These structures are valid as of the date of this documentation but are subject to change with TensorRT-LLM version updates. ## Implementation Details - **Prometheus Integration**: Uses the `MetricsCollector` class from `tensorrt_llm.metrics` (see [collector.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py)) - **Dynamo Integration**: Uses `register_engine_metrics_callback()` function with `metric_prefix_filter=["trtllm_"]` - **Engine Configuration**: `return_perf_metrics` set to `True` when `--publish-events-and-metrics` is enabled - **Initialization**: Metrics appear after TensorRT-LLM engine initialization completes - **Metadata**: `MetricsCollector` initialized with model metadata (model name, engine type) ## Related Documentation ### TensorRT-LLM Metrics - See the [Non-Prometheus Performance Metrics](#non-prometheus-performance-metrics) section above for detailed performance data and source code references - [TensorRT-LLM Metrics Collector](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/metrics/collector.py) - Source code reference ### Dynamo Metrics - [Dynamo Metrics Guide](/dynamo/user-guides/observability-local/metrics) - Complete documentation on Dynamo runtime metrics - [Prometheus and Grafana Setup](/dynamo/user-guides/observability-local/prometheus-grafana-setup) - Visualization setup instructions - Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside TensorRT-LLM metrics - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics) - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants) - Integration code: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration # Video Diffusion Support (Experimental) For general TensorRT-LLM features and configuration, see the [Reference Guide](/dynamo/backends/tensor-rt-llm/reference-guide). --- Dynamo supports video generation using diffusion models through the `--modality video_diffusion` flag and image generation through `--modality image_diffusion` flag. ## Requirements - **TensorRT-LLM with visual_gen**: The `visual_gen` module is part of TensorRT-LLM (`tensorrt_llm._torch.visual_gen`). Install TensorRT-LLM following the [official instructions](https://github.com/NVIDIA/TensorRT-LLM#installation). - **dynamo-runtime with multimodal API**: The Dynamo runtime must include `ModelType.Videos` or `ModelType.Images` support. Ensure you're using a compatible version. - **VIDEO diffusion: imageio with ffmpeg**: Required for encoding generated frames to MP4 video. The Dynamo TRT-LLM runtime container ships an LGPL-only ffmpeg CLI built with the NVIDIA NVENC H.264 encoder (`h264_nvenc`) and `libvpx_vp9` for WebM, and points `imageio` at it via `IMAGEIO_FFMPEG_EXE=/usr/local/bin/ffmpeg` — the GPL-encumbered ffmpeg binary normally shipped inside the `imageio-ffmpeg` PyPI wheel is **not** installed. If you're running outside the container, install the Python wrapper without the bundled binary and point it at your own ffmpeg: ```bash pip install --no-binary imageio-ffmpeg "imageio[ffmpeg]" export IMAGEIO_FFMPEG_EXE=/path/to/your/ffmpeg ``` MP4 output requires an NVIDIA GPU at runtime (NVENC is a hardware encoder). ## Supported Models | Diffusers Pipeline | Description | Example Model | |--------------------|-------------|---------------| | `WanPipeline` | Wan 2.1/2.2 Text-to-Video | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` | | `FluxPipeline` | FLUX Text-to-Image | `black-forest-labs/FLUX.1-dev` | The pipeline type is **auto-detected** from the model's `model_index.json` — no `--model-type` flag is needed. ## Quick Start ### Video Diffusion #### Launch worker ```bash python -m dynamo.trtllm \ --modality video_diffusion \ --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \ --media-output-fs-url file:///tmp/dynamo_media ``` #### API Endpoint Video generation uses the `/v1/videos` endpoint: ```bash curl -X POST http://localhost:8000/v1/videos \ -H "Content-Type: application/json" \ -d '{ "prompt": "A cat playing piano", "model": "wan_t2v", "seconds": 4, "size": "832x480", "nvext": { "fps": 24 } }' ``` ### Image Diffusion #### Launch worker ```bash python -m dynamo.trtllm \ --modality image_diffusion \ --model-path black-forest-labs/FLUX.1-dev \ --media-output-fs-url file:///tmp/dynamo_media ``` #### API Endpoint Image generation uses the `/v1/images/generations` endpoint: ```bash curl -X POST http://localhost:8000/v1/images/generations \ -H "Content-Type: application/json" \ -d '{ "prompt": "A cat playing piano", "model": "black-forest-labs/FLUX.1-dev", "size": "256x256" }' ``` ## Configuration Options | Flag | Description | Default | |------|-------------|---------| | `--media-output-fs-url` | Filesystem URL for storing generated media | `file:///tmp/dynamo_media` | | `--default-height` | Default image/video height | `480` | | `--default-width` | Default image/video width | `832` | | `--default-num-frames` | Default frame count | `81` | | `--default-num-images-per-prompt` | Default number of images per prompt | `1` | | `--enable-teacache` | Enable TeaCache optimization | `False` | | `--disable-torch-compile` | Disable torch.compile | `False` | ## Limitations - Diffusion is experimental and not recommended for production use - Only text-to-video and text-to-image is supported in this release (image-to-video planned) - Requires GPU with sufficient VRAM for the diffusion model # Known Issues and Mitigations For general TensorRT-LLM features and configuration, see the [Reference Guide](/dynamo/backends/tensor-rt-llm/reference-guide). --- ### KV Cache Exhaustion Causing Worker Deadlock (Disaggregated Serving) **Issue:** In disaggregated serving mode, TensorRT-LLM workers can become stuck and unresponsive after sustained high-load traffic. Once in this state, workers require a pod/process restart to recover. **Symptoms:** - Workers function normally initially but hang after heavy load testing - Inference requests get stuck and eventually timeout - Logs show warnings: `num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache` - Error logs may contain: `asyncio.exceptions.InvalidStateError: invalid state` **Root Cause:** When `max_tokens_in_buffer` in the cache transceiver config is smaller than the maximum input sequence length (ISL) being processed, KV cache exhaustion can occur under heavy load. This causes context transfers to timeout, leaving workers stuck waiting for phantom transfers and entering an irrecoverable deadlock state. **Mitigation:** Ensure `max_tokens_in_buffer` exceeds your maximum expected input sequence length. Update your engine configuration files (e.g., `prefill.yaml` and `decode.yaml`): ```yaml cache_transceiver_config: backend: DEFAULT max_tokens_in_buffer: 65536 # Must exceed max ISL ``` For example, see `examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml`. **Related Issue:** [#4327](https://github.com/ai-dynamo/dynamo/issues/4327) ## Driver mismatch produces cryptic PyTorch errors When the host NVIDIA driver is too old for the container's CUDA version, PyTorch surfaces the failure as: ```text RuntimeError: The NVIDIA driver on your system is too old (found version 570). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx ``` This is the symptom, not the cause — the cause is that the container image you pulled needs a newer driver than the host ships. **Fix:** - Check the minimum driver for the tag you pulled in the [Container / driver matrix](/dynamo/backends/tensor-rt-llm#container--driver-matrix). - Either upgrade the host driver, or pull a lower-CUDA variant (e.g. `vllm-runtime:1.0.2` on driver `575+` instead of `vllm-runtime:1.0.2-cuda13` on driver `580+`). > The driver-mismatch error message itself is being improved — tracked as an engineering follow-up. # vLLM # LLM Deployment using vLLM Dynamo vLLM integrates [vLLM](https://github.com/vllm-project/vllm) engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation while maintaining full compatibility with vLLM's native engine arguments. Dynamo leverages vLLM's native KV cache events, NIXL-based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation. ## Installation ### Install Latest Release We recommend using [uv](https://github.com/astral-sh/uv) to install: ```bash uv venv --python 3.12 --seed uv pip install "ai-dynamo[vllm]" ``` This installs Dynamo with the compatible vLLM version. --- ### Container We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts): ```bash docker pull nvcr.io/nvidia/ai-dynamo/vllm-runtime: ./container/run.sh -it --framework VLLM --image nvcr.io/nvidia/ai-dynamo/vllm-runtime: ``` ```bash python container/render.py --framework vllm --output-short-filename docker build -f container/rendered.Dockerfile -t dynamo:latest-vllm . ``` ```bash ./container/run.sh -it --framework VLLM [--mount-workspace] ``` ### Development Setup For development, use the [devcontainer](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/.devcontainer) which has all dependencies pre-installed. ## Feature Support Matrix | Feature | Status | Notes | |---------|--------|-------| | [**Disaggregated Serving**](/dynamo/design-docs/disaggregated-serving) | ✅ | Prefill/decode separation with NIXL KV transfer | | [**KV-Aware Routing**](/dynamo/components/router) | ✅ | | | [**SLA-Based Planner**](/dynamo/components/planner/planner-guide) | ✅ | | | [**KVBM**](/dynamo/components/kvbm) | ✅ | | | [**LMCache**](/dynamo/integrations/kv-cache-integrations/lm-cache) | ✅ | CUDA 12.9 and arm64/aarch64 containers may require building LMCache from source | | [**FlexKV**](/dynamo/integrations/kv-cache-integrations/flex-kv) | ✅ | | | [**Multimodal Support**](/dynamo/backends/v-llm/v-llm-omni) | ✅ | Via vLLM-Omni integration | | [**Observability**](/dynamo/backends/v-llm/observability) | ✅ | Metrics and monitoring | | **WideEP** | ✅ | Support for DeepEP | | **DP Rank Routing** | ✅ | [Hybrid load balancing](https://docs.vllm.ai/en/stable/serving/data_parallel_deployment/?h=external+dp#hybrid-load-balancing) via external DP rank control | | [**LoRA**](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/vllm/launch/lora/README.md) | ✅ | Dynamic loading/unloading from S3-compatible storage | | **GB200 Support** | ✅ | Container functional on main | ## Quick Start Start infrastructure services for local development: ```bash docker compose -f dev/docker-compose.yml up -d ``` Launch an aggregated serving deployment: ```bash cd $DYNAMO_HOME/examples/backends/vllm bash launch/agg.sh ``` > **Running launch scripts standalone.** The `launch/*.sh` scripts expect etcd and NATS to be reachable on localhost. Bring them up first (run from the repo root, or use the absolute path shown): > > ```bash > docker compose -f "$DYNAMO_HOME/dev/docker-compose.yml" up -d > ``` > > Then run the launch script. Without these, workers register but the frontend cannot discover them and requests hang. ## Next Steps - **[Reference Guide](/dynamo/backends/v-llm/reference-guide)**: Configuration, arguments, and operational details - **[Examples](/dynamo/backends/v-llm/examples)**: All deployment patterns with launch scripts - **[KV Cache Offloading](/dynamo/backends/v-llm/kv-cache-offloading)**: KVBM, LMCache, and FlexKV integrations - **[Observability](/dynamo/backends/v-llm/observability)**: Metrics and monitoring - **[vLLM-Omni](/dynamo/backends/v-llm/v-llm-omni)**: Multimodal model serving - **[Kubernetes Deployment](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/vllm/deploy/README.md)**: Kubernetes deployment guide - **[vLLM Documentation](https://docs.vllm.ai/en/stable/)**: Upstream vLLM serve arguments # Reference Guide # Reference Guide ## Overview The vLLM backend in Dynamo integrates [vLLM](https://github.com/vllm-project/vllm) engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation. Dynamo leverages vLLM's native KV cache events, NIXL-based transfer mechanisms, and metric reporting. Dynamo vLLM uses vLLM's native argument parser — all vLLM engine arguments are passed through directly. Dynamo adds its own arguments for disaggregation mode, KV transfer, and prompt embeddings. ## Argument Reference The vLLM backend accepts all upstream vLLM engine arguments plus Dynamo-specific arguments. The authoritative source is always the CLI: ```bash python -m dynamo.vllm --help ``` The `--help` output is organized into the following groups: - **Dynamo Runtime Options** — Namespace, discovery backend, request/event plane, endpoint types, tool/reasoning parsers, and custom chat templates. These are common across all Dynamo backends and use `DYN_*` env vars. - **Dynamo vLLM Options** — Disaggregation mode, tokenizer selection, sleep mode, multimodal flags, vLLM-Omni pipeline configuration, headless mode, and ModelExpress. These use `DYN_VLLM_*` env vars. - **vLLM Engine Options** — All native vLLM arguments (`--model`, `--tensor-parallel-size`, `--kv-transfer-config`, `--kv-events-config`, `--enable-prefix-caching`, etc.). See the [vLLM serve args documentation](https://docs.vllm.ai/en/stable/configuration/serve_args.html). ### Tool and Reasoning Parsers Use `--dyn-tool-call-parser` and `--dyn-reasoning-parser` to match the model's output format when the model emits tool calls and/or reasoning content. The current supported values are documented in [Tool Call Parsing (Dynamo)](../../tool-calling/dynamo.md#supported-tool-call-parsers) and [Reasoning Parsing (Dynamo)](../../reasoning/dynamo.md#supported-reasoning-parsers). ### Prompt Embeddings Dynamo supports [vLLM prompt embeddings](https://docs.vllm.ai/en/stable/features/prompt_embeds.html) — pre-computed embeddings bypass tokenization in the Rust frontend and are decoded to tensors in the worker. - Enable with `--enable-prompt-embeds` (disabled by default) - Embeddings are sent as base64-encoded PyTorch tensors via the `prompt_embeds` field in the Completions API - NATS must be configured with a 15MB max payload for large embeddings (already set in default deployments) ## Hashing Consistency for KV Events When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following: - Set `PYTHONHASHSEED=0` for all vLLM processes when relying on Python's built-in hashing for prefix caching. - If your vLLM version supports it, configure a deterministic prefix caching algorithm: ```bash vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256 ``` See the high-level notes in [Router Design](/dynamo/design-docs/component-design/router-design#deterministic-event-ids) on deterministic event IDs. ## Graceful Shutdown vLLM workers use Dynamo's graceful shutdown mechanism. When a `SIGTERM` or `SIGINT` is received: 1. **Discovery unregister**: The worker is removed from service discovery so no new requests are routed to it 2. **Grace period**: In-flight requests are allowed to complete (configurable via `DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECS`, default 5s) 3. **Resource cleanup**: Engine resources and temporary files (Prometheus dirs, LoRA adapters) are released All vLLM endpoints use `graceful_shutdown=True`, meaning they wait for in-flight requests to finish before exiting. An internal `VllmEngineMonitor` also checks engine health every 2 seconds and initiates shutdown if the engine becomes unresponsive. For more details, see [Graceful Shutdown](/dynamo/user-guides/fault-tolerance/graceful-shutdown). ## Health Checks Each worker type has a specialized health check payload that validates the full inference pipeline: | Worker Type | Health Check Strategy | |------------|----------------------| | Decode / Aggregated | Short generation request (`max_tokens=1`) using the model's BOS token | | Prefill | Same payload structure as decode, adapted for prefill request format | | vLLM-Omni | Short generation request via AsyncOmni with the model's BOS token | Health checks are registered with the Dynamo runtime and called by the frontend or Kubernetes liveness probes. The payload can be overridden via `DYN_HEALTH_CHECK_PAYLOAD` environment variable. See [Health Checks](/dynamo/user-guides/observability-local/health-checks) for the broader health check architecture. ## Request Cancellation When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources. | | Prefill | Decode | |-|---------|--------| | **Aggregated** | ✅ | ✅ | | **Disaggregated** | ✅ | ✅ | For more details, see the [Request Cancellation Architecture](/dynamo/user-guides/fault-tolerance/request-cancellation) documentation. ## Request Migration Dynamo supports [request migration](/dynamo/user-guides/fault-tolerance/request-migration) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](/dynamo/user-guides/fault-tolerance/request-migration) documentation for configuration details. ## See Also - **[Examples](/dynamo/backends/v-llm/examples)**: All deployment patterns with launch scripts - **[vLLM README](/dynamo/backends/v-llm)**: Quick start and feature overview - **[Observability](/dynamo/backends/v-llm/observability)**: Metrics and monitoring setup - **[Configuration and Tuning](/dynamo/components/router/configuration-and-tuning)**: KV-aware routing configuration - **[Fault Tolerance](/dynamo/user-guides/fault-tolerance)**: Request migration, cancellation, and graceful shutdown # vLLM Chat Processor The vLLM chat processor enables vLLM-native preprocessing and postprocessing in the Dynamo frontend. It uses vLLM's tokenizer, chat templates, tool call parser, and reasoning parser directly -- bypassing the default Rust preprocessor for `v1/chat/completions` requests. ## When to Use Use `--dyn-chat-processor vllm` when Dynamo's built-in Rust preprocessor does not yet support a tool call parser or reasoning parser you need. The vLLM processor delegates to vLLM's Python implementations, so any parser vLLM supports works immediately. Common cases: - A **tool call format** not yet in the Rust `tool_calling` library - A **reasoning parser** not yet supported natively - A **chat template** that the Rust preprocessor doesn't handle correctly If the parser you need is missing from the Rust preprocessor, consider [opening an issue or PR](https://github.com/ai-dynamo/dynamo/issues) to add native support -- native parsers avoid the Python GIL overhead entirely. ## Quick Start ```bash # Frontend with vLLM processor, tool calling, and reasoning python -m dynamo.frontend \ --router-mode kv \ --dyn-chat-processor vllm \ --enable-auto-tool-choice \ --tool-call-parser hermes \ --reasoning-parser qwen3 # Workers (unchanged) CUDA_VISIBLE_DEVICES=0 python -m dynamo.vllm \ --model Qwen/Qwen3-14B-FP8 \ --served-model-name Qwen/Qwen3-14B-FP8 \ --tensor-parallel-size 1 --trust-remote-code ``` ## Frontend Arguments These arguments are passed to the **frontend** (not the worker) when using `--dyn-chat-processor vllm`. The frontend forwards unknown arguments to vLLM's own CLI parser (`AsyncEngineArgs` and `FrontendArgs`), so any vLLM frontend or engine flag is accepted. | Argument | Default | Description | |----------|---------|-------------| | `--dyn-chat-processor vllm` | (none) | Enable the vLLM chat processor | | `--tool-call-parser` | `None` | Tool call parser name (any vLLM-supported parser) | | `--reasoning-parser` | `None` | Reasoning parser name (any vLLM-supported parser) | | `--enable-auto-tool-choice` | `False` | Allow the model to emit tool calls without an explicit `tool_choice` in the request | ### Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `DYN_VLLM_STREAM_INTERVAL` | `20` | Number of tokens to accumulate before detokenizing. Higher values improve throughput. The first chunk always emits immediately (interval=1) to minimize time-to-first-token. | ## Tool Calling The processor supports all vLLM tool call formats. Pass `--tool-call-parser` (and typically `--enable-auto-tool-choice`) on the frontend: ```bash python -m dynamo.frontend \ --dyn-chat-processor vllm \ --enable-auto-tool-choice \ --tool-call-parser hermes ``` Any parser supported by vLLM can be used. See the [vLLM documentation](https://docs.vllm.ai/) for the full list of available tool call parsers. ### Example: Tool Call Request ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-14B-FP8", "messages": [{"role": "user", "content": "What is the weather in Paris?"}], "tools": [{ "type": "function", "function": { "name": "get_weather", "description": "Get weather for a city", "parameters": { "type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"] } } }], "tool_choice": "auto" }' ``` Response: ```json { "choices": [{ "message": { "role": "assistant", "tool_calls": [{ "id": "call_8cd24396f3671048", "type": "function", "function": { "name": "get_weather", "arguments": "{\"city\": \"Paris\"}" } }], "reasoning_content": "The user wants weather info for Paris..." }, "finish_reason": "tool_calls" }] } ``` ## Reasoning Parsing For models that produce chain-of-thought reasoning (e.g., Qwen3, DeepSeek-R1), pass `--reasoning-parser`: ```bash python -m dynamo.frontend \ --dyn-chat-processor vllm \ --reasoning-parser qwen3 ``` The parser separates think tag content into the `reasoning_content` field and regular content into the `content` field. ## See Also - **[Tool Calling](/dynamo/user-guides/parsing/tool-call-parsing-dynamo)**: General tool calling guide - **[Reference Guide](/dynamo/backends/v-llm/reference-guide)**: Full vLLM backend reference - **[Examples](/dynamo/backends/v-llm/examples)**: vLLM deployment examples # Examples # vLLM Examples For quick start instructions, see the [vLLM README](/dynamo/backends/v-llm). This document provides all deployment patterns for running vLLM with Dynamo, including aggregated, disaggregated, KV-routed, and expert-parallel configurations. ## Table of Contents - [Infrastructure Setup](#infrastructure-setup) - [LLM Serving](#llm-serving) - [Advanced Examples](#advanced-examples) - [Kubernetes Deployment](#kubernetes-deployment) - [Troubleshooting](#troubleshooting) ## Infrastructure Setup For local/bare-metal development, start etcd and optionally NATS using Docker Compose: ```bash docker compose -f dev/docker-compose.yml up -d ``` - **etcd** is optional but is the default local discovery backend. File-based discovery is also available (see `python -m dynamo.vllm --help` for `--discovery-backend` options). - **NATS** is only needed when using NATS-backed KV routing events. ZMQ-backed events and prediction-based routing do not require NATS. - **On Kubernetes**, neither is required when using the Dynamo operator. Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for better log visibility. For AI agents working with Dynamo, you can run the launch script in the background and use the `curl` commands to test the deployment. ## LLM Serving ### Aggregated Serving The simplest deployment pattern: a single worker handles both prefill and decode. Requires 1 GPU. Run on CUDA devices: ```bash cd $DYNAMO_HOME/examples/backends/vllm bash launch/agg.sh ``` Run on XPUs: ```bash cd $DYNAMO_HOME/examples/backends/vllm bash launch/xpu/agg_xpu.sh ``` ### Aggregated Serving with KV Routing Two workers behind a [KV-aware router](/dynamo/components/router) that maximizes cache reuse. Requires 2 GPUs. Run on CUDA devices: ```bash cd $DYNAMO_HOME/examples/backends/vllm bash launch/agg_router.sh ``` Run on XPUs: ```bash cd $DYNAMO_HOME/examples/backends/vllm bash launch/xpu/agg_router_xpu.sh ``` This launches the frontend in KV routing mode with two workers publishing KV events over ZMQ. ### Disaggregated Serving Separates prefill and decode into independent workers connected via NIXL for KV cache transfer. Requires 2 GPUs. ```bash cd $DYNAMO_HOME/examples/backends/vllm bash launch/disagg.sh ``` ### Disaggregated Serving with KV Routing Scales to 2 prefill + 2 decode workers with KV-aware routing on both pools. Requires 4 GPUs. ```bash cd $DYNAMO_HOME/examples/backends/vllm bash launch/disagg_router.sh ``` The frontend runs in KV routing mode and automatically detects prefill workers to activate an internal prefill router. ### Data Parallel / Expert Parallelism Launches 4 data-parallel workers with expert parallelism behind a KV-aware router. Uses a Mixture-of-Experts model (`Qwen/Qwen3-30B-A3B`). Requires 4 GPUs. ```bash cd $DYNAMO_HOME/examples/backends/vllm bash launch/dep.sh ``` Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker. ## Advanced Examples ### Speculative Decoding Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model for faster inference while maintaining accuracy. **Guide:** [Speculative Decoding Quickstart](/dynamo/additional-resources/speculative-decoding/speculative-decoding-with-v-llm) > **See also:** [Speculative Decoding Feature Overview](/dynamo/additional-resources/speculative-decoding) for cross-backend documentation. ### Multimodal Serve multimodal models using the vLLM-Omni integration. **Guide:** [vLLM-Omni](/dynamo/backends/v-llm/v-llm-omni) ### Multi-Node Deploy vLLM across multiple nodes using Dynamo's distributed capabilities. Multi-node deployments require network connectivity between nodes and firewall rules allowing NATS/ETCD communication. Start NATS/ETCD on the head node so all worker nodes can reach them: ```bash # On head node docker compose -f dev/docker-compose.yml up -d # Set on ALL nodes export HEAD_NODE_IP="" export NATS_SERVER="nats://${HEAD_NODE_IP}:4222" export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379" ``` For multi-node tensor/pipeline parallelism (when TP x PP exceeds GPUs on a single node), see [`launch/multi_node_tp.sh`](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/vllm/launch/multi_node_tp.sh). For details on distributed execution, see the [vLLM multiprocessing docs](https://docs.vllm.ai/en/stable/serving/parallelism_scaling/#running-vllm-with-multiprocessing). ### DeepSeek-R1 Dynamo supports DeepSeek R1 with data parallel attention and wide expert parallelism. Each DP attention rank is a separate Dynamo component emitting its own KV events and metrics. Run on 2 nodes (16 GPUs, dp=16): ```bash # Node 0 cd $DYNAMO_HOME/examples/backends/vllm ./launch/dsr1_dep.sh --num-nodes 2 --node-rank 0 --gpus-per-node 8 --master-addr # Node 1 ./launch/dsr1_dep.sh --num-nodes 2 --node-rank 1 --gpus-per-node 8 --master-addr ``` See [`launch/dsr1_dep.sh`](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/vllm/launch/dsr1_dep.sh) for configurable options. ## Kubernetes Deployment For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the [vLLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/vllm/deploy/README.md). See also the [Kubernetes Deployment Guide](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart) for general Dynamo K8s documentation. ## Troubleshooting ### Workers Fail to Start with NIXL Errors Ensure NIXL is installed and the side-channel ports are not in conflict. Each worker in a multi-worker setup needs a unique `VLLM_NIXL_SIDE_CHANNEL_PORT`. ### KV Router Not Routing Correctly Ensure `PYTHONHASHSEED=0` is set for all vLLM processes when using KV-aware routing. See [Hashing Consistency](/dynamo/backends/v-llm/reference-guide#hashing-consistency-for-kv-events) for details. ### GPU OOM on Startup If a previous run left orphaned GPU processes, the next launch may OOM. Check for zombie processes: ```bash nvidia-smi # look for lingering python processes kill -9 ``` ## See Also - **[vLLM README](/dynamo/backends/v-llm)**: Quick start and feature overview - **[Reference Guide](/dynamo/backends/v-llm/reference-guide)**: Configuration, arguments, and operational details - **[Observability](/dynamo/backends/v-llm/observability)**: Metrics and monitoring - **[Benchmarking](/dynamo/user-guides/benchmarking)**: Performance benchmarking tools - **[Tuning Disaggregated Performance](/dynamo/additional-resources/tuning-disaggregated-performance)**: P/D tuning guide # KV Cache Offloading # KV Cache Offloading Dynamo supports multiple KV cache offloading backends for vLLM, allowing you to extend effective KV cache capacity beyond GPU memory using CPU RAM and disk storage. Each backend integrates through vLLM's connector interface and works with both aggregated and disaggregated serving. | Backend | Source | | ----------------------- | ------------------------------------------------ | | **[KVBM](#kvbm)** | [Dynamo](/dynamo/components/kvbm) | | **[LMCache](#lmcache)** | [GitHub](https://github.com/LMCache/LMCache) | | **[FlexKV](#flexkv)** | [GitHub](https://github.com/taco-project/FlexKV) | ## KVBM [KVBM](/dynamo/components/kvbm) (KV Block Manager) is Dynamo's built-in KV cache offloading system. It provides a three-layer architecture (LLM runtime, logical block management, NIXL transport) with support for CPU and disk cache tiers, and integrates natively with Dynamo's KV-aware routing and disaggregated serving. | Deployment | Launch Script | | -------------------------- | --------------------------------------------------------------------------------------- | | Aggregated | [`agg_kvbm.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/agg_kvbm.sh) | | Aggregated + KV routing | [`agg_kvbm_router.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/agg_kvbm_router.sh) | | Disaggregated (1P1D) | [`disagg_kvbm.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/disagg_kvbm.sh) | | Disaggregated (2P2D) | [`disagg_kvbm_2p2d.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/disagg_kvbm_2p2d.sh) | | Disaggregated + KV routing | [`disagg_kvbm_router.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/disagg_kvbm_router.sh) | For configuration details, see the [KVBM Guide](/dynamo/user-guides/kv-cache-offloading). ## LMCache [LMCache](https://github.com/LMCache/LMCache) is an open-source KV cache engine that provides prefill-once, reuse-everywhere caching with multi-level storage backends (CPU RAM, local storage, Redis, GDS, InfiniStore/Mooncake). | Deployment | Launch Script | | ----------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | | Aggregated (MP sidecar — recommended) | [`agg_lmcache_mp.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/agg_lmcache_mp.sh) | | Aggregated (legacy, in-process) | [`agg_lmcache.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/agg_lmcache.sh) | | Aggregated (legacy, multiprocess metrics) | [`agg_lmcache_multiproc.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/agg_lmcache_multiproc.sh) | | Disaggregated | [`disagg_lmcache.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/disagg_lmcache.sh) | For configuration details, see the [LMCache Integration Guide](/dynamo/integrations/kv-cache-integrations/lm-cache). ## FlexKV [FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed KV cache runtime developed by Tencent Cloud's TACO team. It supports multi-level caching (GPU, CPU, SSD), distributed KV cache reuse across nodes, and high-performance I/O via io_uring and GPUDirect Storage. | Deployment | Launch Script | | ----------------------- | ------------------------------------------------------------------------------------- | | Aggregated | [`agg_flexkv.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/agg_flexkv.sh) | | Aggregated + KV routing | [`agg_flexkv_router.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/agg_flexkv_router.sh) | | Disaggregated | [`disagg_flexkv.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/disagg_flexkv.sh) | For configuration details, see the [FlexKV Integration Guide](/dynamo/integrations/kv-cache-integrations/flex-kv). ## See Also - **[KVBM Design](/dynamo/design-docs/component-design/kvbm-design)**: Architecture and design of Dynamo's built-in KV cache offloading - **[Routing Concepts](/dynamo/components/router/routing-concepts)**: Routing requests based on KV cache state - **[Disaggregated Serving](/dynamo/design-docs/disaggregated-serving)**: Prefill/decode separation architecture # Prometheus ## Overview When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with `vllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint. **For the complete and authoritative list of all vLLM metrics**, always refer to the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/stable/design/metrics.html). **For LMCache metrics and integration**, see the [LMCache Integration Guide](/dynamo/integrations/kv-cache-integrations/lm-cache). **For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](/dynamo/user-guides/observability-local/metrics). **For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](/dynamo/user-guides/observability-local/prometheus-grafana-setup). ## Environment Variables and Flags | Variable | Description | Default | Example | |----------|-------------|---------|---------| | `DYN_SYSTEM_PORT` | System metrics/health port. Required to expose `/metrics` endpoint. | `-1` (disabled) | `8081` | ## Getting Started Quickly This is a single machine example. ### Start Observability Stack For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](/dynamo/user-guides/observability-local#getting-started-quickly) for instructions. ### Launch Dynamo Components The launch scripts in `examples/backends/vllm/launch/` already enable metrics on port 8081 by default. For example: ```bash cd $DYNAMO_HOME/examples/backends/vllm bash launch/agg.sh ``` Once the deployment is running, send a request and check metrics: ```bash curl -s localhost:8081/metrics | grep "^vllm:" ``` ## Exposed Metrics vLLM exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All vLLM engine metrics use the `vllm:` prefix and include labels (e.g., `model_name`, `finished_reason`, `scheduling_event`) to identify the source. **Example Prometheus Exposition Format text:** ``` # HELP vllm:request_success_total Number of successfully finished requests. # TYPE vllm:request_success_total counter vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B"} 15.0 vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B"} 150.0 # HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds. # TYPE vllm:time_to_first_token_seconds histogram vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B"} 0.0 vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B"} 5.0 vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B"} 165.0 vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38 ``` **Note:** The specific metrics shown above are examples and may vary depending on your vLLM version. Always inspect your actual `/metrics` endpoint or refer to the [official documentation](https://docs.vllm.ai/en/stable/design/metrics.html) for the current list. ### Metric Categories vLLM provides metrics in the following categories (all prefixed with `vllm:`): - **Request metrics** - Request success, failure, and completion tracking - **Performance metrics** - Latency, throughput, and timing measurements - **Resource usage** - System resource consumption - **Scheduler metrics** - Scheduling and queue management - **Disaggregation metrics** - Metrics specific to disaggregated deployments (when enabled) **Note:** Specific metrics are subject to change between vLLM versions. Always refer to the [official documentation](https://docs.vllm.ai/en/stable/design/metrics.html) or inspect the `/metrics` endpoint for your vLLM version. ## Available Metrics The official vLLM documentation includes complete metric definitions with: - Detailed explanations and design rationale - Counter, Gauge, and Histogram metric types - Metric labels (e.g., `model_name`, `finished_reason`, `scheduling_event`) - Information about v1 metrics migration - Future work and deprecated metrics For the complete and authoritative list of all vLLM metrics, see the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/stable/design/metrics.html). ## LMCache Metrics The `lmcache server` runs as an out-of-process sidecar and exposes its metrics (prefixed `lmcache_mp_`) on its own HTTP admin port (default `:8080/metrics`). vLLM and Dynamo metrics remain on Dynamo's `:8081/metrics`. To try it out, use the LMCache launch script: ```bash cd $DYNAMO_HOME/examples/backends/vllm bash launch/agg_lmcache_mp.sh ``` Send a request and view LMCache metrics: ```bash curl -s localhost:8080/metrics | grep '^lmcache_mp_' ``` ### Troubleshooting Troubleshooting LMCache-related metrics and logs is documented in: - [LMCache Integration Guide](/dynamo/integrations/kv-cache-integrations/lm-cache#troubleshooting) **For complete LMCache configuration and metric details**, see: - [LMCache Integration Guide](/dynamo/integrations/kv-cache-integrations/lm-cache) - Setup and configuration - [LMCache Observability Documentation](https://docs.lmcache.ai/mp/observability.html) - Complete metrics reference ## Implementation Details - vLLM v1 uses multiprocess metrics collection via `prometheus_client.multiprocess` - `PROMETHEUS_MULTIPROC_DIR`: (optional). By default, Dynamo automatically manages this environment variable, setting it to a temporary directory where multiprocess metrics are stored as memory-mapped files. Each worker process writes its metrics to separate files in this directory, which are aggregated when `/metrics` is scraped. Users only need to set this explicitly where complete control over the metrics directory is required. - Dynamo uses `MultiProcessCollector` to aggregate metrics from all worker processes - Metrics on Dynamo's `:8081/metrics` are filtered to `vllm:*` and `dynamo_*` series. `lmcache_mp_*` series are served by the `lmcache server` process on its own `:8080/metrics`. - The integration uses Dynamo's `register_engine_metrics_callback()` function with the global `REGISTRY` - Metrics appear after vLLM engine initialization completes - vLLM v1 metrics are different from v0 - see the [official documentation](https://docs.vllm.ai/en/stable/design/metrics.html) for migration details ## Related Documentation ### vLLM Metrics - [Official vLLM Metrics Design Documentation](https://docs.vllm.ai/en/stable/design/metrics.html) - [vLLM Production Metrics User Guide](https://docs.vllm.ai/en/stable/usage/metrics.html) - [vLLM GitHub - Metrics Implementation](https://github.com/vllm-project/vllm/tree/main/vllm/v1/metrics) ### Dynamo Metrics - [Dynamo Metrics Guide](/dynamo/user-guides/observability-local/metrics) - Complete documentation on Dynamo runtime metrics - [Prometheus and Grafana Setup](/dynamo/user-guides/observability-local/prometheus-grafana-setup) - Visualization setup instructions - Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside vLLM metrics - Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics) - Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants) - Integration code: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration # vLLM-Omni Dynamo supports multimodal generation through the [vLLM-Omni](https://github.com/vllm-project/vllm-omni) backend. This integration exposes text-to-image, text-to-video, image-to-video, and text-to-audio (TTS) capabilities via OpenAI-compatible API endpoints. ## Prerequisites This guide assumes familiarity with deploying Dynamo with vLLM as described in the [vLLM backend guide](/dynamo/backends/v-llm). ### Installation Dynamo container images include vLLM-Omni pre-installed. If you are using `pip install ai-dynamo[vllm]`, vLLM-Omni is **not** included automatically because the matching release is not yet available on PyPI. Install it separately from source, pinning the vLLM-Omni release that matches your installed vLLM version (see the [vLLM-Omni releases](https://github.com/vllm-project/vllm-omni/releases) page): ```bash pip install git+https://github.com/vllm-project/vllm-omni.git@ ``` > **ARM64 not supported:** vLLM-Omni is currently only installed on `amd64` builds. On `arm64`, the container build skips the install and vLLM-Omni features are unavailable. ## Supported Modalities | Modality | Endpoint(s) | `--output-modalities` | |---|---|---| | Text-to-Image | `/v1/chat/completions`, `/v1/images/generations` | `image` | | Text-to-Video | `/v1/videos` | `video` | | Image-to-Video | `/v1/videos` | `video` | | Text-to-Audio (TTS) | `/v1/audio/speech` | `audio` | The `--output-modalities` flag determines which endpoint(s) the worker registers. When set to `image`, both `/v1/chat/completions` (returns inline base64 images) and `/v1/images/generations` are available. When set to `video`, the worker serves `/v1/videos`. When set to `audio`, the worker serves `/v1/audio/speech`. ## Tested Models | Modality | Models | |---|---| | Text-to-Image | `Qwen/Qwen-Image`, `AIDC-AI/Ovis-Image-7B`, `zai-org/GLM-Image` (disagg) | | Text-to-Video | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`, `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | | Image-to-Video | `Wan-AI/Wan2.2-TI2V-5B-Diffusers`, `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | | Text-to-Audio (TTS) | `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`, `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` | To run a non-default model, pass `--model` to any launch script: ```bash bash examples/backends/vllm/launch/agg_omni_image.sh --model AIDC-AI/Ovis-Image-7B bash examples/backends/vllm/launch/agg_omni_video.sh --model Wan-AI/Wan2.2-T2V-A14B-Diffusers ``` ## Text-to-Image Launch using the provided script with `Qwen/Qwen-Image`: ```bash bash examples/backends/vllm/launch/agg_omni_image.sh ``` ### Via `/v1/chat/completions` ```bash curl -s http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen-Image", "messages": [{"role": "user", "content": "A cat sitting on a windowsill"}], "stream": false }' ``` The response includes base64-encoded images inline: ```json { "choices": [{ "delta": { "content": [ {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}} ] } }] } ``` ### Via `/v1/images/generations` ```bash curl -s http://localhost:8000/v1/images/generations \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen-Image", "prompt": "A cat sitting on a windowsill", "size": "1024x1024", "response_format": "url" }' ``` ## Text-to-Video Launch using the provided script with `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`: ```bash bash examples/backends/vllm/launch/agg_omni_video.sh ``` Generate a video via `/v1/videos`: ```bash curl -s http://localhost:8000/v1/videos \ -H "Content-Type: application/json" \ -d '{ "model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", "prompt": "A drone flyover of a mountain landscape", "seconds": 2, "size": "832x480", "response_format": "url" }' ``` The response returns a video URL or base64 data depending on `response_format`: ```json { "id": "...", "object": "video", "model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", "status": "completed", "data": [{"url": "file:///tmp/dynamo_media/videos/req-abc123.mp4"}] } ``` The `/v1/videos` endpoint also accepts NVIDIA extensions via the `nvext` field for fine-grained control: | Field | Description | Default | |---|---|---| | `nvext.fps` | Frames per second | 24 | | `nvext.num_frames` | Number of frames (overrides `fps * seconds`) | -- | | `nvext.negative_prompt` | Negative prompt for guidance | -- | | `nvext.num_inference_steps` | Number of denoising steps | 50 | | `nvext.guidance_scale` | CFG guidance scale | 5.0 | | `nvext.seed` | Random seed for reproducibility | -- | | `nvext.boundary_ratio` | MoE expert switching boundary (I2V) | 0.875 | | `nvext.guidance_scale_2` | CFG scale for low-noise expert (I2V) | 1.0 | ## Image-to-Video Image-to-video (I2V) uses the same `/v1/videos` endpoint as text-to-video, with an additional `input_reference` field that provides the source image. The image can be an HTTP URL, a base64 data URI, or a local file path. Launch with the provided script using `Wan-AI/Wan2.2-TI2V-5B-Diffusers`: ```bash bash examples/backends/vllm/launch/agg_omni_i2v.sh ``` Generate a video from an image: ```bash curl -s http://localhost:8000/v1/videos \ -H "Content-Type: application/json" \ -d '{ "model": "Wan-AI/Wan2.2-TI2V-5B-Diffusers", "prompt": "A bear playing with yarn, smooth motion", "input_reference": "https://example.com/bear.png", "size": "832x480", "response_format": "url", "nvext": { "num_inference_steps": 40, "num_frames": 33, "guidance_scale": 1.0, "boundary_ratio": 0.875, "guidance_scale_2": 1.0, "seed": 42 } }' ``` The `input_reference` field accepts: - **HTTP/HTTPS URL**: `"https://example.com/image.png"` - **Base64 data URI**: `"data:image/png;base64,iVBORw0KGgo..."` - **Local file path**: `"/path/to/image.png"` or `"file:///path/to/image.png"` The I2V-specific `nvext` fields (`boundary_ratio`, `guidance_scale_2`) control the dual-expert MoE denoising schedule in Wan2.x models. See [Wan2.2-I2V model card](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers) for details. ## Text-to-Audio (TTS) Launch using the provided script with `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`: ```bash bash examples/backends/vllm/launch/agg_omni_audio.sh ``` ### CustomVoice (predefined speakers) ```bash curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "Hello, how are you?", "voice": "vivian", "language": "English" }' --output output.wav ``` ### CustomVoice with style instructions ```bash curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "I am so excited!", "voice": "vivian", "instructions": "Speak with great enthusiasm" }' --output excited.wav ``` ### VoiceDesign (describe a voice) ```bash bash examples/backends/vllm/launch/agg_omni_audio.sh --model Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign curl -X POST http://localhost:8000/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "Hello world", "task_type": "VoiceDesign", "instructions": "A warm, friendly female voice with a gentle tone" }' --output voicedesign.wav ``` ### Parameters The `/v1/audio/speech` endpoint follows the [vLLM-Omni](https://docs.vllm.ai/projects/vllm-omni/en/latest/) API format. All TTS-specific parameters are top-level fields: | Field | Description | Default | |---|---|---| | `input` | Text to synthesize (required) | -- | | `model` | TTS model name | auto-detected | | `voice` | Speaker name (e.g., vivian, ryan). Validated against model config. | Vivian | | `response_format` | Audio format: wav, mp3, pcm, flac, aac, opus | wav | | `speed` | Speed factor (0.25-4.0) | 1.0 | | `task_type` | CustomVoice, VoiceDesign, or Base (Qwen3-TTS) | CustomVoice | | `language` | Language code. Validated against model config. | Auto | | `instructions` | Voice style/emotion description. Required for VoiceDesign. | -- | | `ref_audio` | Reference audio URL or base64 data URI. Required for Base. | -- | | `ref_text` | Transcript of reference audio (Base task) | -- | | `max_new_tokens` | Maximum tokens to generate (1-4096) | 2048 | Available voices and languages are loaded dynamically from the model's `config.json` at startup. Non-Qwen3-TTS audio models (e.g., MiMo-Audio) use a generic text prompt and ignore TTS-specific parameters. ## CLI Reference The omni backend uses a dedicated entrypoint: `python -m dynamo.vllm.omni`. | Flag | Description | |---|---| | `--omni` | Enable the vLLM-Omni orchestrator (required for all omni workloads) | | `--output-modalities ` | Output modality: `text`, `image`, `video`, or `audio` | | `--stage-configs-path ` | Path to stage config YAML (optional; vLLM-Omni uses model defaults if omitted) | | `--boundary-ratio ` | MoE expert switching boundary (default: 0.875) | | `--flow-shift ` | Scheduler flow_shift (5.0 for 720p, 12.0 for 480p) | | `--vae-use-slicing` | Enable VAE slicing for memory optimization | | `--vae-use-tiling` | Enable VAE tiling for memory optimization | | `--default-video-fps ` | Default frames per second for generated videos (default: 16) | | `--enable-layerwise-offload` | Enable layerwise offloading on DiT modules to reduce GPU memory | | `--layerwise-num-gpu-layers ` | Number of ready layers to keep on GPU during generation (default: 1) | | `--cache-backend ` | Diffusion cache: `cache_dit` or `tea_cache` | | `--cache-config ` | Cache configuration as JSON string (overrides defaults) | | `--enable-cache-dit-summary` | Enable cache-dit summary logging after diffusion forward passes | | `--enforce-eager` | Disable torch.compile for diffusion models | | `--enable-cpu-offload` | Enable CPU offloading for diffusion models | | `--ulysses-degree ` | GPUs for Ulysses sequence parallelism in diffusion (default: 1) | | `--ring-degree ` | GPUs for ring sequence parallelism in diffusion (default: 1) | | `--cfg-parallel-size ` | GPUs for classifier-free guidance parallelism (1 or 2, default: 1) | | `--media-output-fs-url ` | Filesystem URL for storing generated media (default: `file:///tmp/dynamo_media`) | | `--media-output-http-url ` | Base URL for rewriting media paths in responses (optional) | ## Storage Configuration Generated images, videos, and audio files are stored via [fsspec](https://filesystem-spec.readthedocs.io/), which supports local filesystems, S3, GCS, and Azure Blob. By default, media is written to the local filesystem at `file:///tmp/dynamo_media`. To use cloud storage: ```bash bash examples/backends/vllm/launch/agg_omni_video.sh \ --media-output-fs-url s3://my-bucket/media \ --media-output-http-url https://cdn.example.com/media ``` When `--media-output-http-url` is set, response URLs are rewritten as `{base-url}/{storage-path}` (e.g., `https://cdn.example.com/media/videos/req-id.mp4`). When unset, the raw filesystem path is returned. For S3 credential configuration, set the standard AWS environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`) or use IAM roles. See the [fsspec S3 docs](https://s3fs.readthedocs.io/en/latest/#credentials) for details. ## Stage Configuration Omni pipelines are configured via YAML stage configs. By default vLLM-Omni ships built-in stage configs for supported models, so no `--stage-configs-path` is needed unless you want to override the defaults. For full documentation on stage config format and multi-stage pipelines, refer to the [vLLM-Omni Stage Configs documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/). ## Disaggregated Multi-Stage Serving For models with multiple pipeline stages (e.g., AR + Diffusion), Dynamo supports disaggregated serving where each stage runs as an independent process on its own GPU. This enables independent scaling, GPU isolation, and multi-worker replicas per stage. ### Architecture Each stage runs as an independent process on its own GPU. A lightweight router coordinates them, acting as a **pure message broker** — it never inspects or transforms inter-stage data. ```mermaid flowchart LR client(Client) --> frontend(Frontend) frontend --> router(Router) router -->|request| s0(Stage 0) s0 -->|ref| router router -->|ref| s1(Stage 1) s1 -->|result| router router --> frontend --> client s0 <-->|bulk data| conn[(Connector)] conn <--> s1 ``` **How it works:** - The router sends the initial request to Stage 0 and receives back a lightweight connector reference (pointer to the output in shared memory). - The router forwards that reference — unchanged — to Stage 1. It never reads the bulk data. - Each stage fetches its inputs from the connector, runs any model-specific processor (e.g., `ar2diffusion`, `thinker2talker`), then runs its engine. - The final stage's result goes back to the router for formatting and response. - Connector references accumulate as the pipeline progresses, so any stage can access outputs from all previous stages. ### Data Flow ```mermaid sequenceDiagram participant C as Client participant R as Router participant S0 as Stage 0 (AR) participant SHM as Connector participant S1 as Stage 1 (DiT) C->>R: POST /v1/images/generations R->>S0: request + prompt S0->>SHM: store output S0-->>R: connector ref R->>S1: connector ref (opaque) S1->>SHM: fetch output S1->>S1: processor → engine S1-->>R: result R-->>C: {"data": [...]} ``` ### Quick Start: GLM-Image (2-Stage, 2 GPUs) GLM-Image is a 2-stage text-to-image model with an AR stage (generates prior token IDs) and a DiT stage (diffusion denoising + VAE decode). The built-in vLLM-Omni stage config already assigns each stage to a separate GPU. > **Experimental:** GLM-Image support is experimental; generation may fail or produce incorrect/garbled outputs for some prompts and sizes. ```bash bash examples/backends/vllm/launch/disagg_omni_glm_image.sh ``` Test: ```bash curl -s http://localhost:8000/v1/images/generations \ -H "Content-Type: application/json" \ -d '{ "model": "zai-org/GLM-Image", "prompt": "A red apple on a white table", "size": "1024x1024", "response_format": "url" }' | jq ``` ### Scaling Stage Replicas Each stage registers independently with Dynamo's service discovery. To scale a bottleneck stage, launch additional workers with the same `--stage-id` on different GPUs — the router automatically load-balances across all replicas for that stage. Other stages are unaffected. ### Tested Models | Model | Stages | Output | Stage Config | |---|---|---|---| | GLM-Image (`zai-org/GLM-Image`) | AR -> DiT | Image | `glm_image.yaml` (built-in) | ### CLI Flags (Disaggregated Mode) | Flag | Description | |---|---| | `--stage-id ` | Run as a single-stage worker for the given stage ID. Requires `--stage-configs-path`. | | `--omni-router` | Run as the stage router. Requires `--stage-configs-path`. Mutually exclusive with `--stage-id`. | | `--stage-configs-path ` | Path to vLLM-Omni stage configuration YAML. | ## Current Limitations - Image input is supported only for I2V via `input_reference` in `/v1/videos`. Other endpoints accept text prompts only. - KV cache events are not published for omni workers. - Each worker supports a single output modality at a time. - Audio: streaming (`stream: true`) is not yet supported. - Audio: Base task (voice cloning) is not yet supported. - Disaggregated mode: `async_chunk=true` (streaming between stages) is not yet supported. # Custom Backend Overview Dynamo supports custom backends through one preferred unified contract, a lower-level worker path, and a packaging path: | Path | Use when | | --- | --- | | [Writing Unified Backends](/dynamo/backends/custom-backend/writing-unified-backends) | You are writing a new token-in-token-out engine in Python or Rust and want Dynamo to own the runtime lifecycle. | | [Python Workers (lower-level)](/dynamo/backends/custom-backend/python-workers-lower-level) | You need the older `register_model` and `serve_endpoint` path for features the unified backend does not cover yet. | | [Runtime Containers](/dynamo/backends/custom-backend/runtime-containers) | You need to package a built-in or custom backend into a deployable Dynamo image. | The unified backend path is the preferred starting point for new custom engines. It gives Python and Rust backends the same lifecycle shape: parse arguments, start the engine, stream generated chunks, handle cancellation, drain, and clean up. The Dynamo framework owns runtime registration, signal handling, model registration, and graceful shutdown. Use the lower-level Python worker path when your backend needs features that are still outside the unified contract, such as multimodal, LoRA adapter management, logprobs, guided decoding, engine-specific routes, or custom request handling. If your custom engine wants KV-cache-aware routing, also implement [KV Events for Custom Engines](/dynamo/integrations/kv-cache-integrations/kv-events-for-custom-engines) so the Dynamo router can track which workers hold each prefix. # Writing Unified Backends Dynamo's unified backend path lets custom engines implement the same lifecycle contract used by the built-in backends. The engine owns inference; Dynamo owns runtime registration, request serving, cancellation monitoring, signal handling, drain, and graceful shutdown. Use this path for new token-in-token-out engines unless you need a feature that is still outside the unified contract. ## Choose an Implementation Language | Path | Use when | |---|---| | Python unified backend | Your engine or serving library is Python-first, or you want the quickest path to integrate a custom engine. | | Rust unified backend | You want a native Rust binary, tighter control of runtime dependencies, or no Python worker runtime. | | Python workers (lower-level) | You need custom request handling or features not yet covered by the unified backend contract. Use [Python Workers](/dynamo/backends/custom-backend/python-workers-lower-level). | Both unified implementations follow the same shape: ```text parse config -> start engine -> stream generated chunks -> abort/drain -> cleanup ``` The framework handles model registration, endpoint serving, cancellation plumbing, and shutdown behavior around that engine contract. ## What the Unified Contract Covers Supported today: - aggregated token-in-token-out inference - disaggregated serving modes for supported engines - model registration through Dynamo discovery - request cancellation - structured backend errors - graceful shutdown and drain hooks Still use the lower-level Python worker path when you need features such as multimodal requests, LoRA adapter management, logprobs, guided decoding, engine-specific routes, custom request handling, or features that need direct control of the request payload. After you implement the backend, package it into a runtime image with [Runtime Containers](/dynamo/backends/custom-backend/runtime-containers). For Kubernetes deployment, place the custom backend in a `DynamoGraphDeployment` and follow the [Deployment Overview](/dynamo/kubernetes-deployment/deploy-models/model-deployment-guide). ## Python Implementation > **New — Dynamo's unified backend.** This guide covers the new > **unified backend** infrastructure in > [`dynamo.common.backend`](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/common/backend): > a shared `LLMEngine` ABC that vLLM, SGLang, TRT-LLM, and a sample > engine already implement, and that any custom Python engine can plug > into the same way. For the Rust version of the same contract, use the > Rust tab on this page. For the older lower-level Python worker path (`register_model` + > `serve_endpoint`) — still the right choice for features the unified > backend does not yet cover — see > [Writing Python Workers](/dynamo/backends/custom-backend/python-workers-lower-level). > > **Beta — actively under development.** The unified backend surface > is beta quality and may change without backwards compatibility > between releases. See [Feature gaps](#python-feature-gaps) below for what > the unified path covers today versus the existing (non-unified) > backend paths. This guide walks through building a Python backend for an inference engine that plugs into Dynamo's distributed runtime via `dynamo.common.backend`. A "unified backend" is a Python entry point that implements the shared `LLMEngine` ABC and lets the framework own runtime lifecycle (signal handling, model registration, graceful shutdown, cancellation monitoring) — your code just owns inference. Your backend lives in its own package and **does not need to be part of the dynamo repository**. It depends on `ai-dynamo` from PyPI (or the git source) and imports `dynamo.common.backend`. The steps below assume you're starting a fresh package in your own repo. The reference example is the **sample engine** at [`sample_engine.py`](../../components/src/dynamo/common/backend/sample_engine.py) — a complete, runnable implementation under 120 lines. Read it alongside this guide. **Where to look for what:** - This guide — step-by-step walkthrough for someone starting a new backend from scratch. - [`LLMEngine` ABC docstrings](../../components/src/dynamo/common/backend/engine.py) — authoritative method-by-method contract. - [Package README](../../components/src/dynamo/common/backend/README.md) — in-tree reference: `GenerateRequest` / `GenerateChunk` field definitions, per-engine cancellation cookbook (vLLM / SGLang / TRT-LLM), full `DynamoException` table, file index, and the per-engine feature-gap matrix. ### Python feature gaps The unified backend is in beta. The summary below is the common contract — what every engine on the unified path gets — plus the gaps that apply to all three engines. Per-engine specifics (vLLM sleep/wake, SGLang diffusion, TRT-LLM custom logits processors, etc.) live in the [package README](../../components/src/dynamo/common/backend/README.md#feature-gaps). **Supported today** Lifecycle and runtime: - Aggregated token-in-token-out inference - Disaggregated serving (`agg` / `prefill` / `decode`) — KV transfer uses NIXL across all three engines; SGLang exchanges a Dynamo-level bootstrap address (host/port/room), vLLM and TRT-LLM use an engine-internal handshake - Model registration with discovery and endpoint types - Request cancellation via `abort()` + `context.is_stopped()` - Graceful shutdown with signal handling - `drain()` hook for pre-cleanup work (e.g. in-flight NIXL transfers) - `DynamoException` error chain wrapping - Finish reason normalization (handled by the Rust layer) - Engine control plumbing, with per-backend profiling, quiesce/resume, and supported weight-update controls Observability: - Health-check canary via `health_check_payload()` (plus `DYN_HEALTH_CHECK_PAYLOAD` / `--health-check-payload` overrides) - Vendor-prefixed Prometheus bridge (`vllm:` / `sglang:` / `trtllm_` / `lmcache:`) via `register_prometheus()` - Framework-owned lifecycle gauges (`cleanup_time_seconds`, `drain_time_seconds`, `model_load_time_seconds`) — always on - Per-rank `dynamo_component_*` gauges + router `kv_used_blocks` signal via `component_metrics_dp_ranks()` + `attach_snapshot_publisher()` + `ComponentSnapshot` push - KV event publishing via `kv_event_sources()` returning `ZmqSource` or `PushSource` - KV-aware routing (DP-rank-aware) via `dp_rank.forced_dp_rank` / `validate_global_dp_rank` + `EngineConfig.data_parallel_{size, start_rank}` - OpenTelemetry tracing facade — `telemetry.current_span` / `start_span` plus W3C trace header propagation through `telemetry.engine_trace_kwargs(context)` Request handling: - Guided decoding — wired per-engine on the request side with engine-specific coverage. vLLM (`StructuredOutputsParams`) and TRT-LLM (`GuidedDecodingParams`) cover JSON schema / regex / grammar / choice; SGLang (`_get_guided_decoding_params`) covers JSON schema only — regex / grammar / choice are silently dropped today (see the SGLang-specific gaps in the package README) - Structural tag generation via `WorkerConfig.structural_tag_{mode, scope, schema}` and `serialize_structural_tag` - Custom Jinja chat templates via `WorkerConfig.custom_jinja_template` (frontend applies; the backend advertises through model registration) - Tool / reasoning parser configuration (`tool_call_parser`, `reasoning_parser`, `exclude_tools_when_tool_choice_none`) **Not yet on the unified path (common to all engines)** | Feature | What's missing | |---------|----------------| | Logprob response wire | Legacy handlers extract logprobs onto response chunks (vLLM `_extract_logprobs`, SGLang `_extract_logprobs` in `decode_handler`, TRT-LLM `_extract_logprobs` in `handler_base`); the unified `generate()` loops do not populate `log_probs` / `top_logprobs` / `cum_log_probs` on `GenerateChunk`. vLLM's `build_sampling_params` still passes `output_options.logprobs` to the engine on the unified path, so the engine computes them, but the values are dropped before they reach the chunk. SGLang and TRT-LLM unified `generate()` do not read `output_options.logprobs` at all. | | Text-in-text-out mode | Unified hardcodes `ModelInput.Tokens`; no engine-side tokenization or chat templating path | | Multimodal | Images / video / embeddings, NIXL embedding transfer, separate encode workers, `ENCODE` disaggregation role | | Diffusion | Image (FLUX), video (Wan2.1), LLM diffusion (DLLM) workers; no diffusion engine, MediaOutput, or media scheduling on the unified path | | LoRA adapters | Dynamic load / unload / list, ModelDeploymentCard publishing, per-adapter serialization locks, per-request adapter threading on prefill | | Snapshot / checkpoint | CRIU-based engine state save/restore + identity reload | If you need one of these features today, keep that workload on the existing per-engine entry point (`dynamo..main`) until the unified path catches up. ### Python: What you are building A backend is two things: 1. **An engine class** that subclasses `LLMEngine` — owns the model, accepts preprocessed token requests, streams output chunks. 2. **A `main.py` entry point** — a three-line shim that hands the engine class to `run()` from `dynamo.common.backend.run`, which drives the lifecycle. The `dynamo.common.backend` package handles everything else: signal handling, distributed runtime setup, model registration with discovery, the serving loop, graceful shutdown, cancellation monitoring, and error chain wrapping. (The lifecycle state machine actually lives in Rust; `dynamo.common.backend.Worker` is a thin Python shim over it.) ```text from_args → start() → generate() / abort() → drain() → cleanup() | | | | | parse argv, start engine, serve requests pre-cleanup release return return (concurrent) drain resources engine metadata ``` ### Python prerequisites - Python 3.11 or newer. `dynamo` uses `typing.Required`, which is 3.11+. - NATS and etcd reachable for end-to-end runs. The dynamo repo's `deploy/docker-compose.yml` brings up both in one command if you don't already have them running. - `uv` or `pip` for installing dependencies. - Familiarity with `async` Python (`asyncio`, async generators) and `argparse`. ### Python Step 1: Create the package ```text my-backend/ ├── pyproject.toml └── src/ └── my_backend/ ├── __init__.py ├── engine.py └── main.py ``` Minimal `pyproject.toml`: ```toml [build-system] requires = ["hatchling"] build-backend = "hatchling.build" [project] name = "my-backend" version = "0.1.0" requires-python = ">=3.11" dependencies = [ # ai-dynamo bundles dynamo.common.backend. Pin to the release whose # LLMEngine contract you tested against — the surface is still beta # and may change between releases. "ai-dynamo>=1.2.0", ] [project.optional-dependencies] dev = ["pytest>=8", "pytest-asyncio>=0.23"] [project.scripts] my-backend = "my_backend.main:main" ``` For a bleeding-edge dependency on the dynamo source tree, install the runtime wheel from a clone: ```bash git clone https://github.com/ai-dynamo/dynamo.git pip install maturin cd dynamo/lib/bindings/python && maturin build --release --out /tmp/wheels pip install /tmp/wheels/*.whl # ai-dynamo-runtime pip install /path/to/dynamo # ai-dynamo (components/ tree) ``` Building the wheel needs a Rust toolchain plus `clang`, `cmake`, `protobuf-compiler`, and `libssl-dev`. ### Python Step 2: Subclass `LLMEngine` In `src/my_backend/engine.py`, declare a class that subclasses `LLMEngine` and owns whatever state your engine needs. Construction must be cheap and side-effect-free — heavy work goes in `start()`. ```python # src/my_backend/engine.py from __future__ import annotations import argparse import asyncio from collections.abc import AsyncGenerator from dynamo._core import Context from dynamo.common.backend import ( EngineConfig, GenerateChunk, GenerateRequest, LLMEngine, WorkerConfig, ) class MyBackend(LLMEngine): def __init__(self, model_name: str, max_tokens: int = 16): self.model_name = model_name self.max_tokens = max_tokens # Heavy state (engine handles, schedulers, KV allocators) is # left None here and initialized in start(). self._engine = None ``` `GenerateRequest` and `GenerateChunk` are `TypedDict`s describing the shared shape — see Step 4 for the fields. ### Python Step 3: Implement `from_args` `from_args` is a classmethod factory that parses CLI args and returns `(engine, WorkerConfig)`. The engine is constructed but **not started**. ```python @classmethod async def from_args( cls, argv: list[str] | None = None ) -> tuple[MyBackend, WorkerConfig]: parser = argparse.ArgumentParser(prog="my-backend") parser.add_argument("--model-name", default="my-model") parser.add_argument("--max-tokens", type=int, default=16) # Runtime / discovery flags — every unified backend needs these. parser.add_argument("--namespace", default="dynamo") parser.add_argument("--component", default="backend") parser.add_argument("--endpoint", default="generate") parser.add_argument("--endpoint-types", default="chat,completions") parser.add_argument("--discovery-backend", default="etcd") parser.add_argument("--request-plane", default="tcp") parser.add_argument("--event-plane", default=None) args = parser.parse_args(argv) engine = cls(model_name=args.model_name, max_tokens=args.max_tokens) worker_config = WorkerConfig( namespace=args.namespace, component=args.component, endpoint=args.endpoint, model_name=args.model_name, served_model_name=args.model_name, endpoint_types=args.endpoint_types, discovery_backend=args.discovery_backend, request_plane=args.request_plane, event_plane=args.event_plane, ) return engine, worker_config ``` `from_args` is `async` to match the ABC; you can `await` from it if your CLI parsing reads config from a file or hits an API. Most backends don't need to. For backends that already have a `DynamoRuntimeConfig`-shaped config object (e.g. ones derived from vLLM's, SGLang's, or TRT-LLM's existing config), prefer the `WorkerConfig.from_runtime_config(runtime_cfg, model_name=...)` helper — it pulls the shared discovery / request-plane / parser fields off the config in one line. ### Python Step 4: Implement `LLMEngine` methods The ABC has three required methods (`start`, `generate`, `cleanup`) plus two with default no-op implementations (`abort`, `drain`). #### Python: `start()` Start the engine and return `EngineConfig` metadata. After this returns, `generate()` MUST be ready for concurrent calls. ```python async def start(self, worker_id: int) -> EngineConfig: # ... load weights, build scheduler, warm up CUDA, etc. # Heavy: may take minutes. Emit logger.info checkpoints so # operators see progress (Worker logs around start() but not # inside it). self._engine = await heavy_init(self.model_name) return EngineConfig( model=self.model_name, served_model_name=self.model_name, context_length=8192, kv_cache_block_size=16, # None if no block-structured KV total_kv_blocks=1024, max_num_seqs=64, max_num_batched_tokens=8192, ) ``` `worker_id` is an opaque per-worker identifier — most engines ignore it. Backends needing a stable cluster-wide key (e.g. TRT-LLM's `disagg_machine_id` snowflake) should derive from it instead of hashing host/pid or asking operators for a CLI override. Every `EngineConfig` field except `model` is optional. `None` means "don't advertise"; KV-aware routing falls back to round-robin when KV fields are unset. #### Python: `generate()` An async generator that yields `GenerateChunk` dicts for a single request. Called concurrently for multiple in-flight requests. **Contract** (chunk shape is defined by the `GenerateChunk` TypedDict — see [Request / Response Types](../../components/src/dynamo/common/backend/README.md#request--response-types) in the package README for the field reference): - Every chunk carries `token_ids` and `index` (use `0` for single choice). - The final chunk additionally carries `finish_reason` and `completion_usage`. - The framework's cancellation monitor calls `engine.abort(context)` when the client disconnects or cancels; your loop should also poll `context.is_stopped()` between yields and exit cleanly with a `finish_reason="cancelled"` chunk. ```python async def generate( self, request: GenerateRequest, context: Context ) -> AsyncGenerator[GenerateChunk, None]: prompt_tokens = list(request.get("token_ids", [])) prompt_len = len(prompt_tokens) stop_conditions = request.get("stop_conditions") or {} max_new = stop_conditions.get("max_tokens") or self.max_tokens def _usage(completion_tokens: int) -> dict[str, int]: return { "prompt_tokens": prompt_len, "completion_tokens": completion_tokens, "total_tokens": prompt_len + completion_tokens, } for i in range(max_new): if context.is_stopped(): yield { "token_ids": [], "index": 0, "finish_reason": "cancelled", "completion_usage": _usage(i), } return token_id = await self._next_token(prompt_tokens) chunk: GenerateChunk = {"token_ids": [token_id], "index": 0} if i == max_new - 1: chunk["finish_reason"] = "length" chunk["completion_usage"] = _usage(max_new) yield chunk ``` Finish reason normalization (`"abort"` → `"cancelled"`, etc.) is handled by the Rust layer — emit whatever your engine uses natively. #### Python: `abort(context)` — optional Called by the framework only when the client disconnects or the request is cancelled. NOT called on silent stream drops. Override to release engine-side resources (KV slots, scheduler entries, remote schedulers): ```python async def abort(self, context: Context) -> None: request_id = context.id() await self._engine.cancel(request_id) ``` For cleanup that must run on every drop path — including silent drops — use a `try/finally` or a context manager inside `generate`, not `abort`. The sample engine doesn't override `abort` because it has no engine-side state to release; the default is a no-op. #### Python: `drain()` — optional Runs once before shutdown, after the discovery unregister + grace-period sleep, while NATS/etcd are still alive. Use it for backend-side draining that must complete before transport teardown (e.g. in-flight NIXL KV transfers on prefill workers). Default is no-op. #### Python: `cleanup()` Two real requirements, both pinned by the Rust-side conformance kit: - **Null-safe against partial `start()` failure.** If `start()` raises partway through, fields you allocate incrementally may still be `None`. `cleanup()` must guard each resource (`if self._engine is not None: …`) so the post-failure call doesn't crash on half-initialized state. - **Idempotent.** A second call after a successful first must return cleanly without re-entering teardown. The Rust `Worker` drives both: it calls `cleanup()` after `start()` returns Ok on shutdown, and the conformance kit (`run_conformance`) additionally calls `cleanup()` on a never-started engine and twice in a row, failing your tests with `CleanupWithoutStartFailed` / `SecondCleanupFailed` if either invariant breaks. The guarded single-shot pattern below covers both: ```python async def cleanup(self) -> None: if self._engine is not None: await self._engine.shutdown() self._engine = None ``` ### Python Step 5: Write `main.py` Three lines. ```python # src/my_backend/main.py from dynamo.common.backend.run import run from .engine import MyBackend def main() -> None: run(MyBackend) if __name__ == "__main__": main() ``` `run` installs signal handlers, builds the distributed runtime, calls `engine.start(worker_id)` with a runtime-allocated identifier, registers the model with discovery, serves the endpoint, and runs the graceful-shutdown orchestrator on SIGTERM/SIGINT. Pair this with the `[project.scripts]` entry from Step 1's `pyproject.toml` so `my-backend ...` works as a console command. ### Python Step 6: Errors and logging **Errors**: the framework wraps non-`DynamoException` errors raised from `generate()` (or lifecycle methods) as `Unknown`. For typed error reporting, raise a `DynamoException` subclass directly from [`dynamo.llm.exceptions`](../../components/src/dynamo/common/backend/README.md#error-handling) — it propagates unchanged through the Rust bridge: ```python from dynamo.llm.exceptions import InvalidArgument async def generate(self, request, context): if not request.get("token_ids"): raise InvalidArgument("empty prompt") ... ``` The package README has the full table of exception types and which lifecycle phase raises which one. Engine-init failures should raise `EngineShutdown` from `start()`. Cleanup shouldn't normally raise — log and swallow if a subsystem fails. **Logging**: keep levels consistent across unified backends so operators see the same surface regardless of which engine they're running: - `logger.info` — lifecycle milestones (engine init complete, serving started, engine shutdown). - `logger.debug` — per-request events (request abort, cancellation). - `logger.warning` — recoverable problems (empty outputs, unexpected finish reasons). - `logger.error` — unrecoverable failures only. The framework also configures `dynamo.runtime.logging` for you; you just call `logger = logging.getLogger(__name__)` at the top of your module and use it. ### Python Step 7: Test your engine Install the dev extras (`pytest`, `pytest-asyncio`) declared in Step 1: ```bash pip install -e ".[dev]" ``` The sample engine has a unit-test [suite](../../components/src/dynamo/common/backend/tests/test_engine.py) that you can copy as a starting point. The shape of a useful test: ```python import pytest from my_backend import MyBackend class _StubContext: def __init__(self, stopped: bool = False) -> None: self._stopped = stopped def is_stopped(self) -> bool: return self._stopped def stop(self) -> None: self._stopped = True @pytest.mark.asyncio async def test_generate_emits_terminal_chunk(): engine = MyBackend(model_name="m", max_tokens=3) await engine.start(worker_id=0) try: chunks = [ chunk async for chunk in engine.generate( {"token_ids": [1, 2, 3]}, _StubContext() ) ] assert chunks[-1]["finish_reason"] in ("stop", "length") assert chunks[-1]["completion_usage"]["completion_tokens"] == 3 finally: await engine.cleanup() @pytest.mark.asyncio async def test_generate_observes_cancellation(): engine = MyBackend(model_name="m", max_tokens=1000) await engine.start(worker_id=0) try: ctx = _StubContext() collected = [] async for chunk in engine.generate({"token_ids": [1]}, ctx): collected.append(chunk) if len(collected) >= 2: ctx.stop() assert collected[-1]["finish_reason"] == "cancelled" finally: await engine.cleanup() ``` Cover the happy path, cancellation, and any backend-specific edge cases (stop tokens, max-tokens cap, empty prompt). Three to five focused tests is plenty — the framework already pins the lifecycle state machine and cancellation contract with Rust-side tests in `lib/backend-common`. ### Python Step 8: Run it locally Three moving parts need to come up: NATS + etcd (discovery and the event/request planes), the Dynamo frontend (HTTP → backend discovery), and your backend. ```bash pip install -e . # Ensure NATS + etcd are reachable (NATS_SERVER, ETCD_ENDPOINTS). # --model-name must be a valid HuggingFace repo (or local path); the # framework fetches the tokenizer + chat template from it on startup. # Pick a small public repo for smoke tests. my-backend --model-name Qwen/Qwen3-0.6B --namespace dynamo # In another shell, start the Dynamo frontend: python -m dynamo.frontend --http-port 8000 ``` Then send a request: ```bash curl http://localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "hello"}], "max_tokens": 32 }' ``` A successful response has non-empty `choices[0].message.content` and a `finish_reason` of `stop` or `length`. `jq -e '.choices[0].finish_reason'` is a good one-liner for a CI smoke test. If your backend looks silent, set `DYN_LOG=info` (or `DYN_LOG=debug,dynamo=debug` for finer scoping) before launching — the framework configures `tracing` from `DYN_LOG`. ### Python reference: sample engine [`sample_engine.py`](../../components/src/dynamo/common/backend/sample_engine.py) is the canonical minimal reference. Run it as-is: ```bash python -m dynamo.common.backend.sample_main --model-name test-model ``` It generates rotating token IDs with no ML dependencies, so it's a useful stand-in for AIPerf / end-to-end pipeline smoke tests. Lift these patterns: - `from_args` parses CLI args and returns `(engine, WorkerConfig)` with no awaits. - `start()` returns an `EngineConfig` whose KV fields are illustrative but not load-bearing (no real KV cache). - `generate()` polls `context.is_stopped()` between yields and emits a `cancelled` terminal on observation. - `cleanup()` is a no-op because the engine holds no resources. ### Python checklist Before shipping: - [ ] `LLMEngine` subclassed; `from_args` returns `(engine, WorkerConfig)`. - [ ] `start()` returns `EngineConfig` with at least a non-empty `model`. - [ ] `generate()` polls `context.is_stopped()` between yields and emits a `"cancelled"` terminal on observation. - [ ] Final chunk has `finish_reason` and `completion_usage`. - [ ] Typed `DynamoException` subclasses used for error reporting where the category matters. - [ ] `cleanup()` releases all engine resources. - [ ] Logging levels match the standards in Step 6. ### Python see also - [`LLMEngine` ABC](../../components/src/dynamo/common/backend/engine.py) — authoritative contract. - [Package README](../../components/src/dynamo/common/backend/README.md) — feature gaps, error model, request/response contract. - [Sample engine](../../components/src/dynamo/common/backend/sample_engine.py) — example user guide. - Rust tab on this page — the Rust counterpart, same contract, lower-level. ## Rust Implementation > **New — Dynamo's unified backend.** This guide covers the new > **unified backend** infrastructure in > [`dynamo-backend-common`](https://github.com/ai-dynamo/dynamo/tree/main/lib/backend-common): a shared > `LLMEngine` contract that vLLM, SGLang, TRT-LLM, and the mocker > already implement, and that any custom engine can plug into the > same way. > > **Beta — actively under development.** The Rust native backend > surface is beta quality and may change without backwards > compatibility between releases. See [Feature gaps](#rust-feature-gaps) > below for what the unified path covers today versus the existing > (non-unified) backend paths. This guide walks through building a Rust unified backend for an inference engine that plugs into Dynamo's distributed runtime. A unified backend is a standalone Rust binary that owns its engine and serves requests via the shared [`LLMEngine`](https://github.com/ai-dynamo/dynamo/blob/main/lib/backend-common/src/engine.rs) contract in [`dynamo-backend-common`](https://github.com/ai-dynamo/dynamo/tree/main/lib/backend-common) — no Python worker runtime required. For the Python version of the same contract use the Python tab on this page. Your backend lives in its own crate and **does not need to be part of the dynamo repository**. It pulls `dynamo-backend-common` in as a normal git or path dependency. The steps below assume you're starting a fresh crate in your own repo; an optional note in Step 1 covers the in-tree variant for contributors landing a backend inside `ai-dynamo/dynamo`. For a Python engine, use the Python tab on this page — same contract, lighter setup. The non-unified fallback for feature gaps (multimodal, LoRA, logprobs, etc.) is Python-only; see [Writing Python Workers](/dynamo/backends/custom-backend/python-workers-lower-level) if you need one of those today. The reference example is the **mocker backend** at [`lib/backend-common/examples/mocker`](https://github.com/ai-dynamo/dynamo/tree/main/lib/backend-common/examples/mocker) — a small, complete, pure-Rust implementation. Read it alongside this guide. **Where to look for what:** - This guide — step-by-step walkthrough for someone starting a new backend from scratch. - [`LLMEngine` trait doc comments](https://github.com/ai-dynamo/dynamo/blob/main/lib/backend-common/src/engine.rs) — authoritative method-by-method contract. - [Crate README](https://github.com/ai-dynamo/dynamo/blob/main/lib/backend-common/README.md) — in-tree reference: architecture, file index, disaggregation contract, error taxonomy, conformance kit. - [`backend-common` design notes](https://github.com/ai-dynamo/dynamo/blob/main/lib/backend-common/CLAUDE.md) — rationale and invariants. ### Rust feature gaps The unified backend is in beta. The summary below is the common contract — what every engine on the unified path gets, whether written in Rust directly or plugged in from Python via the PyO3 `Worker` shim. Per-engine specifics (vLLM sleep/wake, SGLang diffusion, TRT-LLM custom logits processors, etc.) live in the [Python package README](../../components/src/dynamo/common/backend/README.md#feature-gaps). **Supported today** Lifecycle and runtime: - Aggregated token-in-token-out inference - Disaggregated serving (`Aggregated` / `Prefill` / `Decode`) — KV transfer uses NIXL across all production engines; SGLang exchanges a Dynamo-level bootstrap address, vLLM and TRT-LLM use an engine-internal handshake. The Rust [mocker example](https://github.com/ai-dynamo/dynamo/tree/main/lib/backend-common/examples/mocker) exercises the same wire format CPU-only - Model registration with discovery and endpoint types - Request cancellation via in-stream `ctx.is_stopped()` polling plus the framework's out-of-band `abort()` monitor - `drain()` hook for pre-cleanup work - Typed `DynamoError` with `ErrorType::Backend(BackendError::X)` - Graceful shutdown with signal handling and 3-phase distributed-runtime teardown - Debug-build stream validator and the `testing::run_conformance` kit - Engine control plumbing, with per-backend profiling, pause/resume, and supported weight-update controls Observability: - Health-check canary via `LLMEngine::health_check_payload()` plus the operator override (`DYN_HEALTH_CHECK_PAYLOAD` / `--health-check-payload`) - Vendor-registry bridge into the runtime's `/metrics` output via `LLMEngine::setup_metrics()`, plus framework-owned lifecycle gauges (`dynamo_component_{cleanup_time_seconds, drain_time_seconds, model_load_time_seconds}`) and per-rank `dynamo_component_*` gauges driven by `SnapshotPublisher` - KV event publishing via `kv_event_sources()` returning `KvEventSource::Zmq` or `KvEventSource::Push` - KV-aware routing (DP-rank-aware) — engines advertise their slice via `EngineConfig::data_parallel_size` / `data_parallel_start_rank`; read the router-forced rank off `request.routing.dp_rank` in `generate()` - OpenTelemetry tracing — the framework auto-opens an `engine.generate` span around every `generate()` call with attributes for `model` / `input_tokens` / `disagg_role` / `ttft_ms` / `output_tokens` / `finish_reason` / ITL percentiles. Static-name spans opened with `tracing::info_span!` inside `generate()` nest under it automatically; for dynamic span names use `dynamo_backend_common::telemetry::start_span(name)`. For outbound calls that need to carry trace context (custom HTTP/TCP transports), use `dynamo_runtime::logging::inject_trace_headers_into_map`. NATS egress is auto-injected — engines do nothing. Request handling: - Guided decoding — request shape carries `SamplingOptions::guided_decoding` (`GuidedDecodingOptions`); engine-side coverage on the existing Python-bridged engines is: vLLM and TRT-LLM forward JSON schema / regex / grammar / choice; SGLang forwards JSON schema only (regex / grammar / choice are silently dropped today). A new Rust engine should forward whichever variants its backend supports - Structural tag generation — `WorkerConfig::structural_tag_{mode, scope, schema}` (typed enums) - Custom Jinja chat templates — `WorkerConfig::custom_jinja_template` flows to `LocalModelBuilder::custom_template_path` and the frontend applies the template at preprocessing time - Tool / reasoning parser configuration on `WorkerConfig` (`tool_call_parser`, `reasoning_parser`, `exclude_tools_when_tool_choice_none`) **Not yet on the unified path (common to all engines)** | Feature | What's missing | |---------|----------------| | Logprob response wire | `PreprocessedRequest.output_options.{logprobs, prompt_logprobs}` exists on the request shape. Of the existing engines (Python-bridged through PyO3), only vLLM passes the option through to its sampling params on the unified path; SGLang and TRT-LLM unified `generate()` ignore it. No engine populates `log_probs` / `top_logprobs` / `cum_log_probs` on `LLMEngineOutput` — the response wire is open but unused | | Text-in-text-out mode | `ModelInput::Text` is rejected at startup — `Tokens` only | | Multimodal | Images / video / embeddings, NIXL embedding transfer, separate encode workers; `ENCODE` disaggregation role | | Diffusion | Image (FLUX), video (Wan2.1), LLM diffusion (DLLM) workers; no diffusion engine, MediaOutput, or media scheduling on the unified path | | LoRA adapters | Dynamic load / unload / list, ModelDeploymentCard publishing, per-adapter serialization | | Snapshot / checkpoint | CRIU-based engine state save/restore + identity reload | If you need one of these features today, keep that workload on the existing per-engine entry point until the unified path catches up. ### Rust: What you are building A backend is two things: 1. **An engine type** that implements the `LLMEngine` trait — owns the model, accepts preprocessed token requests, streams output tokens. 2. **A `main.rs` entry point** — a three-line shim that hands the engine to `dynamo_backend_common::run`, which drives the lifecycle. The `dynamo-backend-common` crate handles everything else: signal handling, model registration with discovery, the serving loop, graceful shutdown, metrics, cancellation plumbing, and the debug-mode contract validator. Engines work directly with `PreprocessedRequest` and `LLMEngineOutput` — the same types used by Dynamo's preprocessing, routing, and frontend. No Python-shaped translation layer. ```text construct → start() → generate() / abort() → drain() → cleanup() | | | | | parse args start engine, serve requests pre-cleanup release return return (concurrent) drain resources engine metadata ``` ### Rust prerequisites - Rust 1.85 or newer (the dynamo workspace is edition 2024). The toolchain pin in Step 1 locks this in for you; older toolchains will fail with `feature edition2024 is required` deep inside the build. - NATS and etcd reachable for end-to-end runs. The dynamo repo's `deploy/docker-compose.yml` brings up both in one command if you don't already have them running. - Familiarity with `async` Rust, `tokio`, and `clap`. The trait uses `async_trait`, and the framework expects a `tokio` runtime. ### Rust Step 1: Create the crate Your backend is a standalone Rust binary crate. It can live in its own repository — the dynamo repo is **not** required to be your parent workspace. Pick whatever layout you prefer: ```text my-backend/ ├── Cargo.toml └── src/ ├── main.rs └── engine.rs # (or my_engine.rs — whatever you call it) ``` `cargo new --bin my-backend` is the fastest starting point; add `src/engine.rs` yourself afterwards. #### Rust: Getting the `dynamo-backend-common` crate `dynamo-backend-common` lives in the [ai-dynamo/dynamo](https://github.com/ai-dynamo/dynamo) repository and is not on crates.io. Depend on it via git: ```toml [package] name = "my-backend" version = "0.1.0" edition = "2024" [[bin]] name = "my-backend" path = "src/main.rs" [dependencies] # Replace with the dynamo commit you want to build against. # `branch = "main"` works too but moves under you on every rebuild. dynamo-backend-common = { git = "https://github.com/ai-dynamo/dynamo.git", rev = "" } anyhow = "1" async-stream = "0.3" async-trait = "0.1" clap = { version = "4", features = ["derive", "env"] } futures = "0.3" # Must match the version pinned by dynamo-runtime — it relies on # tokio_unstable runtime metrics that change shape across releases. tokio = { version = "=1.48.0", features = ["full"] } tracing = "0.1" [dev-dependencies] dynamo-backend-common = { git = "https://github.com/ai-dynamo/dynamo.git", rev = "", features = ["testing"] } ``` The `testing` feature pulls in the conformance kit used in Step 7. Pick a SHA with: ```bash git ls-remote https://github.com/ai-dynamo/dynamo.git main ``` > **No release tags yet.** `dynamo-backend-common` landed after the last > tagged release (`v1.1.1`), so `tag = "v1.1.1"` won't resolve the > crate. Track `main` or pin to a specific SHA until a release tag ships > that includes the crate. #### Rust: Two build-time requirements you cannot skip These are easy to miss and surface as confusing compile errors deep inside `dynamo-runtime`: 1. **`tokio_unstable` cfg flag.** `dynamo-runtime` uses tokio's unstable runtime-metrics API. Create `.cargo/config.toml` in your crate root: ```toml [build] rustflags = ["--cfg", "tokio_unstable"] ``` Without it, you'll see errors like `method blocking_queue_depth not found on RuntimeMetrics` while compiling `dynamo-runtime`. 2. **Rust toolchain pin.** Match dynamo's toolchain so workspace-edition crates compile. Create `rust-toolchain.toml`: ```toml [toolchain] channel = "1.93.1" ``` Older toolchains fail with `feature edition2024 is required`. > **Tip — local development**: while iterating against an unreleased > change in `dynamo-backend-common`, point the dep at a local clone: > `dynamo-backend-common = { path = "/path/to/dynamo/lib/backend-common" }`. > Switch back to the git dep before publishing your crate. If you'd rather develop *inside* the dynamo workspace as a new sub-crate, drop your crate under `dynamo/lib/` and use `dynamo-backend-common = { workspace = true }` instead. The trait contract is identical, and the `.cargo/config.toml` plus toolchain pin in the dynamo repo cover the two requirements above for you. ### Rust Step 2: Define your engine struct In `src/engine.rs` (or whatever you named it), declare a struct that owns whatever state your engine needs. Anything you allocate inside `start()` later must live behind interior mutability so the trait's `&self` methods can reach it. ```rust use async_trait::async_trait; use dynamo_backend_common::engine::GenerateContext; use dynamo_backend_common::{ BackendError, CommonArgs, DynamoError, EngineConfig, ErrorType, FinishReason, LLMEngine, LLMEngineOutput, LLMEngineOutputExt, PreprocessedRequest, WorkerConfig, chunk, usage, }; use futures::stream::BoxStream; use tokio::sync::RwLock; pub struct MyBackend { model: String, inner: RwLock>, // allocated in start() } // Replace this with whatever your engine owns — handle, scheduler, // client, channel sender, etc. Fields go here. Truly stateless // engines can skip `Inner` and `inner` entirely. struct Inner {} ``` `async-trait` lets the trait use `async fn` (still required for object-safety with `Arc`); `async-stream`'s `stream!` macro lets the `generate` body yield items from inside an `async` block. The mocker example uses `OnceCell` for `Inner`; `RwLock>` also works — pick whichever fits your shutdown semantics. ### Rust Step 3: Wire up CLI arguments Every backend's CLI shares a common base (`--namespace`, `--component`, `--endpoint`, etc.) provided by `CommonArgs`. Flatten that into your engine's `Args` struct and add your engine-specific flags. ```rust #[derive(clap::Parser, Debug)] #[command( name = env!("CARGO_BIN_NAME"), about = "My Dynamo Rust backend." )] struct Args { #[command(flatten)] common: CommonArgs, /// HF repo or local model directory. #[arg(value_name = "MODEL")] model: String, /// Public-facing model name advertised to clients. #[arg(long)] served_model_name: Option, // ... engine-specific flags here. } ``` Define an inherent `from_args` constructor that parses the args and returns both the engine and a `WorkerConfig`. **`from_args` is not on the trait** — it stays inherent so the trait can remain object-safe (`Arc` must work). The snippet below calls a tiny `invalid_arg` helper that builds a typed `BackendError::InvalidArgument`. Its full definition lives in Step 6 — for now, mentally substitute "any function that returns a `DynamoError` with category `InvalidArgument`." ```rust impl MyBackend { pub fn from_args(argv: Option>) -> Result<(Self, WorkerConfig), DynamoError> { let args = match argv { Some(a) => ::try_parse_from(a), None => ::try_parse(), } .map_err(|e| invalid_arg(e.to_string()))?; let engine = Self { model: args.model.clone(), inner: RwLock::new(None), }; let config = WorkerConfig { namespace: args.common.namespace, component: args.common.component, endpoint: args.common.endpoint, endpoint_types: args.common.endpoint_types, custom_jinja_template: args.common.custom_jinja_template, // Pass `--disaggregation-mode` from `CommonArgs` through to the // Worker — without this line the worker silently registers as // Aggregated regardless of what the operator passed. disaggregation_mode: args.common.disaggregation_mode, model_name: args.model, served_model_name: args.served_model_name, ..Default::default() }; Ok((engine, config)) } } ``` `WorkerConfig::default()` sets `model_input` to `ModelInput::Tokens`, which is the only mode `Worker` currently supports — the framework validates this at startup. Engines needing raw text or tensor inputs aren't supported yet. If your engine branches on the disaggregation role inside `generate` (prefill vs decode), keep the same `DisaggregationMode` on your engine struct so the runtime registration (`WorkerConfig`) and the per-request dispatch stay in lockstep. ### Rust Step 4: Implement the `LLMEngine` trait The trait has three required methods (`start`, `generate`, `cleanup`) plus two with default implementations you can override (`abort`, `drain`). #### Rust: `start()` Start the engine and return `EngineConfig` metadata. After this returns, the engine MUST be ready for concurrent `generate()` calls. Use interior mutability for anything you initialize here. ```rust async fn start(&self, _worker_id: u64) -> Result { tracing::info!(model = %self.model, "starting my backend"); // ... start your engine (may take minutes for real backends — emit // tracing::info! checkpoints so operators see progress). let inner = init_engine(&self.model).await?; *self.inner.write().await = Some(inner); Ok(EngineConfig { model: self.model.clone(), served_model_name: Some(self.model.clone()), context_length: Some(8192), kv_cache_block_size: Some(64), // None if no block-structured KV total_kv_blocks: Some(16384), max_num_seqs: Some(256), max_num_batched_tokens: Some(8192), ..Default::default() }) } ``` `worker_id` is an opaque per-worker identifier — most engines ignore it with `_worker_id`. Backends needing a stable cluster-wide key (e.g. TRT-LLM's `disagg_machine_id` snowflake) should derive from it. Every `EngineConfig` field except `model` is optional. `None` means "don't advertise"; KV-aware routing falls back to round-robin when KV fields are unset. Engines wrapping an external runtime can read these values from the live engine after it comes up, instead of hard-coding them. The `..Default::default()` is load-bearing: `EngineConfig` sometimes grows new fields (e.g. `bootstrap_host`/`bootstrap_port` for SGLang disagg) and the default keeps existing engines compiling. #### Rust: `generate()` Yield a stream of `Result` items for a single request. Called concurrently for multiple in-flight requests. `ctx: GenerateContext` is a thin wrapper that `Deref`s to `dyn AsyncEngineContext`, so the cancellation methods (`stopped()`, `is_stopped()`, `id()`) you'd expect are still there. The wrapper additionally exposes `notify_first_token()` for decode-mode requests — most engines can ignore it; the framework auto-fires on the first non-empty chunk. **Contract** (the [debug-mode validator](../../lib/backend-common/src/validate.rs) panics on violations): - Exactly one **terminal item** must be the last item yielded. A terminal is either an `Ok(chunk)` with `finish_reason` set, or an `Err(DynamoError)`. No items may be yielded after a terminal. - Non-terminal chunks use `chunk::token(id)` and leave `finish_reason` unset. - The returned stream is `'static`: clone or move any state from `&self` or `request` into the stream body before constructing it. Terminal chunks come from one of four `LLMEngineOutput` constructors, optionally chained with the `LLMEngineOutputExt` setters (`.with_tokens(...)`, `.with_usage(...)`): - `LLMEngineOutput::stop()` — natural completion (e.g. you reached your echo limit, the engine hit a stop string). - `LLMEngineOutput::length()` — `max_tokens` cap reached. - `LLMEngineOutput::cancelled()` — you observed `ctx.stopped()`. - `LLMEngineOutput::error(msg)` — message-only error terminal (loses the typed `BackendError` variant — yield `Err(DynamoError)` instead when the category matters). Non-terminal chunks use `chunk::token(id)` (single-token convenience). A streaming-`generate` template: ```rust async fn generate( &self, request: PreprocessedRequest, ctx: GenerateContext, ) -> Result>, DynamoError> { // Destructure once and move fields into the stream — no extra clones // (the stream is 'static and outlives `request`). let PreprocessedRequest { token_ids, stop_conditions, .. } = request; let prompt_tokens = token_ids.len() as u32; let mut output_rx = self.submit_to_engine(token_ids, stop_conditions).await?; Ok(Box::pin(async_stream::stream! { let mut completion_tokens = 0_u32; loop { tokio::select! { biased; // Cancellation: emit FinishReason::Cancelled terminal. _ = ctx.stopped() => { yield Ok(LLMEngineOutput::cancelled() .with_usage(usage(prompt_tokens, completion_tokens))); break; } // Next item from the engine. next = output_rx.recv() => { let Some(engine_output) = next else { yield Ok(LLMEngineOutput::error( "engine stream ended without a terminal".into() )); break; }; // Translate your engine's per-step output into LLMEngineOutput. // For a terminal step set `finish_reason`; otherwise leave it None. let mut out = LLMEngineOutput { token_ids: engine_output.tokens, finish_reason: engine_output.terminal_reason, ..Default::default() }; completion_tokens += out.token_ids.len() as u32; if out.finish_reason.is_some() { out = out.with_usage(usage(prompt_tokens, completion_tokens)); yield Ok(out); break; } yield Ok(out); } } } })) } ``` `biased` is load-bearing for the channel-receiving pattern above: 1. When cancellation and a pending token are both ready, yield the cancellation, not one more token. 2. During cleanup the stream sees both `ctx.stopped()` and `rx.recv() -> None` simultaneously; `biased` picks the clean cancellation path instead of erroring on a closed channel. The mocker's [stream body](../../lib/backend-common/examples/mocker/src/engine.rs) spells this out. If your engine doesn't have a receiver — e.g. you're computing tokens inline like a deterministic echo backend — the body collapses to a plain loop that polls cancellation between yields: ```rust Ok(Box::pin(async_stream::stream! { for (i, token_id) in tokens_to_emit.iter().enumerate() { tokio::select! { biased; _ = ctx.stopped() => { yield Ok(LLMEngineOutput::cancelled() .with_usage(usage(prompt_tokens, i as u32))); return; } _ = tokio::time::sleep(delay), if !delay.is_zero() => {} } if i == tokens_to_emit.len() - 1 { yield Ok(LLMEngineOutput::stop() .with_tokens(vec![*token_id]) .with_usage(usage(prompt_tokens, (i + 1) as u32))); } else { yield Ok(chunk::token(*token_id)); } } })) ``` No channel-close race to worry about; `biased` is still cheap and recommended for consistency. **Cancellation rules**: - The stream **must** poll `ctx.is_stopped()` (or `await ctx.stopped()`) between yields. - On cancellation, emit a terminal with `FinishReason::Cancelled` — not `Length` or `Stop`. The conformance kit treats any other terminal after cancellation as ignoring the signal. **Typed errors vs. string errors**: ```rust // Typed (preferred): preserves BackendError variant end-to-end. yield Err(DynamoError::builder() .error_type(ErrorType::Backend(BackendError::InvalidArgument)) .message("bad request") .build()); // String: convenient, loses the typed variant. yield Ok(LLMEngineOutput::error("something went wrong".into())); ``` Use typed errors when the failure category matters to the caller. Use string errors when it doesn't. #### Rust: `abort()` and per-request cleanup `abort` is called by the framework **only** when `ctx.stopped()` or `ctx.killed()` fires — i.e. an explicit client/operator cancel. It is NOT called when the stream is silently dropped (TCP reset, consumer timeout without cancellation). **For cleanup that must run on any drop path** (releasing a scheduler slot, freeing a request handle), use RAII inside the `generate` stream body: ```rust struct RequestGuard { /* ... */ } impl Drop for RequestGuard { fn drop(&mut self) { // Always runs when the stream is dropped, however that happens. } } Ok(Box::pin(async_stream::stream! { let _guard = RequestGuard { /* ... */ }; // ... your stream body })) ``` The mocker's `ActiveRequestGuard` is the canonical example. Use `abort` only for out-of-band notifications (e.g. telling a remote scheduler to stop computing for this request). #### Rust: `drain()` and `cleanup()` - `drain()` runs once before shutdown, after the discovery unregister + grace-period sleep, while NATS/etcd are still alive. Use it for backend-side draining that must complete before the transport layer goes away (e.g. in-flight NIXL KV transfers on prefill workers). Default is no-op. - `cleanup()` is called once on shutdown. Release all engine resources. The framework guarantees `cleanup()` runs exactly once if `start()` succeeded — even if registration or serve fails afterward. Make `cleanup()` idempotent and tolerant of being called from a half-initialized state. Engines like vLLM/TRT-LLM tear down NCCL groups in `cleanup()` and a second attempt can hang. ### Rust Step 5: Write `main.rs` Three lines. That's it. ```rust use std::sync::Arc; mod engine; fn main() -> anyhow::Result<()> { let (engine, config) = engine::MyBackend::from_args(None)?; dynamo_backend_common::run(Arc::new(engine), config) } ``` `run` installs signal handlers, builds the distributed runtime, calls `engine.start()`, registers the model with discovery, serves the endpoint, and runs the full graceful-shutdown orchestrator on SIGTERM/SIGINT. ### Rust Step 6: Errors and logging **Errors**: every error returned from `start`, `generate`, `cleanup`, and `from_args` uses `ErrorType::Backend(BackendError::X)`. From the frontend's perspective, anything bubbling up through the backend has "originated from the backend" — engine code vs. framework code is not distinguished. Top-level `ErrorType::X` variants are reserved for non-backend paths. A small helper module per backend keeps the call sites clean: ```rust pub(crate) fn invalid_arg(msg: impl Into) -> DynamoError { DynamoError::builder() .error_type(ErrorType::Backend(BackendError::InvalidArgument)) .message(msg) .build() } ``` Common nested categories: `InvalidArgument`, `CannotConnect`, `EngineShutdown`, `StreamIncomplete`, `Cancelled`, `ResponseTimeout`, `Disconnected`, `ConnectionTimeout`, `Unknown`. **Logging**: keep levels consistent across Rust backends so operators see the same surface everywhere. - `tracing::info!` for lifecycle milestones (engine started, cleanup complete). `Worker` already logs "Serving {model} on …" and "Engine cleanup complete" — add your own only for events those don't cover. - `tracing::debug!` for per-request events (cancellation, abort). - `tracing::warn!` for recoverable problems. - `tracing::error!` only for unrecoverable failures. ### Rust Step 7: Run the conformance kit Before merging, prove your engine satisfies the contract. The conformance kit is one call: ```rust #[tokio::test] async fn my_engine_passes_conformance() { // `run_conformance` takes a factory closure rather than a built // engine — the kit constructs a second pristine engine for its // "cleanup-without-start" check. dynamo_backend_common::testing::run_conformance(|| { MyBackend::new(/* your defaults */).expect("construct") }) .await .expect("conformance"); } ``` The kit runs `start`/`generate`/`cleanup` directly against your engine — no external service is involved. If your engine needs a real GPU, remote model server, or other heavyweight resource to construct, gate the test with `#[ignore]` and require an explicit opt-in env var. What it asserts: | Check | Failure mode | |---|---| | `start()` returns a non-empty `EngineConfig.model` | `EmptyModelInConfig` | | Single `generate()` ends in a terminal chunk | `NoTerminalChunk` | | No chunks after the terminal | `ChunkAfterTerminal` | | Interleaved `generate()` calls all succeed | `ConcurrentGenerateFailed` | | Mid-stream cancel terminates within 2s | `CancellationNotObserved` | | Cancelled stream's terminal is `FinishReason::Cancelled` | `CancellationIgnored` | | `cleanup()` succeeds twice (idempotent) | `SecondCleanupFailed` | | `cleanup()` on a never-started engine succeeds | `CleanupWithoutStartFailed` | For tests that don't need a real engine, use `testing::mock_context()` or `testing::cancelling_context(after)` to drive `generate` manually. ### Rust Step 8: Run it locally Three moving parts need to come up: NATS + etcd (discovery and the event/request planes), the Dynamo Python frontend (HTTP → backend discovery), and your backend. The fastest path is to copy the **mocker example's [`docker-compose.yml`](../../lib/backend-common/examples/mocker/docker-compose.yml) and [`Dockerfile.frontend`](../../lib/backend-common/examples/mocker/Dockerfile.frontend)**, swap in your image, and run `docker compose up --build`. That brings up NATS + etcd + the Python frontend (built from the dynamo workspace at the same SHA as your backend) + your backend, all on one network. For a non-Docker dev loop: ```bash cargo build --release # Ensure NATS + etcd are reachable (NATS_SERVER, ETCD_ENDPOINTS). ./target/release/my-backend Qwen/Qwen3-0.6B \ --namespace dynamo \ --component backend \ --endpoint generate # In another shell, start the Python frontend from the dynamo repo: python -m dynamo.frontend --http-port 8000 ``` Then send a request: ```bash curl http://localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "hello"}], "max_tokens": 32 }' ``` A successful response has non-empty `choices[0].message.content` and a `finish_reason` of `stop` or `length`. `jq -e '.choices[0].finish_reason'` is a good one-liner for a CI smoke test. `run` initializes `tracing` from the `DYN_LOG` env var (defaults to `info`); set `DYN_LOG=debug` or `DYN_LOG=info,dynamo_backend_common=trace` for more detail. `RUST_LOG` is not honored — `DYN_LOG` replaces it. ### Rust reference: mocker backend [`lib/backend-common/examples/mocker`](https://github.com/ai-dynamo/dynamo/tree/main/lib/backend-common/examples/mocker) is the canonical small-but-complete reference. Lift these patterns: - Single shared scheduler driving many concurrent streams via a fan-out task and per-request `mpsc` channels. - `ActiveRequestGuard` for RAII cleanup that runs on any stream drop. - `biased` select with `ctx.stopped()` first, channel second — the shutdown-race fix discussed in Step 4. - `cleanup()` signals every active stream via `ctx.stop_generating()` so each yields a clean `Cancelled` terminal instead of an error from channel-close. ### Rust checklist Before shipping: - [ ] `LLMEngine` implemented; `from_args` is inherent (not on the trait). - [ ] All errors use `ErrorType::Backend(BackendError::X)`. - [ ] `generate` polls `ctx.is_stopped()` between yields and emits `FinishReason::Cancelled` on cancel. - [ ] Per-request cleanup uses RAII guards, not just `abort`. - [ ] `cleanup` is idempotent. - [ ] Conformance kit runs green: `testing::run_conformance(|| ...)`. - [ ] Logging levels match the standards in Step 6. ### Rust see also - [Crate README](../../lib/backend-common/README.md) — in-tree reference (architecture, file index, contracts at a glance). - [`LLMEngine` trait](../../lib/backend-common/src/engine.rs) — authoritative contract. - [Design notes](../../lib/backend-common/CLAUDE.md) — rationale and invariants. - [`Worker`](../../lib/backend-common/src/worker.rs) — runtime lifecycle internals (signal handling, graceful shutdown, model registration). - [Conformance kit](../../lib/backend-common/src/testing.rs) — `run_conformance`, `mock_context`, `cancelling_context`. - [Mocker backend](../backends/mocker_backend/README.md) — example user guide. - [Python sibling](../../components/src/dynamo/common/backend/README.md) — Python ABC layered over this crate. # Writing Python Workers in Dynamo # Writing Python Workers in Dynamo > **Lower-level Python worker path.** This guide documents the > `@dynamo_worker()` + `register_model()` + `endpoint.serve_endpoint()` > entry point. For new engines, prefer Dynamo's > [unified backend path](/dynamo/backends/custom-backend/writing-unified-backends) for Python or Rust — it puts the > framework in charge of lifecycle, signal handling, cancellation > monitoring, and model registration, and ships plumbing for Prometheus > metrics, KV event publishing, KV-aware routing, OpenTelemetry > tracing, health-check canaries, guided decoding, and custom Jinja > chat templates. Stay on this path for workloads that depend on > multimodal, LoRA, logprob extraction, engine routes (pause/resume, > profiling, weight updates), text-in-text-out, snapshot/CRIU, or > diffusion — features the unified backend does not yet cover. See > the [unified-path feature gaps](/dynamo/backends/custom-backend/writing-unified-backends#python-feature-gaps) > for the current matrix. This guide explains how to create your own Python worker in Dynamo. The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo. The Python file must do three things: 1. Decorate a function to get the runtime 2. Register on the network 3. Attach a request handler ``` from dynamo.llm import ModelInput, ModelType, register_model from dynamo.runtime import DistributedRuntime, dynamo_worker # 1. Decorate a function to get the runtime # @dynamo_worker() async def worker(runtime: DistributedRuntime): # 2. Register ourselves on the network # endpoint = runtime.endpoint("namespace.component.endpoint") model_path = "Qwen/Qwen3-0.6B" # or "/data/models/Qwen3-0.6B" model_input = ModelInput.Tokens # or ModelInput.Text if engine handles pre-processing model_type = ModelType.Chat # or ModelType.Chat | ModelType.Completions if model can be deployed on chat and completions endpoints # Optional last param to register_model is model_name. If not present derives it from model_path await register_model(model_input, model_type, endpoint, model_path) # Initialize your engine here # engine = ... # 3. Attach request handler # await endpoint.serve_endpoint(RequestHandler(engine).generate) class RequestHandler: def __init__(self, engine): ... async def generate(self, request): # Call the engine # yield result dict ... if __name__ == "__main__": uvloop.install() asyncio.run(worker()) ``` The `model_path` can be: - A HuggingFace repo ID, optionally prefixed with `hf://`. It is downloaded and cached locally. - The path to a checkout of a HuggingFace repo - any folder containing safetensor files as well as `config.json`, `tokenizer.json` and `tokenizer_config.json`. The `model_input` can be: - ModelInput.Tokens. Your engine expects pre-processed input (token IDs). Dynamo handles tokenization and pre-processing. - ModelInput.Text. Your engine expects raw text input and handles its own tokenization and pre-processing. The `model_type` can be: - ModelType.Chat. Your `generate` method receives a `request` and must return a response dict of type [OpenAI Chat Completion](https://developers.openai.com/api/reference/resources/chat/subresources/completions/methods/create). - ModelType.Completions. Your `generate` method receives a `request` and must return a response dict of the older [Completions](https://developers.openai.com/api/reference/resources/completions/methods/create). `register_model` can also take the following kwargs: - `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name or the folder name. - `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM. - `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16. - `user_data`: Optional dictionary containing custom metadata for worker behavior (e.g., LoRA configuration). Defaults to None. See `examples/backends` for full code examples. ## Component Names A worker needs three names to register itself: namespace.component.endpoint * *Namespace*: A pipeline. Usually a model. e.g "llama_8b". Just a name. * *Component*: A load balanced service needed to run that pipeline. "backend", "prefill", "decode", "preprocessor", "draft", etc. This typically has some configuration (which model to use, for example). * *Endpoint*: Like a URL. "generate", "load_metrics". * *Instance*: A process. Unique. Dynamo assigns each one a unique instance_id. The thing that is running is always an instance. Namespace/component/endpoint can refer to multiple instances. If you run two models, that is two pipelines. An exception would be if doing speculative decoding. The draft model is part of the pipeline of a bigger model. If you run two instances of the same model ("data parallel") they are the same namespace+component+endpoint but different instances. The router will spread traffic over all the instances of a namespace+component+endpoint. If you have four prefill workers in a pipeline, they all have the same namespace+component+endpoint and are automatically assigned unique instance_ids. Example 1: Data parallel load balanced, one model one pipeline two instances. ``` Node 1: namespace: qwen3-32b, component: backend, endpoint: generate, model: /data/Qwen3-32B --tensor-parallel-size 2 --base-gpu-id 0 Node 2: namespace: qwen3-32b, component: backend, endpoint: generate model: /data/Qwen3-32B --tensor-parallel-size 2 --base-gpu-id 2 ``` Example 2: Two models, two pipelines. ``` Node 1: namespace: qwen3-32b, component: backend, endpoint: generate, model: /data/Qwen3-32B Node 2: namespace: llama3-1-8b, component: backend, endpoint: generat, model: /data/Llama-3.1-8B-Instruct/ ``` Example 3: Different endpoints. The KV metrics publisher in VLLM adds a `load_metrics` endpoint to the current component. If the `llama3-1-8b.backend` component above is using patched vllm it will also expose `llama3-1-8b.backend.load_metrics`. Example 4: Multiple component in a pipeline. In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.generate` (possibly multiple instances of this) and `deepseek-distill-llama8b.decode.generate`. ## Migrate Ongoing Requests A Python worker may need to be shut down promptly, for example when the node running the worker is to be reclaimed and there isn't enough time to complete all ongoing requests before the shutdown deadline. In such cases, you can signal incomplete responses by raising an `EngineShutdown` exception in your generate loop. This will immediately close the response stream, signaling to the frontend that the stream is incomplete. With request migration enabled (see the [`migration_limit`](/dynamo/user-guides/fault-tolerance/request-migration) parameter), the frontend will automatically migrate the partially completed request to another worker instance, if available, to be completed. Here's an example of how to implement this in your `RequestHandler`: ```python from dynamo.llm.exceptions import EngineShutdown class RequestHandler: async def generate(self, request): """Generate response, with support for request migration""" for result in self.engine.generate_streaming(request): # Check if we need to migrate before yielding each token if is_shutting_down(): # Raising EngineShutdown closes the stream and triggers migration raise EngineShutdown("Worker shutting down, migrating request") yield result ``` When `EngineShutdown` is raised, the frontend receives the incomplete response and can seamlessly continue generation on another available worker instance, preserving the user experience even during worker shutdowns. For more information about how request migration works, see the [Request Migration Architecture](/dynamo/user-guides/fault-tolerance/request-migration) documentation. ## Request Cancellation Your Python worker's request handler can optionally support request cancellation by accepting a `context` argument after the `request` argument. This context object allows you to check for cancellation signals and respond appropriately: ```python class RequestHandler: async def generate(self, request, context): """Generate response with cancellation support""" for result in self.engine.generate_streaming(request): # Check if the request has been cancelled if context.is_stopped(): # Stop processing and clean up break yield result ``` The context parameter is optional - if your generate method doesn't include it in its signature, Dynamo will call your method without the context argument. For detailed information about request cancellation, including async cancellation monitoring and context propagation patterns, see the [Request Cancellation Architecture](/dynamo/user-guides/fault-tolerance/request-cancellation) documentation. # Runtime Containers Dynamo runtime images package the Dynamo runtime with an inference engine. The same container build flow can generate images for the built-in engines or a backend that you add on top of the Dynamo runtime. Use [`container/render.py`](../../container/render.py) to select the engine family and Docker target: ```bash # vLLM runtime image python container/render.py --framework=vllm --target=runtime --output-short-filename docker build -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile . # SGLang runtime image python container/render.py --framework=sglang --target=runtime --output-short-filename docker build -t dynamo:latest-sglang-runtime -f container/rendered.Dockerfile . # TensorRT-LLM runtime image python container/render.py --framework=trtllm --target=runtime --cuda-version=13.1 --output-short-filename docker build -t dynamo:latest-trtllm-runtime -f container/rendered.Dockerfile . ``` ## Engine and Target Toggles `--framework` chooses the engine base. Use `vllm`, `sglang`, or `trtllm` for built-in backends. Use `none` when you want a Dynamo-only base image and plan to install your own backend package. `--target` chooses the image shape: | Target | Use when | | --- | --- | | `runtime` | Running inference, benchmarks, or Kubernetes deployments. | | `local-dev` | Developing locally with the workspace bind-mounted into the container. | | `dev` | Legacy root-based development workflows. Prefer `local-dev` for new work. | ## Custom Backend Image For a Python custom backend, start with a built-in engine image if you need that framework's CUDA/Python stack, or use `--framework=none` if your backend brings its own dependencies: ```bash python container/render.py --framework=none --target=runtime --output-short-filename docker build -t dynamo:custom-backend-base -f container/rendered.Dockerfile . ``` Then layer your backend package into a small Dockerfile: ```Dockerfile FROM dynamo:custom-backend-base COPY dist/my_backend-*.whl /tmp/ RUN uv pip install --system --no-deps /tmp/my_backend-*.whl ENTRYPOINT ["my-backend"] ``` For a Rust custom backend, build the backend binary in your own builder stage and copy it into the Dynamo runtime image: ```Dockerfile FROM rust:1.93 AS backend-builder WORKDIR /src COPY . . RUN cargo build --release FROM dynamo:custom-backend-base COPY --from=backend-builder /src/target/release/my-backend /usr/local/bin/my-backend ENTRYPOINT ["my-backend"] ``` ## Run Locally Use `container/run.sh` to launch the image with the same GPU and mount defaults used by Dynamo development workflows: ```bash container/run.sh --image dynamo:custom-backend-base --mount-workspace -it ``` For the full container build reference, target matrix, and troubleshooting notes, see the repository-level [Container Development Guide](../../container/README.md). # Frontend The Dynamo Frontend is the API gateway for serving LLM inference requests. It provides OpenAI-compatible HTTP endpoints and KServe gRPC endpoints, handling request preprocessing, routing, and response formatting. ## Feature Matrix | Feature | Status | |---------|--------| | OpenAI Chat Completions API (`/v1/chat/completions`) | ✅ Supported | | OpenAI Completions API (`/v1/completions`) | ✅ Supported | | OpenAI Embeddings API (`/v1/embeddings`) | ✅ Supported | | OpenAI Responses API (`/v1/responses`) | ✅ Supported | | OpenAI Models API (`/v1/models`) | ✅ Supported | | Image Generation (`/v1/images/generations`) | ✅ Supported | | Video Generation (`/v1/videos/generations`) | ✅ Supported | | Anthropic Messages API (`/v1/messages`) | 🧪 Experimental | | KServe gRPC v2 API | ✅ Supported | | Streaming responses (SSE) | ✅ Supported | | Multi-model serving | ✅ Supported | | Integrated KV-aware routing | ✅ Supported | | Tool calling | ✅ Supported | | TLS (HTTPS) | ✅ Supported | | Swagger UI (`/docs`) | ✅ Supported | | NVIDIA request extensions (`nvext`) | ✅ Supported | ## Quick Start ### Prerequisites - Dynamo platform installed - `etcd` and `nats-server -js` running - At least one backend worker registered ### HTTP Frontend ```bash python -m dynamo.frontend --http-port 8000 ``` This starts an OpenAI-compatible HTTP server with integrated pre/post processing and routing. Backends are auto-discovered when they call `register_model`. The frontend does the pre and post processing. To do this it will need access to the model configuration files: `config.json`, `tokenizer.json`, `tokenizer_config.json`, etc. It does not need the weights. Frontend will download the files it needs from Hugging Face, no setup is required. However we recommend setting up [modelexpress-server](https://github.com/ai-dynamo/modelexpress) and a shared folder such as a Kubernetes PVC. This ensures the model is only downloaded once across the whole cluster. If the model is not available on Hugging Face, such as a private or customized model, you will need to make the model files available locally at the same file path as on the backend. The backend's `--model-path ` will need to exist on the frontend and contain at least the configuration (JSON) files. ### KServe gRPC Frontend ```bash python -m dynamo.frontend --kserve-grpc-server ``` See the [Frontend Guide](/dynamo/components/frontend/frontend-guide) for KServe-specific configuration and message formats. ### Kubernetes ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: frontend-example spec: graphs: - name: frontend replicas: 1 services: - name: Frontend image: nvcr.io/nvidia/dynamo/dynamo-vllm:latest command: - python - -m - dynamo.frontend - --http-port - "8000" ``` ## Configuration | Parameter | Default | Description | |-----------|---------|-------------| | `--http-port` | 8000 | HTTP server port | | `--kserve-grpc-server` | false | Enable KServe gRPC server | | `--router-mode` | `round-robin` | Routing strategy: `round-robin`, `random`, `kv`, `direct`, `least-loaded`, `device-aware-weighted` (`power-of-two` and `least-loaded` use synchronous prefill fallback in disaggregated prefill mode) | See the [Frontend Guide](/dynamo/components/frontend/frontend-guide) for full configuration options. ## Next Steps | Document | Description | |----------|-------------| | [Configuration Reference](/dynamo/components/frontend/configuration-reference) | All CLI arguments, env vars, and HTTP endpoints | | [Frontend Guide](/dynamo/components/frontend/frontend-guide) | KServe gRPC configuration and integration | | [NVIDIA Request Extensions (nvext)](/dynamo/additional-resources/nvidia-request-extensions-nvext) | Custom request fields for routing hints and cache control | | [Router Documentation](/dynamo/components/router) | KV-aware routing configuration | # Frontend Guide This guide covers the KServe gRPC frontend configuration and integration for the Dynamo Frontend. ## KServe gRPC Frontend ### Motivation [KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry-standard protocols for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend. This documentation assumes readers are familiar with the usage of KServe v2 API and focuses on explaining the Dynamo parts that work together to support KServe API and how users may migrate existing KServe deployment to Dynamo. ## Supported Endpoints * `ModelInfer` endpoint: KServe Standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#inference-1) * `ModelStreamInfer` endpoint: Triton extension endpoint that provide bi-directional streaming version of the inference RPC to allow a sequence of inference requests/responses to be sent over a GRPC stream, as described [here](https://github.com/triton-inference-server/common/blob/main/protobuf/grpc_service.proto#L84-L92) * `ModelMetadata` endpoint: KServe standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#model-metadata-1) * `ModelConfig` endpoint: Triton extension endpoint as described [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_configuration.md) ## Starting the Frontend To start the KServe frontend, run the below command: ```bash python -m dynamo.frontend --kserve-grpc-server ``` ## gRPC Performance Tuning The gRPC server supports optional HTTP/2 flow control tuning via environment variables. These can be set before starting the server to optimize for high-throughput streaming workloads. | Environment Variable | Description | Default | |---------------------|-------------|---------| | `DYN_GRPC_INITIAL_CONNECTION_WINDOW_SIZE` | HTTP/2 connection-level flow control window size in bytes | tonic default (64KB) | | `DYN_GRPC_INITIAL_STREAM_WINDOW_SIZE` | HTTP/2 per-stream flow control window size in bytes | tonic default (64KB) | ### Example: High-ISL/OSL configuration for streaming workloads ```bash # For 128 concurrent 15k-token requests export DYN_GRPC_INITIAL_CONNECTION_WINDOW_SIZE=16777216 # 16MB export DYN_GRPC_INITIAL_STREAM_WINDOW_SIZE=1048576 # 1MB python -m dynamo.frontend --kserve-grpc-server ``` If these variables are not set, the server uses tonic's default values. Tune these values based on your workload. Connection window should accommodate `concurrent_requests x request_size`. Memory overhead equals the connection window size (shared across all streams). See [gRPC performance best practices](https://grpc.io/docs/guides/performance/) and [gRPC channel arguments](https://grpc.github.io/grpc/core/group__grpc__arg__keys.html) for more details. ## Registering a Backend Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_model()` API will be used. Currently the frontend support serving of the following model type and model input combination: * `ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor * `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo SGLang / TRTLLM / vLLM backend) * `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor-based inference The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail: ### OpenAI Completions Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor-based messages and OpenAI message and imposes specific structure of the KServe request message. #### Model Metadata / Config The metadata and config endpoint will report the registered backend to have the below, note that this is not the exact response. ```json { "name": "$MODEL_NAME", "version": 1, "platform": "dynamo", "backend": "dynamo", "inputs": [ { "name": "text_input", "datatype": "BYTES", "shape": [1] }, { "name": "streaming", "datatype": "BOOL", "shape": [1], "optional": true } ], "outputs": [ { "name": "text_output", "datatype": "BYTES", "shape": [-1] }, { "name": "finish_reason", "datatype": "BYTES", "shape": [-1], "optional": true } ] } ``` #### Inference On receiving inference request, the following conversion will be performed: * `text_input`: the element is expected to contain the user prompt string and will be converted to `prompt` field in OpenAI Completion request * `streaming`: the element will be converted to `stream` field in OpenAI Completion request On receiving model response, the following conversion will be performed: * `text_output`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `text` of the choice. * `finish_reason`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `finish_reason` of the choice. ### Tensor This combination is used when the user is migrating an existing KServe-based backend into Dynamo ecosystem. #### Model Metadata / Config When registering the backend, the backend must provide the model's metadata as tensor-based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata: * [TensorModelConfig](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values. * [triton_model_config](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/tests/frontend/grpc/echo_tensor_worker.py) for example. #### Inference When receiving inference request, the backend will receive [NvCreateTensorRequest](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo. ## Python Bindings The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/lib/bindings/python/examples/kserve_grpc_service/server.py) for example. ## Integration ### With Router The frontend includes an integrated router for request distribution. Configure routing mode: ```bash python -m dynamo.frontend --router-mode kv --http-port 8000 ``` See [Router Documentation](/dynamo/components/router) for routing configuration details. ### With Backends Backends auto-register with the frontend when they call `register_model()`. Supported backends: - [vLLM Backend](/dynamo/backends/v-llm) - [SGLang Backend](/dynamo/backends/sg-lang) - [TensorRT-LLM Backend](/dynamo/backends/tensor-rt-llm) ## See Also | Document | Description | |----------|-------------| | [Frontend Overview](/dynamo/components/frontend) | Quick start and feature matrix | | [NVIDIA Request Extensions (`nvext`)](/dynamo/additional-resources/nvidia-request-extensions-nvext) | Routing, preprocessing, response metadata, and engine priority extensions | | [Router Documentation](/dynamo/components/router) | KV-aware routing configuration | # Frontend Configuration Reference This page documents all configuration options for the Dynamo Frontend (`python -m dynamo.frontend`). Every CLI argument has a corresponding environment variable. CLI arguments take precedence over environment variables. ## HTTP & Networking | CLI Argument | Env Var | Default | Description | |-------------|---------|---------|-------------| | `--http-host` | `DYN_HTTP_HOST` | `0.0.0.0` | HTTP listen address | | `--http-port` | `DYN_HTTP_PORT` | `8000` | HTTP listen port | | `--tls-cert-path` | `DYN_TLS_CERT_PATH` | — | TLS certificate path (PEM). Must be paired with `--tls-key-path` | | `--tls-key-path` | `DYN_TLS_KEY_PATH` | — | TLS private key path (PEM). Must be paired with `--tls-cert-path` | The Rust HTTP server also reads these environment variables (not exposed as CLI args): | Env Var | Default | Description | |---------|---------|-------------| | `DYN_HTTP_BODY_LIMIT_MB` | `192` | Maximum request body size in MB | | `DYN_HTTP_GRACEFUL_SHUTDOWN_TIMEOUT_SECS` | `5` | Graceful shutdown timeout in seconds | ## Router | CLI Argument | Env Var | Default | Description | |-------------|---------|---------|-------------| | `--router-mode` | `DYN_ROUTER_MODE` | `round-robin` | Routing strategy: `round-robin`, `random`, `kv`, `direct`, `least-loaded`, `device-aware-weighted` | | `--load-aware` / `--no-load-aware` | `DYN_ROUTER_LOAD_AWARE` | `false` | Preset for KV load-aware routing without cache-reuse signals; implies `--router-mode kv` | | `--router-kv-overlap-score-credit` | `DYN_ROUTER_KV_OVERLAP_SCORE_CREDIT` | `1.0` | Credit multiplier for device-local prefix overlap, from 0.0 to 1.0 | | `--router-prefill-load-scale` | `DYN_ROUTER_PREFILL_LOAD_SCALE` | `1.0` | Scale adjusted prompt-side prefill load before adding decode blocks | | `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` | Softmax temperature for normalized worker sampling. 0 = deterministic | | `--router-kv-events` / `--no-router-kv-events` | `DYN_ROUTER_USE_KV_EVENTS` | `true` | Enable KV cache state events from workers. Disable for prediction-based routing | | `--router-ttl-secs` | `DYN_ROUTER_TTL_SECS` | `120.0` | Block TTL when KV events are disabled | | `--router-replica-sync` / `--no-router-replica-sync` | `DYN_ROUTER_REPLICA_SYNC` | `false` | Sync state across multiple router instances | | `--router-snapshot-threshold` | `DYN_ROUTER_SNAPSHOT_THRESHOLD` | `1000000` | Messages before triggering a snapshot | | `--router-reset-states` / `--no-router-reset-states` | `DYN_ROUTER_RESET_STATES` | `false` | Reset router state on startup. **Warning:** affects existing replicas | | `--router-track-active-blocks` / `--no-router-track-active-blocks` | `DYN_ROUTER_TRACK_ACTIVE_BLOCKS` | `true` | Track blocks used by in-progress requests for load balancing | | `--router-assume-kv-reuse` / `--no-router-assume-kv-reuse` | `DYN_ROUTER_ASSUME_KV_REUSE` | `true` | Assume KV cache reuse when tracking active blocks | | `--router-track-output-blocks` / `--no-router-track-output-blocks` | `DYN_ROUTER_TRACK_OUTPUT_BLOCKS` | `false` | Track output blocks with fractional decay during generation | | `--router-track-prefill-tokens` / `--no-router-track-prefill-tokens` | `DYN_ROUTER_TRACK_PREFILL_TOKENS` | `true` | Track prompt-side prefill load in worker load accounting | | `--router-prefill-load-model` | `DYN_ROUTER_PREFILL_LOAD_MODEL` | `none` | Prompt-side load model: `none` for static load, `aic` for oldest-prefill decay using an AIC prediction | | `--router-event-threads` | `DYN_ROUTER_EVENT_THREADS` | `4` | KV indexer worker threads. >1 enables the concurrent radix tree, including with `--no-router-kv-events` | | `--router-queue-threshold` | `DYN_ROUTER_QUEUE_THRESHOLD` | `16.0` | Queue threshold fraction of prefill capacity. Priority hints only affect requests waiting in this queue | | `--router-queue-policy` | `DYN_ROUTER_QUEUE_POLICY` | `fcfs` | Queue scheduling policy: `fcfs` (tail TTFT), `wspt` (avg TTFT), or `lcfs` (comparison-only reverse ordering) | | `--decode-fallback` / `--no-decode-fallback` | `DYN_DECODE_FALLBACK` | `false` | Fall back to aggregated mode when prefill workers unavailable | ## AIC Prefill Load Model These options are used only when `--router-mode kv` is combined with `--router-prefill-load-model aic`. | CLI Argument | Env Var | Default | Description | |-------------|---------|---------|-------------| | `--aic-backend` | `DYN_AIC_BACKEND` | — | Backend family to model in AIC, for example `vllm` or `sglang` | | `--aic-system` | `DYN_AIC_SYSTEM` | — | AIC hardware/system identifier, for example `h200_sxm` | | `--aic-model-path` | `DYN_AIC_MODEL_PATH` | — | Model path or model identifier used for AIC perf lookup | | `--aic-backend-version` | `DYN_AIC_BACKEND_VERSION` | backend-specific | Pinned AIC database version. If omitted, Dynamo uses the backend default | | `--aic-tp-size` | `DYN_AIC_TP_SIZE` | `1` | Tensor-parallel size to model in AIC | | `--aic-moe-tp-size` | `DYN_AIC_MOE_TP_SIZE` | — | MoE tensor-parallel size for models that require AIC MoE parallelism | | `--aic-moe-ep-size` | `DYN_AIC_MOE_EP_SIZE` | — | MoE expert-parallel size for models that require AIC MoE parallelism | | `--aic-attention-dp-size` | `DYN_AIC_ATTENTION_DP_SIZE` | — | Attention data-parallel size for models that require AIC MoE parallelism | When enabled, the frontend's embedded KV router predicts one expected prefill duration per admitted request, using the selected worker's overlap-derived cached prefix. The router then decays only the oldest active prefill request on each worker for prompt-side load accounting. For MoE models, AIC requires `aic_tp_size * aic_attention_dp_size == aic_moe_tp_size * aic_moe_ep_size`. For Kimi-style TP-only MoE runs, set `--aic-moe-tp-size` to the same value as `--aic-tp-size`, with `--aic-moe-ep-size 1` and `--aic-attention-dp-size 1`. ## Fault Tolerance | CLI Argument | Env Var | Default | Description | |-------------|---------|---------|-------------| | `--migration-limit` | `DYN_MIGRATION_LIMIT` | `0` | Max request migrations per worker disconnect. 0 = disabled | | `--active-decode-blocks-threshold` | `DYN_ACTIVE_DECODE_BLOCKS_THRESHOLD` | `1.0` | KV cache utilization fraction (0.0–1.0) for busy detection. Pass `None` to disable | | `--active-prefill-tokens-threshold` | `DYN_ACTIVE_PREFILL_TOKENS_THRESHOLD` | `10000000` | Absolute token count for prefill busy detection. Pass `None` to disable | | `--active-prefill-tokens-threshold-frac` | `DYN_ACTIVE_PREFILL_TOKENS_THRESHOLD_FRAC` | `64.0` | Fraction of `max_num_batched_tokens` for prefill busy detection. OR logic with absolute threshold. Pass `None` to disable | | `--admission-control` | `DYN_ADMISSION_CONTROL` | `none` | Admission control mode. `token-capacity` applies the busy thresholds above; `none` clears them. Router queueing remains controlled by `--router-queue-threshold` | ## Model Discovery | CLI Argument | Env Var | Default | Description | |-------------|---------|---------|-------------| | `--namespace` | `DYN_NAMESPACE` | — | Exact namespace for model discovery scoping | | `--namespace-prefix` | `DYN_NAMESPACE_PREFIX` | — | Namespace prefix for discovery (e.g., `ns` matches `ns`, `ns-abc123`). Takes precedence over `--namespace` | | `--model-name` | `DYN_MODEL_NAME` | — | Override model name string | | `--model-path` | `DYN_MODEL_PATH` | — | Path to local model directory (for private/custom models) | | `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | — | KV cache block size override | ## Infrastructure | CLI Argument | Env Var | Default | Description | |-------------|---------|---------|-------------| | `--discovery-backend` | `DYN_DISCOVERY_BACKEND` | `etcd` | Service discovery: `kubernetes`, `etcd`, `file`, `mem` | | `--request-plane` | `DYN_REQUEST_PLANE` | `tcp` | Request distribution: `tcp` (fastest), `nats` | | `--event-plane` | `DYN_EVENT_PLANE` | auto | Event publishing: `nats`, `zmq`; defaults to `zmq` for `file`/`mem` discovery and `nats` for `etcd`/`kubernetes` | ## KServe gRPC | CLI Argument | Env Var | Default | Description | |-------------|---------|---------|-------------| | `--kserve-grpc-server` / `--no-kserve-grpc-server` | `DYN_KSERVE_GRPC_SERVER` | `false` | Start KServe gRPC v2 server | | `--grpc-metrics-port` | `DYN_GRPC_METRICS_PORT` | `8788` | HTTP metrics port for gRPC service | See the [Frontend Guide](/dynamo/components/frontend/frontend-guide) for KServe message formats and integration details. ## Monitoring | CLI Argument | Env Var | Default | Description | |-------------|---------|---------|-------------| | `--metrics-prefix` | `DYN_METRICS_PREFIX` | `dynamo_frontend` | Prefix for frontend Prometheus metrics | | `--dump-config-to` | `DYN_DUMP_CONFIG_TO` | — | Dump resolved config to file path | ## Tokenizer | CLI Argument | Env Var | Default | Description | |-------------|---------|---------|-------------| | `--tokenizer` | `DYN_TOKENIZER` | `default` | Tokenizer: `default` (HuggingFace) or `fastokens` (high-performance Rust tokenizer). See [Tokenizer](/dynamo/components/frontend/tokenizer) | ## Experimental | CLI Argument | Env Var | Default | Description | |-------------|---------|---------|-------------| | `--enable-anthropic-api` | `DYN_ENABLE_ANTHROPIC_API` | `false` | Enable `/v1/messages` (Anthropic Messages API) | | `--dyn-chat-processor` | `DYN_CHAT_PROCESSOR` | `dynamo` | Chat processor: `dynamo` (default), `vllm`, or `sglang`. See [Parser Configuration](/dynamo/user-guides/parsing/parser-configuration) for how this combines with the parser flags. | | `--dyn-debug-perf` | `DYN_DEBUG_PERF` | `false` | Log per-function timing for preprocessing (vllm processor only) | | `--dyn-preprocess-workers` | `DYN_PREPROCESS_WORKERS` | `0` | Worker processes for CPU-bound preprocessing. 0 = main event loop (vllm processor only) | | `-i` / `--interactive` | `DYN_INTERACTIVE` | `false` | Interactive text chat mode | ## HTTP Endpoints The frontend exposes the following HTTP endpoints: ### OpenAI-Compatible | Method | Path | Description | |--------|------|-------------| | `POST` | `/v1/chat/completions` | Chat completions (streaming and non-streaming) | | `POST` | `/v1/completions` | Text completions | | `POST` | `/v1/embeddings` | Text embeddings | | `POST` | `/v1/responses` | Responses API | | `POST` | `/v1/images/generations` | Image generation | | `POST` | `/v1/videos/generations` | Video generation | | `POST` | `/v1/videos/generations/stream` | Video generation (streaming) | | `GET` | `/v1/models` | List available models | ### Anthropic (Experimental) | Method | Path | Description | |--------|------|-------------| | `POST` | `/v1/messages` | Anthropic Messages API (requires `--enable-anthropic-api`) | | `POST` | `/v1/messages/count_tokens` | Token counting for Anthropic API | ### Infrastructure | Method | Path | Description | |--------|------|-------------| | `GET` | `/health` | Health check | | `GET` | `/live` | Liveness check | | `GET` | `/metrics` | Prometheus metrics | | `GET` | `/openapi.json` | OpenAPI specification | | `GET` | `/docs` | Swagger UI | | `POST` | `/busy_threshold` | Set busy thresholds | | `GET` | `/busy_threshold` | Get current busy thresholds | ### Endpoint Path Customization All endpoint paths can be overridden via environment variables: | Env Var | Default Path | |---------|-------------| | `DYN_HTTP_SVC_CHAT_PATH_ENV` | `/v1/chat/completions` | | `DYN_HTTP_SVC_CMP_PATH_ENV` | `/v1/completions` | | `DYN_HTTP_SVC_EMB_PATH_ENV` | `/v1/embeddings` | | `DYN_HTTP_SVC_RESPONSES_PATH_ENV` | `/v1/responses` | | `DYN_HTTP_SVC_MODELS_PATH_ENV` | `/v1/models` | | `DYN_HTTP_SVC_ANTHROPIC_PATH_ENV` | `/v1/messages` | | `DYN_HTTP_SVC_HEALTH_PATH_ENV` | `/health` | | `DYN_HTTP_SVC_LIVE_PATH_ENV` | `/live` | | `DYN_HTTP_SVC_METRICS_PATH_ENV` | `/metrics` | ## Deprecated | CLI Argument | Env Var | Description | |-------------|---------|-------------| | `--router-durable-kv-events` | `DYN_ROUTER_DURABLE_KV_EVENTS` | Use event-plane local indexer instead | ## See Also - [Frontend Overview](/dynamo/components/frontend) — quick start and feature matrix - [Frontend Guide](/dynamo/components/frontend/frontend-guide) — KServe gRPC configuration - [NVIDIA Request Extensions (nvext)](/dynamo/additional-resources/nvidia-request-extensions-nvext) — custom request fields - [Configuration and Tuning](/dynamo/components/router/configuration-and-tuning) — detailed routing configuration - [Metrics](/dynamo/user-guides/observability-local/metrics) — available Prometheus metrics - [Fault Tolerance](/dynamo/user-guides/fault-tolerance) — request migration and rejection # Tokenizer The Dynamo Frontend supports multiple tokenizer backends for BPE-based `tokenizer.json` models. `BPE` is the underlying tokenization algorithm, not a backend-specific feature: both the default HuggingFace path and the `fastokens` path can serve these models. The backend choice controls which implementation performs tokenization before requests are sent to the inference engine. ## Tokenizer Backends #### `default` HuggingFace Tokenizers The default backend uses the [HuggingFace `tokenizers`](https://github.com/huggingface/tokenizers) library (Rust). It supports features in `tokenizer.json` files (normalizers, pre-tokenizers, post-processors, decoders, added tokens with special-token flags, and byte-fallback). #### `fastokens` High-Performance Encoder The `fastokens` backend uses the [`fastokens`](https://github.com/Atero-ai/fastokens) crate, a purpose-built encoder optimized for throughput on supported BPE `tokenizer.json` models. It is a _hybrid_ backend: encoding uses `fastokens` while decoding falls back to HuggingFace so that incremental detokenization, byte-fallback, and special-token handling work correctly. Use this backend when tokenization is a measurable bottleneck, for example on high-concurrency prefill-heavy workloads. #### Compatibility notes: - Works with standard BPE `tokenizer.json` files (Qwen, LLaMA, GPT-family, Mistral, DeepSeek, etc.). - If `fastokens` cannot load a particular tokenizer file, the frontend logs a warning and transparently falls back to HuggingFace; requests are never dropped. - Has no effect on TikToken-format tokenizers (`.model` / `.tiktoken` files), which always use the TikToken backend. ## Configuration Set the backend with a CLI flag or environment variable. The CLI flag takes precedence. | CLI Argument | Env Var | Valid values | Default | |---|---|---|---| | `--tokenizer` | `DYN_TOKENIZER` | `default`, `fastokens` | `default` | **Examples:** ```bash # CLI flag python -m dynamo.frontend --tokenizer fastokens # Environment variable export DYN_TOKENIZER=fastokens python -m dynamo.frontend ``` ## Dynamo Frontend Behavior When `DYN_TOKENIZER=fastokens` is set: 1. The frontend passes the environment variable to the Rust runtime. 2. When building the tokenizer for a model, `ModelDeploymentCard::tokenizer()` attempts to load `fastokens::Tokenizer` from the same `tokenizer.json` file. 3. If loading succeeds, a hybrid `FastTokenizer` is created that encodes with `fastokens` and decodes with HuggingFace. 4. If loading fails (unsupported tokenizer features, missing file, etc.), the frontend logs a warning and falls back to the standard HuggingFace backend; no operator intervention is needed. # Router

简体中文

The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups. ## Quick Start To launch the Dynamo frontend with the KV Router: ```bash python -m dynamo.frontend --router-mode kv --http-port 8000 ``` For Kubernetes, set `DYN_ROUTER_MODE=kv` on the Frontend service. For event-driven KV state, configure backend workers to publish KV cache events using the backend-specific flags described in [Router Operations](/dynamo/components/router/router-operations#additional-notes). Use `--no-router-kv-events` only when you want approximate cache-state prediction. | Argument | Default | Description | |----------|---------|-------------| | `--router-mode kv` | `round-robin` | Enable KV cache-aware routing | | `--load-aware` | disabled | Use KV active-load routing without cache-reuse signals; implies `--router-mode kv` on the frontend | | `--router-kv-overlap-score-credit` | `1.0` | Credit multiplier for device-local prefix overlap, from 0.0 to 1.0 | | `--router-prefill-load-scale` | `1.0` | Scale adjusted prompt-side prefill load before adding decode blocks | | `--router-kv-events` / `--no-router-kv-events` | `--router-kv-events` | Consume worker KV events, or fall back to approximate routing without events | | `--router-queue-threshold` | `16.0` | Backpressure queue threshold; enables priority scheduling via `nvext.agent_hints.priority` | | `--router-queue-policy` | `fcfs` | Queue scheduling policy: `fcfs` (tail TTFT), `wspt` (avg TTFT), or `lcfs` (comparison-only reverse ordering) | | `--no-router-track-prefill-tokens` | disabled | Ignore prompt-side prefill tokens in router load accounting; useful for decode-only routing paths | ### Standalone Router You can also run the KV router as a standalone service (without the Dynamo frontend). See the [Standalone Router component](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/components/src/dynamo/router/) for more details. For deployment modes and quick start steps, see the [Router Guide](/dynamo/user-guides/kv-cache-aware-routing). For CLI arguments and tuning guidelines, see [Configuration and Tuning](/dynamo/components/router/configuration-and-tuning). For A/B benchmarking, see the [KV Router A/B Benchmarking Guide](/dynamo/additional-resources/kv-router-a-b-testing). ## Prerequisites and Limitations **Requirements:** - **Dynamic endpoints only**: KV router requires `register_model()` with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text. - Backend workers must call `register_model()` with `model_input=ModelInput.Tokens` (see [Backend Guide](/dynamo/backends/custom-backend/python-workers-lower-level)) - Use dynamic discovery with KV routing so the router can track worker instances and KV cache state **Multimodal Support:** - **Image routing via multimodal hashes**: Supported in the documented TRT-LLM and vLLM router paths. - **Other backend or modality combinations**: Check the backend-specific multimodal docs before relying on multimodal hash routing. **Limitations:** - Static endpoints are not supported with KV routing; use dynamic discovery so the router can track worker instances and KV cache state For basic model registration without KV routing, use `--router-mode round-robin`, `--router-mode random`, `--router-mode least-loaded`, or `--router-mode device-aware-weighted` with both static and dynamic endpoints. ## Next Steps - **[Router Guide](/dynamo/user-guides/kv-cache-aware-routing)**: Deployment modes, quick start, and page map - **[Routing Concepts](/dynamo/components/router/routing-concepts)**: Cost model and worker-selection behavior - **[Router Filtering](router-filtering.md)**: Candidate eligibility, DP-rank filtering, and busy-threshold overload handling - **[Configuration and Tuning](/dynamo/components/router/configuration-and-tuning)**: Router flags, transport modes, and metrics - **[Disaggregated Serving](/dynamo/components/router/disaggregated-serving)**: Prefill and decode routing setups - **[Router Operations](/dynamo/components/router/router-operations)**: Replicas, persistence, and recovery - **[Router Examples](/dynamo/components/router/router-examples)**: Python API usage, K8s examples, and custom routing patterns - **[Router Testing](router-testing.md)**: Test layers from Rust unit tests to fixture-backed replay and full process E2E - **[Standalone Indexer](/dynamo/components/router/standalone-indexer)**: Run the KV indexer as a separate service for independent scaling - **[Router Design](/dynamo/design-docs/component-design/router-design)**: Architecture details, algorithms, and event transport modes # Routing Concepts This page explains how the Dynamo router evaluates workers, chooses a target, and fits into the request path. For CLI flags and tuning knobs, see [Configuration and Tuning](/dynamo/components/router/configuration-and-tuning). ## KV Cache Routing KV cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data. By maximizing cache reuse, it reduces redundant computation and improves both throughput and latency. ```mermaid graph TD T[Tokens] --> R[KV Aware Router] R -.-> W1["Worker 1
Cached: 2 blocks
Prefill: 8 blks
Decode: 10 blks"] R ==>|Selected| W2["Worker 2
Cached: 5 blocks
Prefill: 5 blks
Decode: 5 blks"] R -.-> W3["Worker 3
Cached: 8 blocks
Prefill: 2 blks
Decode: 9 blks"] style T fill:#fff3e0,stroke:#333,color:#333 style R fill:#2e8b57,stroke:#333,color:#fff style W1 fill:#f3e5f5,stroke:#333,color:#333 style W2 fill:#c8e6c9,stroke:#333,color:#333 style W3 fill:#f3e5f5,stroke:#333,color:#333 linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px ``` KV cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to: - Missed cache reuse opportunities due to suboptimal worker selection - System throughput degradation from uneven request distribution across workers The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions. ## Cost Calculation 1. **Prefill blocks**: Calculated from active prompt-side token load plus the incoming request's input tokens, divided by the block size. The system updates active prompt load when the first output token signals prefill completion. 2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed. 3. **Overlap credits**: Device-local, host, disk, and shared-cache hits reduce the prompt-side prefill load before the final prefill scale is applied. 4. **Cost formula**: ```text adjusted_prefill_blocks = max( prefill_blocks - overlap_score_credit * device_overlap_blocks - host_cache_hit_weight * host_overlap_blocks - disk_cache_hit_weight * disk_overlap_blocks - shared_cache_multiplier * shared_beyond_blocks, 0, ) cost = prefill_load_scale * adjusted_prefill_blocks + decode_blocks ``` Lower costs indicate better routing choices. `overlap_score_credit` is the device-local prefix-overlap credit multiplier, from 0.0 to 1.0. Higher values favor cache reuse (improving TTFT), while lower values prioritize even load distribution (improving ITL). `prefill_load_scale` controls the weight of the adjusted prompt-side load relative to decode blocks. ### Active Load Modeling The `prefill_blocks` and `decode_blocks` terms include projected load for the new request plus active load already assigned to each worker. #### Prefill Load Modeling For prefill load, the router first estimates the uncached prompt work for each candidate worker: ```text effective_isl = input_tokens - cached_tokens ``` By default, that effective prefill load remains charged at full value until the first output token marks prefill complete. With `--router-prefill-load-model aic`, the router also asks AIC for an expected prefill duration using the effective ISL and cached prefix length. The active load tracker then decays the oldest active prefill request on each worker over that predicted duration. This only changes router-side prompt load accounting; it does not change backend execution. #### Decode Load Modeling For decode load, the router tracks active KV blocks assigned to each worker. By default, this covers the prompt-side blocks that are already assigned to active requests and frees them when each request finishes. When `--router-track-output-blocks` is enabled, the router also adds placeholder output blocks as generation crosses block boundaries. If the request includes `nvext.agent_hints.osl`, those output blocks receive a fractional weight based on progress toward the expected output length. This expected OSL proxy lets requests near completion contribute less future decode load. Without an expected OSL, tracked output blocks count at full weight until the request finishes. For the flags that enable these models, see [Configuration and Tuning](/dynamo/components/router/configuration-and-tuning). ## Worker Selection The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution. Before scoring, the router filters candidates by request allow-lists, exact pins, DP-rank bounds, required taints, and busy-threshold overload state. For those hard eligibility rules, see [Router Filtering](router-filtering.md). Example calculation with `overlap_score_credit = 1.0`: - Worker 1: raw prefill 10 blocks, device overlap 2 blocks, decode 10 blocks => cost = 8 + 10 = 18 - **Worker 2: raw prefill 10 blocks, device overlap 5 blocks, decode 5 blocks => cost = 5 + 5 = 10** (selected - lowest cost) - Worker 3: raw prefill 10 blocks, device overlap 8 blocks, decode 9 blocks => cost = 2 + 9 = 11 ## Using the KV Cache Router To enable KV cache-aware routing, start the frontend node like this: ```bash python -m dynamo.frontend --router-mode kv ``` When KV blocks are created or removed, the engine notifies the Dynamo router, which then identifies the worker with the best matching blocks and routes traffic accordingly. To evaluate the benefits of KV-aware routing, compare your workload's performance using `--router-mode random|round-robin` against KV-aware routing. For detailed CLI arguments and advanced configuration options, see [Configuration and Tuning](/dynamo/components/router/configuration-and-tuning). ## Basic Routing Dynamo supports several routing strategies when sending requests from one component to another component's endpoint. First, create a client tied to a component endpoint. Here we get a client tied to the `generate` endpoint of the `VllmWorker` component. ```python client = runtime.endpoint("dynamo.VllmWorker.generate").client() ``` You can then use the default routing methods exposed by the client class to send requests to the `VllmWorker` component. - **Random routing**: Default strategy, available via `client.generate()` or `client.random()` - **Round-robin routing**: Cycles through available workers via `client.round_robin()` - **Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)` - **Least-loaded routing**: Routes to the worker with fewest active connections via `--router-mode least-loaded` - **Device-aware weighted routing**: Routes using CPU/non-CPU ratio budgeting plus least-loaded selection within the selected device group via `--router-mode device-aware-weighted` KV cache routing uses direct routing with a special worker selection algorithm. For benchmarking KV router performance, see the [KV Router A/B Benchmarking Guide](/dynamo/additional-resources/kv-router-a-b-testing). For custom routing logic and advanced patterns, see [Routing Patterns](/dynamo/components/router/router-examples#routing-patterns). ## Device-Aware Weighted Routing `device-aware-weighted` is designed for heterogeneous fleets where CPU and non-CPU workers share the same endpoint. Instead of comparing raw in-flight counts, the router compares a capability-normalized load across the CPU and non-CPU groups, then selects the least-loaded worker within the winning group. ```text normalized_load = total_inflight(group) / (instance_count(group) x throughput_weight) ``` The throughput weight is `1` for CPU workers and `DYN_ENCODER_CUDA_TO_CPU_RATIO` for non-CPU workers. This lets the router route proportionally to device capability instead of permanently starving slower devices. When only one device class is present, the behavior degenerates to standard least-loaded routing. # Configuration and Tuning This page collects the main router flags for frontend-embedded and standalone deployments. For the routing cost model and worker-selection behavior, see [Routing Concepts](/dynamo/components/router/routing-concepts). ## Routing Behavior - `--router-kv-overlap-score-credit`: Device-local prefix-overlap credit multiplier in the prefill cost calculation, from 0.0 to 1.0. Higher values improve Time To First Token (TTFT) at the cost of Inter-Token Latency (ITL). When set to 0, the router ignores prefix caches and skips creating a local indexer. Defaults to 1. - `--router-prefill-load-scale`: Scale applied to adjusted prompt-side prefill load after device, lower-tier, and shared-cache credits are subtracted. Defaults to 1. - `--load-aware`: Preset for load-aware KV routing without cache-reuse signals. On the frontend, it implies `--router-mode kv`. It sets `overlap_score_credit=0`, disables KV events, durable KV events, and KV reuse assumptions, enables active-block and prefill-token load tracking, disables remote/shared cache indexers, and preserves `--router-prefill-load-scale`. - `--router-temperature`: Controls worker selection randomness through softmax sampling of normalized router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness. - `--router-track-prefill-tokens`: Enables prompt-side load accounting in the worker cost model. This should stay enabled if you want queue thresholds, `active_prefill_tokens`, and AIC prefill load decay to reflect prompt work. - `--router-prefill-load-model`: Selects the router's prompt-side load model. `none` keeps the existing static prompt load accounting. `aic` predicts one expected prefill duration per admitted request and lazily decays only the oldest active prefill request on each worker. - `--router-queue-threshold`: Queue threshold fraction for prefill token capacity (default: 16.0). The router holds incoming requests in a priority queue while all workers exceed this fraction of `max_num_batched_tokens`, releasing them when capacity frees up. This defers dispatch rather than rejecting work, so routing decisions use the freshest load metrics at the moment a request is actually sent to a worker. It also enables priority scheduling via `priority` hints in `nvext.agent_hints`. Must be greater than 0. Set to `None` to disable queueing. See the SGLang note under [Tuning Guidelines](#tuning-guidelines) for caveats around how `max_num_batched_tokens` is populated on that backend. - `--router-queue-policy`: Scheduling policy for the router queue (default: `fcfs`). For how queue backpressure differs from candidate filtering and busy-threshold overload handling, see [Router Filtering](router-filtering.md). `fcfs` orders by adjusted arrival time (`priority_jump - arrival_offset`) and optimizes tail TTFT. `lcfs` orders by adjusted reverse arrival time (`priority_jump + arrival_offset`) and mainly serves controlled comparison experiments. `wspt` orders by `(1 + priority_jump) / isl_tokens` and optimizes average TTFT. For `--router-mode device-aware-weighted`, set `DYN_ENCODER_CUDA_TO_CPU_RATIO` to the approximate throughput ratio of one non-CPU worker relative to one CPU worker. The default is `8`. ### AIC Prefill Load Model Use `--router-prefill-load-model aic` when you want prompt-side load tracking to decay the oldest active prefill request using an AIC-predicted duration instead of keeping prompt load static until first token. For the cost-model behavior, see [Prefill Load Modeling](/dynamo/components/router/routing-concepts#prefill-load-modeling). Enable it on the frontend like this: ```bash python -m dynamo.frontend \ --router-mode kv \ --router-prefill-load-model aic \ --aic-backend vllm \ --aic-system h200_sxm \ --aic-model-path nvidia/Llama-3.1-8B-Instruct-FP8 ``` Required when `--router-prefill-load-model=aic` is enabled: - `--router-mode kv` on the frontend - `--router-track-prefill-tokens` - `--aic-backend` - `--aic-system` - `--aic-model-path` Optional AIC knobs: - `--aic-backend-version`: pinned AIC database version; if omitted, Dynamo uses a backend-specific default - `--aic-tp-size`: tensor-parallel size for the modeled backend; defaults to `1` - `--aic-moe-tp-size`: MoE tensor-parallel size for models that require AIC MoE parallelism - `--aic-moe-ep-size`: MoE expert-parallel size for models that require AIC MoE parallelism - `--aic-attention-dp-size`: attention data-parallel size for models that require AIC MoE parallelism For MoE models, these values must satisfy AIC's parallelism constraint: `aic_tp_size * aic_attention_dp_size == aic_moe_tp_size * aic_moe_ep_size`. For Kimi-style TP-only MoE runs, use `--aic-moe-tp-size` equal to `--aic-tp-size`, `--aic-moe-ep-size 1`, and `--aic-attention-dp-size 1`. ## KV Event Transport and Persistence - `--no-router-kv-events`: Disables KV event tracking. By default, the router consumes KV events to monitor block creation and deletion from workers that publish them. When disabled, the router predicts cache state from routing decisions with TTL-based expiration. - `--router-durable-kv-events`: **Deprecated.** Enables JetStream mode for KV event transport. The event-plane subscriber in local indexer mode is now the recommended path. - `--router-reset-states`: Only applies in JetStream mode (`--router-durable-kv-events`). Resets the router state on startup by clearing both the JetStream event stream and NATS object store, starting from a fresh state. - `--router-snapshot-threshold`: Only applies in JetStream mode (`--router-durable-kv-events`). Sets the number of messages in JetStream before triggering a snapshot. ## Topology-Aware KV Transfer Topology-aware KV transfer is configured on workers through runtime metadata, not with frontend router flags. In Kubernetes, use `spec.experimental.kvTransferPolicy` on the `DynamoGraphDeployment`; the operator injects the worker environment and topology files. Outside Kubernetes, set `DYN_TOPOLOGY_ENABLED`, `DYN_TOPOLOGY_MOUNT_PATH`, `DYN_KV_TRANSFER_DOMAIN`, `DYN_KV_TRANSFER_ENFORCEMENT`, and `DYN_KV_TRANSFER_PREFERRED_WEIGHT` on workers. For the full runtime contract and routing behavior, see [Topology-Aware KV Transfer](/dynamo/components/router/topology-aware-kv-transfer). For Kubernetes deployment examples, see [Kubernetes Topology-Aware KV Transfer](/dynamo/kubernetes-deployment/operate/topology-aware-kv-transfer). ## Block Tracking - `--no-router-track-active-blocks`: Disables tracking of active blocks used for ongoing generation or decode phases. Disable this when routing to workers that only perform prefill. - `--router-track-output-blocks`: **Experimental.** Enables tracking of output blocks during generation. When enabled, the router adds placeholder blocks as tokens are generated and applies fractional decay based on progress toward the expected output sequence length (`agent_hints.osl` in `nvext`). For the cost-model behavior, see [Decode Load Modeling](/dynamo/components/router/routing-concepts#decode-load-modeling). - `--no-router-assume-kv-reuse`: When tracking active blocks, disables the assumption of KV cache reuse. This is useful in disaggregated setups where transferred blocks are not actually deduplicated on the decode side. - `--no-router-track-prefill-tokens`: Disables prompt-side prefill token accounting in the router's active load model. Use this for decode-only routing paths where prompt processing already happened elsewhere. - `--router-replica-sync`: Disabled by default. Enables NATS-based synchronization of local routing decisions between router replicas. ## KV Indexer / Approx KV Indexer - `--router-ttl-secs`: Time-to-live in seconds for blocks in the router's local cache predictions. Defaults to 120.0 seconds when `--no-router-kv-events` is used. - `--router-event-threads`: Number of KV indexer worker threads (default: 4). Values greater than 1 use the concurrent radix tree for event-driven routing, approximate routing with `--no-router-kv-events`, and the predict-on-route side indexer. - `--router-predicted-ttl-secs`: Enables predict-on-route with this TTL in seconds for entries in a local side indexer. Requires KV events; omit to disable. When enabled, the router feeds each routing decision into the side indexer and scores each worker with the larger overlap from the primary indexer and the local side indexer. Independent of `--router-ttl-secs`; kept short so decisions the engine never confirms (cancelled requests, prefill failures) age out quickly. ### When to use `--router-predicted-ttl-secs` Without this setting, an event-driven router depends entirely on engine KV events to learn which worker now holds which prefix. That works for steady-state traffic, but creates a race when many sibling requests arrive in a single batch — for example, 16 problems × 4 samples each with a shared system prompt, or any parallel-sampling / best-of-N workload. No engine has emitted a "block stored" event yet, so the router scores every sibling with zero overlap and round-robins them across workers. The prefix then gets prefilled on every worker instead of being reused. Setting `--router-predicted-ttl-secs 5` makes the router record each routing decision into a secondary, short-TTL approximate indexer. When the next sibling is scored, the router queries both indexers and takes the per-worker max overlap, so siblings see the first sibling's prefix immediately and pin to the same worker. The primary event-driven indexer is untouched — engines compute their sequence hashes with salts and cryptographic digests the router cannot reproduce, so inserting router-computed hashes into the primary would key the same physical block under two different hashes and pollute the tree. Running the two trees in parallel sidesteps that entirely; the side tree has a short TTL and its entries simply expire once the primary takes over. Do not combine this setting with `--no-router-kv-events`, including when the approximate primary is remote: approximate mode already inserts on routing decisions by construction, and running a second approximate side indexer is redundant. With `--use-remote-indexer` and KV events enabled, the side indexer remains local to the consumer router while the remote indexer remains the shared primary view. If a router also serves an indexer for other routers, the side indexer is still local only; it is never served or consumed as the remote primary. To implement KV event publishing for custom inference engines, see [KV Event Publishing for Custom Engines](/dynamo/integrations/kv-cache-integrations/kv-events-for-custom-engines). For details on per-request agent hints (`priority`, `osl`, `speculative_prefill`), see [NVIDIA Request Extensions (`nvext`)](/dynamo/additional-resources/nvidia-request-extensions-nvext#agent-hints). ### Session Control and Sticky Routing When a request carries `nvext.session_control`, the KV router can activate two session-related components: - **StickySessionRouter**: Maintains an in-memory `session_id -> (worker_id, dp_rank)` affinity map with sliding-window TTL. `action: "bind"` creates router-only affinity without backend engine RPCs. Subsequent requests with the same `session_id` are routed to the pinned worker/rank, bypassing KV overlap scoring. - **AgentController**: Sends session lifecycle RPCs (`open_session`, `close_session`) to the worker's `session_control` endpoint when `action` is `"open"` or `"close"`. The event-plane client is lazily initialized on the first lifecycle request. These activate automatically with `--router-mode kv` -- no additional flags are needed. Requests without `session_control` are unaffected and follow the standard KV-aware routing path. Router-only sticky routing only requires `action: "bind"`; engine-backed session lifecycle currently requires the SGLang backend with `--enable-streaming-session`. See [SGLang for Agentic Workloads -- Session Control](/dynamo/backends/sg-lang/agentic-workloads#session-control-for-subagent-kv-isolation-experimental) for details. ## Tuning Guidelines `--router-kv-overlap-score-credit` is the primary knob for cache reuse. It credits device-local prefix overlap against the prefill load and must be between 0.0 and 1.0. Higher values steer requests toward workers with better cache overlap and reduce TTFT. Lower values distribute load more evenly and reduce ITL. The default of 1.0 is a reasonable starting point. This credit can also be overridden per request via `nvext.agent_hints.kv_overlap_score_credit`. Use `--load-aware` when you want the KV scheduler's active load model without prefix/cache reuse. This is equivalent to using KV mode with overlap credit set to 0, KV events disabled, KV reuse assumptions disabled, active load tracking enabled, and shared-cache routing disabled. `--router-prefill-load-scale` remains available to tune prompt-side load relative to decode blocks. Deprecated: `--router-kv-overlap-score-weight`, `--kv-overlap-score-weight`, `DYN_ROUTER_KV_OVERLAP_SCORE_WEIGHT`, and `DYN_OVERLAP_SCORE_WEIGHT` are still accepted, but emit deprecation warnings. Nonzero legacy values map to `prefill_load_scale` to preserve existing behavior without changing overlap credit. A legacy value of 0 maps to both `prefill_load_scale=0` and `overlap_score_credit=0`, which preserves the old no-overlap/no-indexer behavior. If a deprecated overlap score weight is still present, it takes precedence over the newer prefill load scale field; a legacy value of 0 also takes precedence over the newer overlap credit field. When migrating to `--router-prefill-load-scale` or `DYN_ROUTER_PREFILL_LOAD_SCALE`, remove the deprecated flag, env var, or JSON field from the deployment config. Use `--router-kv-overlap-score-credit` or `DYN_ROUTER_KV_OVERLAP_SCORE_CREDIT` only when you mean to tune the cache-overlap credit itself. If an older config used overlap score weight above 1.0 to make the router care more about TTFT, keep the overlap credit at or below 1.0 and move that larger value to `--router-prefill-load-scale` instead. `prefill_load_scale` multiplies the overlap-adjusted prompt-side load, so it still implicitly accounts for device, host, disk, and shared-cache credits. Use `--router-prefill-load-scale` when prompt-side load should count more or less than decode-side block load after cache-hit credits are applied. The final score is `prefill_load_scale * adjusted_prefill_blocks + decode_blocks`. Use `--no-router-kv-events` when you are not confident that your backend engine emits KV events correctly. In this mode the router falls back to approximate routing, predicting cache state from its own routing decisions with TTL-based expiration. Use `--router-predicted-ttl-secs 5` when the workload fires bursts of sibling requests with shared prefixes — parallel sampling, best-of-N, agent fan-out. It closes the window between the routing decision and the engine's first "block stored" event so siblings co-locate on the worker the first sibling picked. See the configuration section above for the side-indexer mechanics. Use `--no-router-assume-kv-reuse` in disaggregated setups where the decode worker does not reuse transferred KV cache blocks. Without this flag, the router undercounts decode blocks when duplicates exist, leading to inaccurate load estimates. Use `--no-router-track-prefill-tokens` when a router is serving decode-only traffic and prompt processing has already completed elsewhere. This keeps decode routing decisions focused on decode-side load instead of briefly charging prompt tokens to the decode worker after handoff. Use `--router-track-output-blocks` when your workload is output-heavy and you want the router to account for output-side KV cache growth in load balancing. If you also pass `nvext.agent_hints.osl` per request, the router applies fractional decay to output blocks so that requests nearing completion contribute less future load. See [Decode Load Modeling](/dynamo/components/router/routing-concepts#decode-load-modeling) for the cost-model details. `--router-queue-threshold` controls when incoming requests are held in a priority queue. The router waits while all workers exceed the configured fraction of `max_num_batched_tokens`, then releases work as capacity frees up. Set it to `None` to disable queueing entirely. This threshold delays dispatch. It does not remove workers from the candidate set; for that distinction, see [Router Filtering](router-filtering.md). Use `DYN_ROUTER_OVERLAP_REFRESH_AFTER_SECS` when queued requests may wait long enough for worker cache state to materially change before dispatch. The default is `10` seconds; set it to `0` to disable dequeue-time overlap refresh. **Note for the SGLang backend.** Since [#8220](https://github.com/ai-dynamo/dynamo/pull/8220), the value the SGLang worker publishes for `max_num_batched_tokens` in its Model Deployment Card depends on the server args: - If `--max-prefill-tokens` is set, MDC's `max_num_batched_tokens` equals that value (the per-step prefill window — the value most users expect). - If `--max-prefill-tokens` is not set, MDC's `max_num_batched_tokens` falls back to `max_total_num_tokens` from SGLang's `scheduler_info`, which is the **total KV cache pool** in tokens. On large GPUs with high `mem-fraction-static` the pool can be hundreds of thousands of tokens — much larger than `chunked-prefill-size`. The threshold is applied as `active_tokens > threshold * max_num_batched_tokens`, so this fallback inflates the effective denominator and a threshold like `1.0` may effectively never queue. To get the originally intended "fraction of the per-step prefill window" semantics on SGLang, either set `--max-prefill-tokens` explicitly on the SGLang backend so the MDC value matches the prefill window, or use a much smaller `--router-queue-threshold` (for example `0.1`) to compensate for the inflated denominator. Use `--router-prefill-load-model aic` when you want prompt-side load tracking to decay the oldest active prefill request using an AIC-predicted duration instead of keeping prompt load static until first token. This requires `--router-track-prefill-tokens` and the shared `--aic-*` config; see [AIC Prefill Load Model](#aic-prefill-load-model) for the full flag set and [Prefill Load Modeling](/dynamo/components/router/routing-concepts#prefill-load-modeling) for the cost-model details. Use `--router-queue-policy wspt` when your workload has a mix of short and long requests and you want to minimize average TTFT. Use the default `fcfs` when you want to minimize tail TTFT. ## Prometheus Metrics The router exposes Prometheus metrics on the frontend's HTTP port (default 8000) at `/metrics`: - **Router request metrics** (`dynamo_component_router_*`): Registered via the component's metrics hierarchy and exposed on the frontend via the `drt_metrics` bridge. In KV mode they are populated per request; in non-KV modes they are registered with zero values. The standalone router also registers these metrics, available on `DYN_SYSTEM_PORT` when set. - **Routing overhead metrics** (`dynamo_router_overhead_*`) and **per-worker gauges** (`dynamo_frontend_worker_*`): Registered on the frontend's own Prometheus registry. These are frontend-only and not available on the standalone router. For the full list of router metrics, see the [Metrics reference](/dynamo/user-guides/observability-local/metrics#router-metrics). # Disaggregated Serving Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill`, the frontend automatically detects them and activates an internal prefill router. For the high-level deployment matrix, see [Router Guide](/dynamo/user-guides/kv-cache-aware-routing). For the router flags used in this setup, see [Configuration and Tuning](/dynamo/components/router/configuration-and-tuning). If prefill and decode workers span topology domains such as zones or racks, use [Topology-Aware KV Transfer](/dynamo/components/router/topology-aware-kv-transfer) to constrain or bias decode routing toward workers in the selected prefill worker's transfer domain. ## Automatic Prefill Router Activation The prefill router is automatically created when: 1. A decode model is registered, for example via `register_model()` with `ModelType.Chat | ModelType.Completions`. 2. A prefill worker is detected with the same model name and `ModelType.Prefill`. Key characteristics of the prefill router: - **Always disables active block tracking** (`track_active_blocks=false`) since prefill workers do not perform decode. - **Seamlessly integrates** into the request pipeline between preprocessing and decode routing. - **Falls back gracefully** to decode-only mode if prefill fails or no prefill workers are available. Key characteristics of the decode routing stage in disaggregated mode: - **Disables overlap scoring** (`overlap_score_credit=0`) because decode routing should not chase prefix reuse. - **Disables KV reuse assumption** (`assume_kv_reuse=false`) unless the backend can truly deduplicate transferred blocks. - **Disables prefill-token tracking** (`track_prefill_tokens=false`) so decode-side load reflects decode work rather than already-completed prompt work. ## Setup Example When both workers are registered, requests are automatically routed. ```python # Decode worker registration (in your decode worker) decode_endpoint = runtime.endpoint("dynamo.decode.generate") await register_model( model_input=ModelInput.Tokens, model_type=ModelType.Chat | ModelType.Completions, endpoint=decode_endpoint, model_name="meta-llama/Llama-2-7b-hf", # ... other parameters ) await decode_endpoint.serve_endpoint(decode_handler.generate) # Prefill worker registration (in your prefill worker) prefill_endpoint = runtime.endpoint("dynamo.prefill.generate") await register_model( model_input=ModelInput.Tokens, model_type=ModelType.Prefill, endpoint=prefill_endpoint, model_name="meta-llama/Llama-2-7b-hf", # ... other parameters ) await prefill_endpoint.serve_endpoint(prefill_handler.generate) ``` The automatic disaggregated routing setup described here is currently supported by the integrated `dynamo.frontend` path. It is not provided as a single turnkey mode by the standalone Python router (`python -m dynamo.router`). If you build this topology with standalone routers, you must launch and connect the prefill and decode routing stages yourself and handle request handoff, including the `disaggregated_params` returned by prefill. For an advanced reference, see the [Global Router](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/components/src/dynamo/global_router), which composes local prefill and decode router pools explicitly. ## Request Flow The following diagram shows an overview of the major components in disaggregated serving: ```mermaid graph TD HTTP[HTTP] ROUTER[Router] PREFILL[Prefill Worker] DECODE[Decode Worker] classDef worker_style fill:#f3e5f5,stroke:#333,stroke-width:2px,color:#333; classDef router_style fill:#2e8b57,stroke:#333,stroke-width:2px,color:#fff; class PREFILL,DECODE worker_style class ROUTER router_style HTTP <--> |"request/response"| ROUTER ROUTER --> |"1. send to prefill"| PREFILL PREFILL --> |"2. return NIXL metadata"| ROUTER ROUTER --> |"3. send with metadata"| DECODE DECODE --> |"4. stream response"| ROUTER PREFILL -.-> |"publish kv events"| ROUTER linkStyle 0,1,2,3,4 stroke:#8b4513,stroke-width:2px linkStyle 5 stroke:#2196f3,stroke-width:2px ``` When topology-aware KV transfer is enabled, the prefill router also derives decode `RoutingConstraints` from the selected prefill worker's runtime topology metadata before the request enters the decode router. # Topology-Aware KV Transfer # Topology-Aware KV Transfer Topology-aware KV transfer constrains or biases decode worker selection after a prefill worker has been selected. The router derives standard `RoutingConstraints` from the selected prefill worker's published topology metadata, then merges those constraints into the decode request. Use the Kubernetes operator path when possible. For deployment examples, see [Kubernetes Topology-Aware KV Transfer](/dynamo/kubernetes-deployment/operate/topology-aware-kv-transfer). ## Runtime Contract Workers publish topology and policy fields through `ModelRuntimeConfig`: | Field | Meaning | |-------|---------| | `topology_domains` | Map of logical domain name to this worker's topology value, for example `{"zone": "us-east-1a"}`. | | `kv_transfer_domain` | Domain key used for prefill-to-decode KV transfer routing, for example `zone`. | | `kv_transfer_enforcement` | `required` or `preferred`. | | `kv_transfer_preferred_weight` | Preferred-taint weight used only when enforcement is `preferred`. | Each topology entry also becomes a canonical worker taint: ```text dynamo.topology/= ``` For example: ```json { "topology_domains": { "zone": "us-east-1a", "rack": "rack-22" }, "kv_transfer_domain": "zone", "kv_transfer_enforcement": "preferred", "kv_transfer_preferred_weight": 0.85 } ``` This creates worker taints: ```text dynamo.topology/zone=us-east-1a dynamo.topology/rack=rack-22 ``` The KV-transfer policy uses only `kv_transfer_domain` to derive the decode constraint. Other topology domains remain available as ordinary routing taints. ## Request Flow ```mermaid sequenceDiagram participant F as Frontend participant PR as Prefill Router participant P as Prefill Worker participant DR as Decode Router participant D as Decode Worker F->>PR: request PR->>P: select prefill worker PR->>PR: read selected worker topology metadata PR->>PR: build RoutingConstraints PR->>DR: decode request + topology constraints DR->>D: select compatible or preferred decode worker ``` The prefill router builds the decode constraint before dispatching prefill when the selected worker is already known. This keeps `required` policy fail-closed: if the router cannot derive authoritative decode constraints for a required policy, it fails the request instead of dispatching prefill and then discovering that decode cannot be routed safely. ## Enforcement Modes ### Required `required` turns the selected prefill worker's transfer-domain topology into a required taint. ```text required_taints = {"dynamo.topology/zone=us-east-1a"} ``` Decode workers without that taint are ineligible. If no eligible decode worker exists, routing returns no endpoint for that request. ### Preferred `preferred` turns the same topology into a preferred taint. ```text preferred_taints = {"dynamo.topology/zone=us-east-1a": 0.85} ``` All decode workers remain eligible, but matching workers receive a lower routing cost. `preferredWeight` controls the strength of the preference from `0` to `1`. ## Worker Environment Contract The Python backend utility reads topology from files and transfer policy from environment variables: | Environment variable | Description | |----------------------|-------------| | `DYN_TOPOLOGY_ENABLED` | Set to `true` to enable topology reading. | | `DYN_TOPOLOGY_MOUNT_PATH` | Directory containing topology files. Defaults to `/etc/dynamo/topology`. | | `DYN_KV_TRANSFER_DOMAIN` | Required when topology is enabled. Names the topology file and runtime domain to use for KV transfer constraints. | | `DYN_KV_TRANSFER_ENFORCEMENT` | `required` or `preferred`. Defaults to `required` when a domain is set. | | `DYN_KV_TRANSFER_PREFERRED_WEIGHT` | Weight used when enforcement is `preferred`. | Each non-hidden, non-empty file under `DYN_TOPOLOGY_MOUNT_PATH` is interpreted as one topology domain. The file name is the domain; the file content is the worker's value for that domain. For example: ```bash mkdir -p /tmp/dynamo-topology printf 'us-east-1a\n' > /tmp/dynamo-topology/zone export DYN_TOPOLOGY_ENABLED=true export DYN_TOPOLOGY_MOUNT_PATH=/tmp/dynamo-topology export DYN_KV_TRANSFER_DOMAIN=zone export DYN_KV_TRANSFER_ENFORCEMENT=required ``` When topology is enabled, the worker polls until the selected transfer-domain file exists and has content. If it remains missing or empty through the timeout window, the worker exits so the bad topology source is visible during startup. ## Backend Support The integrated Python backends apply the topology config during worker registration: - vLLM - SGLang - TensorRT-LLM The topology utility writes the fields onto `ModelRuntimeConfig`; Rust owns validation and canonical topology-taint generation. ## Interactions with Existing Routing Constraints Topology-aware KV transfer uses the existing `RoutingConstraints` path. It does not add a topology-specific selector. If a request already has routing constraints, the prefill router merges the generated topology constraints into the decode request: - Required topology taints are appended to existing `required_taints`. - Preferred topology taints are appended to existing `preferred_taints`. User-provided constraints still apply. A decode worker must satisfy all required constraints to be eligible. ## Operational Notes - Configure this only for disaggregated prefill/decode deployments. Aggregated workers do not perform a remote prefill-to-decode KV transfer. - Keep `DYN_ROUTER_MODE=kv` on the frontend so the prefill and decode routing paths use the KV router. - Make sure every prefill domain has enough decode capacity when using `required`; otherwise the router can legitimately fail requests in domains without decode workers. - Use `preferred` during incremental rollouts when same-domain transfer is a latency preference rather than a hard placement requirement. - Transport health is separate from topology selection. Topology-aware routing chooses a better peer, but RDMA, EFA, UCX, or libfabric still need to be configured correctly for NIXL KV transfer. ## Troubleshooting Signals | Symptom | Likely cause | Check | |---------|--------------|-------| | Worker exits during startup | `DYN_KV_TRANSFER_DOMAIN` missing, or topology file never populated. | Worker logs and contents of `DYN_TOPOLOGY_MOUNT_PATH`. | | Required policy returns no endpoint | No decode worker has the selected prefill worker's generated topology taint. | Worker `ModelRuntimeConfig` topology metadata and decode worker placement. | | Preferred policy still routes cross-domain | Matching domain is overloaded or unavailable, or weight is too low relative to load. | Increase `preferredWeight`, add same-domain decode capacity, or switch to `required`. | | Router sees no topology metadata | Worker did not publish topology fields. | Backend startup logs and runtime config metrics/discovery data. | For Kubernetes-specific verification commands, see [Verify the Deployment](/dynamo/kubernetes-deployment/operate/topology-aware-kv-transfer#verify-the-deployment). # Router Operations This page covers day-2 operational topics for router deployments. For flags and tuning guidance, see [Configuration and Tuning](/dynamo/components/router/configuration-and-tuning). ## Serving Multiple Router Replicas For improved fault tolerance, you can launch multiple frontend-plus-router replicas. If multiple `dynamo.frontend` processes share the same host or network namespace, give each instance a different HTTP port. In Kubernetes or on separate hosts, replicas can usually reuse the same container port. Alternatively, you can deploy the router separately as the standalone `python -m dynamo.router` service. ## Router State Management The KV router maintains two independent state families with different synchronization, persistence, and recovery behavior: 1. **Prefix cache state**: The global view of cached KV prefix blocks on workers. This state drives cache-overlap scoring. 2. **Active block state**: The router's view of KV blocks currently assigned to in-flight requests. This state drives active-load balancing. For the architecture behind these states, see [Router Design](/dynamo/design-docs/component-design/router-design). ### Prefix Cache State Prefix cache state is maintained by the KV indexer in each router or frontend. In event-driven mode, workers publish KV `Stored` and `Removed` events, and each router replica consumes those events to update its radix tree. Because KV events are distributed through the event plane, multiple router replicas naturally receive the same prefix-cache updates; they do not need router-to-router synchronization for prefix blocks. When `--no-router-kv-events` is used, the router does not consume worker KV events. It instead predicts cache state from its own routing decisions and expires predicted blocks with `--router-ttl-secs`. This approximate mode is useful for development or for backends whose KV events are not yet reliable, but it is not the recommended production path. #### Prefix Cache Persistence and Recovery Prefix cache recovery matters because stale or missing prefix state directly affects cache-hit routing decisions. Dynamo supports two recovery strategies. ##### NATS Core / Event Plane with Local Indexer Mode - Prefix state persists on workers. Events are fire-and-forget, but workers retain their local indexer state. - On startup, each router queries each worker's local indexer to rebuild prefix state. - Recovery depends on workers being available. If a worker is down, its blocks cannot be recovered until the worker returns. - This mode keeps the infrastructure simpler because JetStream is not required. For more on gap detection and replay, see [KV Event Replay — Dynamo vs vLLM](/dynamo/components/router/kv-event-replay-dynamo-vs-v-llm). ##### JetStream Mode JetStream mode requires `--router-durable-kv-events` on both frontend and workers. - Prefix blocks are stored in NATS JetStream with 1-hour retention. - Snapshots are saved to NATS object store at configurable thresholds. - New replicas automatically restore this state on startup. - You can launch a third router replica even if the first two are down, and it will recover the full prefix state. ```bash python -m dynamo.frontend \ --router-mode kv \ --http-port 8002 \ --router-durable-kv-events ``` If you need to start with a fresh state in JetStream mode, you have two options: 1. Use a different namespace or component, which creates a new stream and NATS object store path. 2. Launch a router with `--router-reset-states`, which purges the entire stream and radix snapshot. Only do this when launching the first router replica in a component, because it can bring existing replicas into an inconsistent state. ### Active Block State Active block state tracks in-flight request load. It is derived from the request lifecycle: the router records a request when it is assigned to a worker, updates prefill completion and optional output-block growth as responses arrive, and frees the request when it finishes. This state is deliberately ephemeral. If a router replica restarts, it starts with no active-block knowledge. That is usually acceptable for fault tolerance because active requests are short lived relative to prefix cache state: old active blocks leave the system as requests complete, and the router's view becomes accurate again as it handles new requests. The operational concern is replica synchronization. Active blocks are tracked locally by the router that routed a request, so multiple frontend or router replicas do not automatically share the same active-load view. #### Active Block Replica Synchronization There are two operating modes for active blocks: - **Local-only tracking**: Leave replicas unsynchronized. Each router balances using the subset of active requests it routed itself. This is simpler and may be acceptable when traffic is already well distributed across replicas or when active-load precision is less important. - **Replica sync**: Enable `--router-replica-sync` so replicas publish and subscribe to active-sequence lifecycle events through NATS core messaging. This gives each replica a more complete active-load view across the router fleet. ```bash # Router replica 1 python -m dynamo.frontend --router-mode kv --http-port 8000 --router-replica-sync # Router replica 2 python -m dynamo.frontend --router-mode kv --http-port 8001 --router-replica-sync ``` With replica sync enabled, a new router still starts with zero active-block knowledge, but it converges through live request handling and active-sequence events from other replicas. Without it, each replica keeps an isolated active-block view, which can lead to suboptimal load balancing. ## Dynamo-Native Remote Indexer For Dynamo-native deployments, the remote indexer is served by `dynamo.frontend` or `dynamo.router`, not by `dynamo.indexer`. - Use `--serve-indexer` on router or frontend replicas that should expose `kv_indexer_query` from the worker component. - Use `--use-remote-indexer` on consumer routers or frontends that should query that served endpoint instead of maintaining a local overlap indexer. - `dynamo.indexer` remains the standalone HTTP plus ZMQ microservice for non-Dynamo or direct-ZMQ deployments. Frontend example: ```bash # Serving anchors python -m dynamo.frontend --router-mode kv --serve-indexer # Consumer frontend python -m dynamo.frontend --router-mode kv --use-remote-indexer ``` The served service is request-plane only. Each serving router or frontend keeps its normal local KV event ingestion, gap detection, and worker-query recovery path; remote consumers only issue hash-based overlap queries. Approximate mode (`--no-router-kv-events`) is singleton-only for remote serving: only one `--serve-indexer` replica may exist for a given worker component. Event-driven mode allows multiple serving replicas behind the same worker component. ```mermaid graph TD subgraph "Workers" W1["Worker 1"] W2["Worker 2"] end subgraph "Event Plane" EP["KV Events"] end subgraph "Serving Routers / Frontends" S1["Router / Frontend A
--serve-indexer"] S2["Router / Frontend B
--serve-indexer"] I1["Local Indexer"] I2["Local Indexer"] end subgraph "Request Plane" RP["backend.kv_indexer_query"] end C["Consumer Router / Frontend
--use-remote-indexer"] W1 --> EP W2 --> EP EP --> S1 EP --> S2 S1 --> I1 S2 --> I2 C --> RP RP --> S1 RP --> S2 ``` ## Additional Notes Request-plane transport is independent of KV event transport. The request plane (`DYN_REQUEST_PLANE` or `--request-plane`) controls how requests reach workers. KV events use NATS in JetStream or NATS Core modes, or ZMQ when `--event-plane zmq` is set. With `--event-plane zmq` and `--discovery-backend file` or `mem`, the router can run without etcd or NATS. When using a NATS-based event plane, NATS is initialized automatically; set `NATS_SERVER=nats://...` to override the default `localhost:4222`. When `--router-kv-overlap-score-credit` is set to 0, no KV indexer is created and prefix matching is disabled. When `--no-router-kv-events` is set, a KV indexer is still created but no event subscriber is launched; the router predicts cache state from its own routing decisions with TTL-based expiration. Backend KV event publishing is independent of the frontend's `--no-router-kv-events` flag. The frontend flag controls whether the router consumes events; backend flags control whether workers publish them. If the router is not consuming events, workers that still publish will waste resources but cause no harm. - **vLLM**: Pass `--kv-events-config '{"enable_kv_cache_events": false}'` to disable, or `'{"enable_kv_cache_events": true, "publisher": "zmq", "endpoint": "tcp://*:5557"}'` to enable. - **SGLang**: Pass `--kv-events-config` with a JSON config to enable, or omit it to keep publishing disabled. - **TRT-LLM**: Pass `--publish-events-and-metrics` to enable, or omit it to keep publishing disabled. The CLI arg `--router-ttl-secs` controls local cache prediction lifetime when the router operates without receiving events from workers. When workers are configured to publish KV events, the router relies on worker-side eviction events and this parameter is ignored. `--router-queue-threshold` and the busy thresholds (`--active-decode-blocks-threshold`, `--active-prefill-tokens-threshold`, `--active-prefill-tokens-threshold-frac`) serve different purposes. Busy thresholds reject a worker entirely from the candidate set when it exceeds a utilization limit. In contrast, `--router-queue-threshold` defers the entire routing decision until at least one worker has capacity, so the request is routed with the freshest load metrics. The busy thresholds can be updated at runtime without restarting the frontend via the `/busy_threshold` HTTP endpoint. For the eligibility and backpressure distinction, see [Router Filtering](router-filtering.md). For rejection behavior details, see [Request Rejection](/dynamo/user-guides/fault-tolerance/request-rejection). # Router Examples For quick start instructions, see the [Router README](/dynamo/components/router). This document provides further examples for using the Dynamo Router, including Python API usage, Kubernetes deployments, and custom routing patterns. ## Using KvRouter Python API Instead of launching the KV Router via command line, you can create a `KvRouter` object directly in Python. This allows per-request routing configuration overrides. **Multiple Routers from the Same Runtime**: Do not create multiple independently managed `KvRouter` instances from the same `DistributedRuntime`. Routers created from endpoints owned by the same runtime share that runtime's primary cancellation token, so dropping one router can cancel background work used by the others. For one in-process frontend, use a single `KvRouter`; for independent router lifetimes, use separate frontend processes or create each router from a separate `DistributedRuntime`. With the event loop available as `loop`, independent in-process router lifetimes require separate runtimes: ```python router_a = KvRouter(DistributedRuntime(loop, "etcd", "tcp").endpoint("dynamo.backend.generate"), 16, KvRouterConfig()) router_b = KvRouter(DistributedRuntime(loop, "etcd", "tcp").endpoint("dynamo.backend.generate"), 16, KvRouterConfig()) ``` ### Methods The `KvRouter` provides the following methods: - **`generate(token_ids, model, ...)`**: Route and execute a request, returning an async stream of responses. Automatically handles worker selection, state tracking, and lifecycle management. - **`best_worker(token_ids, router_config_override=None, request_id=None, update_indexer=False)`**: Query which worker would be selected for given tokens. Returns `(worker_id, dp_rank, overlap_blocks)`. - Without `request_id`: Query-only, doesn't update router state - With `request_id`: Updates router lifecycle state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking - With `update_indexer=True`: Records the selected worker in the approximate indexer for future overlap predictions. This is only meaningful when `use_kv_events=False` - **`get_potential_loads(token_ids)`**: Get detailed load information for all workers, including potential prefill tokens and active decode blocks. Returns a list of load dictionaries. - **`mark_prefill_complete(request_id)`**: Signal that a request has completed its prefill phase. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`. - **`free(request_id)`**: Signal that a request has completed and its resources should be released. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`. - **`dump_events()`**: Dump all KV cache events from the router's indexer as a JSON string. Useful for debugging and analysis. ### Setup First, launch your backend engines: ```bash python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf ``` ### Example Script ```python import asyncio from dynamo.runtime import DistributedRuntime from dynamo.llm import KvRouter, KvRouterConfig async def main(): # Get runtime and create endpoint loop = asyncio.get_running_loop() runtime = DistributedRuntime(loop, "etcd", "nats") endpoint = runtime.endpoint("dynamo.backend.generate") # Create KV router kv_router_config = KvRouterConfig() router = KvRouter( endpoint=endpoint, block_size=16, kv_router_config=kv_router_config ) # Optional startup gate shared with the frontend and standalone indexer: # os.environ["DYN_ROUTER_MIN_INITIAL_WORKERS"] = "2" # Your input tokens token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # Generate with per-request routing override stream = await router.generate( token_ids=token_ids, model="meta-llama/Llama-2-7b-hf", stop_conditions={ "max_tokens": 20, # Generate exactly 20 tokens "ignore_eos": True, # Don't stop at EOS token }, sampling_options={ "temperature": 0.7, "top_p": 0.9, }, router_config_override={ "overlap_score_credit": 1.0, # Prioritize cache hits for this request "router_temperature": 0.5, # Add routing randomness } ) # Collect generated tokens generated_tokens = [] async for response in stream: if isinstance(response, dict) and "token_ids" in response: generated_tokens.extend(response["token_ids"]) print(f"Generated {len(generated_tokens)} tokens: {generated_tokens}") if __name__ == "__main__": asyncio.run(main()) ``` ## K8s Examples For basic Kubernetes deployment with the KV Router, see the [Kubernetes Deployment section](/dynamo/user-guides/kv-cache-aware-routing#kubernetes-deployment) in the Router Guide. ### Complete K8s Examples - [TRT-LLM aggregated router example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/trtllm/deploy/agg_router.yaml) - [vLLM aggregated router example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/vllm/deploy/agg_router.yaml) - [SGLang aggregated router example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/backends/sglang/deploy/agg_router.yaml) - [Kubernetes deployment guide](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart) **For A/B Testing and Advanced K8s Setup:** See the comprehensive [KV Router A/B Benchmarking Guide](/dynamo/additional-resources/kv-router-a-b-testing) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes. ### Example with Advanced Configuration ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-deployment spec: services: Frontend: componentType: frontend replicas: 1 envs: - name: DYN_ROUTER_MODE value: kv - name: DYN_ROUTER_TEMPERATURE value: "0.5" # Add some randomness to prevent worker saturation - name: DYN_ROUTER_KV_OVERLAP_SCORE_CREDIT value: "1.0" # Prefer device-local KV cache reuse - name: DYN_ROUTER_PREFILL_LOAD_SCALE value: "1.5" # Prioritize TTFT over ITL - name: DYN_KV_CACHE_BLOCK_SIZE value: "16" extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 ``` ### Alternative: Using Command Args in K8s You can also pass CLI arguments directly in the container command: ```yaml extraPodSpec: mainContainer: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0 command: - /bin/sh - -c args: - "python3 -m dynamo.frontend --router-mode kv --router-temperature 0.5 --http-port 8000" ``` **Recommendation:** Use environment variables for easier configuration management and consistency with Dynamo's K8s patterns. ## Routing Patterns The `KvRouter` supports multiple usage patterns depending on your control requirements: ### 1. Automatic Routing (Recommended) Call `generate()` directly and let the router handle everything: ```python stream = await router.generate(token_ids=tokens, model="model-name") ``` - **Best for**: Most use cases - **Router automatically**: Selects best worker, updates state, routes request, tracks lifecycle ### 2. Manual State Management (Advanced) Use `best_worker(request_id=...)` to select and track, then manage the request yourself: ```python worker_id, _dp_rank, overlap = await router.best_worker( tokens, request_id="req-123", update_indexer=True, # needed for approximate mode (use_kv_events=False) ) response = await client.generate(tokens, request_id="req-123") # await anext(response) # Get first token await router.mark_prefill_complete("req-123") # After first token # async for _ in response: # Continue generating # ... await router.free("req-123") # After completion ``` - **Best for**: Custom request handling with router state tracking - **Requires**: Calling `mark_prefill_complete()` and `free()` at correct lifecycle points - **Approximate mode**: Pass `update_indexer=True` when `use_kv_events=False` so the router learns from manual worker selections - **Caution**: Incorrect lifecycle management degrades load balancing accuracy ### 3. Hierarchical Router Probing Query without state updates, then route through a chosen router: ```python # Probe multiple routers without updating state worker_id_1, dp_rank, overlap_1 = await router_1.best_worker(tokens) # No request_id worker_id_2, dp_rank, overlap_2 = await router_2.best_worker(tokens) # Pick the best router and corresponding worker based on results if overlap_1 > overlap_2: chosen_router, chosen_worker = router_1, worker_id_1 else: chosen_router, chosen_worker = router_2, worker_id_2 stream = await chosen_router.generate(tokens, model="model-name", worker_id=chosen_worker) ``` - **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups) - **Advantage**: Query multiple routers before committing to one ### 4. Custom Load-Based Routing Use `get_potential_loads()` to implement custom routing logic: ```python loads = await router.get_potential_loads(tokens) # Apply custom logic (e.g., weighted scoring, constraints) best_worker = min(loads, key=lambda x: custom_cost_fn(x)) stream = await router.generate(tokens, model="model-name", worker_id=best_worker['worker_id']) ``` - **Best for**: Custom optimization strategies beyond the built-in cost function - **Advantage**: Full control over worker selection logic - **See also**: Detailed example below in "Custom Routing Example: Minimizing TTFT" All patterns support `router_config_override` to adjust routing behavior per-request without recreating the router. ## Custom Routing Example: Minimizing TTFT Here's an example of using `get_potential_loads()` to implement custom routing that minimizes Time To First Token (TTFT) by selecting the worker with the least prefill work: ```python import asyncio from dynamo.runtime import DistributedRuntime from dynamo.llm import KvRouter, KvRouterConfig async def minimize_ttft_routing(): # Setup router loop = asyncio.get_running_loop() runtime = DistributedRuntime(loop, "etcd", "nats") endpoint = runtime.endpoint("dynamo.backend.generate") router = KvRouter( endpoint=endpoint, block_size=16, kv_router_config=KvRouterConfig() ) # Your input tokens token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # Get potential loads for all workers potential_loads = await router.get_potential_loads(token_ids) # Find worker with minimum prefill tokens (best for TTFT) best_worker = min(potential_loads, key=lambda x: x['potential_prefill_tokens']) print(f"Worker loads: {potential_loads}") print(f"Selected worker {best_worker['worker_id']} with {best_worker['potential_prefill_tokens']} prefill tokens") # Route directly to the selected worker stream = await router.generate( token_ids=token_ids, model="meta-llama/Llama-2-7b-hf", worker_id=best_worker['worker_id'], # Force routing to optimal worker stop_conditions={"max_tokens": 20} ) # Process response async for response in stream: if isinstance(response, dict) and "token_ids" in response: print(f"Generated tokens: {response['token_ids']}") if __name__ == "__main__": asyncio.run(minimize_ttft_routing()) ``` This approach gives you complete control over routing decisions, allowing you to optimize for different metrics based on your specific requirements. As some examples: - **Minimize TTFT**: Select worker with lowest `potential_prefill_tokens` - **Maximize cache reuse**: Use `best_worker()` which considers both prefill and decode loads - **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together See [Router Design](/dynamo/design-docs/component-design/router-design) for architecture details and the cost function algorithm. ## KV Event Publishing for Custom Engines For full documentation on implementing KV event publishing for custom inference engines, see the dedicated [KV Event Publishing for Custom Engines](/dynamo/integrations/kv-cache-integrations/kv-events-for-custom-engines) guide. It covers: - **Direct publishing**: Call `publish_stored()` / `publish_removed()` to push events over the Dynamo event plane - **ZMQ relay**: For engines that emit raw KV events over ZMQ (like SGLang and vLLM), the same `KvEventPublisher` subscribes to the ZMQ socket and relays events automatically - API reference, event structure, ZMQ wire format, and best practices ## Global Router (Hierarchical Routing) For deployments with multiple worker pools, the **Global Router** enables hierarchical routing by sitting between the frontend and local routers. It selects the appropriate pool for each request based on configurable policies, supporting disaggregated topologies where pools are tuned for different workload characteristics. - **Component details**: [`components/src/dynamo/global_router/`](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/components/src/dynamo/global_router/) - **Example**: [`examples/global_planner/`](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/global_planner/) ## See Also - **[Router README](/dynamo/components/router)**: Quick start guide for the KV Router - **[Configuration and Tuning](/dynamo/components/router/configuration-and-tuning)**: Router flags and production setup - **[Router Design](/dynamo/design-docs/component-design/router-design)**: Architecture details and event transport modes # Standalone KV Indexer ## Overview The standalone KV indexer (`python -m dynamo.indexer`) is a lightweight service that maintains a radix tree of cached blocks and exposes HTTP endpoints for querying and managing workers. - It subscribes to ZMQ KV event streams directly from workers. - It exposes an HTTP API for registration, inspection, and overlap queries. - It preserves P2P recovery and gap detection/replay for the standalone ZMQ path. - It indexes device, host-pinned, and disk tier blocks and reports per-tier matches in `/query` responses. This is distinct from the [Standalone Router](../../../components/src/dynamo/router/README.md), which is a full routing service. The standalone indexer provides only the indexing and query layer without routing logic. For Dynamo-native remote indexing, use `--serve-indexer` on `dynamo.frontend` or `dynamo.router` and `--use-remote-indexer` on consumers instead. That request-plane service reuses the router's existing event ingestion and recovery machinery; it is not implemented by `dynamo.indexer`. The HTTP API follows the [Mooncake KV Indexer RFC](https://github.com/kvcache-ai/Mooncake/issues/1403) conventions. `DYN_ROUTER_MIN_INITIAL_WORKERS` is also honored here. When set to a positive integer, the standalone indexer waits for that many workers to register before opening its startup-ready gate, matching the frontend/router startup behavior. ## Multi-Model and Multi-Tenant Support The indexer maintains one radix tree per `(model_name, tenant_id)` pair. Workers registered with different model names or tenant IDs are isolated into separate indexers — queries against one model/tenant never return scores from another. - **`model_name`** (required on `/register` and `/query`): Identifies the model. Workers serving different models get separate radix trees. - **`tenant_id`** (optional, defaults to `"default"`): Enables multi-tenant isolation within the same model. Omit for single-tenant deployments. - **`block_size`** is per-indexer: the first `/register` call for a given `(model_name, tenant_id)` sets the block size. Subsequent registrations for the same pair must use the same block size or the request will fail. ## Compatibility The standalone indexer works with any engine that publishes KV cache events over ZMQ in the expected msgpack format. This includes bare vLLM and SGLang engines, which emit ZMQ KV events natively — no Dynamo-specific wrapper is required. Events tagged with non-device storage tiers (host-pinned, disk, external) are routed into a lower-tier slot rather than dropped, and surface in `/query` responses as `cpu` / `disk` reach. ## Use Cases - **Debugging**: Inspect the radix tree state to verify which blocks are cached on which workers. - **State verification**: Confirm that the indexer's view of KV cache state matches the router's internal state (used in integration tests). - **Custom routing**: Build external routing logic that queries the indexer for overlap scores and makes its own worker selection decisions. - **Monitoring**: Observe KV cache distribution across workers without running a full router. - **Standalone microservice**: Run an indexer independently of the router/frontend when you want direct HTTP inspection and ZMQ-based ingestion. ## P2P Recovery Multiple indexer replicas can subscribe to the same ZMQ worker endpoints for fault tolerance. When a replica starts (or restarts after a crash), it bootstraps its radix tree state from a healthy peer before processing live events. ### How It Works 1. Workers are registered via `--workers` or `/register`. Each ZMQ listener enters `pending` state and begins its initial subscribe/connect attempt in the background. 2. A 1-second delay biases peer recovery past the slow-joiner window, so the dump covers events that may have occurred before a fresh listener can safely start draining. 3. The indexer fetches a `/dump` from the first reachable peer in `--peers`. 4. Dump events are applied to populate the radix tree. 5. After recovery completes, the ready gate opens. Any listener whose initial ZMQ connect has already succeeded transitions to `active` and begins draining buffered events; listeners for workers that are still down remain `pending` until they connect. If no peers are reachable, the indexer starts with an empty state. ### Example: Two-Replica Setup ```bash # Replica A (first instance, no peers) python -m dynamo.indexer --port 8090 --block-size 16 \ --workers "1=tcp://worker1:5557,2=tcp://worker2:5558" # Replica B (recovers from A on startup) python -m dynamo.indexer --port 8091 --block-size 16 \ --workers "1=tcp://worker1:5557,2=tcp://worker2:5558" \ --peers "http://localhost:8090" ``` Both replicas subscribe to the same workers. Replica B recovers A's tree state on startup, then both independently process live ZMQ events going forward. ### Consistency The dump is a weakly consistent BFS snapshot of the radix tree — concurrent writes may race with the traversal. This is acceptable because: - **Stale blocks** (partially removed branches): live `Remove` events will clean them up. - **Missing blocks** (partially added branches): live `Stored` events will add them. - The tree converges to the correct state after live events catch up. ### Peer Management Peers can be registered at startup via `--peers` or dynamically via the HTTP API. The peer list is used for recovery only — peers do not synchronize state in real time. ## Building The service is exposed through the Python bindings package and launched with `python -m dynamo.indexer` after building the bindings with maturin. Feature flags control which capabilities are compiled in: | Feature | Description | |---------|-------------| | `kv-indexer` | Core standalone indexer service path (`python -m dynamo.indexer`: HTTP API, ZMQ listeners, P2P recovery) | | `kv-indexer-metrics` | Optional `/metrics` endpoint | ### Standalone build ```bash cd lib/bindings/python && VIRTUAL_ENV=../../.venv ../../.venv/bin/maturin develop --uv --features kv-indexer ``` After installation, launch the service with `python -m dynamo.indexer`. ### Standalone build with metrics ```bash cd lib/bindings/python && VIRTUAL_ENV=../../.venv ../../.venv/bin/maturin develop --uv --features kv-indexer,kv-indexer-metrics ``` This keeps the default `kv-indexer` build lean while still allowing Prometheus metrics when needed. ## CLI ```bash python -m dynamo.indexer --port 8090 [--threads 4] [--block-size 16 --model-name my-model --tenant-id default --workers "1=tcp://host:5557,2:1=tcp://host:5558"] [--peers "http://peer1:8090,http://peer2:8091"] ``` | Flag | Default | Description | |------|---------|-------------| | `--block-size` | (none) | KV cache block size for initial `--workers` (required when `--workers` is set) | | `--port` | `8090` | HTTP server listen port | | `--threads` | `4` | Number of indexer threads (1 = single-threaded, >1 = thread pool) | | `--workers` | (none) | Initial workers as `instance_id[:dp_rank]=zmq_address,...` pairs (dp_rank defaults to 0) | | `--model-name` | `default` | Model name for initial `--workers` | | `--tenant-id` | `default` | Tenant ID for initial `--workers` | | `--peers` | (none) | Comma-separated peer indexer URLs for P2P recovery on startup | ### Shared Startup Gate Set `DYN_ROUTER_MIN_INITIAL_WORKERS=` to require at least `` workers before the standalone indexer, frontend push-router path, and KV router config-ready gate all proceed. Leave it unset or set it to `0` to disable the startup wait. ## HTTP API ### `GET /health` — Liveness check Returns `200 OK` unconditionally. ```bash curl http://localhost:8090/health ``` ### `GET /metrics` — Prometheus metrics Returns metrics in Prometheus text exposition format. Available when the Python bindings are built with the `kv-indexer-metrics` feature. ```bash curl http://localhost:8090/metrics ``` | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `dynamo_kvindexer_request_duration_seconds` | Histogram | `endpoint` | HTTP request latency | | `dynamo_kvindexer_requests_total` | Counter | `endpoint`, `method` | Total HTTP requests | | `dynamo_kvindexer_errors_total` | Counter | `endpoint`, `status_class` | HTTP error responses (4xx/5xx) | | `dynamo_kvindexer_models` | Gauge | — | Number of active model+tenant indexers | | `dynamo_kvindexer_workers` | Gauge | — | Number of registered worker instances | | `dynamo_kvindexer_listeners` | Gauge | `status` | Number of ZMQ listeners by status (`pending`, `active`, `paused`, `failed`) | ### `POST /register` — Register an endpoint Register a ZMQ endpoint for an instance. Each call creates or reuses the indexer for the given `(model_name, tenant_id)` pair. Registration is non-blocking: if the worker is not up yet, the listener is accepted in `pending` state and transitions to `active` once the initial ZMQ connection succeeds. ```bash # Single model, default tenant curl -X POST http://localhost:8090/register \ -H 'Content-Type: application/json' \ -d '{ "instance_id": 1, "endpoint": "tcp://127.0.0.1:5557", "model_name": "llama-3-8b", "block_size": 16 }' # With tenant isolation curl -X POST http://localhost:8090/register \ -H 'Content-Type: application/json' \ -d '{ "instance_id": 2, "endpoint": "tcp://127.0.0.1:5558", "model_name": "llama-3-8b", "tenant_id": "customer-a", "block_size": 16, "dp_rank": 0 }' ``` | Field | Required | Default | Description | |-------|----------|---------|-------------| | `instance_id` | yes | — | Worker instance identifier | | `endpoint` | yes | — | ZMQ PUB address to subscribe to | | `model_name` | yes | — | Model name (used to select the indexer) | | `block_size` | yes | — | KV cache block size (must match the engine) | | `tenant_id` | no | `"default"` | Tenant identifier for isolation | | `dp_rank` | no | `0` | Data parallel rank | | `replay_endpoint` | no | — | ZMQ ROUTER address for gap replay (e.g. `tcp://host:5560`) | | `additional_salt` | no | — | Per-tenant salt (Mooncake RFC #1403 `additionalsalt`, alias accepted). Currently parsed for forward compatibility — engines apply their own salting today. | ### `POST /unregister` — Deregister an instance Remove an instance. Omitting `tenant_id` removes the instance from **all** tenants for the given model; providing it targets only that tenant's indexer. ```bash # Remove from all tenants curl -X POST http://localhost:8090/unregister \ -H 'Content-Type: application/json' \ -d '{"instance_id": 1, "model_name": "llama-3-8b"}' # Remove from a specific tenant curl -X POST http://localhost:8090/unregister \ -H 'Content-Type: application/json' \ -d '{"instance_id": 1, "model_name": "llama-3-8b", "tenant_id": "customer-a"}' # Remove a specific dp_rank curl -X POST http://localhost:8090/unregister \ -H 'Content-Type: application/json' \ -d '{"instance_id": 1, "model_name": "llama-3-8b", "tenant_id": "default", "dp_rank": 0}' ``` | Field | Required | Default | Description | |-------|----------|---------|-------------| | `instance_id` | yes | — | Worker instance to remove | | `model_name` | yes | — | Model name (identifies the indexer) | | `tenant_id` | no | — | Tenant identifier (omit to remove from all tenants) | | `dp_rank` | no | — | Specific dp_rank to remove (omit to remove all) | ### `GET /workers` — List registered instances Returns all registered workers, optionally filtered by model and/or tenant. | Query parameter | Description | |-----------------|-------------| | `model_name` | Return only workers registered for this model. Omit to return all models. | | `tenant_id` | Return only workers registered for this tenant. Omit to return all tenants. | ```bash # All workers curl http://localhost:8090/workers # Workers for a specific model curl "http://localhost:8090/workers?model_name=llama-3-8b" # Workers for a specific model and tenant curl "http://localhost:8090/workers?model_name=llama-3-8b&tenant_id=customer-a" ``` Returns: ```json [ { "instance_id": 1, "source": "zmq", "status": "active", "model_name": "llama-3-8b", "tenant_id": "default", "block_size": 16, "endpoints": { "0": "tcp://127.0.0.1:5557", "1": "tcp://127.0.0.1:5558" }, "listeners": { "0": { "endpoint": "tcp://127.0.0.1:5557", "status": "active" }, "1": { "endpoint": "tcp://127.0.0.1:5558", "status": "active" } } } ] ``` | Response field | Description | |----------------|-------------| | `instance_id` | Worker instance identifier | | `source` | Always `"zmq"` for ZMQ-managed workers | | `status` | Aggregated listener status: `failed > pending > active > paused` | | `model_name` | Model this worker is registered under | | `tenant_id` | Tenant this worker is registered under | | `block_size` | KV cache block size for this worker's `(model_name, tenant_id)` indexer | | `endpoints` | Map of `dp_rank → zmq_address` | | `listeners` | Per-dp_rank listener detail; each entry may include a `last_error` field when the most recent startup or recv-loop attempt failed | Filters are independent — providing both `model_name` and `tenant_id` returns only workers matching both. An empty array is returned (not a 404) when no workers match the filter. ### `POST /query` — Query overlap for token IDs Given raw token IDs, compute block hashes and return per-instance overlap scores (in matched tokens): ```bash curl -X POST http://localhost:8090/query \ -H 'Content-Type: application/json' \ -d '{"token_ids": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], "model_name": "llama-3-8b"}' ``` Returns: ```json { "scores": {"1": {"0": 32}, "2": {"1": 0}}, "frequencies": [1, 1], "instances": { "1": { "longest_matched": 48, "gpu": 32, "dp": {"0": 32}, "cpu": 48, "disk": 48 }, "2": { "longest_matched": 0, "gpu": 0, "dp": {"1": 0}, "cpu": 0, "disk": 0 } } } ``` All counts are in **matched tokens** (block overlap count × block size). - `scores` / `frequencies`: legacy device-tier overlap. `scores` is nested by `instance_id` then `dp_rank`. Preserved for backward compatibility — existing callers do not need to change. - `instances`: per-instance, per-tier breakdown aligned with [Mooncake RFC #1403](https://github.com/kvcache-ai/Mooncake/issues/1403). See [Per-instance tier breakdown](#per-instance-tier-breakdown) below. | Field | Required | Default | Description | |-------|----------|---------|-------------| | `token_ids` | yes | — | Token sequence to query | | `model_name` | yes | — | Model name (selects the indexer) | | `tenant_id` | no | `"default"` | Tenant identifier | | `lora_name` | no | — | LoRA adapter (overrides indexer-level lora_name for this query) | | `cache_salt` | no | — | Per-request cache salt (Mooncake RFC #1403). Currently parsed for forward compatibility — engines apply their own salting today. | ### `POST /query_by_hash` — Query overlap for pre-computed hashes ```bash curl -X POST http://localhost:8090/query_by_hash \ -H 'Content-Type: application/json' \ -d '{"block_hashes": [123456, 789012], "model_name": "llama-3-8b"}' ``` Same response format as `/query`, including the per-instance `instances` map. Scores are in matched tokens. | Field | Required | Default | Description | |-------|----------|---------|-------------| | `block_hashes` | yes | — | Pre-computed block hash array | | `model_name` | yes | — | Model name (selects the indexer) | | `tenant_id` | no | `"default"` | Tenant identifier | | `cache_salt` | no | — | Per-request cache salt (Mooncake RFC #1403). Currently parsed for forward compatibility — engines apply their own salting today. | ### Per-instance tier breakdown Each entry in `instances` is keyed by `instance_id` (as a string) and reports prefix reach across the device, host-pinned, and disk storage tiers: | Field | Description | |-------|-------------| | `gpu` | Tokens matched on the device tier (the longest device-tier prefix for any `dp_rank` of this instance). | | `dp` | Per-`dp_rank` device-tier match count, as `{rank: tokens}`. | | `cpu` | Tokens matched through the host-pinned tier. **Cumulative** through the device tier — includes everything counted in `gpu` plus any host-pinned extension. | | `disk` | Tokens matched through the disk (or external) tier. **Cumulative** through the device → host-pinned walk. | | `longest_matched` | The maximum of `gpu`, `cpu`, and `disk` — a single "best prefix length" the gateway can sort on. | Tier counts are cumulative because the lower-tier walk reports each tier's *extension* on top of the previous one. Under a natural offload pipeline (device → host → disk), this guarantees `gpu ≤ cpu ≤ disk` for every instance — lower tiers extend the device-tier prefix rather than shrink it. Legacy callers that only consume `scores` keep working: those values are equal to each instance's per-`dp_rank` `gpu` count. ### `GET /dump` — Dump all radix tree events Returns the full radix tree state as a JSON object keyed by `model_name:tenant_id`: ```bash curl http://localhost:8090/dump ``` Returns: ```json { "llama-3-8b:default": { "block_size": 16, "events": [, ...] }, "mistral-7b:customer-a": { "block_size": 16, "events": [, ...] } } ``` Each indexer is dumped concurrently. The `block_size` field lets recovering peers create indexers with the correct block size without requiring `--block-size` on every replica. ### `POST /register_peer` — Register a peer indexer ```bash curl -X POST http://localhost:8090/register_peer \ -H 'Content-Type: application/json' \ -d '{"url": "http://peer:8091"}' ``` ### `POST /deregister_peer` — Remove a peer indexer ```bash curl -X POST http://localhost:8090/deregister_peer \ -H 'Content-Type: application/json' \ -d '{"url": "http://peer:8091"}' ``` ### `GET /peers` — List registered peers ```bash curl http://localhost:8090/peers ``` Returns: ```json ["http://peer:8091"] ``` ## DP Rank Handling When a worker registers with the standalone KV indexer (`/register`), it provides an `instance_id`, a ZMQ `endpoint`, and an optional `dp_rank` (defaults to 0). The service spawns one ZMQ listener per registration. Each incoming `KvEventBatch` may carry an optional `data_parallel_rank` field. If present, it **overrides** the statically-registered `dp_rank` for that batch. This allows a single ZMQ port to multiplex events from multiple DP ranks. **Caveat**: the registry only tracks dp_ranks from explicit `/register` calls. If an engine dynamically emits batches with a dp_rank that was never registered, the indexer will store those blocks correctly (under the dynamic `WorkerWithDpRank` key), but per-dp_rank deregistration (`/unregister` with `dp_rank`) will not find them. Full-instance deregistration (`/unregister` without `dp_rank`) still cleans up all dp_ranks for a given `worker_id` in the tree via `remove_worker`. ## Gap Detection and Replay ZMQ PUB/SUB is lossy — messages can be dropped under backpressure or brief disconnects. The indexer detects gaps by tracking the sequence number of each batch: if `seq > last_seq + 1`, a gap is detected. When a `replay_endpoint` is provided during `/register`, the indexer connects a DEALER socket to the engine's ROUTER socket and requests the missing batches by sequence number. The engine streams back buffered `(seq, payload)` pairs from its ring buffer until an empty-payload sentinel. If no `replay_endpoint` is configured, gaps are logged as warnings but not recovered. The sequence counter (`last_seq`) persists across unregister/register cycles, so re-registering a worker after a gap will trigger replay on the first batch received by the new listener. ## Limitations - **Standalone mode is ZMQ only**: Workers must publish KV events via ZMQ PUB sockets. - **No routing logic**: The indexer only maintains the radix tree and answers queries. It does not track active blocks, manage request lifecycle, or perform worker selection. ## Architecture ### Standalone Mode ```mermaid graph TD subgraph Workers W1[Worker 1
ZMQ PUB] W2[Worker 2
ZMQ PUB] end subgraph "Standalone Indexer (HTTP)" REG[Worker Registry] ZMQ[ZMQ SUB Listeners] IDX["Indexer Map
(model, tenant) → Radix Tree"] HTTP[HTTP API
/query /dump /register /health] end CLIENT[External Client] W1 -->|ZMQ events| ZMQ W2 -->|ZMQ events| ZMQ CLIENT -->|POST /register| REG REG -->|spawn listeners| ZMQ ZMQ -->|apply events| IDX CLIENT -->|POST /query, GET /dump| HTTP HTTP -->|query| IDX style W1 fill:#f3e5f5,stroke:#333,color:#333 style W2 fill:#f3e5f5,stroke:#333,color:#333 style IDX fill:#2e8b57,stroke:#333,color:#fff style ZMQ fill:#2e8b57,stroke:#333,color:#fff style REG fill:#2e8b57,stroke:#333,color:#fff style HTTP fill:#2e8b57,stroke:#333,color:#fff style CLIENT fill:#fff3e0,stroke:#333,color:#333 ``` ### P2P Recovery Flow ```mermaid sequenceDiagram participant B as Replica B (new) participant A as Replica A (healthy) participant W as Workers (ZMQ PUB) B->>W: Connect ZMQ SUB sockets Note over B,W: 1s delay for peer tree to advance past connection point B->>A: GET /dump A-->>B: Radix tree snapshot + block sizes Note over B: Apply dump events Note over B: Unblock ZMQ listeners B->>W: Start draining buffered events Note over B: Ready to serve queries ``` ## See Also - **[Mooncake KV Indexer RFC](https://github.com/kvcache-ai/Mooncake/issues/1403)**: Community API standardization for KV cache indexers - **[Configuration and Tuning](/dynamo/components/router/configuration-and-tuning)**: Full KV router configuration and tuning - **[Router Design](/dynamo/design-docs/component-design/router-design)**: Architecture and event transport modes - **[Standalone Router](../../../components/src/dynamo/router/README.md)**: Full routing service (routes requests to workers) # Standalone Slot Tracker ## Overview The standalone slot tracker (`python -m dynamo.slot_tracker`) exposes the KV router's active-request accounting as a small HTTP service. It is runtime-independent: consumers register workers manually, submit request lifecycle events, and read advisory load snapshots for their own routing decisions. The service accepts ordered final chained sequence hashes, one hash per prompt block. Hashes are serialized as signed 64-bit JSON integers and reinterpreted bit-for-bit as internal unsigned hashes. Send hashes rather than prompt tokens. This first version intentionally excludes metrics, discovery-based registration, output block updates, replica synchronization, persistence, and peer recovery. ## Build And Launch Build the Python bindings with the `slot-tracker` feature: ```bash cd lib/bindings/python VIRTUAL_ENV=../../../.venv ../../../.venv/bin/maturin develop --uv --features slot-tracker ``` Launch the service: ```bash .venv/bin/python -m dynamo.slot_tracker --port 8091 ``` The default port is `8091`. `GET /health` returns `200 OK` with an empty body as soon as the HTTP listener is ready. This endpoint is liveness-only. After a restart the registry is empty; consumers must re-register workers and replay active requests if they need restored accounting. The service binds to `0.0.0.0` and does not provide authentication. Run it on a trusted internal network or place it behind an appropriate network policy. ## Common Responses Successful topology and lifecycle writes return: ```json {"status": "ok"} ``` Errors, including malformed JSON, oversized JSON bodies, unknown routes, and unsupported methods, return: ```json {"error": "concise description"} ``` `tenant_id` defaults to `"default"` when omitted. Request bodies use Axum's default bounded JSON handling. ## Topology API ### `POST /register` Register one contiguous data-parallel range: ```json { "worker_id": 7, "model_name": "llama-3-8b", "tenant_id": "default", "block_size": 16, "dp_start": 0, "dp_size": 2 } ``` Returns `201`. `block_size` and `dp_size` must be positive, and the DP range must not overflow. Workers in the same `(model_name, tenant_id)` tracker must use the same block size. Worker IDs are scoped by `(model_name, tenant_id)`. ### `POST /unregister` Remove a worker's full DP range and active requests immediately: ```json { "worker_id": 7, "model_name": "llama-3-8b", "tenant_id": "default" } ``` Returns `200`, or `404` if the registration does not exist. ### `GET /workers` List workers with independent optional `model_name` and `tenant_id` filters: ```json [ { "worker_id": 7, "model_name": "llama-3-8b", "tenant_id": "default", "block_size": 16, "dp_start": 0, "dp_size": 2 } ] ``` The response is sorted for stable inspection. ## Lifecycle API ### `POST /add` Record prompt blocks on a registered worker rank: ```json { "model_name": "llama-3-8b", "tenant_id": "default", "request_id": "req-123", "worker_id": 7, "dp_rank": 0, "sequence_hashes": [101, -22, 303], "new_isl_tokens": 48 } ``` Returns `201`. `sequence_hashes` is required and may be empty. `new_isl_tokens` defaults to `0`; positive values enable prefill-token accounting. Duplicate request IDs return `409`. Unknown trackers or worker ranks return `404`. ### `POST /prefill_complete` Mark prompt processing complete: ```json { "model_name": "llama-3-8b", "tenant_id": "default", "request_id": "req-123" } ``` Returns `200` for an active request. Repeated completion is a no-op. Unknown requests return `404`. ### `POST /free` Release prompt blocks and any remaining prefill state: ```json { "model_name": "llama-3-8b", "tenant_id": "default", "request_id": "req-123" } ``` Returns `200`. Free is idempotent while the model/tenant tracker exists, including for an unknown request. Unknown trackers return `404`. Lifecycle writes preserve the core slot tracker's arrival ordering. Consumers should normally wait for `/add` success before sending later lifecycle writes. The service does not repair reordered delivery: an early unknown `/free` or `/prefill_complete` is forgotten, so a later `/add` may remain accounted until a later free or expiry. A request older than 300 seconds may be removed by inherited stale-request cleanup. ## Load API ### `GET /loads` Read current load snapshots with independent optional `model_name` and `tenant_id` filters: ```json [ { "model_name": "llama-3-8b", "tenant_id": "default", "worker_id": 7, "dp_rank": 0, "active_prefill_tokens": 48, "active_decode_blocks": 3 } ] ``` The response is sorted for stable inspection. ### `POST /potential_loads` Project the loads for a new request: ```json { "model_name": "llama-3-8b", "tenant_id": "default", "sequence_hashes": [101, -22, 303, 404], "new_isl_tokens": 48 } ``` Returns: ```json [ { "worker_id": 7, "dp_rank": 0, "potential_prefill_tokens": 96, "potential_decode_blocks": 4 } ] ``` Projection response order is unspecified to keep the routing read path lean. `/loads` and `/potential_loads` are advisory snapshots, not reservations. A selected worker may disappear before `/add`; recompute after `/add` returns `404`. An ambiguous `/add` timeout is also consumer-owned: automatically retrying the same request is not guaranteed safe because duplicate adds return `409`. # KV Event Replay — Dynamo vs vLLM ## Overview Both Dynamo and vLLM publish KV cache events (block stored, block removed, etc.) over a fire-and-forget transport (ZMQ PUB/SUB). Because PUB/SUB is lossy, both systems need a mechanism for consumers to detect missed messages and recover. This document compares the two approaches. ## The Problem A KV event consumer (router, cache coordinator) subscribes to a live stream of block events from workers. Events carry monotonically increasing sequence numbers. When the consumer detects a gap in the sequence (e.g., received seq 42 then seq 45), it needs to recover the missed events or it will have a stale, incorrect view of the worker's KV cache state. ## Architecture Comparison | | vLLM Replay Buffer | Dynamo Local Indexer | |---|---|---| | **Core buffer** | `collections.deque[tuple[int, bytes]]` with `maxlen` | `VecDeque` with `max_buffer_size` | | **Buffer semantics** | FIFO ring, old entries silently dropped | FIFO ring, old entries silently dropped | | **Event ordering** | Monotonic sequence number (8-byte int) | Monotonic `event_id` with consecutive-ID validation | | **Lookup** | Linear scan (`for seq, buf in buffer`) | Binary search (`binary_search_by_key`) | | **Serialization** | Pre-serialized msgpack bytes stored in buffer | Structured events stored; serialized on demand | | **Fallback when buffer too old** | Consumer must rebuild externally | Tree dump of full RadixTree state | | **Initial sync** | Not built in — consumer starts from live stream | Tree dump (request with `start_event_id=None`) | | **Authoritative state** | Buffer only | RadixTree (buffer is an optimization layer) | | **Compression / dedup** | Events stored as-is (pre-serialized) | RadixTree compresses shared prefixes across sequences | | **Expiration** | Implicit via `maxlen` eviction | TTL expiration via `PruneManager` | | **Transport** | ZMQ PUB/SUB + ROUTER/REQ | Dynamo service RPC (request/response) | | **Multi-rank** | Port offset per DP rank | Separate query endpoint per DP rank | | **Thread model** | Background thread with queue | Single-threaded tokio runtime on dedicated OS thread | | **Delivery guarantee** | At-least-once (consumer dedupes) | At-least-once (router dedupes via event ID tracking) | | **Dedup responsibility** | Consumer must filter by seq number | Handled inside indexer infrastructure | ## How Each System Works ### vLLM: Buffer-Only Replay vLLM's `ZmqEventPublisher` (in `vllm/distributed/kv_events.py`) runs two ZMQ sockets in a background thread: 1. **PUB socket** (default `tcp://*:5557`): Streams `KVEventBatch` messages tagged with a monotonic sequence number. 2. **ROUTER socket** (optional, e.g., `tcp://*:5558`): Handles replay requests from consumers. The publisher keeps a `deque` of the last `buffer_steps` (default 10,000) serialized batches. When a consumer detects a gap, it sends the missing start sequence number to the ROUTER socket. The publisher linearly scans the buffer and streams back all batches from that sequence onward, ending with a sentinel (`seq=-1, payload=empty`). **Trade-offs:** - Lightweight — no additional state beyond the buffer itself; easy to reason about and deploy. - If the gap is older than the buffer window, the consumer must rebuild state through other means (e.g., restart and re-discover). - No built-in initial state sync — a consumer that connects after events have already been published starts with an empty view. - Linear scan on every replay request (no indexing into the buffer). - Consumer handles dedup by checking `replay_seq > last_seq`. ### Dynamo: Buffer + Indexer with Tree Dump Fallback Dynamo's `LocalKvIndexer` (in `lib/kv-router/src/indexer/local.rs`) wraps a `KvIndexer` (backed by a `RadixTree`) with a circular event buffer: ```text LocalKvIndexer ├── indexer: KvIndexer // Authoritative state (RadixTree) ├── event_buffer: VecDeque // Circular buffer for fast replay └── max_buffer_size: usize ``` When the router queries a worker for events via `get_events_in_id_range(start_id, end_id)`, the local indexer returns one of three responses: | Response | When | What happens | |----------|------|--------------| | `Events` | Requested range within buffer | Returns buffered events directly (binary search for slice bounds) | | `TreeDump` | Range too old or initial sync (`start_id=None`) | Dumps the full RadixTree as synthetic events — complete state snapshot | | `TooNew` | Consumer is ahead of producer | Error response; no gap to fill | The tree dump fallback means that when the buffer can't satisfy the request, the indexer falls back to dumping the entire tree state. This makes "buffer too old" a recoverable condition at the cost of additional complexity and memory for maintaining the tree. ## Gap Detection Both systems detect gaps the same way: the consumer tracks the last sequence/event ID it processed and compares it against the next one received. **vLLM** (from `examples/online_serving/kv_events_subscriber.py`): ```python if last_seq >= 0 and seq > last_seq + 1: missed = seq - last_seq - 1 replay.send((last_seq + 1).to_bytes(8, "big")) # ... receive and process replayed events ``` **Dynamo** (from `lib/llm/src/kv_router/worker_query.rs`): The router tracks `last_recovered_event_id` per worker and requests `recover_from_worker(worker_id, dp_rank, start_event_id, end_event_id)` when it detects a gap or on initial discovery. The local indexer handles the complexity of deciding whether to replay from buffer or dump the tree. ## When to Use Which **vLLM's built-in replay** is a good fit when: - You are running vLLM standalone and want basic gap recovery without additional infrastructure. - Your consumer is long-lived and rarely disconnects — transient gaps are the main concern. - You are building a custom external router or cache coordinator and want to consume KV events directly from vLLM without wrapping it in another framework. **Dynamo's local indexer** is a good fit when: - You need robust recovery, including initial state sync for newly joined routers or consumers that were offline for extended periods. - You are running multiple router replicas that may start at different times and need to converge on a consistent view of cache state. - You want dedup and recovery handled by the infrastructure rather than implementing it in each consumer. The two approaches share the same core idea — a FIFO ring buffer for catching up on small, transient gaps. Dynamo adds a RadixTree underneath as authoritative state, which enables the tree dump fallback for full state recovery at the cost of additional memory and complexity. vLLM keeps it simple with just the buffer, which is sufficient when consumers are stable and gaps are short-lived. For deployments using Dynamo's KV-aware routing, the local indexer is used automatically. For standalone vLLM deployments where you want to build your own event consumer, vLLM's replay buffer provides a lightweight starting point. ## See Also - **[KV Router Index Data Structures](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/lib/kv-router/src/indexer/README.md)**: `RadixTree`, `ConcurrentRadixTree`, and `PositionalIndexer` internals - **[Router Guide](/dynamo/user-guides/kv-cache-aware-routing)**: Deployment modes and quick start for KV-aware routing - **[Configuration and Tuning](/dynamo/components/router/configuration-and-tuning)**: Router flags and tuning details - **[Router Design](/dynamo/design-docs/component-design/router-design)**: Architecture details and event transport modes # Planner ## Why LLM Inference Needs a Different Autoscaler Scaling a traditional web service is straightforward: watch CPU or request rate, add replicas when load is high, remove them when it's low. Tools like HPA and KEDA work well for this because the relationship between load and latency is roughly linear — twice the requests means roughly twice the CPU, so a simple threshold policy keeps response times stable. LLM inference breaks these assumptions: - **Latency depends on request content, not just request count.** A single request with a 32K-token prompt consumes orders of magnitude more compute than a short one. Two requests per second can mean completely different GPU loads depending on input/output sequence lengths. - **Prefill and decode have different scaling characteristics.** In disaggregated serving, prefill is compute-bound (scales with input length) while decode is memory-bound (scales with concurrent sequences and KV cache usage). A single replica count doesn't capture both. - **The metrics that matter aren't standard.** The SLAs users care about — Time to First Token (TTFT) and Inter-Token Latency (ITL) — don't map cleanly to CPU utilization or request throughput. HPA can't target "keep P95 TTFT under 500ms" because that requires understanding the relationship between sequence lengths, GPU memory pressure, and latency. - **Scaling decisions are expensive.** Spinning up a GPU worker takes minutes, not seconds. Overscaling wastes GPU-hours at cloud prices; underscaling violates SLAs. The autoscaler needs to predict demand, not just react to it. The Dynamo **Planner** is an autoscaler purpose-built for these constraints. It understands engine profiling data, tracks per-worker GPU utilization, predicts traffic patterns, and makes scaling decisions that directly target TTFT and ITL SLAs — not proxy metrics. ## Getting Started: Optimization Targets The planner offers three `optimization_target` settings that control how scaling decisions are made: | Target | Description | Requires SLA? | Requires Profiling? | |--------|-------------|:-------------:|:-------------------:| | **`throughput`** (default) | Maximizes throughput by scaling based on queue depth and KV cache utilization. Scales up when engines are saturated, scales down when utilization drops. | No | No | | **`latency`** | Minimizes latency by scaling aggressively to keep queues short. Scales up at lower utilization thresholds. | No | No | | **`sla`** | Targets specific TTFT/ITL SLA values using regression-based performance models. Most precise, but requires configuration. | Yes (`ttft_ms`, `itl_ms`) | Recommended | **We recommend starting with the default `throughput` target** — it works out of the box with zero configuration. Switch to `latency` if your workload is latency-sensitive, or to `sla` when you need precise SLA targeting with pre-deployment profiling. > **New to the Planner?** Start with the [Planner Guide](/dynamo/components/planner/planner-guide) for a complete workflow including profiling and deployment. > **Need multi-DGD coordination?** See the [Global Planner Guide](/dynamo/components/planner/global-planner-guide) for shared-policy coordination across multiple DGDs and single-endpoint multi-pool deployments. ## Scaling Modes The Planner supports two scaling modes that can run independently or together: - **Throughput-based scaling**: Uses pre-deployment engine performance data (from self-benchmark or profiler) and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments. - **Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No pre-deployment data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts. When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor. ## Feature Matrix | Feature | Throughput-Based | Load-Based | |---------|:----------------:|:-------------------------:| | **Deployment** | | | | Disaggregated | Supported | Supported | | Aggregated | Supported | Supported | | **LLM Framework** | | | | SGLang | Supported | Supported | | TensorRT-LLM | Supported | Supported | | vLLM | Supported | Supported | | **Requires Pre-deployment Data** | Yes (self-benchmark or profiler) | No | | **Load Predictors** | ARIMA, Prophet, Kalman, Constant | N/A | | **Router** | | | | Any (round-robin, random, etc.) | Supported | Not supported | | KV Router | Supported | Supported | | **Connectors** | | | | KubernetesConnector | Supported | Supported | | VirtualConnector | Supported | Supported | ## When to Use Which Mode - **Throughput-based scaling** should be enabled whenever engine performance data is available (through self-benchmark or pre-deployment profiling). It provides stable, prediction-based capacity planning. - **Load-based scaling** should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring pre-deployment data. - **Both modes together**: For the best of both worlds, enable both. Throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer `--adjustment-interval` for throughput-based scaling. ## Quick Start ### Prerequisites - Dynamo platform installed on Kubernetes ([Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide)) - kube-prometheus-stack installed ([Metrics Setup](/dynamo/kubernetes-deployment/operate/observability/metrics)) ### Default Mode (zero config) The planner works out of the box with no configuration needed. By default, `optimization_target` is set to `throughput`, which uses static thresholds on queue depth and KV cache utilization — no SLAs or profiling required: ```yaml # Minimal planner config — uses throughput optimization by default features: planner: mode: disagg backend: vllm ``` For latency-sensitive workloads: ```yaml features: planner: mode: disagg backend: vllm optimization_target: latency ``` ### SLA-Based Scaling (advanced) For precise SLA targeting with pre-deployment profiling, set `optimization_target: sla`: ```yaml features: planner: optimization_target: sla enable_throughput_scaling: true enable_load_scaling: true ttft_ms: 500.0 itl_ms: 50.0 pre_deployment_sweeping_mode: rapid ``` The fastest path to SLA-based scaling is through a DynamoGraphDeploymentRequest, which automatically profiles your model. See [Planner Examples](/dynamo/components/planner/planner-examples) for copyable DGDR manifests. See [Planner Guide](/dynamo/components/planner/planner-guide) for the full workflow. ## Current Limitations ### Load-based scaling Load-based scaling has the following known limitations. Throughput-based scaling is not affected by any of these. **Requires ForwardPassMetrics (FPM).** Load-based scaling uses per-engine per-iteration metrics delivered via the Dynamo event plane (ForwardPassMetrics). The KV Router is **not** required for load-based scaling. FPM availability by backend: - **vLLM** — supported. Automatically enabled when the engine uses `InstrumentedScheduler` and `DYN_FORWARDPASS_METRIC_PORT` is set. - **TensorRT-LLM** — supported for non-attention-DP workers (`attention_dp_size == 1`); gated off when `attention_dp_size > 1` pending per-rank FPM emission. - **SGLang** — pipeline wired in Dynamo, but the upstream SGLang FPM module is not included in the current 1.2.0 SGLang runtime image. See the [SGLang FPM section](/dynamo/backends/sg-lang/observability#forward-pass-metrics-fpm) for the runtime-image prerequisite. ### General **In-flight requests during scale-down.** When the Planner scales down a worker, the worker is terminated without waiting for in-flight requests to complete. Requests that were mid-prefill on the terminated worker will fail. In disaggregated deployments, this can also affect decode workers that were waiting on KV cache transfers from the terminated prefill worker. **Workaround:** Set `--min-endpoint` to a value that avoids scaling below your steady-state traffic floor, and use a lower `--loadbased-scaling-down-sensitivity` value to reduce the frequency of scale-down events. ## Documentation | Document | Description | |----------|-------------| | [Planner Guide](/dynamo/components/planner/planner-guide) | Deployment, configuration, integration | | [Planner Design](/dynamo/design-docs/component-design/planner-design) | Architecture and algorithm internals | | [Planner Examples](/dynamo/components/planner/planner-examples) | DGDR YAML examples, sample configurations, advanced patterns | | [Global Planner Guide](/dynamo/components/planner/global-planner-guide) | Multi-DGD coordination, shared GPU budgets, single-endpoint multi-pool deployments | ## Configuration Reference ### Key Arguments | Argument | Default | Description | |----------|---------|-------------| | **Common** | | | | `--namespace` | `$DYN_NAMESPACE` or `dynamo` | Dynamo logical namespace | | `--backend` | `vllm` | Backend framework (`sglang`, `trtllm`, `vllm`) | | `--mode` | `disagg` | Planner mode (`disagg`, `prefill`, `decode`, `agg`) | | `--optimization-target` | `throughput` | Scaling target: `throughput` (queue/util thresholds), `latency` (aggressive low-latency), `sla` (regression-based SLA targeting) | | `--environment` | `kubernetes` | Deployment environment | | `ttft_ms` | `500.0` | Target Time To First Token (ms) | | `itl_ms` | `50.0` | Target Inter-Token Latency (ms) | | `--max-gpu-budget` | `8` | Maximum GPUs across all workers | | `--min-endpoint` | `1` | Minimum replicas per worker type | | `--decode-engine-num-gpu` | `1` | GPUs per decode engine | | `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine | | `advisory` | `false` | Suggestion-only mode. The Planner computes and reports recommended replica counts, but does not execute scaling actions or change the deployment. | | **Throughput-based scaling** | | | | `--enable-throughput-scaling` | `true` | Enable throughput-based scaling | | `--adjustment-interval` | `180` | Seconds between throughput-based scaling decisions | | `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) | | `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) | | **Load-based scaling** | | | | `--enable-loadbased-scaling` | `false` | Enable load-based scaling | | `--loadbased-adjustment-interval` | `5` | Seconds between FPM regression updates and load-based scaling decisions | | `--max-num-fpm-samples` | `64` | Maximum retained FPM observations for regression | | `--fpm-sample-bucket-size` | `16` | Number of buckets for observation retirement (must be perfect square) | | `--loadbased-scaling-down-sensitivity` | `80` | Scale-down sensitivity 0-100 (0=never, 100=aggressive) | | `--loadbased-metric-samples` | `10` | Number of metric samples per adjustment interval | | `--loadbased-min-observations` | `5` | Minimum observations before regression activates | ### Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `DYN_NAMESPACE` | `dynamo` | Dynamo logical namespace | | `DYN_PARENT_DGD_K8S_NAME` | (required) | Parent DGD K8s resource name | | `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` | Prometheus URL | | `PLANNER_PROMETHEUS_PORT` | `0` (disabled) | Port for planner's own Prometheus metrics | ## Monitoring ### Grafana Dashboard Deploy the planner dashboard: ```bash kubectl apply -n monitoring -f deploy/observability/grafana-planner-dashboard-configmap.yaml ``` The dashboard shows: - Worker counts and GPU usage over time - Observed TTFT, ITL, request rate, sequence lengths - Predicted load and recommended replica counts - FPM regression model status ### Prometheus Metrics When `PLANNER_PROMETHEUS_PORT` is set, the planner serves its own metrics endpoint. Exported series use the `dynamo_planner_*` naming convention (underscores and standard unit suffixes), replacing older `planner:*`-style names. **Throughput-based scaling** pulls traffic metrics from the cluster-wide Prometheus server: - Request count and duration - TTFT and ITL distributions - Input/output sequence lengths Planner can read these traffic signals from either the public `Frontend` or a pool-local `LocalRouter`. Use `throughput_metrics_source: "frontend"` for a single-DGD deployment. Use `throughput_metrics_source: "router"` for GlobalPlanner / multi-pool deployments so each pool Planner reads its own router traffic instead of the shared public endpoint. | Planner input | Frontend source | Router source | |---|---|---| | Request count | `dynamo_frontend_requests_total` | `dynamo_component_router_requests_total` | | TTFT | `dynamo_frontend_time_to_first_token_seconds` | `dynamo_component_router_time_to_first_token_seconds` | | ITL | `dynamo_frontend_inter_token_latency_seconds` | `dynamo_component_router_inter_token_latency_seconds` | | Request duration | `dynamo_frontend_request_duration_seconds` | `dynamo_component_request_duration_seconds` until router-specific duration metrics are available | | Input sequence length / ISL | `dynamo_frontend_input_sequence_tokens` | `dynamo_component_router_input_sequence_tokens` | | Output sequence length / OSL | `dynamo_frontend_output_sequence_tokens` | `dynamo_component_router_output_sequence_tokens` | | KV hit rate | Not available from frontend source | `dynamo_component_router_kv_hit_rate` | The throughput planner uses request count, ISL, OSL, and optional KV hit rate as the core traffic forecast inputs. TTFT, ITL, and request duration are also scraped and exported as observed diagnostics. **Load-based scaling** uses ForwardPassMetrics (FPM) from the Dynamo event plane: - Per-iteration wall time, scheduled prefill/decode tokens, and queued request status - Delivered via `FpmEventSubscriber` with automatic engine discovery and lifecycle tracking - No router `/metrics` scraping required FPM observes engine-side scheduled and queued work. It does not include requests still queued in the `LocalRouter` before engine assignment. Core gauges on the planner port include replica counts (`dynamo_planner_num_prefill_replicas`, `dynamo_planner_num_decode_replicas`), observed traffic (`dynamo_planner_observed_*`), replica recommendations (`dynamo_planner_predicted_num_prefill_replicas`, `dynamo_planner_predicted_num_decode_replicas`), and cumulative `dynamo_planner_gpu_hours`. Throughput prediction gauges `dynamo_planner_predicted_requests_per_second`, `dynamo_planner_predicted_input_sequence_tokens`, and `dynamo_planner_predicted_output_sequence_tokens` are wired from throughput-scaling traffic prediction and exposed alongside observed sequence-length metrics. ### Advisory mode Set `advisory: true` to run the local Planner in suggestion-only mode. This is recommended when you are evaluating a new Planner configuration, validating SLA targets, or reviewing how the Planner would react to production traffic before allowing it to scale workers. In advisory mode, the Planner still observes traffic and FPM data, computes recommended prefill and decode replica counts, logs recommendation summaries, exports predicted replica metrics, and includes recommendations in diagnostics reports. The recommendations are not applied as scaling decisions: the Planner does not execute scaling actions, send replica changes to Kubernetes or GlobalPlanner, or mutate the deployment. #### Diagnostics metrics Additional series support dashboards and offline analysis: - **Regression-based latency estimates:** `dynamo_planner_estimated_ttft_ms` and `dynamo_planner_estimated_itl_ms` reflect the maximum estimated TTFT and ITL from the online regression across engines. - **Engine capacity:** `dynamo_planner_engine_prefill_requests_per_second` and `dynamo_planner_engine_decode_requests_per_second` report single-engine prefill and decode capacity under the configured SLA. - **Scaling decision reasons:** `dynamo_planner_load_scaling_decision` and `dynamo_planner_throughput_scaling_decision` are Enum gauges whose state labels encode why each mode chose to scale, hold, or skip (for example `scale_up`, `no_fpm_data`, `set_lower_bound`). - **Per-engine FPM queue depths:** `dynamo_planner_engine_queued_prefill_tokens`, `dynamo_planner_engine_queued_decode_kv_tokens`, and `dynamo_planner_engine_inflight_decode_kv_tokens` are labeled with `worker_id` and `dp_rank` for each engine. ### HTML diagnostics reports The planner can emit periodic, self-contained HTML diagnostics files with interactive Plotly charts. Configure this in `PlannerConfig` (or the equivalent YAML / constructor wiring your deployment uses): - `report_interval_hours`: interval in **simulated** time between reports (default `24.0` hours); set to `None` to disable. - `report_output_dir`: directory where HTML files are written (default `./planner_reports`). - `live_dashboard_port`: port for a real-time HTTP dashboard (default `8080`). Set to `0` to disable. An aiohttp server starts on the given port and serves the current accumulated snapshot data as an interactive Plotly report at `http://:/`. Unlike periodic reports, the live dashboard does **not** clear snapshots — it always shows all data accumulated since the last periodic report (or since startup if periodic reports are disabled). Reports aggregate per-tick snapshots and use `TickInput.now_s` for timestamps, so they behave the same in live runs (wall clock) and in **replay** with a simulated clock. Typical charts cover worker counts, recommended replica counts, observed versus estimated latencies versus SLA targets, request rate, engine capacity, scaling decision timelines, and input/output sequence lengths. In the Replica Counts plot, actual replicas are shown as lines and the Planner's recommended prefill and decode replica counts are shown as discrete markers at the ticks where recommendations were produced. This is especially useful with `advisory: true` because those recommendations are suggestions only. # Planner Guide

简体中文

The Dynamo Planner is an autoscaling controller that adjusts prefill and decode engine replica counts at runtime to meet latency SLAs. It reads traffic signals (Prometheus metrics or load predictor output) and engine performance profiles to decide when to scale up or down. For a quick overview, see the [Planner overview](/dynamo/components/planner). For architecture internals, see [Planner Design](/dynamo/design-docs/component-design/planner-design). ## Scaling Modes The planner supports four optimization targets that determine how scaling decisions are made: - **`throughput`** (default): Uses static thresholds on queue depth and KV cache utilization. No SLA targets or profiling needed. Works out of the box. - **`latency`**: Same approach as `throughput` but with more aggressive thresholds — scales up earlier and tolerates less queuing. Ideal for latency-sensitive workloads. - **`load`**: Uses user-defined prefill queue token thresholds and decode KV utilization thresholds for reactive load-based scaling. - **`sla`**: Uses regression-based performance models with specific TTFT/ITL targets. Supports both throughput-based (predictive) and load-based (reactive) scaling modes. For advanced users who need precise SLA control. **When to use which:** - Start with **`throughput`** (the default) — it works immediately with no configuration. - Switch to **`latency`** if your workload has strict latency requirements and you prefer to over-provision rather than queue. - Use **`load`** when you want direct control through prefill queue and decode KV utilization thresholds. - Use **`sla`** when you have pre-deployment profiling data and want to target specific TTFT/ITL values. ## PlannerConfig Reference The planner is configured via a `PlannerConfig` JSON/YAML object. When using the profiler, this is placed under the `features.planner` section of the DGDR spec: ```yaml features: planner: mode: disagg backend: vllm # optimization_target defaults to "throughput" — works out of the box ``` For SLA-based scaling: ```yaml features: planner: optimization_target: sla enable_throughput_scaling: true enable_load_scaling: false pre_deployment_sweeping_mode: rapid mode: disagg backend: vllm ``` To evaluate Planner behavior without changing replica counts, turn on advisory mode: ```yaml features: planner: advisory: true ``` Advisory mode is suggestion-only. The Planner computes recommended replica counts, logs them, exports them as diagnostics, and shows them in HTML reports. The recommendations are not applied as scaling decisions: the Planner does not execute scaling actions or change the deployment. ### Optimization Target | Field | Type | Default | Description | |-------|------|---------|-------------| | `optimization_target` | string | `throughput` | `throughput`: scale based on queue/utilization thresholds. `latency`: aggressive low-latency thresholds. `load`: user-defined prefill queue and decode KV utilization thresholds. `sla`: regression-based scaling with ttft_ms/itl_ms targets. | When `optimization_target` is `throughput`, `latency`, or `load`, load-based scaling is automatically enabled and throughput-based scaling is disabled. The `ttft_ms`/`itl_ms` fields are ignored. ### Scaling Mode Fields (SLA mode) | Field | Type | Default | Description | |-------|------|---------|-------------| | `enable_throughput_scaling` | bool | `true` | Enable throughput-based scaling (requires pre-deployment performance data). Only used when `optimization_target: sla`. | | `enable_load_scaling` | bool | `false` | Enable load-based scaling. Only used when `optimization_target: sla`. | At least one scaling mode must be enabled when using `optimization_target: sla`. ### Pre-Deployment Sweeping | Field | Type | Default | Description | |-------|------|---------|-------------| | `pre_deployment_sweeping_mode` | string | `rapid` | How to generate engine performance data: `rapid` (AIC simulation, ~30s), `thorough` (real GPUs, 2-4h), or `none` (skip). | When throughput-based scaling is enabled, the planner needs engine performance data. At startup, it first tries to fetch self-benchmark results from the `get_perf_metrics` Dynamo endpoint (see PR #7779). If unavailable, it falls back to profiler-generated data (npz or JSON) at `profile_results_dir`. Both sources are converted to ForwardPassMetrics and fed into the FPM regression model. ### Throughput-Based Scaling Settings | Field | Type | Default | Description | |-------|------|---------|-------------| | `throughput_adjustment_interval_seconds` | int | `180` | Seconds between throughput-based scaling decisions. | | `throughput_metrics_source` | string | `frontend` | Prometheus traffic source for throughput scaling: `frontend` reads `dynamo_frontend_*` metrics from the public Frontend; `router` reads `dynamo_component_router_*` metrics from a LocalRouter. Use `router` for pool-local Planner in GlobalPlanner deployments. | | `min_endpoint` | int | `1` | Minimum number of engine endpoints to maintain. | | `max_gpu_budget` | int | `8` | Maximum total GPUs the planner may allocate. | | `ttft_ms` | float | `500.0` | TTFT SLA target (ms) for scaling decisions. | | `itl_ms` | float | `50.0` | ITL SLA target (ms) for scaling decisions. | ### Load-Based Scaling Settings | Field | Type | Default | Description | |-------|------|---------|-------------| | `load_adjustment_interval_seconds` | int | `5` | Seconds between FPM regression updates and load-based scaling decisions. Even when only throughput scaling is enabled, live FPM observations are fed into the regression at this interval. Must be shorter than `throughput_adjustment_interval_seconds`. | | `max_num_fpm_samples` | int | `64` | Maximum retained FPM observations for regression. | | `fpm_sample_bucket_size` | int | `16` | Number of buckets for observation retirement (must be a perfect square). | | `load_scaling_down_sensitivity` | int | `80` | Scale-down sensitivity 0–100 (0=never, 100=aggressive). | | `load_metric_samples` | int | `10` | Number of metric samples to collect per decision. | | `load_min_observations` | int | `5` | Minimum observations before making scaling decisions. | ### General Settings | Field | Type | Default | Description | |-------|------|---------|-------------| | `mode` | string | `disagg` | Planner mode: `disagg`, `prefill`, `decode`, or `agg`. | | `backend` | string | `vllm` | Backend: `vllm`, `sglang`, `trtllm`, or `mocker`. | | `environment` | string | `kubernetes` | Runtime environment: `kubernetes`, `virtual`, or `global-planner`. | | `namespace` | string | env `DYN_NAMESPACE` | Kubernetes namespace for the deployment. | | `advisory` | bool | `false` | Suggestion-only mode. Compute, log, export, and report recommended replica counts without executing scaling actions or changing the deployment. | ### Traffic Prediction Settings | Field | Type | Default | Description | |-------|------|---------|-------------| | `load_predictor` | string | `arima` | Prediction method: `constant`, `arima`, `kalman`, or `prophet`. | | `load_predictor_log1p` | bool | `false` | Apply log1p transform to load data before prediction. | | `prophet_window_size` | int | `50` | Window size (seconds) for Prophet predictor. | | `load_predictor_warmup_trace` | string | `null` | Path to a warmup trace file for bootstrapping predictions. | ### Kalman Filter Settings | Field | Type | Default | Description | |-------|------|---------|-------------| | `kalman_q_level` | float | `1.0` | Process noise for level component. | | `kalman_q_trend` | float | `0.1` | Process noise for trend component. | | `kalman_r` | float | `10.0` | Measurement noise. | | `kalman_min_points` | int | `5` | Minimum data points before Kalman predictions activate. | ### Diagnostics Reports | Field | Type | Default | Description | |-------|------|---------|-------------| | `report_interval_hours` | float or `null` | `24.0` | Generate an HTML diagnostics report every N hours (simulated time). Set to `null` to disable periodic report generation. | | `report_output_dir` | string | `./planner_reports` | Directory for HTML diagnostics reports. | | `live_dashboard_port` | int | `8080` | Port for the live diagnostics dashboard HTTP server. Set to `0` to disable. When enabled, visit `http://host:port/` to view a real-time Plotly report of accumulated snapshots. | The same diagnostic signals surfaced in these reports are also exported as Prometheus metrics under the `dynamo_planner_*` prefix—for example estimated TTFT/ITL (`dynamo_planner_estimated_ttft_ms`, `dynamo_planner_estimated_itl_ms`), recommended replica counts (`dynamo_planner_predicted_num_prefill_replicas`, `dynamo_planner_predicted_num_decode_replicas`), per-engine capacity and FPM queue depths, and load/throughput scaling decision enums. The Replica Counts plot overlays actual prefill/decode replicas with discrete recommendation markers for the Planner's recommended prefill/decode replicas. When `advisory: true`, these recommended counts are suggestions only; the Planner records what it would do without applying the change. ## Integration with Profiler When the profiler runs with planner enabled, it: 1. Selects the best prefill and decode engine configurations 2. Generates engine performance data (prefill TTFT vs ISL, decode ITL vs KV-cache utilization) 3. Saves the `PlannerConfig` and performance data into separate Kubernetes ConfigMaps 4. Adds the planner service to the generated DGD, configured to read from those ConfigMaps The planner receives its config via `--config /path/to/planner_config.json` which is mounted from the `planner-config-XXXX` ConfigMap. Profiling data is mounted from the `planner-profile-data-XXXX` ConfigMap. See the [Profiler Guide](/dynamo/components/profiler/profiler-guide) for the full profiling workflow and how to configure pre-deployment sweeping. ## Hierarchical Deployments If you want one public endpoint for a model but multiple private DGDs optimized for different request classes, use a hierarchical deployment: - one control DGD with `Frontend`, `GlobalRouter`, and `GlobalPlanner` - one or more prefill pool DGDs - one or more decode pool DGDs In the current workflow, run profiling independently for each intended pool, then compose the final control DGD plus pool DGDs manually. See the [Global Planner Guide](/dynamo/components/planner/global-planner-guide). ## See Also - [Planner overview](/dynamo/components/planner) — Why LLM inference needs a different autoscaler - [Planner Design](/dynamo/design-docs/component-design/planner-design) — Architecture and algorithm internals - [Planner Examples](/dynamo/components/planner/planner-examples) — DGDR YAML examples, sample configurations, advanced patterns - [Global Planner Guide](/dynamo/components/planner/global-planner-guide) — Multi-DGD coordination, shared GPU budgets, single-endpoint multi-pool deployments - [Profiler Guide](/dynamo/components/profiler/profiler-guide) — How profiling data is generated # Global Planner Deployment Guide This guide explains how to deploy `GlobalPlanner` and when to use it. `GlobalPlanner` is the centralized scaling execution layer for deployments where multiple DGDs should delegate scaling through one component, whether those DGDs expose separate endpoints or sit behind one shared endpoint. > **New to Planner?** We recommend starting with a single-DGD deployment using either throughput-based or load-based scaling before adopting GlobalPlanner. See the [Planner overview](/dynamo/components/planner) and [Planner Guide](/dynamo/components/planner/planner-guide) to get started. ## Why Global Planner? Without `GlobalPlanner`, each DGD's local planner scales only its own deployment directly. That is fine for isolated deployments, but it becomes awkward when you want one place to: - apply centralized scaling policy across multiple DGDs - enforce shared constraints such as authorization or total GPU budget - coordinate scaling for a single-endpoint, multi-pool deployment `GlobalPlanner` solves that by becoming the common scale-execution endpoint for multiple local planners. ## Terminology - **Planner**: The `dynamo.planner` component that computes desired replica counts to maintain latency SLAs. See the [Planner overview](/dynamo/components/planner). - **Local Planner**: A pool-local instance of the Planner running inside a single DGD. - **Global Planner**: The centralized execution and policy layer that receives scale requests from local planners. - **Single-endpoint multi-pool deployment**: One model endpoint backed by multiple DGDs for the same model. This pattern uses both `GlobalRouter` and `GlobalPlanner`. ## Deployment Patterns Use `GlobalPlanner` in one of these two patterns: | Pattern | Use when | Needs `GlobalRouter` | Public endpoint shape | |---------|----------|----------------------|-----------------------| | Multiple model endpoints or independent DGDs | Separate DGDs should share centralized scaling policy, such as authorization or total GPU budget | No | One endpoint per DGD, or however each DGD is exposed | | One model endpoint, multiple DGDs | One model should be reachable through one public endpoint, but different request classes should land on different DGDs | Yes | One shared endpoint | ## Pattern 1: Multiple Model Endpoints Or Independent DGDs Use this pattern when you have multiple DGDs, often for different models, and you want them to share centralized scaling policy without collapsing them into one endpoint. Typical examples: - DGD A: `qwen-0.6b` disaggregated deployment with its own local planner - DGD B: `qwen-32b` disaggregated deployment with its own local planner - one shared `GlobalPlanner` that all local planners delegate to In this pattern: - each DGD keeps its own normal local planner - each local planner is configured with `environment: "global-planner"` - all those planners point at the same `global_planner_namespace` - each DGD keeps its own endpoint or frontend as needed - you do **not** need `GlobalRouter` This is the pattern to use when the goal is centralized scaling control across multiple deployments or models. ## Pattern 2: One Model Endpoint, Multiple DGDs Use this pattern when all of the following are true: - You want one public endpoint for a single model. - You want different private pools for different request classes, such as short ISL vs. long ISL requests, or different latency targets. - You want each pool to autoscale independently. - You want routing and scale execution to be centralized instead of exposing multiple endpoints to clients. Typical examples: - short-input requests are cheaper on a smaller prefill pool - long-input requests need a larger prefill pool - decode capacity should scale independently from prefill capacity If you only need one pool for one model, use a single Local Planner and DGD/DGDR instead. ## What You Deploy In the current implementation, the single-endpoint pattern is composed from multiple resources: | Resource | Purpose | Typical contents | |----------|---------|------------------| | Control DGD | Public entrypoint and centralized control plane | `Frontend`, `GlobalRouter`, `GlobalPlanner` | | Prefill pool DGD(s) | Private prefill capacity pools | `LocalRouter`, prefill worker(s), `Planner` | | Decode pool DGD(s) | Private decode capacity pools | `LocalRouter`, decode worker(s), `Planner` | | Optional DGDR(s) | Generate or validate one optimized pool shape at a time | Model, workload, SLA, hardware inputs | > **Current workflow** > > A single DGDR does **not** generate the full single-endpoint multi-pool topology today. Instead, run one DGDR or profiling job per intended pool, then compose the final control DGD plus pool DGDs manually. ## Architecture ```mermaid flowchart LR client["Client"] prometheus["Prometheus
per-pool router metrics"] operator["Kubernetes operator
DGD replica updates"] subgraph control["Control DGD"] frontend["Frontend
single public endpoint"] global_router["GlobalRouter
selects a pool"] global_planner["GlobalPlanner
policy, budget, scale execution"] end subgraph prefill0["Prefill pool DGD: short prompts"] prefill_router0["LocalRouter"] --> prefill_workers0["Prefill workers"] prefill_planner0["Pool Planner"] end subgraph prefill1["Prefill pool DGD: long prompts"] prefill_router1["LocalRouter"] --> prefill_workers1["Prefill workers"] prefill_planner1["Pool Planner"] end subgraph decode0["Decode pool DGD"] decode_router0["LocalRouter"] --> decode_workers0["Decode workers"] decode_planner0["Pool Planner"] end prometheus -.-> prefill_planner0 prometheus -.-> prefill_planner1 prometheus -.-> decode_planner0 client --> frontend frontend --> global_router global_router --> prefill_router0 global_router --> prefill_router1 global_router --> decode_router0 prefill_planner0 -- scale request --> global_planner prefill_planner1 -- scale request --> global_planner decode_planner0 -- scale request --> global_planner global_planner --> operator operator --> prefill_workers0 operator --> prefill_workers1 operator --> decode_workers0 ``` Read the diagram left to right for request traffic: clients call the control `Frontend`, `GlobalRouter` selects a private pool, and each pool's `LocalRouter` sends the request to workers. Read the dotted and lower paths for scaling: each pool-local `Planner` reads that pool's router metrics, sends a scale request to `GlobalPlanner`, and `GlobalPlanner` applies the Kubernetes replica changes centrally. ## Prerequisites - Dynamo Kubernetes Platform installed. See [Kubernetes Quickstart](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart). - Prometheus deployed and scraping router metrics. The global planner examples assume cluster Prometheus is available. - Backend images available for your chosen framework (`vllm`, `sglang`, or `trtllm`). - Secrets for model access, such as a Hugging Face token secret. - A storage strategy for model weights if your workers should share a model cache PVC. For throughput-based scaling, you also need profiling data for each pool. See [Profiler Guide](/dynamo/components/profiler/profiler-guide). ## Inputs You Need To Decide Up Front Before writing manifests, decide the following: | Input | Why it matters | Example | |-------|----------------|---------| | Model name | All pools in one hierarchy serve the same model | `meta-llama/Llama-3.3-70B-Instruct` | | Backend | Worker args and profiling flow depend on it | `vllm` | | Pool inventory | Number of specialized prefill and decode pools | 2 prefill pools, 1 decode pool | | Workload classes | Determines how many pool profiles you generate | short ISL, long ISL, long context decode | | SLA targets | Guides profiling and routing decisions | `ttft: 200 ms`, `itl: 20 ms` | | Worker shape | Tensor parallelism, GPUs per worker, and memory footprint | TP1 prefill vs. TP2 prefill | | Routing policy | Maps requests to pools at runtime | low-ISL requests -> pool 0 | | Optional global budget | Caps total GPUs across managed pools | `--max-total-gpus 16` | ## Step 1: Profile Each Intended Pool Independently Start by deciding what each pool should specialize in. Common examples: - Prefill pool 0: lower-cost pool for short prompts. - Prefill pool 1: larger pool for long prompts. - Decode pool 0: standard decode pool for most requests. For each intended pool, run a separate DGDR or profiling job with the workload and SLA that represent that pool. Example DGDR skeleton: ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: llama-prefill-short spec: model: meta-llama/Llama-3.3-70B-Instruct backend: vllm image: nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0 # dynamo-frontend for Dynamo < 1.1.0 workload: isl: 2048 osl: 256 sla: ttft: 200.0 itl: 20.0 searchStrategy: rapid autoApply: false ``` Repeat this once per planned pool, changing the workload and SLA inputs for each request class. What to keep from each profiling result: - Worker shape (`tensor-parallel-size`, GPUs per worker, memory/caching settings). - Planner profile data directory or generated ConfigMaps. - Planner settings such as `prefill_engine_num_gpu` or `decode_engine_num_gpu`. - Any backend-specific flags that differ across pools. See [Planner Examples](/dynamo/components/planner/planner-examples) and [Profiler Guide](/dynamo/components/profiler/profiler-guide) for DGDR details. ## Step 2: Create The Control DGD Deploy one control DGD that contains: - `Frontend`: the single public model endpoint. - `GlobalRouter`: chooses which pool receives each request. - `GlobalPlanner`: receives scale requests from pool planners and applies replica changes. The vLLM example topology is in [examples/global_planner/global-planner-vllm-test.yaml](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/global_planner/global-planner-vllm-test.yaml). The `GlobalPlanner` section is minimal: ```yaml GlobalPlanner: componentType: default replicas: 1 extraPodSpec: mainContainer: image: ${DYNAMO_IMAGE} command: - python3 - -m - dynamo.global_planner args: - --managed-namespaces - ${K8S_NAMESPACE}-gp-prefill-0 - ${K8S_NAMESPACE}-gp-prefill-1 - ${K8S_NAMESPACE}-gp-decode-0 ``` The values passed to `--managed-namespaces` are the pool planners' **Dynamo namespaces** (`caller_namespace`), not raw Kubernetes namespaces. In many examples they share the same string prefix, but they are logically different identifiers. **Management modes**: When `--managed-namespaces` is set (explicit mode), only the listed Dynamo namespaces are authorized to send scale requests, and only their corresponding DGDs count toward the GPU budget. DGD names are derived from the Dynamo namespace using the operator convention `DYN_NAMESPACE = {k8s_namespace}-{dgd_name}`. When omitted (implicit mode), any caller is accepted and all DGDs in the Kubernetes namespace count toward the GPU budget. If you want the central executor to reject scale requests that exceed a total GPU budget, add `--max-total-gpus`. See [examples/global_planner/global-planner-gpu-budget.yaml](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/global_planner/global-planner-gpu-budget.yaml). ## Step 3: Create One DGD Per Pool Each private pool gets its own DGD. A pool DGD usually contains: - `LocalRouter` - one worker type (`prefill` or `decode`) - one `Planner` The planner inside each pool must be configured for `global-planner` mode so it delegates scaling to the control stack: ```json { "environment": "global-planner", "global_planner_namespace": "${K8S_NAMESPACE}-gp-ctrl", "backend": "vllm", "mode": "prefill", "enable_load_scaling": false, "enable_throughput_scaling": true, "throughput_metrics_source": "router", "ttft": 2000, "prefill_engine_num_gpu": 2, "model_name": "${MODEL_NAME}", "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D" } ``` `global_planner_namespace` must point to the control stack's **Dynamo namespace**. In the reference manifests, that is the namespace string passed to the control `Frontend` and `GlobalRouter`. `throughput_metrics_source: "router"` is required for pool-local Planner in GlobalPlanner deployments. The pool Planner should forecast demand from its own `LocalRouter` `dynamo_component_router_*` Prometheus metrics, not from the shared public `Frontend`. See the [Planner overview](/dynamo/components/planner#prometheus-metrics) for the exact frontend and router metric sources. Use: - `mode: "prefill"` for prefill-only pools - `mode: "decode"` for decode-only pools The worker and planner settings for each pool come from the pool-specific profiling result you created in Step 1. In the reference vLLM example: - `gp-prefill-0` uses a 1-GPU TP1 prefill worker - `gp-prefill-1` uses a 2-GPU TP2 prefill worker - `gp-decode-0` uses a 1-GPU TP1 decode worker See [global-planner-vllm-test.yaml](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/global_planner/global-planner-vllm-test.yaml). ## Step 4: Configure GlobalRouter To Select Pools `GlobalRouter` reads a JSON config that lists the pool namespaces and a routing grid for each request type. Example: ```json { "enable_priority_retry": true, "num_prefill_pools": 2, "num_decode_pools": 1, "prefill_pool_dynamo_namespaces": [ "${K8S_NAMESPACE}-gp-prefill-0", "${K8S_NAMESPACE}-gp-prefill-1" ], "decode_pool_dynamo_namespaces": [ "${K8S_NAMESPACE}-gp-decode-0" ], "prefill_pool_priorities": [0, 1], "decode_pool_priorities": [0], "prefill_pool_selection_strategy": { "ttft_min": 10, "ttft_max": 3000, "ttft_resolution": 2, "isl_min": 0, "isl_max": 32000, "isl_resolution": 2, "prefill_pool_mapping": [[0, 1], [0, 1]] }, "decode_pool_selection_strategy": { "itl_min": 10, "itl_max": 500, "itl_resolution": 2, "context_length_min": 0, "context_length_max": 32000, "context_length_resolution": 2, "decode_pool_mapping": [[0, 0], [0, 0]] } } ``` The `prefill_pool_dynamo_namespaces` and `decode_pool_dynamo_namespaces` entries are **Dynamo namespaces** that the pool-local routers register under. Important runtime behavior: - Prefill pool selection uses **ISL + TTFT target** - Decode pool selection uses **context length + ITL target** - OSL is useful for **designing and profiling pools**, but it is **not a direct routing key** in the current `GlobalRouter` - Optional priority retry is enabled with `enable_priority_retry`; lower values in `*_pool_priorities` are faster pools, and omitted priority lists default to pool order (`0`, `1`, ...) Clients can pass request targets through `extra_args`: ```json { "extra_args": { "ttft_target": 200, "itl_target": 20 } } ``` For more details, see [Global Router README](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/components/src/dynamo/global_router/README.md). ## Step 5: Deploy In Order For a fresh cluster, the usual order is: 1. Install Dynamo platform and Prometheus. 2. Create secrets and PVCs needed by workers. 3. Create the `GlobalRouter` ConfigMap. 4. Apply the control DGD. 5. Apply the pool DGDs. 6. Wait for all DGDs to reach ready state. 7. Expose or port-forward the control `Frontend`. Example: ```bash export K8S_NAMESPACE=my-llama export MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct export DYNAMO_IMAGE= export DYNAMO_VLLM_IMAGE= export STORAGE_CLASS_NAME= kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN=${HF_TOKEN} \ -n ${K8S_NAMESPACE} envsubst < examples/global_planner/global-planner-vllm-test.yaml | \ kubectl apply -n ${K8S_NAMESPACE} -f - ``` The single user-facing endpoint is the `Frontend` in the control DGD, not the pool DGDs. ## Step 6: Validate The Stack Validate the deployment from outside in: - Confirm the control `Frontend` is healthy and serving the model endpoint. - Confirm `GlobalRouter` logs show requests being assigned to the expected pool namespaces. - Confirm pool-local planners are producing scale requests. - Confirm `GlobalPlanner` logs show accepted scale operations. - Confirm the target DGDs' replica counts change as expected. If you use Prometheus and Grafana, also inspect: - TTFT and ITL over time - per-pool worker counts - per-pool request mix - total GPU usage ## Recommended Workflow For New Deployments For most teams, the easiest way to build this deployment is: 1. Design your pool classes from expected traffic patterns. 2. Run one DGDR per pool class to generate or validate the pool configuration. 3. Copy the selected worker shape and planner settings into the final pool DGDs. 4. Build one control DGD with `Frontend`, `GlobalRouter`, and `GlobalPlanner`. 5. Route all client traffic through the control `Frontend`. This keeps profiling and pool selection simple while still giving you one public endpoint for the model. ## Current Limitations - Single-endpoint `GlobalPlanner` deployments are assembled manually today. One DGDR does not emit the full control DGD plus pool DGDs topology. - `GlobalRouter` routes by ISL/TTFT and context-length/ITL grids, not directly by OSL. - In the single-endpoint pattern, all pools are expected to serve the same model. ## See Also - [Planner README](/dynamo/components/planner) — Planner overview and quick start - [Planner Guide](/dynamo/components/planner/planner-guide) — Planner configuration reference - [Planner Examples](/dynamo/components/planner/planner-examples) — DGDR examples for generating per-pool configs - [Profiler Guide](/dynamo/components/profiler/profiler-guide) — Pre-deployment profiling workflow - [Global Planner README](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/components/src/dynamo/global_planner/README.md) — Centralized scale execution - [Global Router README](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/components/src/dynamo/global_router/README.md) — Cross-pool request routing - [vLLM global planner example](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/examples/global_planner/global-planner-vllm-test.yaml) — End-to-end reference manifest # Planner Examples Practical examples for deploying the Planner with throughput-based scaling. All examples below use the DGDR workflow with pre-deployment profiling. For deployment concepts, see the [Planner Guide](/dynamo/components/planner/planner-guide). For a quick overview, see the [Planner README](/dynamo/components/planner). ## Basic Examples ### Minimal DGDR with AIC (Fastest) The simplest way to deploy with the Planner. Uses AI Configurator for offline profiling (20-30 seconds instead of hours): ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: sla-aic spec: model: Qwen/Qwen3-32B backend: vllm image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 ``` Deploy: ```bash export NAMESPACE=your-namespace # Save the manifest above as sla-aic.yaml first. kubectl apply -f sla-aic.yaml -n $NAMESPACE ``` ### Online Profiling (Real Measurements) Standard online profiling runs real GPU measurements for more accurate results. Takes 2-4 hours: ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: sla-online spec: model: meta-llama/Llama-3.3-70B-Instruct backend: vllm image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 ``` Deploy: ```bash # Save the manifest above as sla-online.yaml first. kubectl apply -f sla-online.yaml -n $NAMESPACE ``` > **Note**: Starting with Dynamo 1.0.0 (DGDR API version v1beta1), DGDR fields use structured spec fields (e.g., `spec.workload`, `spec.sla`, `spec.hardware`) instead of the nested `profilingConfig.config` blob used in v1alpha1. ## Kubernetes Examples ### MoE Models (SGLang) For Mixture-of-Experts models like DeepSeek-R1, use SGLang backend: ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: sla-moe spec: model: deepseek-ai/DeepSeek-R1 backend: sglang image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 ``` Deploy: ```bash # Save the manifest above as sla-moe.yaml first. kubectl apply -f sla-moe.yaml -n $NAMESPACE ``` ### Using Existing DGD Configs (Custom Setups) Reference an existing DynamoGraphDeployment config via ConfigMap: **Step 1: Create ConfigMap from your DGD config:** ```bash kubectl create configmap deepseek-r1-config \ --from-file=disagg.yaml=/path/to/your/disagg.yaml \ --namespace $NAMESPACE \ --dry-run=client -o yaml | kubectl apply -f - ``` **Step 2: Reference it in your DGDR:** ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: deepseek-r1 spec: model: deepseek-ai/DeepSeek-R1 backend: sglang image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 ``` The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` and `spec.backend` into the final configuration. ### Inline Configuration (Simple Use Cases) For simple use cases without a custom DGD config, provide the configuration directly in the v1beta1 DGDR spec fields. The profiler auto-generates a basic DGD configuration: ```yaml spec: workload: isl: 8000 osl: 200 sla: ttft: 200.0 itl: 10.0 hardware: gpuSku: h200_sxm searchStrategy: rapid ``` ### Simulation with Mocker Deploy a mocker backend that simulates GPU timing behavior without real GPUs. Useful for: - Large-scale experiments without GPU resources - Testing planner behavior and infrastructure - Validating deployment configurations ```yaml spec: model: backend: trtllm # Real backend for profiling features: mocker: enabled: true # Deploy mocker instead of real backend image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 ``` Profiling runs against the real backend (via GPUs or AIC). The mocker deployment then uses profiling data to simulate realistic timing. ### Model Cache PVC (0.8.1+) For large models, use a pre-populated PVC instead of downloading from HuggingFace: See [SLA-Driven Profiling](/dynamo/components/profiler/profiler-guide) for configuration details. ## Advanced Examples ### Custom Load Predictors #### Warm-starting with Trace Data Pre-load predictors with historical request patterns before live traffic: ```yaml # In planner arguments args: - --load-predictor arima - --load-predictor-warmup-trace /data/trace.jsonl - --load-predictor-log1p ``` The trace file should be in mooncake-style JSONL format with request-count, ISL, and OSL samples. #### Kalman Filter Tuning For workloads with rapid changes, tune the Kalman filter: ```yaml args: - --load-predictor kalman - --kalman-q-level 2.0 # Higher = more responsive to level changes - --kalman-q-trend 0.5 # Higher = trend changes faster - --kalman-r 5.0 # Lower = trusts new measurements more - --kalman-min-points 3 # Fewer points before forecasting starts - --load-predictor-log1p # Often helps with request-rate series ``` #### Prophet for Seasonal Workloads For workloads with daily/weekly patterns: ```yaml args: - --load-predictor prophet - --prophet-window-size 100 # Larger window for seasonal detection - --load-predictor-log1p ``` ### Virtual Connector For non-Kubernetes environments, use the VirtualConnector to communicate scaling decisions: ```python from dynamo._core import DistributedRuntime, VirtualConnectorClient # Initialize client client = VirtualConnectorClient(distributed_runtime, namespace) # Main loop: watch for planner decisions and execute them while True: # Block until the planner makes a new scaling decision await client.wait() # Read the decision decision = await client.get() print(f"Scale to: prefill={decision.num_prefill_workers}, " f"decode={decision.num_decode_workers}, " f"id={decision.decision_id}") # Execute scaling in your environment scale_prefill_workers(decision.num_prefill_workers) scale_decode_workers(decision.num_decode_workers) # Report completion await client.complete(decision) ``` See `components/planner/test/test_virtual_connector.py` for a full working example. ### Planner Configuration Passthrough Pass planner-specific settings through the DGDR: ```yaml features: planner: plannerMinEndpoint: 2 ``` ### Review Before Deploy (autoApply: false) Disable auto-deployment to inspect the generated DGD: ```yaml spec: autoApply: false ``` After profiling completes: ```bash # Extract and review generated DGD kubectl get dgdr sla-aic -n $NAMESPACE \ -o jsonpath='{.status.profilingResults.selectedConfig}' > my-dgd.yaml # Review and modify as needed vi my-dgd.yaml # Deploy manually kubectl apply -f my-dgd.yaml -n $NAMESPACE ``` ### Profiling Artifacts with PVC Save detailed profiling artifacts (plots, logs, raw data) to a PVC: ```yaml spec: workload: isl: 3000 osl: 150 sla: ttft: 200 itl: 20 ``` Setup: ```bash export NAMESPACE=your-namespace deploy/utils/setup_benchmarking_resources.sh ``` Access results: ```bash kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results kubectl delete pod pvc-access-pod -n $NAMESPACE ``` ## Related Documentation - [Planner README](/dynamo/components/planner) -- Overview and quick start - [Planner Guide](/dynamo/components/planner/planner-guide) -- Deployment, configuration, integration - [Planner Design](/dynamo/design-docs/component-design/planner-design) -- Architecture deep-dive - [DGDR Configuration Reference](/dynamo/components/profiler/profiler-guide#dgdr-configuration-structure) - [SLA-Driven Profiling](/dynamo/components/profiler/profiler-guide) # Profiler The Dynamo Profiler is an automated performance analysis tool that measures model inference characteristics to optimize deployment configurations. It determines optimal tensor parallelism (TP) settings for prefill and decode phases, generates performance interpolation data, and enables SLA-driven autoscaling through the Planner. ## Feature Matrix | Feature | SGLang | TensorRT-LLM | vLLM | |---------|--------|--------------|------| | Dense Model Profiling | ✅ | ✅ | ✅ | | MoE Model Profiling | ✅ | 🚧 | 🚧 | | AI Configurator (Offline) | ✅ | ✅ | ✅ | | Online Profiling (AIPerf) | ✅ | ✅ | ✅ | | Interactive WebUI | ✅ | ✅ | ✅ | | Runtime Profiling Endpoints | ✅ | ❌ | ❌ | ## Quick Start ### Prerequisites - Dynamo platform installed (see [Installation Guide](/dynamo/kubernetes-deployment/start-here/installation-guide)) - Kubernetes cluster with GPU nodes (for DGDR-based profiling) - kube-prometheus-stack installed (required for SLA planner) ### Using DynamoGraphDeploymentRequest (Recommended) The recommended way to profile models is through DGDRs, which automate the entire profiling and deployment workflow. ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: my-model-profiling spec: model: "Qwen/Qwen3-0.6B" backend: vllm image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 workload: isl: 3000 # Average input sequence length osl: 150 # Average output sequence length sla: ttft: 200.0 # Target Time To First Token (ms) itl: 20.0 # Target Inter-Token Latency (ms) autoApply: true ``` ```bash kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE ``` ### Using AI Configurator (Fast Offline Profiling) AI Configurator enables rapid offline profiling (~30 seconds) and supports all backends (vLLM, SGLang, TensorRT-LLM). Since `searchStrategy: rapid` is the default, AIC is used automatically unless you explicitly set `searchStrategy: thorough`. ## Configuration | Parameter | Default | Description | |-----------|---------|-------------| | `workload.isl` | 4000 | Average input sequence length (tokens) | | `workload.osl` | 1000 | Average output sequence length (tokens) | | `sla.ttft` | 2000 | Target Time To First Token (milliseconds) | | `sla.itl` | 30 | Target Inter-Token Latency (milliseconds) | | `hardware.numGpusPerNode` | auto | Number of GPUs per node | | `hardware.gpuSku` | auto | GPU SKU identifier | ## Profiling Methods | Method | Duration | Accuracy | GPU Required | Backends | |--------|----------|----------|--------------|----------| | Online (AIPerf) | 2-4 hours | Highest | Yes | All | | Offline (AI Configurator) | 20-30 seconds | Estimated | No | All | ## Output The profiler generates: 1. **Optimal Configuration**: Recommended TP sizes for prefill and decode engines 2. **Performance Data**: Interpolation models for the SLA Planner 3. **Generated DGD**: Complete deployment manifest with optimized settings Example recommendations: ```text Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU) Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU) ``` ## Next Steps | Document | Description | |----------|-------------| | [Profiler Guide](/dynamo/components/profiler/profiler-guide) | Configuration, methods, and troubleshooting | | [Profiler Examples](/dynamo/components/profiler/profiler-examples) | Complete DGDR YAMLs, WebUI, script examples | | [SLA Planner Guide](/dynamo/components/planner/planner-guide) | End-to-end deployment workflow | | [SLA Planner Architecture](/dynamo/components/planner/planner-guide) | How the Planner uses profiling data | # Profiler Guide ## Overview The Dynamo Profiler analyzes model inference performance and generates optimized deployment configurations (DynamoGraphDeployments). Given a model, hardware, and SLA targets, it determines the best parallelization strategy, selects optimal prefill and decode engine configurations, and produces a ready-to-deploy DGD YAML. The profiler accepts a `DynamoGraphDeploymentRequestSpec` (DGDR) as input and uses [AI Configurator (AIC)](https://github.com/ai-dynamo/aiconfigurator) for performance simulation, candidate enumeration, and configuration picking. When the planner is enabled, the profiler additionally generates engine interpolation curves used for runtime autoscaling. ## Workflow - **What** model you want to deploy (`model`) - **How** it should perform (SLA targets: `sla.ttft`, `sla.itl`) - **Where** it should run (optional GPU preferences via `hardware`) - **Which** backend to use (`backend`: auto, vllm, sglang, or trtllm) - **Which** image to use (`image`) The profiler follows this pipeline: ```mermaid flowchart TD Input["DGDR Spec"] --> Validate["Validate + Gate Checks"] Validate --> Strategy{searchStrategy?} Strategy -->|rapid| AICCheck{"AIC supports\nmodel/hw/backend?"} Strategy -->|thorough| Enumerate["Enumerate candidates\nvia AIC"] AICCheck -->|yes| Simulate["AIC Simulation\n+ Picking"] AICCheck -->|no| Naive["Naive Config\nGeneration"] Enumerate --> Deploy["Deploy + Benchmark\neach candidate"] Deploy --> Pick["AIC Picking"] Simulate --> DGDGen["DGD Generation"] Pick --> DGDGen Naive --> DGDGen DGDGen --> Interpolation["Interpolation\nCurves"] Interpolation --> MockerCheck{mocker?} MockerCheck -->|yes| MockerBase["generate_mocker_config()"] MockerCheck -->|no| PlannerCheck MockerBase --> PlannerCheck{planner?} PlannerCheck -->|yes| AddPlanner["add_planner_to_config()"] PlannerCheck -->|no| ProfileCheck AddPlanner --> ProfileCheck{"needs profile data?\n(thorough mocker or\nthorough throughput planner)"} ProfileCheck -->|yes| AddProfile["add_profile_data_to_config()"] ProfileCheck -->|no| Final AddProfile --> Final["final_config.yaml"] ``` ### Stage-by-stage walkthrough 1. **Validation**: The DGDR spec is validated — required fields checked (`image`, `hardware.gpuSku`, `hardware.numGpusPerNode`), SLA targets verified, and gate checks applied (see [Gate Checks](#gate-checks-and-constraints)). 2. **Search Strategy**: The profiler branches based on `searchStrategy`: - **Rapid**: Uses AIC simulation to estimate performance across parallelization configs. No GPUs needed, completes in ~30 seconds. - **Thorough**: Enumerates candidate parallelization configs via AIC, deploys each on real GPUs, benchmarks with AIPerf, then picks the best. Takes 2-4 hours, disagg mode only. 3. **Picking**: The profiler selects the best configuration using one of three modes, determined automatically from the DGDR spec (see [Picking Modes](#picking-modes)). 4. **DGD Generation**: The picked configuration is rendered into a complete DGD YAML via AIC's generator pipeline, including correct parallelization, replica counts, container image, and PVC mounts. 5. **Interpolation** (throughput planner/mocker): When the planner or mocker needs throughput data, the profiler generates detailed performance interpolation curves — TTFT vs ISL for prefill, ITL vs KV-cache utilization for decode. In thorough sweeping, these are stored as NPZ files and later packaged into a ConfigMap during final assembly. In rapid sweeping, consumers use AIC performance-model flags or in-process interpolation instead, so no profile-data ConfigMap is generated. 6. **Final Assembly** (3 composable layers): 1. **Mocker base**: If mocker is enabled, the base DGD is swapped for the mocker DGD template (`generate_mocker_config`). Otherwise the AIC-picked DGD is kept. 2. **Planner service**: If the planner is enabled, the Planner pod and its planner-config ConfigMap are injected into the DGD (`add_planner_to_config`). 3. **Profile data**: In thorough sweeping, if mocker is enabled or planner throughput-based scaling is enabled, the interpolation data ConfigMap is created and mounted into all consumers — the Planner service and/or mocker workers (`add_profile_data_to_config`). Rapid sweeping does not create this ConfigMap. The result is written to `final_config.yaml`. ## Search Strategies ### Rapid Uses AIC's performance simulation to estimate optimal configurations without deploying real engines. Completes in ~30 seconds. ```yaml searchStrategy: rapid ``` - Supports all backends: vLLM, SGLang, TensorRT-LLM - If the model/hardware/backend combination is not supported by AIC, falls back to a naive config (memory-fit TP calculation) - No GPU resources consumed during profiling ### Thorough Enumerates candidate parallelization configs, deploys each as a real K8s workload, and benchmarks with AIPerf. ```yaml searchStrategy: thorough ``` - Only disaggregated mode is supported - Does not support `auto` backend — specify `vllm`, `sglang`, or `trtllm` - Takes 2-4 hours depending on the number of candidates - Provides highest accuracy since measurements come from real hardware ## Picking Modes The profiler automatically selects a picking mode based on the DGDR spec: ### Autoscale Triggered when the **planner is enabled** (scaling enabled in `features.planner`). Picks prefill and decode engines independently, each with 1 replica. The planner handles scaling at runtime. ### Load Match Triggered when a **target load** is specified (`workload.requestRate` or `workload.concurrency`). Finds the configuration that serves the target load with the minimum number of GPUs under SLA. ```yaml workload: requestRate: 5.0 # target 5 req/s ``` ### Default Triggered when there is **no planner and no target load**. Maximizes throughput for the available GPU budget under SLA. ## Planner Integration When the planner is enabled, the profiler generates engine interpolation data needed for throughput-based autoscaling. The `pre_deployment_sweeping_mode` field controls how this data is produced: ```yaml features: planner: optimization_target: sla # required for throughput-based scaling and specific SLA targets pre_deployment_sweeping_mode: rapid # rapid | thorough | none enable_throughput_scaling: true ``` `optimization_target` must be set to `sla` for `enable_throughput_scaling` and the planner's `ttft_ms`/`itl_ms` SLA targets to take effect. The `PlannerConfig` default is `throughput`, which uses static queue/utilization thresholds: it silently flips `enable_throughput_scaling` to `false` (so pre-deployment profiling is skipped and `planner-profile-data-XXXX` is not emitted) and ignores any `features.planner.ttft_ms`/`itl_ms` values. `enable_load_scaling` is unaffected (easy-mode keeps load scaling enabled). See the [Planner Guide](/dynamo/components/planner/planner-guide#optimization-target) for the full explanation of each `optimization_target` value. - **rapid**: Uses AIC simulation to generate interpolation curves (~30s, no GPUs). Consumers use AIC performance-model flags or in-process interpolation, so `planner-profile-data-XXXX` is not emitted. - **thorough**: Deploys the selected engine config on real GPUs and sweeps across ISL/concurrency ranges (2-4h). When profile data is needed, the profiler packages it into `planner-profile-data-XXXX`. - **none**: Skips interpolation. Only valid when using load-based scaling without throughput-based scaling. The generated DGD can include these ConfigMaps: - **planner-config-XXXX**: Serialized `PlannerConfig` JSON (with `profile_results_dir` pointing to the profiling data mount) - **planner-profile-data-XXXX**: Prefill and decode interpolation data (JSON). Only emitted when `pre_deployment_sweeping_mode: thorough` and either `optimization_target: sla` is set alongside `enable_throughput_scaling: true`, or mocker is enabled. Rapid mode does not emit this ConfigMap. See the [Planner Guide](/dynamo/components/planner/planner-guide) for the full `PlannerConfig` reference. ## Mocker When `features.mocker.enabled: true`, the profiler outputs a mocker DGD that simulates engine behavior without real GPUs. This is useful for testing planner behavior and validating configurations at scale. Mocker requires pre-deployment sweeping to generate simulated performance profiles — `pre_deployment_sweeping_mode` cannot be `none` when mocker is enabled. ## Gate Checks and Constraints The profiler enforces these rules at startup: | Condition | Behavior | |-----------|----------| | `searchStrategy: thorough` + `backend: auto` | Rejected. Specify a concrete backend. | | `enable_throughput_scaling: true` without `optimization_target: sla` | Silently coerced. `PlannerConfig` defaults `optimization_target` to `throughput`, which flips `enable_throughput_scaling` to `false` at validation time. Set `optimization_target: sla` explicitly to keep throughput-based scaling enabled. | | `enable_throughput_scaling: true` + `pre_deployment_sweeping_mode: none` (or unset) | Rejected. Throughput-based scaling requires pre-deployment sweeping. | | `enable_throughput_scaling: true` + `pre_deployment_sweeping_mode: rapid` + AIC unsupported | Rejected. AIC does not support this model/hardware/backend combination; switch `pre_deployment_sweeping_mode` to `thorough`. | | `e2eLatency` provided together with an explicitly-set `ttft` or `itl` | Rejected by SLA validator. Provide only `e2eLatency`; `ttft` and `itl` do not need to be explicitly nulled. | | SLA unachievable | Warning logged, SLA updated to best achievable value. | | Load-match needs more GPUs than available | Warning logged. | ## Support Matrix | Backend | Dense Models | MoE Models | |---------|-------------|------------| | vLLM | ✅ | 🚧 | | SGLang | ✅ | ✅ | | TensorRT-LLM | ✅ | 🚧 | The profiler sweeps over the following parallelization mappings for prefill and decode: | Model Architecture | Prefill Parallelization Mapping | Decode Parallelization Mapping | |---------|-------------|------------| | MLA+MoE (DeepseekV3ForCausalLM, DeepseekV32ForCausalLM) | TEP, DEP | TEP, DEP | | GQA+MoE (Qwen3MoeForCausalLM) | TP, TEP, DEP | TP, TEP, DEP | | Other Models | TP | TP | Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend. ## Deployment ### Kubernetes Deployment (DGDR) The recommended deployment method is through DGDRs. See [Profiler Examples](/dynamo/components/profiler/profiler-examples) for complete DGDR YAML examples covering rapid, thorough, MoE, custom SLA, and override use cases. #### Container Images Each DGDR requires a container image for profiling and deployment: - **`image`** (Optional): Container image for the profiling job. Must contain the profiler code and dependencies. ```yaml spec: image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 ``` #### Quick Start: Deploy with DGDR **Step 1: Create Your DGDR** Use a sample configuration or create your own: ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: my-model-profiling spec: model: "Qwen/Qwen3-0.6B" backend: vllm image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 ``` **Step 2: Apply the DGDR** ```bash export NAMESPACE=your-namespace kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE ``` **Step 3: Monitor Progress** ```bash # View status kubectl get dgdr -n $NAMESPACE # Detailed status kubectl describe dgdr my-model-profiling -n $NAMESPACE # Watch profiling job logs kubectl logs -f job/profile-my-model-profiling -n $NAMESPACE ``` **DGDR Status Phases:** - `Pending`: Initial state, preparing to profile - `Profiling`: Running profiling job (20-30 seconds for AIC, 2-4 hours for online) - `Ready`: Profiling complete, generated DGD spec available in status - `Deploying`: Generating and applying DGD configuration - `Deployed`: DGD successfully deployed and running - `Failed`: Error occurred (check events for details) **Step 4: Access Your Deployment** ```bash # Find the frontend service kubectl get svc -n $NAMESPACE | grep frontend # Port-forward to access locally kubectl port-forward svc/-frontend 8000:8000 -n $NAMESPACE # Test the endpoint curl http://localhost:8000/v1/models ``` DGDRs are **immutable**. To update SLAs or configuration, delete the existing DGDR and create a new one. ## Profiling Method The profiler follows a 5-step process: 1. **Hardware Setup**: Uses defaults or user-specified hardware configuration. Optionally, cluster-scoped operators can enable automatic GPU discovery to detect specifications from cluster nodes. 2. **Identify Sweep Ranges**: Automatically determine minimum and maximum number of GPUs per engine. Minimum is determined by the model size and GPU VRAM. Maximum is set to one node for dense models and 4 nodes for MoE models. 3. **Parallelization Mapping Sweep**: Test performance of engines with different parallelization mappings using the input ISL and OSL. - For dense models, test different TP sizes for both prefill and decode. - For MoE models (SGLang), evaluate both TEP and DEP as candidates for prefill and decode. - **Prefill**: - TP/TEP: Measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse. - DEP: Attention uses data parallelism. Send a single burst with total concurrency `attention_dp_size × attn_dp_num_req_ratio` (defaults to 4) and compute the reported TTFT as `time_to_first_token.max / attn_dp_num_req_ratio` from the AIPerf summary of that burst. ![Prefill Performance](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/3bc6d1b301110acf6e2a9eb689751ff2994c28860501d31329fc39b257e259dd/pages-v1.2.0/assets/img/h100-prefill-performance.png) - **Decode**: Measure the ITL under different numbers of in-flight requests, from 1 to the maximum the KV cache can hold. To measure ITL without being affected by piggy-backed prefill requests, the script enables KV-reuse and warms up the engine by issuing the same prompts before measuring. ![Decode Performance](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/6223dfa3b912e0ed17e78f02613a79e0bed7b5750fb237482df8d4524ed17ac7/pages-v1.2.0/assets/img/h100-decode-performance.png) 4. **Recommendation**: Select optimal parallelization mapping for prefill and decode that achieves the highest per-GPU throughput while adhering to the SLA on TTFT and ITL. 5. **In-Depth Profiling on the Recommended P/D Engine**: Interpolate TTFT with ISL and ITL with active KV cache and decode context length for more accurate performance estimation. ![ITL Interpolation](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/55e4969b5183918299538a6a587e5a69423cbc203270f044bec3c861e9cc1ee0/pages-v1.2.0/assets/img/pd-interpolation.png) - **Prefill**: Measures TTFT and throughput per GPU across different input lengths with batch size=1. - **Decode**: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths. ### AIPerf on Real Engines Profiles your model by creating real test deployments in Kubernetes and measuring their performance. - **Duration**: 2-4 hours - **Accuracy**: Highest (real measurements) - **GPU Requirements**: Full access to test different parallelization mappings - **Backends**: vLLM, SGLang, TensorRT-LLM AIPerf-based profiling is the default behavior. Use `searchStrategy: thorough` for comprehensive real-engine profiling: ```yaml spec: searchStrategy: thorough # Deep exploration with real engine profiling ``` ### AI Configurator Simulation Uses performance simulation to rapidly estimate optimal configurations without running real deployments. - **Duration**: 20-30 seconds - **Accuracy**: Estimated (may have errors for unusual configurations) - **GPU Requirements**: None - **Backends**: All (vLLM, SGLang, TensorRT-LLM) AI Configurator is used by default with `searchStrategy: rapid`: ```yaml spec: searchStrategy: rapid # Fast profiling with AI Configurator simulation (default) ``` `aicBackendVersion` specifies the TensorRT-LLM version that AI Configurator simulates. See the [AI Configurator supported features](https://github.com/ai-dynamo/aiconfigurator#supported-features) for available versions. **Currently supports:** - **Backends**: vLLM, SGLang, TensorRT-LLM - **Systems**: H100 SXM, H200 SXM, B200 SXM, GB200 SXM, A100 SXM - **Models**: Wide range including GPT, Llama, Mixtral, DeepSeek, Qwen, and more See [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features) for the full list. ### Automatic GPU Discovery The operator automatically discovers GPU resources from cluster nodes, providing hardware info (GPU model, VRAM, GPUs per node) and automatic profiling search space calculation. **Requirements:** - **Cluster-scoped operators** (recommended): Have node read permissions by default. GPU discovery works automatically. > **DEPRECATED:** The following applies only to namespace-scoped operators, which are deprecated and will be removed in a future release. Use cluster-wide mode for new deployments. - **Namespace-scoped operators** (deprecated): GPU discovery is enabled by default when installing via Helm — the chart provisions the required ClusterRole/ClusterRoleBinding automatically **For namespace-scoped operators (deprecated)**, GPU discovery is controlled by a Helm value: ```bash # GPU discovery enabled (default) — Helm provisions read-only node access automatically helm install dynamo-platform ... --set dynamo-operator.gpuDiscovery.enabled=true # GPU discovery disabled — you must provide hardware config manually in each DGDR helm install dynamo-platform ... --set dynamo-operator.gpuDiscovery.enabled=false ``` If GPU discovery is disabled, provide hardware config manually in the DGDR: ```yaml spec: hardware: numGpusPerNode: 8 gpuSku: h100_sxm vramMb: 81920 ``` If GPU discovery is disabled and no manual hardware config is provided, the DGDR will be rejected at admission time. ## Configuration ### DGDR Configuration Structure All profiler configuration is provided through the v1beta1 DGDR spec fields: ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: my-deployment spec: model: "Qwen/Qwen3-0.6B" backend: vllm image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 searchStrategy: rapid # or thorough autoApply: true workload: { ... } sla: { ... } hardware: { ... } features: { ... } overrides: { ... } ``` ### SLA Configuration (Optional) ```yaml workload: isl: 3000 # Average input sequence length (tokens) osl: 150 # Average output sequence length (tokens) sla: ttft: 200.0 # Target Time To First Token (milliseconds) itl: 20.0 # Target Inter-Token Latency (milliseconds) ``` - **ISL/OSL**: Based on your expected traffic patterns - **TTFT**: First token latency target (lower = more GPUs needed, affects prefill engine) - **ITL**: Token generation latency target (lower = more GPUs needed, affects decode engine) - **Trade-offs**: Tighter SLAs require more GPU resources ### Hardware Configuration (Optional) ```yaml hardware: gpuSku: h200_sxm # GPU SKU identifier (auto-detected) vramMb: 81920 # VRAM per GPU in MiB totalGpus: 16 # Total GPUs available in the cluster numGpusPerNode: 8 # GPUs per node (for multi-node MoE) ``` - **numGpusPerNode**: Determine the upper bound of GPUs per node for dense models and configure Grove for multi-node MoE engines - **gpuSku**: GPU SKU identifier, auto-detected by the controller If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources. ### Search Strategy (Optional) Controls the profiling search depth: ```yaml spec: searchStrategy: rapid # "rapid" (default) for fast sweep; "thorough" for deeper exploration ``` - **rapid**: Performs a fast sweep over parallelization mappings (default) - **thorough**: Explores more configurations for potentially better results ### Planner Configuration (Optional) Pass arguments to the SLA planner via the features section: ```yaml features: planner: planner_min_endpoint: 2 # Minimum endpoints to maintain planner_adjustment_interval: 60 # Adjustment interval (seconds) planner_load_predictor: linear # Load prediction method ``` Planner arguments use `planner_` prefix. See [SLA Planner documentation](/dynamo/components/planner/planner-guide) for full list. ### Model Cache PVC (Advanced) For large models, use a pre-populated PVC containing model weights instead of downloading from HuggingFace: ```yaml modelCache: pvcName: "model-cache" pvcModelPath: "hub/models--deepseek-ai--DeepSeek-R1" pvcMountPath: "/opt/model-cache" ``` Requirements: - The PVC must exist in the same namespace as the DGDR - The model weights must be accessible at `{mountPath}/{pvcPath}` ### Engine Configuration (Auto-configured) The controller automatically handles model and backend configuration from high-level fields: ```yaml # You specify: spec: model: "Qwen/Qwen3-0.6B" backend: vllm # Controller auto-injects into the profiling job ``` You should **not** manually set model or backend in profiling config overrides. ### Using Existing DGD Configs Provide a base DGD config via the overrides section: ```yaml overrides: dgd: apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-dgd spec: # ... your base DGD spec ``` The profiler uses the DGD config as a **base template**, then optimizes it based on your SLA targets. ## Integration ### With SLA Planner The Profiler generates interpolation data that the SLA Planner uses for autoscaling decisions. **Prefill Interpolation** (`selected_prefill_interpolation/raw_data.npz`): - `prefill_isl`: 1D array of input sequence lengths tested - `prefill_ttft`: 1D array of TTFTs (ms) at each ISL - `prefill_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each ISL **Decode Interpolation** (`selected_decode_interpolation/raw_data.npz`): - `max_kv_tokens`: Total KV tokens capacity in decode engine - `x_kv_usage`: 1D array of active KV usage percentages [0, 1] - `y_context_length`: 1D array of average context lengths tested - `z_itl`: 1D array of ITLs (ms) at each (KV usage, context length) point - `z_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each point ### With Dynamo Operator When using DGDR, the Dynamo Operator: 1. Creates profiling jobs automatically 2. Stores profiler output in ConfigMaps (`dgdr-output-` and, when thorough profile data is needed, `planner-profile-data`) 3. Generates optimized DGD configurations 4. Deploys the DGD with SLA Planner integration #### Failure Handling Profiling failures are not retried at the Kubernetes Job level (`backoffLimit: 0`). Most profiler errors — validation failures, unsupported model/hardware combinations, missing configs — are deterministic and will never succeed on retry, so re-running the full profiling cycle would only waste GPU time. When the profiler reports failure, the output-copier sidecar writes the error details (phase, error message, profiler status) to the output ConfigMap and exits successfully. The DGDR controller reads the failure from the ConfigMap and transitions the DGDR directly to the `Failed` phase with the specific sub-phase failure reason (e.g., `SweepingDecodeFailed`, `GeneratingDGDFailed`). Use `kubectl describe dgdr ` to see the failure details in the conditions. The generated DGD is tracked via labels: ```yaml metadata: labels: dgdr.nvidia.com/name: my-deployment dgdr.nvidia.com/namespace: your-namespace ``` ### With Observability Monitor profiling jobs: ```bash kubectl logs -f job/profile- -n $NAMESPACE kubectl describe dgdr -n $NAMESPACE ``` ## Advanced Topics ### Manual Deployment Control Disable auto-deployment to review the generated DGD before applying: ```yaml spec: autoApply: false ``` Then manually extract and apply: ```bash # Extract generated DGD from DGDR status kubectl get dgdr my-deployment -n $NAMESPACE -o jsonpath='{.status.profilingResults.selectedConfig}' | kubectl apply -f - # Or save to file for review kubectl get dgdr my-deployment -n $NAMESPACE -o jsonpath='{.status.profilingResults.selectedConfig}' > my-dgd.yaml ``` ### Mocker Deployment Deploy a mocker deployment that simulates engines without GPUs: ```yaml spec: model: backend: trtllm features: mocker: enabled: true # Deploy mocker instead of real backend autoApply: true ``` With thorough sweeping, profiling still runs against the real backend to collect performance data and stores it in `planner-profile-data`. With rapid sweeping, the mocker uses AIC performance-model flags instead of a profile-data ConfigMap. Useful for large-scale experiments, testing Planner behavior, and validating configurations. ### Accessing Profiling Artifacts By default, profiler output is stored in ConfigMaps. For detailed artifacts (plots, logs, raw data), attach a PVC via overrides: ```yaml overrides: profilingJob: template: spec: volumes: - name: profiling-output persistentVolumeClaim: claimName: "dynamo-pvc" ``` **ConfigMaps:** - `dgdr-output-`: Generated DGD configuration - `planner-profile-data`: Profiling data for Planner and mocker consumers (JSON). Only created for thorough sweeping when profile data is needed. **PVC artifacts (optional):** - Performance plots (PNGs) - DGD configurations for each profiled deployment - AIPerf profiling artifacts - Raw profiling data (`.npz` files) - Profiler logs Access PVC results: ```bash kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results kubectl delete pod pvc-access-pod -n $NAMESPACE ``` ### Output Performance Plots The profiler generates plots to visualize performance data: **Parallelization Mapping Sweep Plots:** - `prefill_performance.png`: TTFT vs Parallelization Mapping size - `decode_performance.png`: ITL vs Parallelization Mapping size and in-flight requests **In-Depth Profiling Plots:** - `selected_prefill_interpolation/prefill_ttft_interpolation.png`: TTFT vs ISL - `selected_prefill_interpolation/prefill_throughput_interpolation.png`: Throughput vs ISL - `selected_decode_interpolation/decode_itl_interplation.png`: ITL vs KV usage and context length - `selected_decode_interpolation/decode_throughput_interpolation.png`: Throughput vs KV usage and context length ## Runtime Profiling (SGLang) SGLang workers expose profiling endpoints for runtime performance analysis: ```bash # Start profiling curl -X POST http://localhost:9090/engine/start_profile \ -H "Content-Type: application/json" \ -d '{"output_dir": "/tmp/profiler_output"}' # Run inference requests... # Stop profiling curl -X POST http://localhost:9090/engine/stop_profile ``` View traces using Chrome's `chrome://tracing`, [Perfetto UI](https://ui.perfetto.dev/), or TensorBoard. ## Troubleshooting ### SLA Cannot Be Met The profiler logs a warning and updates the SLA to the best achievable value. To improve results: - Relax SLA targets (increase TTFT/ITL) - Add more GPU resources - Try a different backend - Use a smaller or quantized model ### Profiling Takes Too Long - Use `searchStrategy: rapid` for ~30s profiling - Reduce interpolation granularity - Reduce the GPU search space via hardware constraints ### Out of Memory During Profiling - Reduce `max_batch_size` in engine config - Skip larger TP configurations by constraining hardware - Use a quantized model variant ### Image Pull Errors Ensure image pull secrets are configured in your namespace for the container registry. ## See Also - [Profiler README](/dynamo/components/profiler) — Quick overview and feature matrix - [Profiler Examples](/dynamo/components/profiler/profiler-examples) — Complete DGDR YAML examples - [Planner Guide](/dynamo/components/planner/planner-guide) — PlannerConfig reference and scaling modes - [DGDR API Reference](/dynamo/additional-resources/api-reference-k-8-s) — Full DGDR specification # Profiler Examples Complete examples for profiling with DGDRs. ## DGDR Examples ### Dense Model: Rapid Fast profiling (~30 seconds): ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: qwen-0-6b spec: model: "Qwen/Qwen3-0.6B" image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 ``` ### Dense Model: Thorough Profiling with real GPU measurements: ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: vllm-dense-online spec: model: "Qwen/Qwen3-0.6B" backend: vllm image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 searchStrategy: thorough ``` ### MoE Model Multi-node MoE profiling with SGLang: The PVC referenced by `modelCache.pvcName` must already exist in the same namespace and contain the model weights at the specified `pvcModelPath`. The DGDR controller does not create or populate the PVC — it only mounts it into the profiling job and deployed workers. ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: sglang-moe spec: model: "deepseek-ai/DeepSeek-R1" backend: sglang image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 hardware: numGpusPerNode: 8 modelCache: pvcName: "model-cache" pvcModelPath: "deepseek-r1" # path within the PVC ``` ### Private Model For gated or private HuggingFace models, pass your token via an environment variable injected into the profiling job. Create the secret first: ```bash kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN="${HF_TOKEN}" \ -n ${NAMESPACE} ``` Then reference it in your DGDR: ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: llama-private spec: model: "meta-llama/Llama-3.1-8B-Instruct" image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 overrides: profilingJob: template: spec: containers: [] # required placeholder; leave empty to inherit defaults initContainers: - name: profiler env: - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-token-secret key: HF_TOKEN ``` ### Custom SLA Targets Control how the profiler optimizes your deployment by specifying latency targets and workload characteristics. **Explicit TTFT + ITL targets** (default mode): ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: low-latency-dense spec: model: "Qwen/Qwen3-0.6B" image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 sla: ttft: 500 # Time To First Token target in milliseconds itl: 20 # Inter-Token Latency target in milliseconds workload: isl: 2000 # expected input sequence length (tokens) osl: 500 # expected output sequence length (tokens) ``` **End-to-end latency target** (alternative to ttft+itl): ```yaml spec: ... sla: e2eLatency: 10000 # total request latency budget in milliseconds ``` ### Overrides Use `overrides` to customize the profiling job pod spec — for example to add tolerations for GPU node taints or inject environment variables. **GPU node toleration** (common on GKE and shared clusters): ```yaml apiVersion: nvidia.com/v1beta1 kind: DynamoGraphDeploymentRequest metadata: name: dense-with-tolerations spec: model: "Qwen/Qwen3-0.6B" image: "nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.2.0" # dynamo-frontend for Dynamo < 1.1.0 overrides: profilingJob: template: spec: containers: [] # required placeholder; leave empty to inherit defaults tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule ``` **Override the generated DynamoGraphDeployment** (e.g., to inject worker environment variables): ```yaml spec: ... overrides: dgd: apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment spec: envs: - name: TRITON_PTXAS_PATH value: "/usr/local/cuda/bin/ptxas" services: VllmWorker: envs: - name: CUSTOM_ENV value: "my-value" ``` ## SGLang Runtime Profiling Profile SGLang workers at runtime via HTTP endpoints: ```bash # Start profiling curl -X POST http://localhost:9090/engine/start_profile \ -H "Content-Type: application/json" \ -d '{"output_dir": "/tmp/profiler_output"}' # Run inference requests to generate profiling data... # Stop profiling curl -X POST http://localhost:9090/engine/stop_profile ``` A test script is provided at `examples/backends/sglang/test_sglang_profile.py`: ```bash python examples/backends/sglang/test_sglang_profile.py ``` View traces using Chrome's `chrome://tracing`, [Perfetto UI](https://ui.perfetto.dev/), or TensorBoard. # KVBM

简体中文

The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer and write-through cache for frameworks like vLLM and TensorRT-LLM. KVBM offers: - A **unified memory API** spanning GPU memory, pinned host memory, remote RDMA-accessible memory, local/distributed SSDs, and remote file/object/cloud storage systems - Support for **block lifecycles** (allocate → register → match) with event-based state transitions - Integration with **[NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)**, a dynamic memory exchange layer for remote registration, sharing, and access of memory blocks > **Get started:** See the [KVBM Guide](/dynamo/user-guides/kv-cache-offloading) for installation and deployment instructions. ## When to Use KV Cache Offloading KV Cache offloading avoids expensive KV Cache recomputation, resulting in faster response times and better user experience. Providers benefit from higher throughput and lower cost per token, making inference services more scalable and efficient. Offloading KV cache to CPU or storage is most effective when KV Cache exceeds GPU memory and cache reuse outweighs the overhead of transferring data. It is especially valuable in: | Scenario | Benefit | |----------|---------| | **Long sessions and multi-turn conversations** | Preserves large prompt prefixes, avoids recomputation, improves first-token latency and throughput | | **High concurrency** | Idle or partial conversations can be moved out of GPU memory, allowing active requests to proceed without hitting memory limits | | **Shared or repeated content** | Reuse across users or sessions (system prompts, templates) increases cache hits, especially with remote or cross-instance sharing | | **Memory- or cost-constrained deployments** | Offloading to RAM or SSD reduces GPU demand, allowing longer prompts or more users without adding hardware | ## Feature Support Matrix | | Feature | Support | |--|---------|---------| | **Backend** | Local | ✅ | | | Kubernetes | ✅ | | **LLM Framework** | vLLM | ✅ | | | TensorRT-LLM | ✅ | | | SGLang | ❌ | | **Serving Type** | Aggregated | ✅ | | | Disaggregated | ✅ | ## Architecture ![KVBM Architecture](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/5b0075bf0e321be71d204bf237a3ced35402bce029989e335a7396c1a1f4108b/pages-v1.2.0/assets/img/kvbm-components.svg) *High-level layered architecture view of Dynamo KV Block Manager and how it interfaces with different components of the LLM inference ecosystem* KVBM has three primary logical layers: **LLM Inference Runtime Layer** — The top layer includes inference runtimes (TensorRT-LLM, vLLM) that integrate through dedicated connector modules to the Dynamo KVBM. These connectors act as translation layers, mapping runtime-specific operations and events into KVBM's block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and memory tiering. **KVBM Logic Layer** — The middle layer encapsulates core KV block manager logic and serves as the runtime substrate for managing block memory. The KVBM adapter normalizes representations and data layout for incoming requests across runtimes and forwards them to the core memory manager. This layer implements table lookups, memory allocation, block layout management, lifecycle state transitions, and block reuse/eviction policies. **NIXL Layer** — The bottom layer provides unified support for all data and storage transactions. NIXL enables P2P GPU transfers, RDMA and NVLink remote memory sharing, dynamic block registration and metadata exchange, and provides a plugin interface for storage backends including block memory (GPU HBM, Host DRAM, Remote DRAM, Local SSD), local/remote filesystems, object stores, and cloud storage. > **Learn more:** See the [KVBM Design Document](/dynamo/design-docs/component-design/kvbm-design) for detailed architecture, components, and data flows. ## Next Steps - **[KVBM Guide](/dynamo/user-guides/kv-cache-offloading)** — Installation, configuration, and deployment instructions - **[KVBM Design](/dynamo/design-docs/component-design/kvbm-design)** — Architecture deep dive, components, and data flows - **[LMCache Integration](/dynamo/integrations/kv-cache-integrations/lm-cache)** — Use LMCache with Dynamo vLLM backend - **[FlexKV Integration](/dynamo/integrations/kv-cache-integrations/flex-kv)** — Use FlexKV for KV cache management - **[SGLang HiCache](/dynamo/integrations/kv-cache-integrations/hi-cache)** — Enable SGLang's hierarchical cache with NIXL - **[NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** — NIXL communication library details # Using HiCache This guide covers running SGLang's Hierarchical Cache (HiCache) with Dynamo, and how the Dynamo KV router integrates with HiCache for tier-aware worker selection when workers share an external pool such as Mooncake. ## Overview SGLang HiCache extends RadixAttention with a multi-tier KV cache that transparently moves pages between GPU HBM, host memory, and an optional external storage backend (e.g. Mooncake). For a full description of HiCache itself — flag reference, storage backends, memory layouts, prefetch policies — see SGLang's own documentation: - [SGLang HiCache Design](https://docs.sglang.ai/advanced_features/hicache_design.html) - [SGLang HiCache Best Practices](https://docs.sglang.ai/advanced_features/hicache_best_practices.html) What Dynamo adds on top of HiCache: - **Tier-aware routing.** The KV router tracks which cache tier each block lives on (GPU / Host / External) and uses that when scoring candidate workers — not just device overlap. - **Shared-pool awareness.** When an external backend such as Mooncake is configured, the router queries the shared pool in parallel with its own indexer so it can discount prefill cost for blocks any worker can fetch, not just blocks the candidate holds locally. If you are running a single worker with HiCache and no shared pool, no Dynamo-side configuration is required — the worker reports KV events to the router as usual. ## Running SGLang with HiCache Launch a worker with HiCache enabled: ```bash python -m dynamo.sglang \ --model-path Qwen/Qwen3-0.6B \ --page-size 64 \ --enable-hierarchical-cache \ --hicache-ratio 2 \ --hicache-write-policy write_through \ --hicache-storage-backend nixl \ --skip-tokenizer-init ``` Then start the frontend: ```bash python -m dynamo.frontend --http-port 8000 ``` The HiCache flags (`--enable-hierarchical-cache`, `--hicache-ratio`, `--hicache-write-policy`, `--hicache-storage-backend`, `--hicache-mem-layout`, etc.) are SGLang-native — Dynamo passes them through unchanged. See [SGLang's best-practices doc](https://docs.sglang.ai/advanced_features/hicache_best_practices.html) for the complete flag reference and tuning guidance. ## Tier-Aware Shared KV Cache Routing When you scale out to multiple SGLang workers that share an external pool such as [Mooncake](https://github.com/kvcache-ai/Mooncake), the Dynamo router can be made tier-aware. It tracks per-tier residency from worker events and consults the shared pool directly so that blocks cached anywhere in the cluster — not just on the candidate worker's GPU — contribute to worker scoring. ### Why By default the router's radix tree only reflects blocks resident in **GPU HBM** on each worker. HiCache silently demotes blocks to host memory and further to Mooncake as the device pool fills, but the router never sees those transitions. A worker that has the full request prefix on host + Mooncake looks identical to a cold worker. The router ends up treating "fetchable from Mooncake in milliseconds" the same as "must be recomputed from scratch." ### Event model SGLang's `HiRadixCache` emits `BlockStored` / `BlockRemoved` events carrying a `medium` field on every tier transition: | Transition | Event emitted | | ----------------------------------------------- | -------------------- | | Fresh prefill writes blocks to GPU | `store(GPU)` | | GPU → Host copy (after async DMA completes) | `store(CPU_PINNED)` | | GPU evicted, block still resident on Host | `remove(GPU)` | | Host evicted (block gone from all worker tiers) | `remove(CPU_PINNED)` | | Host → GPU promotion (`load_back`) | `store(GPU)` | | External → Host prefetch (L2 materialization) | `store(CPU_PINNED)` | `CPU_PINNED` is the value SGLang's `HiRadixCache` actually emits for host-tier blocks (page-locked memory). The rest of this guide uses `CPU_PINNED` to match the on-the-wire string; "Host" is the conceptual tier name. A few properties the router relies on: - **Ordering.** `store(new_tier)` is emitted before `remove(old_tier)` so the block is never invisible to the router during a transition. - **DMA safety.** `store(CPU_PINNED)` for a GPU→Host copy is deferred until `finish_event.synchronize()` confirms the DMA landed — events never fire before bytes are resident. - **Per-tier tracking.** A block can be on GPU and Host simultaneously. The router records both and picks the highest-priority tier when scoring overlap. ### How it works ```mermaid flowchart LR Worker["SGLang Worker
(HiRadixCache)"] Mooncake["Mooncake
shared pool"] Router["Dynamo KV Router
per-tier radix tree"] Worker -- "KV events (store/remove + medium)" --> Router Worker -- "writes pages" --> Mooncake Router -- "batch_query on each request" --> Mooncake ``` On every request the router runs two lookups in parallel: - Its own radix tree, built from worker KV events (per-tier). - A batch query to the Mooncake master for blocks reachable from the shared pool. If the shared-pool query fails, the router falls back to indexer-only scoring and logs a warning. The request still succeeds. ### Scoring For each candidate worker, the router computes a **logit** (lower wins): ```text # Without shared cache adjusted_prefill_blocks = max( prefill_blocks - overlap_score_credit * device_overlap_blocks - host_cache_hit_weight * host_overlap_blocks - disk_cache_hit_weight * disk_overlap_blocks, 0, ) logit = prefill_load_scale * adjusted_prefill_blocks + decode_blocks # With shared cache shared_beyond = shared_cache_hits.hits_beyond(device_overlap_blocks) adjusted_prefill_blocks = max( prefill_blocks - overlap_score_credit * device_overlap_blocks - host_cache_hit_weight * host_overlap_blocks - disk_cache_hit_weight * disk_overlap_blocks - shared_cache_multiplier * shared_beyond, 0, ) logit = prefill_load_scale * adjusted_prefill_blocks + decode_blocks ``` `hits_beyond(n)` counts shared-cache pages at positions `>= n` — "pages past my device prefix that I can still fetch from Mooncake instead of recomputing." **Worked example.** Request is 4 blocks, `shared_cache_multiplier = 0.5`, `block_size = 1`, `overlap_score_credit = 1.0` (the maximum device-local overlap credit). Shared pool contains blocks 0–3. | Worker | Device overlap | `hits_beyond` | Device credit | Shared credit | Adjusted prefill | Logit | | ------ | -------------- | -------------- | ------------- | ------------- | ---------------- | -------------- | | W0 | 2 (A, B) | 2 (C, D) | 2.0 | 1.0 | 1.0 | **1.0 — wins** | | W1 | 0 | 4 (A, B, C, D) | 0.0 | 2.0 | 2.0 | 2.0 | W0 wins because it combines device-local reuse with shared-pool hits beyond that device prefix. The multiplier encodes the cost ratio of a Mooncake fetch relative to a fresh GPU compute — `0.5` means "fetching from shared is half as expensive as recomputing." ## Requirements Tier-aware shared cache routing requires SGLang changes from [sgl-project/sglang#22894](https://github.com/sgl-project/sglang/pull/22894) ("fix(hicache): emit KV events for L2 host cache insertions"). This PR is **not yet merged** to SGLang main. Until it lands and a SGLang release includes it, the feature is not accessible from a stock `pip install sglang` — you must build SGLang from the PR branch (`gh pr checkout 22894 && pip install -e python/` from the SGLang repo). This section will be updated with the minimum required version once #22894 ships in a release. Without PR #22894, worker events carry only `medium=GPU` and the router is blind to Host-tier residency — regardless of Mooncake configuration. You also need: - Dynamo router started with `--shared-cache-type hicache` (see [Configuration](#configuration)). - A Mooncake master reachable from the Dynamo frontend host. Worker-side Mooncake config (master address, page size, TP/PP layout, split-head layout) is published automatically via each worker's registration metadata when the worker is started with `--hicache-storage-backend mooncake`. ## Setup **Known limitation in 1.2.0.** With both `--enable-metrics` and `--disable-piecewise-cuda-graph` set on the SGLang worker, the process can crash on the first KV-cache write due to a race in the upstream `mooncake-transfer-engine` thread pool. The recipe below omits these flags; per-process metrics scraping via the `dynamo.frontend` is unaffected. The mooncake-side fix is being tracked upstream. **SGLang worker** — HiCache with Mooncake storage: ```bash python -m dynamo.sglang \ --model-path Qwen/Qwen3-0.6B \ --page-size 64 \ --enable-hierarchical-cache \ --hicache-ratio 2 \ --hicache-write-policy write_through \ --hicache-storage-backend mooncake \ --hicache-storage-backend-extra-config '{"master_server_address": "mooncake-master.internal:50051"}' \ --skip-tokenizer-init ``` Launch additional workers on other GPUs / hosts with the same Mooncake config so they back to the same cluster. **Dynamo frontend** — enable tier-aware routing: ```bash python -m dynamo.frontend \ --http-port 8000 \ --router-mode kv \ --shared-cache-type hicache \ --shared-cache-multiplier 0.5 ``` ## Configuration | Flag | Env var | Default | Description | | --------------------------- | ----------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `--shared-cache-type` | `DYN_SHARED_CACHE_TYPE` | `none` | `none` disables shared-pool lookups; `hicache` enables Mooncake queries. | | `--shared-cache-multiplier` | `DYN_SHARED_CACHE_MULTIPLIER` | `0.5` | Discount factor for shared-pool hits. `0.0` queries but ignores them; `0.5` treats a shared hit as half a device hit; `1.0` treats shared and device hits equally. | Per-request overrides are available via `RouterConfigOverride.shared_cache_multiplier` for A/B experimentation without restarting the router. No extra flags are required on the worker. When `--hicache-storage-backend mooncake` is set, Dynamo publishes the required metadata (page size, TP/PP layout, master address) via the worker's `ModelRuntimeConfig.engine_specific` blob under the key `sglang_hicache_mooncake`. ## Verification **Events carry a medium.** Run the worker with `--log-level debug` and grep the log: ```bash python -m dynamo.sglang ... --log-level debug 2>&1 | grep -E 'BlockStored|BlockRemoved' # BlockStored(block_hashes=[...], medium=CPU_PINNED) # BlockRemoved(block_hashes=[...], medium=GPU) ``` If `medium` is missing or always reads `GPU`, the worker is running an SGLang build without PR #22894. **Router sees the shared pool.** Two new histograms are exposed on the frontend's Prometheus endpoint: | Metric | Meaning | | ----------------------------------- | ------------------------------------------------------------------------ | | `router_shared_cache_hit_rate` | Fraction of request blocks found in the shared pool (0.0–1.0). | | `router_shared_cache_beyond_blocks` | Blocks in the shared pool _beyond_ the selected worker's device overlap. | ```bash curl -s localhost:8000/metrics | grep shared_cache ``` ## Troubleshooting | Symptom | Likely cause | Fix | | -------------------------------------------------------- | ---------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | | `shared_cache_hit_rate` is always 0 | Mooncake master unreachable from the router host | Check network path; the router logs `Shared cache query failed` when it can't reach Mooncake. | | Events only ever carry `medium=GPU` | SGLang missing [PR #22894](https://github.com/sgl-project/sglang/pull/22894) | Rebuild SGLang from the PR branch. | | Workers registered but router never queries shared cache | `--shared-cache-type` left at default `none` | Set `--shared-cache-type hicache` on the frontend. | | Queries issued but winning worker rarely changes | `--shared-cache-multiplier 0.0` | Raise the multiplier — typical starting range is `0.3`–`0.7`. | | Page-size mismatch warnings | Router `--page-size` doesn't match worker `--page-size` | They must agree; the router hashes pages using the worker's page size. | | Router logs "no workers have HiCache enabled" | No worker published `sglang_hicache_mooncake` metadata | Confirm workers started with `--hicache-storage-backend mooncake`. | ## Further Reading - [SGLang HiCache Design](https://docs.sglang.ai/advanced_features/hicache_design.html) and [Best Practices](https://docs.sglang.ai/advanced_features/hicache_best_practices.html) - [Mooncake](https://github.com/kvcache-ai/Mooncake) — the shared KV store used as the external tier - [SGLang PR #22894](https://github.com/sgl-project/sglang/pull/22894) — the tier-annotated events prerequisite - [KVBM Guide](/dynamo/user-guides/kv-cache-offloading) — Dynamo's own block manager, an alternative to HiCache - [KV Events for Custom Engines](/dynamo/integrations/kv-cache-integrations/kv-events-for-custom-engines) — the event protocol contract for backends other than SGLang # LMCache ## Introduction LMCache is a high-performance KV cache layer that supercharges LLM serving by enabling **prefill-once, reuse-everywhere** semantics. As described in the [official documentation](https://docs.lmcache.ai/index.html), LMCache lets LLMs prefill each text only once by storing the KV caches of all reusable texts, allowing reuse of KV caches for any reused text (not necessarily prefix) across any serving engine instance. This document describes how LMCache is integrated into Dynamo's vLLM backend to provide enhanced performance and memory efficiency. ## Installation Notes Dynamo's vLLM runtime expects LMCache to be present in the same Python environment. On supported environments (x86_64, Python 3.10-3.13, PyTorch built against CUDA 12.x), the published wheel installs directly: ```bash uv pip install lmcache ``` LMCache only publishes x86_64 manylinux wheels linked against CUDA 12. For aarch64 hosts, or hosts running PyTorch built against a different CUDA major version, build LMCache from source against your matching torch + CUDA stack — see the official [LMCache installation guide](https://docs.lmcache.ai/getting_started/installation.html). > **Compatibility note** > > `LMCacheMPConnector` needs the fix from [LMCache#3282](https://github.com/LMCache/LMCache/pull/3282), which is on LMCache `main` but not yet released. Without it, the MP path fails on vLLM ≥ 0.20.0 (including the `vllm==0.21.0` Dynamo currently pins) with `RuntimeError: Unsupported GPUKVFormat: 7` — vLLM 0.20+ uses GPU KV formats 6 / 7 that the MP path doesn't yet handle. > > Until the next LMCache release, build LMCache from source against that PR. ## Aggregated Serving ### Configuration LMCache runs the cache engine as an out-of-process sidecar (`lmcache server`); the Dynamo worker connects to it via the `LMCacheMPConnector`. Start the sidecar, then launch the worker: ```bash lmcache server --l1-size-gb 100 --eviction-policy LRU & python -m dynamo.vllm \ --model \ --disable-hybrid-kv-cache-manager \ --kv-transfer-config '{"kv_connector":"LMCacheMPConnector","kv_role":"kv_both"}' ``` ### Customization The LMCache MP server is configured via CLI arguments. See the [Configuration Reference](https://docs.lmcache.ai/mp/configuration.html) for the full list of `lmcache server` flags. LMCache MP uses a two-tier storage architecture: an in-memory L1 cache (sized with `--l1-size-gb`) plus optional persistent L2 adapters configured with `--l2-adapter`. The supported [L2 storage backends](https://docs.lmcache.ai/mp/l2_storage.html) are: - **POSIX**: Standard POSIX file I/O on any file system - **GDS** / **GDS_MT**: NVIDIA GPU Direct Storage (single- and multi-threaded), bypassing the CPU for NVMe SSDs that support GDS - **HF3FS**: Distributed / shared file-system backend - **OBJ**: Object store backend - **AZURE_BLOB**: Azure Blob Storage ### Deployment Use the provided launch script for quick setup: ```bash ./examples/backends/vllm/launch/agg_lmcache_mp.sh ``` This will: 1. Start the LMCache MP server 2. Start the Dynamo frontend 3. Launch a single vLLM worker with `LMCacheMPConnector` connected to the sidecar ### Architecture for Aggregated Mode In aggregated mode, the system uses: - **KV Connector**: `LMCacheMPConnector` - **KV Role**: `kv_both` (handles both reading and writing) ## Disaggregated Serving Disaggregated serving separates prefill and decode operations into dedicated workers. This provides better resource utilization and scalability for production deployments. ### Deployment Use the provided disaggregated launch script (requires at least 2 GPUs): ```bash ./examples/backends/vllm/launch/disagg_lmcache.sh ``` This will: 1. Start the Dynamo frontend 2. Launch a decode worker on GPU 0 3. Wait for initialization 4. Launch a prefill worker on GPU 1 with LMCache enabled ### Worker Roles #### Decode Worker - **Purpose**: Handles token generation (decode phase) - **GPU Assignment**: CUDA_VISIBLE_DEVICES=0 - **LMCache Config**: Uses `NixlConnector` only for KV transfer between prefill and decode workers #### Prefill Worker - **Purpose**: Handles prompt processing (prefill phase) - **GPU Assignment**: CUDA_VISIBLE_DEVICES=1 - **LMCache Config**: Uses `MultiConnector` with both LMCache and NIXL connectors. This enables prefill worker to use LMCache for KV offloading and use NIXL for KV transfer between prefill and decode workers. - **Flag**: `--disaggregation-mode prefill` ## Architecture ### KV Transfer Configuration The system automatically configures KV transfer based on the deployment mode and worker type: #### Aggregated Mode ```python kv_transfer_config = KVTransferConfig( kv_connector="LMCacheMPConnector", kv_role="kv_both", kv_connector_extra_config={"lmcache.mp.port": 5555}, ) ``` #### Prefill Worker (Disaggregated Mode) ```python kv_transfer_config = KVTransferConfig( kv_connector="PdConnector", kv_role="kv_both", kv_connector_extra_config={ "connectors": [ {"kv_connector": "LMCacheConnectorV1", "kv_role": "kv_both"}, {"kv_connector": "NixlConnector", "kv_role": "kv_both"} ] } ) ``` #### Decode Worker (Disaggregated Mode) ```python kv_transfer_config = KVTransferConfig( kv_connector="LMCacheConnectorV1", kv_role="kv_both" ) ``` #### Fallback (No LMCache) ```python kv_transfer_config = KVTransferConfig( kv_connector="NixlConnector", kv_role="kv_both" ) ``` ### Integration Points 1. **Argument Parsing** (`args.py`): - Configures appropriate KV transfer settings - Sets up connector configurations based on worker type 2. **Engine Setup** (`main.py`): - Creates vLLM engine with proper KV transfer config - Handles both aggregated and disaggregated modes 3. **Sidecar Lifecycle** (launch script): - Starts the `lmcache server` process before the Dynamo worker - Tears it down on exit via the script's cleanup trap ### Best Practices 1. **Chunk Size Tuning**: Pass `--chunk-size` to `lmcache server` based on your use case: - Smaller chunks (128-256): Better reuse granularity for varied content - Larger chunks (512-1024): More efficient for repetitive content patterns 2. **Memory Allocation**: Set `--l1-size-gb` on `lmcache server` conservatively: - Leave sufficient RAM for other system processes - Monitor memory usage during peak loads 3. **Workload Optimization**: LMCache performs best with: - Repeated prompt patterns (RAG, multi-turn conversations) - Shared context across sessions - Long-running services with warm caches ## Metrics and Monitoring The LMCache MP server records metrics through the OpenTelemetry SDK and exposes them on its own HTTP admin port (default `:8080/metrics`), prefixed `lmcache_mp_`: ```bash curl -s localhost:8080/metrics | grep '^lmcache_mp_' ``` vLLM and Dynamo metrics remain on Dynamo's `:8081/metrics` (set `DYN_SYSTEM_PORT=8081` on the worker to enable that endpoint). For detailed information on LMCache metrics, including the complete list of available metrics and how to access them, see the **[LMCache Metrics section](/dynamo/backends/v-llm/observability#lmcache-metrics)** in the vLLM Prometheus Metrics Guide. ## Troubleshooting ### vLLM log: `Found PROMETHEUS_MULTIPROC_DIR was set by user` vLLM v1 uses `prometheus_client.multiprocess` and stores intermediate metric values in `PROMETHEUS_MULTIPROC_DIR`. - If you **set `PROMETHEUS_MULTIPROC_DIR` yourself**, vLLM warns that the directory must be wiped between runs to avoid stale/incorrect metrics. - When running via Dynamo, the vLLM wrapper may set `PROMETHEUS_MULTIPROC_DIR` internally to a temporary directory to avoid vLLM cleanup issues. If you still see the warning, confirm you are not exporting `PROMETHEUS_MULTIPROC_DIR` in your shell or container environment. ## References and Additional Resources - [LMCache Documentation](https://docs.lmcache.ai/index.html) - Comprehensive guide and API reference - [Configuration Reference](https://docs.lmcache.ai/mp/configuration.html) - `lmcache server` CLI arguments - [LMCache Observability Guide](https://docs.lmcache.ai/mp/observability.html) - Metrics and monitoring details # FlexKV ## Introduction [FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed runtime for KV cache offloading developed by Tencent Cloud's TACO team and NVIDIA in collaboration with the community. It acts as a unified KV caching layer for inference engines like SGLang, TensorRT-LLM, and vllm. ### Key Features - **Multi-level caching**: CPU memory, local SSD, and scalable storage (cloud storage) for KV cache offloading - **Distributed KV cache reuse**: Share KV cache across multiple nodes using distributed RadixTree - **High-performance I/O**: Supports io_uring and GPU Direct Storage (GDS) for accelerated data transfer - **Asynchronous operations**: Get and put operations can overlap with computation through prefetching ## Prerequisites 1. **Dynamo installed** with vLLM support 2. **Infrastructure services running**: ```bash docker compose -f dev/docker-compose.yml up -d ``` 3. **FlexKV installed**: ```bash git clone https://github.com/taco-project/FlexKV.git cd FlexKV ./build.sh ``` 4. **Optional: SSD offloading dependencies** (only required for CPU + SSD tiered offloading): ```bash apt install liburing-dev libxxhash-dev ``` ## Quick Start ### Enable FlexKV Set the `DYNAMO_USE_FLEXKV` environment variable and use the `--kv-transfer-config` flag: ```bash export DYNAMO_USE_FLEXKV=1 python -m dynamo.vllm --model Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}' ``` ## Aggregated Serving ### Basic Setup ```bash # Terminal 1: Start frontend python -m dynamo.frontend & # Terminal 2: Start vLLM worker with FlexKV DYNAMO_USE_FLEXKV=1 \ FLEXKV_CPU_CACHE_GB=32 \ python -m dynamo.vllm --model Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}' ``` ### With KV-Aware Routing For multi-worker deployments with KV-aware routing to maximize cache reuse: ```bash # Terminal 1: Start frontend with KV router python -m dynamo.frontend \ --router-mode kv \ --router-reset-states & # Terminal 2: Worker 1 DYNAMO_USE_FLEXKV=1 \ FLEXKV_CPU_CACHE_GB=32 \ FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_0" \ CUDA_VISIBLE_DEVICES=0 \ python -m dynamo.vllm \ --model Qwen/Qwen3-0.6B \ --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}' \ --gpu-memory-utilization 0.2 \ --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20080","enable_kv_cache_events":true}' & # Terminal 3: Worker 2 DYNAMO_USE_FLEXKV=1 \ FLEXKV_CPU_CACHE_GB=32 \ FLEXKV_SERVER_RECV_PORT="ipc:///tmp/flexkv_server_1" \ CUDA_VISIBLE_DEVICES=1 \ python -m dynamo.vllm \ --model Qwen/Qwen3-0.6B \ --kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}' \ --gpu-memory-utilization 0.2 \ --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}' ``` ## Disaggregated Serving > **Note:** Disaggregated FlexKV serving is experimental. The prefill worker must use `PdConnector` with two sub-connectors: `FlexKVConnectorV1` (KV cache offloading) and `NixlConnector` (P/D KV transfer). Using `FlexKVConnectorV1` alone as the top-level connector in disaggregated mode is **not supported** and will result in a `TypeError`. FlexKV can be used with disaggregated prefill/decode serving. The prefill worker uses FlexKV for KV cache offloading, while NIXL handles KV transfer between prefill and decode workers. The `PdConnector` wraps both connectors so they work together. ### Supported connector configuration | Role | Connector | Description | |------|-----------|-------------| | Decode worker | `NixlConnector` | Pulls KV blocks from prefill worker via NIXL | | Prefill worker | `PdConnector` wrapping `[FlexKVConnectorV1, NixlConnector]` | FlexKV offloads/onboards KV blocks; NIXL serves them to decode | ```bash # Terminal 1: Start frontend python -m dynamo.frontend & # Terminal 2: Decode worker (without FlexKV) CUDA_VISIBLE_DEVICES=0 python -m dynamo.vllm --model Qwen/Qwen3-0.6B \ --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' & # Terminal 3: Prefill worker (with FlexKV + NIXL via PdConnector) DYN_VLLM_KV_EVENT_PORT=20081 \ VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \ DYNAMO_USE_FLEXKV=1 \ FLEXKV_CPU_CACHE_GB=32 \ CUDA_VISIBLE_DEVICES=1 \ python -m dynamo.vllm \ --model Qwen/Qwen3-0.6B \ --is-prefill-worker \ --kv-transfer-config '{"kv_connector":"PdConnector","kv_role":"kv_both","kv_connector_extra_config":{"connectors":[{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"},{"kv_connector":"NixlConnector","kv_role":"kv_both"}]},"kv_connector_module_path":"kvbm.vllm_integration.connector"}' \ --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}' ``` You can also use the provided launch script directly: ```bash examples/backends/vllm/launch/disagg_flexkv.sh ``` ## Configuration ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `DYNAMO_USE_FLEXKV` | Enable FlexKV integration | `0` (disabled) | | `FLEXKV_CPU_CACHE_GB` | CPU memory cache size in GB | Required | | `FLEXKV_CONFIG_PATH` | Path to FlexKV YAML config file | Not set | | `FLEXKV_SERVER_RECV_PORT` | IPC port for FlexKV server | Auto | ### CPU-Only Offloading For simple CPU memory offloading: ```bash unset FLEXKV_CONFIG_PATH export FLEXKV_CPU_CACHE_GB=32 ``` ### CPU + SSD Tiered Offloading For multi-tier offloading with SSD storage, create a configuration file: ```bash cat > ./flexkv_config.yml < **Note:** For full configuration options, see the [FlexKV Configuration Reference](https://github.com/taco-project/FlexKV/blob/main/docs/flexkv_config_reference/README_en.md). ## Distributed KV Cache Reuse FlexKV supports distributed KV cache reuse to share cache across multiple nodes. This enables: - **Distributed RadixTree**: Each node maintains a local snapshot of the global index - **Lease Mechanism**: Ensures data validity during cross-node transfers - **RDMA-based Transfer**: Uses Mooncake Transfer Engine for high-performance KV cache transfer For setup instructions, see the [FlexKV Distributed Reuse Guide](https://github.com/taco-project/FlexKV/blob/main/docs/dist_reuse/README_en.md). ## Architecture FlexKV consists of three core modules: ### StorageEngine Initializes the three-level cache (GPU → CPU → SSD/Cloud). It groups multiple tokens into blocks and stores KV cache at the block level, maintaining the same KV shape as in GPU memory. ### GlobalCacheEngine The control plane that determines data transfer direction and identifies source/destination block IDs. Includes: - RadixTree for prefix matching - Memory pool to track space usage and trigger eviction ### TransferEngine The data plane that executes data transfers: - Multi-threading for parallel transfers - High-performance I/O (io_uring, GDS) - Asynchronous operations overlapping with computation ## Verify Deployment ```bash curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello!"}], "stream": false, "max_tokens": 30 }' ``` ## See Also - [FlexKV GitHub Repository](https://github.com/taco-project/FlexKV) - [FlexKV vLLM Adapter Documentation](https://github.com/taco-project/FlexKV/blob/main/docs/vllm_adapter/README_en.md) # KV Events for Custom Engines This document explains how to implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing. ## Overview The KV Router relies on real-time events from backend workers to track which KV cache blocks are stored on each worker. When your custom engine allocates or evicts KV cache blocks, it should publish these events so the router can make optimal routing decisions. Events are published over the **Dynamo event plane**, a transport-agnostic pub/sub layer that supports both NATS and ZMQ backends (see [Event Plane](/dynamo/design-docs/communication-planes/event-plane) for details). The `KvEventPublisher` binding handles all transport concerns — your engine code does not interact with the event plane directly. `KvEventPublisher` supports two publishing modes: 1. **Direct publishing** — Your engine calls `publish_stored()` / `publish_removed()` to push events directly over the event plane. Simplest approach for custom engines. 2. **ZMQ relay** — For engines that emit raw KV events over a ZMQ socket (like SGLang and vLLM). The publisher subscribes to the ZMQ endpoint and relays events to the event plane automatically. ## Event Types The KV cache supports three event types: | Event Type | Description | When to Publish | |------------|-------------|-----------------| | `BlockStored` | New blocks added to cache | After KV cache allocation succeeds | | `BlockRemoved` | Blocks evicted from cache | When blocks are evicted or freed | | `AllBlocksCleared` | All blocks removed | On cache reset or worker restart | ### Event Structure Each event contains: - **`event_id`**: Monotonically increasing identifier per worker (managed internally by the publisher) - **`dp_rank`**: Data parallel rank (0 if DP not enabled) - **`data`**: One of `Stored`, `Removed`, or `Cleared` For `BlockStored` events: - **`token_ids`**: List of token IDs for the stored blocks - **`block_hashes`**: List of **sequence block hashes** from the engine's block manager. These are cumulative hashes that incorporate all tokens from the start of the sequence up to and including the current block (not just the tokens within that block). This enables prefix matching across requests. - **`num_block_tokens`**: Number of tokens per block (should all equal `kv_block_size`) - **`parent_hash`**: Hash of the parent block. Required for all blocks except the first block in a sequence (which has no parent). - **`lora_name`**: LoRA adapter name string (omit or `None` for base model). When set, the adapter name is incorporated into block hash computation so that blocks for different LoRA adapters (or the base model) are never conflated. For `BlockRemoved` events: - **`block_hashes`**: List of sequence block hashes being evicted ## Direct Publishing (Recommended for Custom Engines) Call `publish_stored()` and `publish_removed()` directly from your engine code. The publisher handles event IDs, serialization, and transport. ```mermaid flowchart LR subgraph Engine["Custom Engine"] cache["KV Cache Manager"] end subgraph Worker["Dynamo Worker Process"] pub["KvEventPublisher"] end subgraph EP["Dynamo Event Plane"] topic["kv-events topic"] end subgraph Router["KV Router"] indexer["KvIndexer"] end cache -->|"publish_stored()
publish_removed()"| pub pub -->|"event plane"| topic topic --> indexer ``` **When to use:** - Building a custom inference engine from scratch - Your engine doesn't have a ZMQ-based event system - You want the simplest integration path ### Basic Setup ```python from dynamo.llm import KvEventPublisher class CustomEnginePublisher: def __init__(self, component, block_size: int, dp_rank: int = 0): self.block_size = block_size self.kv_publisher = KvEventPublisher( component=component, kv_block_size=block_size, dp_rank=dp_rank, ) def on_blocks_stored(self, token_ids: list[int], block_hashes: list[int], parent_hash: int | None = None, lora_name: str | None = None): """Call after KV cache blocks are allocated.""" num_block_tokens = [self.block_size] * len(block_hashes) self.kv_publisher.publish_stored( token_ids=token_ids, num_block_tokens=num_block_tokens, block_hashes=block_hashes, parent_hash=parent_hash, lora_name=lora_name, ) def on_blocks_removed(self, block_hashes: list[int]): """Call when KV cache blocks are evicted.""" self.kv_publisher.publish_removed(block_hashes=block_hashes) ``` ### Integration with Your Engine ```python from dynamo.llm import register_model async def main(): component, endpoint = await register_model( model="my-model", generator=my_generate_fn, ) publisher = CustomEnginePublisher( component=component, block_size=16, # Match your engine's block size ) def on_prefill_complete(request_id, token_ids, blocks): block_hashes = [block.hash for block in blocks] publisher.on_blocks_stored(token_ids=token_ids, block_hashes=block_hashes) def on_cache_eviction(evicted_blocks): block_hashes = [block.hash for block in evicted_blocks] publisher.on_blocks_removed(block_hashes=block_hashes) ``` ## ZMQ Relay (For Engines with Raw KV Events) For engines that already publish raw KV events over a ZMQ socket (like SGLang and vLLM), use the same `KvEventPublisher` with a `zmq_endpoint`. The publisher subscribes to the ZMQ socket and relays events to the event plane automatically. ```mermaid flowchart LR subgraph Engine["Custom Engine / SGLang / vLLM"] cache["KV Cache Manager"] zmq_pub["ZMQ Publisher"] end subgraph ZMQ["ZMQ Socket"] socket["tcp://127.0.0.1:5557"] end subgraph Worker["Dynamo Worker Process"] relay["KvEventPublisher
(relay mode)"] end subgraph EP["Dynamo Event Plane"] topic["kv-events topic"] end subgraph Router["KV Router"] indexer["KvIndexer"] end cache --> zmq_pub zmq_pub -->|"PUB"| socket socket -->|"SUB"| relay relay -->|"event plane"| topic topic --> indexer ``` **When to use:** - Your engine already publishes KV events via ZMQ (like SGLang or vLLM) - You want to decouple event publishing from your engine's main loop ### Setup Pass `zmq_endpoint` (and optional `zmq_topic`) to the same `KvEventPublisher`: ```python from dynamo.llm import KvEventPublisher kv_publisher = KvEventPublisher( component=component, kv_block_size=block_size, zmq_endpoint="tcp://127.0.0.1:5557", # Where your engine publishes zmq_topic="", # Subscribe to all topics ) ``` No further calls to `publish_stored()` / `publish_removed()` are needed — the publisher reads events from the ZMQ socket and forwards them automatically. ### ZMQ Wire Format The ZMQ message format (compatible with SGLang / vLLM): | Frame | Description | |-------|-------------| | 1 | Topic (empty string for all topics) | | 2 | Sequence number (8 bytes, big-endian) | | 3 | Msgpack payload: `[timestamp, [events], dp_rank]` | Each event in the payload is a dictionary with a `type` field (`BlockStored`, `BlockRemoved`, or `AllBlocksCleared`). For `BlockStored`: ```python { "type": "BlockStored", "block_hashes": [signed_i64, ...], # Sequence block hashes "parent_block_hash": signed_i64 | None, # Parent hash "token_ids": [int, ...], # Token IDs "block_size": int, # Tokens per block "lora_name": str | None, # LoRA adapter name } ``` For `BlockRemoved`: ```python { "type": "BlockRemoved", "block_hashes": [signed_i64, ...], } ``` For `AllBlocksCleared`: ```python {"type": "AllBlocksCleared"} ``` ## API Reference ### `KvEventPublisher` ```python KvEventPublisher( component: Component, kv_block_size: int, dp_rank: int = 0, enable_local_indexer: bool = False, zmq_endpoint: str | None = None, # Set for relay mode zmq_topic: str | None = None, # Defaults to "" when zmq_endpoint is set ) ``` | Parameter | Description | |-----------|-------------| | `component` | The Dynamo component this publisher belongs to | | `kv_block_size` | Number of tokens per block (must be > 0, must match your engine) | | `dp_rank` | Data parallel rank (defaults to 0) | | `enable_local_indexer` | Enable a worker-local KV indexer for direct overlap queries | | `zmq_endpoint` | ZMQ endpoint to subscribe to for relay mode (e.g. `"tcp://127.0.0.1:5557"`) | | `zmq_topic` | ZMQ topic filter (defaults to `""` = all topics) | #### `publish_stored()` ```python publish_stored( token_ids: list[int], num_block_tokens: list[int], block_hashes: list[int], parent_hash: int | None = None, block_mm_infos: list[dict | None] | None = None, lora_name: str | None = None, ) ``` Publish a block-stored event. Event IDs are managed internally. When `lora_name` is provided, the adapter name is mixed into block hash computation so blocks cached under different adapters produce distinct hashes. #### `publish_removed()` ```python publish_removed(block_hashes: list[int]) ``` Publish a block-removed event. Event IDs are managed internally. #### `shutdown()` ```python shutdown() ``` Stop background tasks (ZMQ listener, event forwarding). ## Best Practices 1. **`kv_block_size` must match** your engine's actual block size. 2. **`parent_hash` is required** for all blocks except the first in a sequence — it links blocks to enable prefix matching. 3. **Block hashes are signed 64-bit integers** in the Python API. The publisher handles conversion internally. 4. **Event ordering is automatic** — the publisher assigns monotonically increasing event IDs. You do not need to track event IDs yourself. ## See Also - **[Event Plane](/dynamo/design-docs/communication-planes/event-plane)**: Transport options (NATS, ZMQ) and configuration - **[Configuration and Tuning](/dynamo/components/router/configuration-and-tuning)**: Router flags, tuning, and production setup - **[Router Design](/dynamo/design-docs/component-design/router-design)**: Architecture details and event transport modes # LWS Dynamo can use [LeaderWorkerSet (LWS)](https://lws.sigs.k8s.io/docs/) as the Kubernetes orchestration layer for multinode workloads. LWS is the lightweight path for spanning one Dynamo worker service across multiple nodes; Dynamo pairs it with [Volcano](https://volcano.sh/) for gang scheduling. Use LWS when you want a simpler multinode orchestrator than Grove, or when your cluster already standardizes on LWS and Volcano. Grove remains the default when both Grove and LWS are available. ## Prerequisites - Kubernetes cluster with GPU nodes. - LWS version `0.7.0` or newer. - Volcano installed for gang scheduling. - Dynamo Kubernetes Platform installed. The installation guide includes the exact Helm commands for [LWS and Volcano](/dynamo/kubernetes-deployment/start-here/installation-guide#lws--volcano). ## Orchestrator Selection For multinode deployments, the Dynamo operator selects an orchestrator based on what is installed: | Cluster state | Operator behavior | | --- | --- | | Grove and LWS installed | Uses Grove by default. | | Grove and LWS installed, DGD has `nvidia.com/enable-grove: "false"` | Uses LWS. | | Only LWS installed | Uses LWS. | | Neither Grove nor LWS installed | Rejects multinode deployments. | To force the LWS path when Grove is also present: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-multinode-deployment annotations: nvidia.com/enable-grove: "false" spec: # ... ``` ## Multinode Spec Set `multinode.nodeCount` on the service that should span nodes. The total GPU count is `multinode.nodeCount` multiplied by the per-node GPU limit: ```yaml apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: qwen3-multinode annotations: nvidia.com/enable-grove: "false" spec: services: backend: multinode: nodeCount: 2 resources: limits: gpu: "4" extraPodSpec: mainContainer: args: - "--tp-size" - "8" ``` In this example, Dynamo asks LWS to place the backend across 2 nodes with 4 GPUs per node, for 8 GPUs total. Make sure your backend's tensor parallel or distributed execution flags match that total. ## Backend Behavior The operator injects backend-specific multinode settings into the generated LeaderWorkerSet: | Backend | LWS behavior | | --- | --- | | vLLM | Uses Ray for multi-node tensor or pipeline parallelism, and injects data-parallel flags for DP deployments. | | SGLang | Injects `--dist-init-addr`, `--nnodes`, and per-node `--node-rank`. | | TensorRT-LLM | Wraps the leader command with `mpirun` and configures worker nodes with SSH. | For detailed backend-specific behavior and examples, see the [Multinode Deployments](/dynamo/kubernetes-deployment/scale/multinode-deployments) guide. # Gateway API Inference Extension (GAIE) ## Gateway API Inference Extension Setup with Dynamo Integrate Dynamo with the Gateway API Inference Extension, also known as Inference Gateway, for intelligent KV-aware request routing at the gateway layer. ## Features - EPP's default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model's tokenizer inline. The EPP plugin configuration is embedded in the recipe-based GAIE deploy YAMLs under [`recipes/llama-3-70b/vllm/agg/gaie/`](https://github.com/ai-dynamo/dynamo/tree/main/recipes/llama-3-70b/vllm/agg/gaie) and [`recipes/llama-3-70b/vllm/disagg-single-node/gaie/`](https://github.com/ai-dynamo/dynamo/tree/main/recipes/llama-3-70b/vllm/disagg-single-node/gaie), following the GAIE/EPP configuration layout used by this repository. - Dynamo Integration with the Inference Gateway supports Aggregated and Disaggregated Serving. A request only exercises disaggregated routing when the EPP config defines a `prefill` profile and prefill workers are available. The recipe examples provide separate aggregated and disaggregated configs under `recipes/llama-3-70b/vllm/agg/gaie/` and `recipes/llama-3-70b/vllm/disagg-single-node/gaie/`. Unless `DYN_ENFORCE_DISAGG=true`, deployments without a `prefill` profile or prefill workers fall back to aggregated serving. - GAIE integration supports Data Parallelism. - If you want to use LoRA deploy Dynamo without the Inference Gateway. - These setups use [agentgateway](https://agentgateway.dev/) as the Inference Gateway implementation. ## Prerequisites - Kubernetes cluster with kubectl configured - NVIDIA GPU drivers installed on worker nodes ## Installation Steps ### 1. Install Dynamo Platform ### [See Quickstart Guide](/dynamo/kubernetes-deployment/start-here/kubernetes-quickstart) to install Dynamo Kubernetes Platform. If you are installing from the source tree rather than a release chart, follow [Advanced: Build from Source](/dynamo/kubernetes-deployment/start-here/installation-guide#advanced-build-from-source) and run `helm dep build ./platform/` before `helm install` so the vendored subcharts match the local chart contents. ### 2. Deploy Inference Gateway ### First, deploy an inference gateway service. In this example, we'll install agentgateway with the inference extension enabled. ```bash cd deploy/inference-gateway export NAMESPACE=my-model # You can put the inference gateway into another namespace and then adjust your http-route.yaml ./scripts/install_gaie_crd_agentgateway.sh ``` This script installs the Gateway API CRDs, the GAIE CRDs, agentgateway into `agentgateway-system`, and a `Gateway` named `inference-gateway` into `${NAMESPACE}`. #### f. Verify the Gateway is running ```bash kubectl get gateway inference-gateway -n ${NAMESPACE} # Sample output # NAME CLASS ADDRESS PROGRAMMED AGE # inference-gateway agentgateway True 1m ``` ### 3. Setup secrets ### Do not forget docker registry secret if needed. ```bash kubectl create secret docker-registry docker-imagepullsecret \ --docker-server=$DOCKER_SERVER \ --docker-username=$DOCKER_USERNAME \ --docker-password=$DOCKER_PASSWORD \ --namespace=$NAMESPACE ``` Do not forget to include the HuggingFace token. ```bash export HF_TOKEN=your_hf_token kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN=${HF_TOKEN} \ -n ${NAMESPACE} ``` ### 4. Build EPP image (Optional) You can either use the provided Dynamo FrontEnd image for the EPP image or you need to build your own Dynamo EPP custom image following the steps below. ```bash # export env vars export DOCKER_SERVER=ghcr.io/nvidia/dynamo # Container registry export IMAGE_TAG=YOUR-TAG # Or auto from git tag cd deploy/inference-gateway/epp make all # Do everything in one command # or make all-push to also push # Or step-by-step make dynamo-lib # Build Dynamo library and copy to project make image-load # Build Docker image and load locally make image-push # Build and push to registry make info # Check image tag ``` #### All-in-one Targets | Target | Description | |--------|-------------| | `make dynamo-lib` | Build Dynamo static library and copy to project | | `make all` | Build Dynamo lib + Docker image + load locally | | `make all-push` | Build Dynamo lib + Docker image + push to registry | ### 5. Deploy We provide an example for the Qwen vLLM below. You have to deploy the Dynamo Graph and the `HTTPRoute`. The example `http-route.yaml` resolves the `Gateway` in the same namespace as the `HTTPRoute`, so the simplest path is to apply the route in the same namespace where you installed the `Gateway` (i.e. `${NAMESPACE}`). If your `Gateway` lives in a different namespace, add `parentRefs[].namespace` to point at it explicitly: ```yaml parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: inference-gateway namespace: my-model # only needed if the Gateway is in a different namespace ``` ```bash cd # kubectl get httproutes -n my-model # Make sure you do not have an incompatible HTTPRoute running, delete if so. # Choose disagg or agg example kubectl apply -f examples/backends/vllm/deploy/gaie/disagg.yaml -n my-model # or kubectl apply -f examples/backends/vllm/deploy/gaie/agg.yaml -n my-model # make sure to apply the route kubectl apply -f examples/backends/vllm/deploy/gaie/http-route.yaml -n my-model ``` Examples for other models can be found in the recipes folder. ```bash # Deploy PVC, having first Update `storageClassName` in recipes/llama-3-70b/model-cache/model-cache.yaml to match your cluster before deploying kubectl apply -f recipes/llama-3-70b/model-cache/model-cache.yaml -n ${NAMESPACE} kubectl apply -f recipes/llama-3-70b/model-cache/model-download.yaml -n ${NAMESPACE} ``` We provide examples for llama-3-70b vLLM under the `recipes/llama-3-70b/vllm/agg/gaie/` for aggregated and `recipes/llama-3-70b/vllm/disagg-single-node/gaie/` for disaggregated serving. Note for the aggregated serving you need to disable DYN_ENFORCE_DISAGG in epp config. ```bash - name: DYN_ENFORCE_DISAGG value: "false" ``` Use the proper folder in commands below. ```bash # Deploy your Dynamo Graph. # agg kubectl apply -f recipes/llama-3-70b/vllm/agg/gaie/deploy.yaml -n ${NAMESPACE} # Deploy the GAIE http-route CR. The route resolves the Gateway in the same namespace by default; # if your Gateway is elsewhere, add parentRefs[].namespace before applying. kubectl apply -f recipes/llama-3-70b/vllm/agg/gaie/http-route.yaml -n ${NAMESPACE} # or disagg kubectl apply -f recipes/llama-3-70b/vllm/disagg-single-node/gaie/deploy.yaml -n ${NAMESPACE} kubectl apply -f recipes/llama-3-70b/vllm/disagg-single-node/gaie/http-route.yaml -n ${NAMESPACE} ``` - When using GAIE the FrontEnd does not choose the workers. The routing is determined in the EPP. - The FrontEnd must run with `--router-mode direct` so that it respects the EPP's routing decisions passed via request headers. - Use the `frontendSidecar` field on a worker service to have the operator automatically inject a fully configured frontend sidecar container with all required Dynamo env vars, probes, and ports: ```yaml frontendSidecar: image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1 args: - --router-mode - direct envFromSecret: hf-token-secret ``` - The pre-selected worker (decode and prefill in case of the disaggregated serving) are passed in the request headers. - The `--router-mode direct` flag ensures the routing respects this selection. **Startup Probe Timeout:** The EPP has a default startup probe timeout of 30 minutes (10s × 180 failures). If your model takes longer to load, increase the `failureThreshold` in the EPP's `startupProbe`. For example, to allow 60 minutes for startup: ```yaml extraPodSpec: mainContainer: startupProbe: failureThreshold: 360 # 10s × 360 = 60 minutes ``` **Gateway Namespace** The example `http-route.yaml` resolves the `Gateway` in the same namespace as the route. If you install the `Gateway` in one namespace and apply the route in another, add `parentRefs[].namespace: ` to `http-route.yaml`. Common Vars for Routing Configuration: **Enabling KV-Aware Routing (most precise)** KV-aware routing uses live KV cache block events from workers so the EPP can route requests to the worker with the best prefix cache overlap. To enable it (default): 1. **Workers — enable prefix caching and KV event publishing.** Each worker must publish KV cache events to event plane (NATS/ZMQ) so the EPP's router can track per-worker cache state. - **vLLM:** Pass `--enable-prefix-caching` and `--kv-events-config '{"enable_kv_cache_events":true}'`. - **SGLang:** Pass `--kv-events-config` with the appropriate endpoint. - **TRT-LLM:** Pass `--publish-events-and-metrics`. 2. **EPP — leave `DYN_USE_KV_EVENTS` at its default (`true`).** The EPP subscribes to worker KV events via event plane (NATS/ZMQ) and uses them for prefix-overlap scoring. 3. **Block size — must be consistent.** The `--block-size` on all workers must match `DYN_KV_CACHE_BLOCK_SIZE` on the EPP (default: 128). Mismatched block sizes cause incorrect block hash computation. **Disabling KV-Aware Routing** To disable the EPP from listening for KV events (e.g., when prefix caching is off on workers, or for simpler load-balanced routing): 1. **EPP:** Set `DYN_USE_KV_EVENTS=false`. The router falls back to approximate mode (routing decisions are tracked locally with TTL decay instead of live KV events from workers). 2. **Workers:** Pass `--no-enable-prefix-caching` to disable prefix caching entirely. Without prefix caching, no KV events are generated regardless of other flags. 3. **Optionally** set `DYN_ROUTER_KV_OVERLAP_SCORE_CREDIT=0` on the EPP to skip prefix-overlap scoring altogether, making the router select workers based on load only. - Set `DYN_BUSY_THRESHOLD` to configure the upper bound on how "full" a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled. - Set `DYN_ENFORCE_DISAGG=true` (default: `false`) to control per-request behavior when prefill workers are unavailable: - **`true` (recommended for disaggregated serving):** Requests fail with an error if prefill workers are not available. Use this when disaggregated serving is required and aggregated fallback is not acceptable. - **`false` (default):** Requests gracefully fall back to aggregated mode (skip prefill, route directly to decode) when prefill workers are not available. When prefill workers appear later, subsequent requests automatically use disaggregated routing. - Set `DYN_ROUTER_KV_OVERLAP_SCORE_CREDIT` to control the device-local prefix-overlap credit multiplier, from 0.0 to 1.0. Higher values bias toward reusing workers with similar cached prefixes. (default: 1) - Set `DYN_ROUTER_PREFILL_LOAD_SCALE` to scale adjusted prompt-side prefill load before decode blocks are added. (default: 1) - Set `DYN_ROUTER_TEMPERATURE` (default: `0.0`) to soften or sharpen normalized worker sampling. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration). - `DYN_ROUTER_REPLICA_SYNC` — Enable replica synchronization (default: false) - `DYN_ROUTER_TRACK_ACTIVE_BLOCKS` — Track active blocks (default: true) - `DYN_ROUTER_TRACK_OUTPUT_BLOCKS` — Track output blocks during generation (default: false) - See the [KV cache routing design](/dynamo/design-docs/component-design/router-design) for details. **Service Mesh Integration (Istio)** When running under a service mesh such as Istio, the mesh sidecar proxy may conflict with the EPP's own TLS serving, causing connection failures (double-TLS). To avoid this, the mesh must be told how to connect to the EPP service via an Istio `DestinationRule`. The Dynamo operator can generate this DestinationRule automatically. Enable it by setting the `dynamo.serviceMesh` parameters when installing or upgrading the Dynamo platform Helm chart: ```bash helm install dynamo deploy/helm/charts/platform \ --set dynamo.serviceMesh.enabled=true ``` Or equivalently in a custom values file: ```yaml dynamo: serviceMesh: enabled: true provider: "istio" istio: tlsMode: "SIMPLE" insecureSkipVerify: true ``` **Helm Parameters** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `dynamo.serviceMesh.enabled` | bool | `false` | Enable automatic DestinationRule generation for EPP services. | | `dynamo.serviceMesh.provider` | string | `"istio"` | Service mesh provider. Only `"istio"` is supported. | | `dynamo.serviceMesh.istio.tlsMode` | string | `"SIMPLE"` | TLS mode for the DestinationRule. Supported values: `DISABLE`, `SIMPLE`, `MUTUAL`, `ISTIO_MUTUAL`. | | `dynamo.serviceMesh.istio.insecureSkipVerify` | bool | `true` | Skip TLS certificate verification. Set to `true` when EPP uses self-signed certificates (the default). | The Istio CRDs (`networking.istio.io`) must be installed on the cluster before enabling this feature. The operator detects Istio availability at startup — if the CRDs are not present, DestinationRule reconciliation is skipped even when `serviceMesh.enabled` is `true`. When enabled, the operator produces a `DestinationRule` for each EPP service equivalent to: ```yaml apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: spec: host: ..svc.cluster.local trafficPolicy: tls: mode: SIMPLE insecureSkipVerify: true ``` If you are **not** using the Dynamo operator's Helm chart, you must create this `DestinationRule` manually for each EPP service. Without it, Istio's default mTLS policy will conflict with the EPP's gRPC TLS endpoint. **Inference-gateway Istio sidecar exclusion** When namespace-level Istio sidecar injection is enabled (`istio-injection=enabled`), the agentgateway-proxy pod also receives an Istio sidecar. This sidecar intercepts the ext_proc gRPC connection from agentgateway-proxy to EPP (port 9002) and routes it through `PassthroughCluster`, which breaks the connection and causes all inference requests to return HTTP 500 with an empty body. The fix is to tell agentgateway to stamp `sidecar.istio.io/inject: "false"` on the proxy pod template so the Istio webhook skips that pod. EPP and worker pods still receive sidecars normally. You have two options depending on how you set up the gateway: ***Option A: Per-gateway `AgentgatewayParameters` (recommended)*** This is what `install_gaie_crd_agentgateway.sh` does automatically. It only affects the `inference-gateway` proxy pods and leaves any other agentgateway-managed gateways untouched. 1. Create an `AgentgatewayParameters` resource in **the same namespace as the `inference-gateway` Gateway** (e.g. `dynamo-cloud`). It must be co-located with the `Gateway` because the Gateway API `spec.infrastructure.parametersRef` is a `LocalParametersReference` — it has no `namespace` field. ```yaml apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayParameters metadata: name: inference-gateway-params namespace: dynamo-cloud # same as the Gateway spec: deployment: spec: template: metadata: annotations: sidecar.istio.io/inject: "false" ``` Apply it with server-side apply (recommended by agentgateway): ```bash kubectl apply --server-side -n dynamo-cloud -f agentgateway-params.yaml ``` 2. Wire the existing `Gateway` to use it. If the Gateway already exists, patch it in place: ```bash kubectl patch gateway inference-gateway -n dynamo-cloud --type='merge' -p '{ "spec": { "infrastructure": { "parametersRef": { "group": "agentgateway.dev", "kind": "AgentgatewayParameters", "name": "inference-gateway-params" } } } }' ``` Or include the `infrastructure` block directly in your `Gateway` manifest: ```yaml apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-gateway namespace: dynamo-cloud spec: gatewayClassName: agentgateway infrastructure: parametersRef: group: agentgateway.dev kind: AgentgatewayParameters name: inference-gateway-params listeners: - name: http port: 80 protocol: HTTP ``` 3. agentgateway will roll the proxy pod. Verify the new pod no longer has an `istio-proxy` container: ```bash kubectl get pod -l gateway.networking.k8s.io/gateway-name=inference-gateway \ -n dynamo-cloud \ -o jsonpath='{.items[0].spec.containers[*].name}{"\n"}' # Expect: agentgateway (NOT "agentgateway istio-proxy") ``` ***Option B: Patch the default `AgentgatewayParameters` CR (cluster-wide)*** The agentgateway controller creates a default `AgentgatewayParameters` resource named `agentgateway` in `agentgateway-system`. Any `Gateway` that does not set `spec.infrastructure.parametersRef` inherits this default. Patching it affects **all** agentgateway-managed proxies in the cluster. ```bash kubectl patch agentgatewayparameters agentgateway -n agentgateway-system \ --type='merge' -p '{ "spec": { "deployment": { "spec": { "template": { "metadata": { "annotations": { "sidecar.istio.io/inject": "false" } } } } } } }' ``` Use Option A instead if you have multiple agentgateway-managed gateways in the cluster and only want the `inference-gateway` proxy to skip injection. The annotation is a no-op on clusters where Istio is not installed, so it is safe to set unconditionally. With both the `DestinationRule` (for EPP) and the `AgentgatewayParameters` sidecar exclusion (for agentgateway-proxy) in place, end-to-end GAIE inference works correctly under Istio namespace-level injection. ### 6. Verify Installation ### Check that all resources are properly deployed: ```bash kubectl get inferencepool -n ${NAMESPACE} kubectl get httproute -n ${NAMESPACE} kubectl get service -n ${NAMESPACE} kubectl get gateway -n ${NAMESPACE} ``` Sample output: ```bash # kubectl get inferencepool NAME AGE qwen-pool 33m # kubectl get httproute NAME HOSTNAMES AGE qwen-route 33m ``` ### 7. Usage ### The Inference Gateway provides HTTP endpoints for model inference. #### 1: Populate gateway URL for your k8s cluster #### a. To test the integration in minikube, proceed as below: Use minikube tunnel to expose the gateway to the host. This requires `sudo` access to the host machine. Alternatively, you can use port-forward to expose the gateway to the host as shown in alternative (b). ```bash # in first terminal ps aux | grep "minikube tunnel" | grep -v grep # make sure minikube tunnel is not already running. minikube tunnel # start the tunnel # in second terminal where you want to send inference requests GATEWAY_URL=$(kubectl get svc inference-gateway -n my-model -o jsonpath='{.spec.clusterIP}') && echo $GATEWAY_URL ``` b. To test on a cluster use commands below: use port-forward to expose the gateway to the host ```bash # in first terminal kubectl port-forward svc/inference-gateway 8000:80 -n ${NAMESPACE} # for NAMESPACE use the namespace where the Gateway service was created, for example agentgateway-system # in second terminal where you want to send inference requests GATEWAY_URL=http://localhost:8000 ``` #### 2: Check models deployed to inference gateway #### a. Query models: ```bash # in the second terminal where you GATEWAY_URL is set curl $GATEWAY_URL/v1/models | jq . # or if you added the host name to http route: curl -H "Host: llama3-70b-disagg.example.com" $GATEWAY_URL/v1/models | jq . ``` Sample output: ```json { "data": [ { "created": 1753768323, "id": "Qwen/Qwen3-0.6B", "object": "object", "owned_by": "nvidia" } ], "object": "list" } ``` b. Send inference request to gateway: ```bash MODEL_NAME="Qwen/Qwen3-0.6B" curl $GATEWAY_URL/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "'"${MODEL_NAME}"'", "messages": [ { "role": "user", "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden." } ], "stream":false, "max_tokens": 30, "temperature": 0.0 }' ``` or ```bash MODEL_NAME="RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic" curl -H "Host: llama3-70b-disagg.example.com" http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "'"${MODEL_NAME}"'", "messages": [ { "role": "user", "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden." } ], "stream":false, "max_tokens": 30, "temperature": 0.0 }' ``` Sample inference output: ```json { "choices": [ { "finish_reason": "stop", "index": 0, "logprobs": null, "message": { "audio": null, "content": "\nOkay, I need to develop a character background for the user's query. Let me start by understanding the requirements. The character is an", "function_call": null, "refusal": null, "role": "assistant", "tool_calls": null } } ], "created": 1753768682, "id": "chatcmpl-772289b8-5998-4f6d-bd61-3659b684b347", "model": "Qwen/Qwen3-0.6B", "object": "chat.completion", "service_tier": null, "system_fingerprint": null, "usage": { "completion_tokens": 29, "completion_tokens_details": null, "prompt_tokens": 196, "prompt_tokens_details": null, "total_tokens": 225 } } ``` ***If you have more than one HTTPRoute running on the cluster*** Add the host to your `http-route.yaml` and add the header `curl -H "Host: llama3-70b-agg.example.com" ...` or `curl -H "Host: llama3-70b-disagg.example.com" http://localhost:8000/v1/models` ```bash spec: hostnames: - llama3-70b-agg.example.com ``` ### 8. Deleting the installation ### If you need to uninstall run: ```bash kubectl delete dynamoGraphDeployment vllm-agg helm uninstall dynamo-gaie -n my-model # To uninstall GAIE # 1. Delete the inference-gateway kubectl delete gateway inference-gateway --ignore-not-found # 2. Uninstall agentgateway helm releases helm uninstall agentgateway -n agentgateway-system helm uninstall agentgateway-crds -n agentgateway-system # 3. Delete the agentgateway-system namespace (optional, cleans up everything in it) kubectl delete namespace agentgateway-system --ignore-not-found # 4. Delete the Inference Extension CRDs IGW_LATEST_RELEASE=v1.5.0-rc.2 kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/${IGW_LATEST_RELEASE}/manifests.yaml --ignore-not-found # 5. Delete the Gateway API CRDs GATEWAY_API_VERSION=v1.5.1 kubectl delete -f https://github.com/kubernetes-sigs/gateway-api/releases/download/$GATEWAY_API_VERSION/standard-install.yaml --ignore-not-found ``` ## Gateway API Inference Extension Integration This section documents the updated plugin implementation for Gateway API Inference Extension **v1.5.0-rc.2**. ### Router bookkeeping operations EPP performs Dynamo router book keeping operations so the FrontEnd's Router does not have to sync its state. ### Header Routing Hints Since v1.5.0-rc.1, the EPP uses **headers and body mutations** for communicating routing decisions. The plugins set HTTP headers for worker targeting and inject pre-computed token IDs into the request body (`nvext.token_data`) so the frontend sidecar can skip redundant tokenization. #### Headers Set by Dynamo Plugins | Header | Description | Set By | |--------|-------------|--------| | `x-worker-instance-id` | Primary worker ID (decode worker in disagg mode) | kv-aware-scorer | | `x-prefill-instance-id` | Prefill worker ID (disaggregated mode only) | kv-aware-scorer | # Overall Architecture

简体中文

# Dynamo Architecture Dynamo is a distributed inference runtime for generative AI systems that must operate at high throughput, low latency, and high reliability under changing traffic conditions. It is backend-agnostic (SGLang, TRT-LLM, vLLM, and others) and is built around three cooperating concerns: - A fast **request path** for token generation - A responsive **control path** for scaling and placement - A resilient **state path** for KV reuse and failure recovery This document presents Dynamo as an architecture, not a feature list: what each plane owns, how requests move, how the system adapts, and how it remains correct under failure. ## Design Goals Dynamo is designed to satisfy the following goals simultaneously: 1. **Latency stability**: keep TTFT and ITL predictable under bursty and mixed-length traffic. 2. **GPU efficiency**: disaggregate prefill and decode so each can scale independently. 3. **Compute reuse**: minimize KV recomputation through KV-aware routing and cache lifecycle management. 4. **Operational resilience**: treat worker crashes, restarts, and overload as normal operating events. 5. **Deployment portability**: support Kubernetes-native control paths and non-Kubernetes runtime modes. ## Why This Architecture Exists Modern LLM serving hits recurring bottlenecks: - **Prefill/decode imbalance** leaves GPUs underutilized when traffic mix shifts ([DistServe](https://arxiv.org/abs/2401.09670)). - **KV recomputation** increases TTFT and wastes compute when routing ignores cache overlap ([DeepSeek](https://arxiv.org/abs/2501.12948)). - **Memory pressure** from long contexts and concurrency exceeds HBM capacity without multi-tier cache management ([KVBM](https://docs.nvidia.com/dynamo/components/kvbm), [Mooncake](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html), [AIBrix](https://blog.vllm.ai/2025/02/21/aibrix-release.html), [FlexKV](https://github.com/taco-project/FlexKV), [LMCache](https://lmcache.ai/)). - **Dynamic demand** breaks static provisioning assumptions ([AzureTrace](https://github.com/Azure/AzurePublicDataset)). - **Real-world failures** (pod restart, partition, hot-spot overload) require first-class recovery behavior. Dynamo addresses these constraints by separating serving, control, and state propagation into explicit planes and control loops. ## Architecture Overview ![Dynamo architecture showing Request Plane (Client, Frontend, Router, Prefill/Decode workers), Control Plane (Planner, Dynamo Operator, Dynamo Graph, Grove, Model Express, Runtime Resources), and Storage & Events Plane (KVBM, NIXL, Local SSD/NFS/Remote Storage)](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/ce4e6a9b9320b44cb76fff57ae435ae33b1b9a04cebbd7d38577592805be4b51/pages-v1.2.0/assets/img/dynamo-architecture.svg "Dynamo Architecture") ## System Model ### Request Plane (critical data path) The request plane is responsible for request/response execution: - **Frontend** accepts and normalizes requests. - **Router** selects workers based on load and KV overlap. - **Prefill workers** compute prompt KV state. - **Decode workers** generate output tokens. This path is optimized for low overhead and continuous token streaming. ### Control Plane (adaptation and orchestration path) The control plane is responsible for desired-state management: - **Planner** computes scaling targets from live metrics. - **Dynamo Operator** reconciles Kubernetes resources from Dynamo CRDs. - **Discovery + Endpoints/CRD** establish liveness and discoverability. - **Grove/KAI Scheduler path** provides topology-aware placement and grouped scaling in multinode Kubernetes deployments. - **Model Express** is an optional model-management endpoint when configured. This path is optimized for correctness and convergence to target capacity. ### Storage & Events Plane (state propagation path) The storage/events plane is responsible for cache state visibility and movement: - **KV Events** publish cache lifecycle transitions. - **KVBM** manages block reuse, eviction, and offload/recall across memory tiers. - **NIXL** performs high-speed KV/data transfer across workers and memory domains. This path is optimized for cache reuse and cross-worker handoff efficiency. ## End-to-End Request Narrative (Disaggregated Mode) 1. Client sends request to **Frontend**. 2. Frontend validates/preprocesses and forwards to **Router**. 3. Router chooses a **Prefill worker**. 4. Prefill computes KV and returns transfer metadata. 5. Router chooses a **Decode worker**. 6. Decode receives KV state (typically via **NIXL** transfer path). 7. Decode streams tokens back through Frontend. 8. **KV Events** update cache visibility for future routing decisions. 9. **KVBM** may offload or recall KV blocks based on pressure and reuse potential. For flow-level detail, see [Architecture Flow](/dynamo/design-docs/architecture-flow). For request transport options, see [Request Plane](/dynamo/design-docs/communication-planes/request-plane). ## Control Loops ### Serving Loop Maintains low-latency request execution across frontend, router, prefill, and decode workers. ### Planning Loop Maintains capacity alignment with demand: - Planner consumes runtime metrics. - Planner computes prefill/decode targets. - Connector layer applies targets to runtime resources. Planner supports throughput-based and load-based strategies. See [Planner Design](/dynamo/design-docs/component-design/planner-design). ### Resilience Loop Maintains system continuity under failure: - Health checks detect unhealthy workers. - Discovery liveness removes stale endpoints. - Graceful shutdown drains in-flight work. - Request migration/cancellation controls in-flight behavior. - Load shedding prevents cascading collapse under overload. See [Fault Tolerance](/dynamo/user-guides/fault-tolerance). ## Kubernetes-Native Realization (CRD + Grove) In Kubernetes deployments, the same architecture maps to declarative resources: - Dynamo Operator reconciles `DynamoGraphDeployment`. - Discoverability is derived from `DynamoWorkerMetadata` + EndpointSlices. - Grove-backed multinode deployments model worker groups as `PodCliqueSet` and `PodClique`. - Independent prefill/decode elasticity is represented via `PodCliqueScalingGroup` with separate `replicas` and `min` targets. The diagram labels such as `PodClique A/B`, `ScalingGroup "Prefill"`, `ScalingGroup "Decode"`, and `(replicas, min)` represent this grouped scaling model. ## Deployment Modes The request plane can be exposed in two ways: - **Standalone mode** (default) — the Dynamo Frontend is the request entry point and the integrated Dynamo Router selects workers using KV-aware scoring. Used by all local installs and the default Kubernetes deployment. - **Gateway mode (GAIE)** — Dynamo runs behind a Kubernetes [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io/) gateway. KV-aware routing is performed at the gateway layer by the Dynamo Endpoint Picker Plugin (EPP); the Frontend runs as a sidecar in `--router-mode direct` and respects the EPP's per-request worker selection passed via request headers. Both modes share the same control plane, storage/events plane, and backend integrations — only the request entry point and the location of the routing decision differ. See the [Inference Gateway (GAIE) guide](/dynamo/integrations/kubernetes-integrations/gateway-api-inference-extension-gaie) for the gateway-mode setup and configuration reference. ## Fault Tolerance Architecture Fault tolerance is embedded across layers: | Layer | Mechanism | Practical effect | |------|-----------|------------------| | Request | Migration, cancellation | In-flight work can continue or terminate intentionally | | Worker | Health checks, graceful shutdown, endpoint draining | Failed/terminating workers stop taking new traffic safely | | System | Request rejection/load shedding | Prevents overload from propagating across workers | | Infrastructure | Discovery lease expiry, event-path recovery | Stale membership is removed and traffic reroutes | This model assumes failures are routine, not exceptional. ## Performance Rationale ### Disaggregated Serving Separating prefill and decode improves utilization and enables phase-specific scaling. ![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/8dea0f547dea3ab03e2bc215231c5c0c94737989bffd7812e69eba66cf250729/pages-v1.2.0/assets/img/disagg-perf-benefit.png) *Tested on H100 with R1 Distilled Llama 70B FP8 on vLLM. 3K ISL / 150 OSL.* ### KV-Aware Routing Routing with cache overlap + load signals reduces prefill recomputation and improves latency. For an external production case study, see [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/#how-baseten-uses-nvidia-dynamo). ![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/b33d69c2a5e15a9fed5beefa6259bed1b4a2ef85b004ab23e91cc35af42d1752/pages-v1.2.0/assets/img/kv-routing.png) *Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 H100 nodes. Avg 4K ISL / 800 OSL.* ### KV Block Manager (KVBM) KVBM extends effective cache capacity using multi-tier memory offload/recall. ![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/e0baf08a31ef5d1e01195916d7132ff94abe32acc746103de7d619733fa8ec6b/pages-v1.2.0/assets/img/kvbm-agg-performance.png) *Tested across QPS values using Qwen3-8B on H100. Avg 20K ISL / 100 OSL.* ### NIXL Data Transfer NIXL reduces KV handoff cost in distributed serving by optimizing cross-worker transfer behavior across heterogeneous memory. ## Implementation Model - **Rust** for performance-sensitive runtime components. - **Python** for backend integration and extensibility. - Modular subsystem boundaries so routing, planning, memory, and transport can evolve independently. ## Related Documentation - [Architecture Flow](/dynamo/design-docs/architecture-flow) - [Router Design](/dynamo/design-docs/component-design/router-design) - [Planner Design](/dynamo/design-docs/component-design/planner-design) - [Discovery Plane](/dynamo/design-docs/communication-planes/discovery-plane) - [Event Plane](/dynamo/design-docs/communication-planes/event-plane) - [Request Plane](/dynamo/design-docs/communication-planes/request-plane) - [Fault Tolerance](/dynamo/user-guides/fault-tolerance) - [Grove](/dynamo/kubernetes-deployment/scale/grove) ## Acknowledgements Dynamo is informed by prior open-source work from: - vLLM - SGLang - DistServe - Mooncake - AIBrix - BentoML # Architecture Flow This diagram shows the NVIDIA Dynamo disaggregated inference system. Color-coded flows indicate different types of operations. ## 🔵 Main Request Flow (Blue) The primary user journey through the system: 1. **Request (S1)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000) 2. **Preprocess (S2)**: Frontend preprocesses the request (applies chat template, tokenizes) and validates it 3. **Route to Prefill (S3)**: PrefillRouter selects a prefill worker using KV-aware routing or load balancing ## 🟢 Prefill Flow (Green) The prefill processing pipeline: 4. **Prefill (S4)**: Prefill worker executes the prefill computation on the input tokens and generates KV cache 5. **Return Metadata (S5)**: Prefill worker returns `disaggregated_params` containing backend-specific transfer metadata ## 🟠 Decode Routing Flow (Orange) Router orchestration to decode phase: 6. **Route to Decode (S6)**: PrefillRouter injects prefill result into decode request and routes to decode worker 7. **KV Transfer (S7)**: Decode worker coordinates with prefill worker for direct GPU-to-GPU KV cache transfer via NIXL ## 🟣 Completion Flow (Purple) The response generation and delivery: 8. **Decode (S8)**: Decode worker generates tokens using the transferred KV cache 9. **Response (S9)**: Generated tokens stream back through Frontend for post-processing (detokenization) and delivery to Client ## 🔗 Infrastructure Connections (Dotted lines) Coordination and messaging support: ### Service Discovery - **On Kubernetes** (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required. - **On bare metal**: Uses etcd or filesystem for service discovery and endpoint registration. ### Request Plane - **TCP** (default): Direct TCP connections between Frontend and Workers for request/response transport. - **HTTP/NATS**: Alternative transports configurable via `DYN_REQUEST_PLANE`. ### NATS Connections (Optional, for KV routing) - **KV Events**: Cache state events for KV-aware routing (can be disabled with `--no-router-kv-events`) ### Planning Connections (Gold, dotted) - **Frontend → Planner**: Metrics collection for auto-scaling decisions - **Planner → Workers**: Resource scaling commands for workers ## Technical Implementation Details ### PrefillRouter Orchestration: - The `PrefillRouter` sits between the Frontend and workers, orchestrating disaggregated serving - Selects prefill workers using KV-aware routing (cache overlap scores + load) or simple load balancing - Injects transfer metadata into decode requests for KV cache coordination ### NIXL (NVIDIA Interchange Library): - Enables high-speed GPU-to-GPU data transfers using NVLink, InfiniBand/UCX, or PCIe - Transfer metadata exchanged via `disaggregated_params` in prefill response - Backend-specific coordination: SGLang uses bootstrap connections, TRTLLM uses opaque state, vLLM uses block IDs ### Disaggregated KV Cache: - Each worker maintains local KV cache in its GPU memory - No shared storage bottlenecks—transfers are direct worker-to-worker via NIXL - Non-blocking transfers allow GPU forward passes to continue during KV transfer ```mermaid %%{init: {'theme':'dark', 'themeVariables': {'primaryColor': '#f4f4f4', 'primaryTextColor': '#333333', 'primaryBorderColor': '#888888', 'lineColor': '#4A90E2', 'sectionBkgColor': '#f9f9f9', 'altSectionBkgColor': '#eeeeee', 'tertiaryColor': '#f0f0f0', 'background': '#ffffff', 'mainBkg': '#f8f8f8', 'secondaryColor': '#f4f4f4', 'nodeTextColor': '#333333'}, 'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'fontFamily': 'Inter, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif', 'fontSize': '18px'}%% graph TD %% Top Layer - Client & Frontend Client["HTTP Client"] Frontend["Frontend
OpenAI Compatible Server
Port 8000
"] S1[["1 REQUEST"]] S2[["2 PREPROCESS"]] %% Router Layer PrefillRouter["PrefillRouter
Orchestrates Disaggregated Serving"] S3[["3 ROUTE TO PREFILL"]] %% Infrastructure subgraph INF["Infrastructure Layer"] Discovery[("Discovery
Service Registry
(ETCD or K8s)
")] NATS[("NATS
KV Events
(Optional)
")] Planner["Planner
Auto-scaling"] end %% Worker Layer subgraph WL["Worker Layer"] %% Prefill Worker PrefillWorker["Prefill Worker
Computes KV Cache"] S4[["4 PREFILL"]] S5[["5 RETURN METADATA"]] %% Decode Worker DecodeWorker["Decode Worker
Token Generation"] S6[["6 ROUTE TO DECODE"]] S7[["7 KV TRANSFER"]] S8[["8 DECODE"]] S9[["9 RESPONSE"]] %% KV Cache PrefillKVCache[("Prefill KV Cache
GPU VRAM")] DecodeKVCache[("Decode KV Cache
GPU VRAM")] end %% Main Request Flow (Blue) Client --> S1 S1 -->|HTTP API Call| Frontend Frontend --> S2 S2 -->|Tokenize & Validate| PrefillRouter PrefillRouter --> S3 S3 -->|Select Prefill Worker| PrefillWorker %% Prefill Flow (Green) PrefillWorker --> S4 S4 -->|Compute KV Cache| PrefillKVCache PrefillWorker --> S5 S5 -->|disaggregated_params| PrefillRouter %% Decode Routing Flow (Orange) PrefillRouter --> S6 S6 -->|Inject Transfer Metadata| DecodeWorker DecodeWorker --> S7 S7 -->|NIXL GPU-to-GPU| PrefillKVCache PrefillKVCache -.->|Direct Transfer| DecodeKVCache %% Completion Flow (Purple) DecodeWorker --> S8 S8 -->|Generate Tokens| DecodeKVCache DecodeWorker --> S9 S9 -->|Stream Tokens| Frontend Frontend -->|HTTP Response| Client %% Infrastructure Connections Frontend -.->|Service Discovery| Discovery PrefillRouter -.->|Worker Discovery| Discovery PrefillWorker -.->|Register| Discovery DecodeWorker -.->|Register| Discovery Planner -.->|Service Discovery| Discovery %% NATS for KV events (optional) PrefillWorker -.->|KV Events| NATS DecodeWorker -.->|KV Events| NATS %% Planning Connections Frontend -.->|Metrics| Planner Planner -.->|Auto-scaling| PrefillWorker Planner -.->|Auto-scaling| DecodeWorker %% Styling classDef client fill:#e8f5e8,stroke:#2E7D32,stroke-width:3px classDef frontend fill:#fff3e0,stroke:#F57C00,stroke-width:3px classDef router fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px classDef worker fill:#e3f2fd,stroke:#1565C0,stroke-width:3px classDef prefillWorker fill:#e8f5e9,stroke:#388E3C,stroke-width:3px classDef planner fill:#f1f8e9,stroke:#558B2F,stroke-width:3px classDef storage fill:#e0f2f1,stroke:#00695C,stroke-width:3px classDef discovery fill:#fff9c4,stroke:#F9A825,stroke-width:3px classDef nats fill:#ede7f6,stroke:#5E35B1,stroke-width:3px classDef infraLayer fill:#fff9c4,stroke:#FFC107,stroke-width:3px classDef workerLayer fill:#e3f2fd,stroke:#2196F3,stroke-width:3px class Client client class Frontend frontend class PrefillRouter router class DecodeWorker worker class PrefillWorker prefillWorker class Planner planner class PrefillKVCache,DecodeKVCache storage class Discovery discovery class NATS nats class INF infraLayer class WL workerLayer %% Flow Colors %% Main Request Flow - Blue linkStyle 0,1,2,3,4,5 stroke:#1565C0,stroke-width:4px %% Prefill Flow - Green linkStyle 6,7,8,9 stroke:#2E7D32,stroke-width:4px %% Decode Routing Flow - Orange linkStyle 10,11,12,13,14 stroke:#E65100,stroke-width:4px %% Completion Flow - Purple linkStyle 15,16,17,18,19 stroke:#6A1B9A,stroke-width:4px %% Infrastructure - Gray dotted linkStyle 20,21,22,23,24,25,26,27,28,29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8 ``` # Disaggregated Serving

简体中文

The prefill and decode phases of LLM requests have different computation characteristics and memory footprints. Disaggregating these phases into specialized llm engines allows for better hardware allocation, improved scalability, and overall enhanced performance. For example, using a larger TP for the memory-bound decoding phase while a smaller TP for the computation-bound prefill phase allows both phases to be computed efficiently. In addition, for requests with long context, separating their prefill phase into dedicated prefill engines allows the ongoing decoding requests to be efficiently processed without being blocked by these long prefills. Disaggregated execution of a request has three main steps: 1. Prefill engine computes prefill phase and generates KV cache 2. Prefill engine transfers the KV cache to decode engine 3. Decode engine computes decode phase. The disaggregation design in Dynamo features a flexible framework that delivers strong performance across various conditions. ## Efficient KV Transfer The key to high-performance disaggregation is efficient KV transfer. Dynamo leverages NIXL to transfer KV cache directly from the VRAM of the prefill engine to the VRAM of the decode engine. The KV transfer is non-blocking, allowing GPU forward passes to continue serving other requests during the transfer. ### Router Orchestration The disaggregated serving flow is orchestrated by the `PrefillRouter`: ```mermaid sequenceDiagram participant Client participant Frontend participant Router as PrefillRouter participant Prefill as Prefill Worker participant Decode as Decode Worker Client->>Frontend: Request Frontend->>Router: Preprocessed Request Router->>Router: Select prefill worker Router->>Prefill: Prefill request Prefill->>Prefill: Compute KV cache Prefill-->>Router: disaggregated_params Router->>Router: Select decode worker Router->>Decode: Decode request + transfer metadata Decode<<->>Prefill: KV transfer (NIXL) Decode->>Decode: Generate tokens Decode-->>Frontend: Stream tokens Frontend-->>Client: Response ``` 1. **Worker Selection**: The router selects a prefill worker using KV-aware routing (based on cache overlap scores and load) or simple load balancing. 2. **Prefill Execution**: The router sends the prefill request to the selected prefill worker. The prefill worker computes the KV cache and returns `disaggregated_params` containing backend-specific transfer metadata. 3. **Decode Routing**: The router injects the prefill result into the decode request, then routes to the decode worker. 4. **KV Transfer**: The decode worker uses the transfer metadata to coordinate with the prefill worker. NIXL handles the direct GPU-to-GPU transfer using the optimal available transport (NVLink, InfiniBand/UCX, etc.). ### Backend-Specific Transfer Metadata The transfer metadata format varies by backend: - **SGLang**: Uses `bootstrap_info` (host, port, room_id) for RDMA bootstrap coordination. SGLang prefill workers publish their bootstrap endpoint to the discovery service during initialization. With this mechanism, prefill can run as a background task, allowing the decode phase to begin immediately while the KV transfer proceeds in parallel. - **vLLM**: Uses `kv_transfer_params` containing block IDs and remote worker connection info. Prefill runs synchronously; decode waits for prefill to complete before proceeding. - **TRTLLM**: Uses `opaque_state` containing serialized TRT-LLM internal metadata. Prefill runs synchronously; decode waits for prefill to complete before proceeding. ## Runtime-Reconfigurable xPyD Dynamo's disaggregation design supports runtime-reconfigurable xPyD (x prefill workers, y decode workers). Workers can be added and removed at runtime: - **Add worker**: Worker registers with the discovery service and publishes its `RuntimeConfig` (including KV capacity). - **Remove worker**: Worker drains active requests and deregisters from discovery. The router automatically discovers new workers via the discovery service and incorporates them into routing decisions. # Distributed Runtime ## Overview Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in Rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., Python bindings can be found in `/lib/bindings/python`). The runtime supports multiple discovery backends (Kubernetes-native or etcd) and request planes (TCP, HTTP, or NATS). `DistributedRuntime` follows a hierarchical structure: - `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It manages connections to discovery backends (K8s API or etcd) and optional messaging (NATS for KV events), and handles lifecycle with cancellation tokens. - `Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments. - `Component`: A `Component` is a discoverable object within a `Namespace` that represents a logical unit of workers. - `Endpoint`: An `Endpoint` is a network-accessible service that provides a specific service or function. While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other. For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple components: - `Frontend`: Starts an HTTP server (OpenAI-compatible API on port 8000), handles incoming requests, applies chat templates, performs tokenization, and routes requests to workers. The `make_engine` function encapsulates this functionality. - `Worker` components (e.g., `VllmDecodeWorker`, `VllmPrefillWorker`, `SGLangDecodeWorker`, `TRTLLMWorker`): Perform the actual inference computation using their respective engines (SGLang, TensorRT-LLM, vLLM). Since these components are deployed in different processes, each has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-disagg`). Under their namespace, each has its own `Component`: - `Frontend` uses the `make_engine` function which handles HTTP serving, request preprocessing, and worker discovery automatically - Worker components register with names like `backend`, `prefill`, `decode`, or `encoder` depending on their role - Workers register endpoints like `generate`, `clear_kv_blocks`, or `load_metrics` Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, and their `Endpoint`s are obtained by path. In Python, use `runtime.endpoint("namespace.component.endpoint")` (e.g., `runtime.endpoint("dynamo.backend.generate")`). ## Initialization In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are multiple modes for `DistributedRuntime` initialization based on the deployment environment. ```{caution} The hierarchy and naming may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same. ``` ### Service Discovery Backends The `DistributedRuntime` supports two service discovery backends, configured via `DYN_DISCOVERY_BACKEND`: - **KV Store Discovery** (`DYN_DISCOVERY_BACKEND=etcd`): Uses etcd for service discovery. **This is the default** for all deployments unless explicitly overridden. Other KV store backends (`file`, `mem`) are also available. - **Kubernetes Discovery** (`DYN_DISCOVERY_BACKEND=kubernetes`): Uses native Kubernetes resources (DynamoWorkerMetadata CRD, EndpointSlices) for service discovery. **Must be explicitly set.** The Dynamo operator automatically sets this environment variable for Kubernetes deployments. **No etcd required.** > **Note:** There is no automatic detection of the deployment environment. The runtime defaults to `etcd`. For Kubernetes deployments, the operator injects `DYN_DISCOVERY_BACKEND=kubernetes` into pod environments. When using Kubernetes discovery, the KV store backend automatically switches to in-memory storage since etcd is not needed. ### Runtime Initialization - `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections based on the discovery backend: - **Kubernetes mode**: Uses K8s API for service registration via DynamoWorkerMetadata CRD. No external dependencies required. - **KV Store mode**: Connects to etcd for service discovery. Creates a primary lease with a background keep-alive task. All objects registered under this `DistributedRuntime` use this lease_id to maintain their lifecycle. - **NATS** (optional): Used for KV event messaging when using KV-aware routing. Can be disabled via `--no-router-kv-events`, which enables prediction-based routing without event persistence. - **Request Plane**: TCP by default. Can be configured to use HTTP or NATS via `DYN_REQUEST_PLANE` environment variable. - `Namespace`: `Namespace`s are primarily a logical grouping mechanism. They provide the root path for all components under this `Namespace`. - `Component`: When a `Component` object is created, it registers a service in the internal registry of the `DistributedRuntime`, which tracks all services and endpoints. - `Endpoint`: When an Endpoint object is created and started, it performs registration based on the discovery backend: - **Kubernetes mode**: Endpoint information is stored in DynamoWorkerMetadata CRD resources, which are watched by other components for discovery. - **KV Store mode**: Endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id`. ## Calling Endpoints Dynamo uses a `Client` object to call an endpoint. When a `Client` is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then watches for endpoint changes: - **Kubernetes mode**: Watches DynamoWorkerMetadata CRD resources for endpoint updates. - **KV Store mode**: Sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The watcher continuously updates the `Client` with information about available `Endpoint`s. The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies: - `random`: randomly select an endpoint to hit - `round_robin`: select endpoints in round-robin order - `direct`: direct the request to a specific endpoint by specifying the instance ID After selecting which endpoint to hit, the `Client` sends the request using the configured request plane (TCP by default). The request plane handles the actual transport: - **TCP** (default): Direct TCP connection with connection pooling - **HTTP**: HTTP/2-based transport - **NATS**: Message broker-based transport (legacy) ## Examples We provide native rust and python (through binding) examples for basic usage of `DistributedRuntime`: - Rust: `/lib/runtime/examples/` - Python: We also provide complete examples of using `DistributedRuntime`. Please refer to the engines in `components/src/dynamo` for full implementation details. # Discovery Plane Dynamo's service discovery layer lets components find each other at runtime. Workers register their endpoints when they start, and frontends discover them automatically. The discovery backend adapts to the deployment environment. ![Discovery plane architecture showing Kubernetes and etcd backends](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/5d9ddfc0c37d57b34711f91082eb00ba984f58fd0811f8db0b90f5aea99bb1a5/pages-v1.2.0/assets/img/discovery-plane.svg) ## Discovery Backends | Deployment | Discovery Backend | Configuration | |------------|-------------------|---------------| | **Kubernetes** (with Dynamo operator) | Native K8s (CRDs, EndpointSlices) | Operator sets `DYN_DISCOVERY_BACKEND=kubernetes` | | **Bare metal / Local** (default) | etcd | `ETCD_ENDPOINTS` (defaults to `http://localhost:2379`) | > **Note:** The runtime always defaults to etcd. Kubernetes discovery must be explicitly enabled -- the Dynamo operator handles this automatically. ## Kubernetes Discovery When running on Kubernetes with the Dynamo operator, service discovery uses native Kubernetes resources instead of etcd. ### How It Works 1. Workers register their endpoints by creating **DynamoWorkerMetadata** custom resources. 2. **EndpointSlices** signal pod readiness to the system. 3. Components watch for CRD changes to discover available workers. ### Benefits - No external etcd cluster required. - Native integration with Kubernetes pod lifecycle. - Automatic cleanup when pods terminate. - Works with standard Kubernetes RBAC. ### Environment Variables (Injected by Operator) | Variable | Description | |----------|-------------| | `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` | | `POD_NAME` | Current pod name | | `POD_NAMESPACE` | Current namespace | | `POD_UID` | Pod unique identifier | ## etcd Discovery (Default) When `DYN_DISCOVERY_BACKEND` is not set (or set to `etcd`), etcd is used for service discovery. ### Connection Configuration | Variable | Description | Default | |----------|-------------|---------| | `ETCD_ENDPOINTS` | Comma-separated etcd URLs | `http://localhost:2379` | | `ETCD_AUTH_USERNAME` | Basic auth username | None | | `ETCD_AUTH_PASSWORD` | Basic auth password | None | | `ETCD_AUTH_CA` | CA certificate path (TLS) | None | | `ETCD_AUTH_CLIENT_CERT` | Client certificate path | None | | `ETCD_AUTH_CLIENT_KEY` | Client key path | None | Example: ```bash export ETCD_ENDPOINTS=http://etcd-0:2379,http://etcd-1:2379,http://etcd-2:2379 ``` ### Service Registration Workers register their endpoints in etcd with a key hierarchy: ``` /services/{namespace}/{component}/{endpoint}/{instance_id} ``` For example: ``` /services/vllm-agg/backend/generate/694d98147d54be25 ``` Frontends and routers discover available workers by watching the relevant prefix and receiving real-time updates when workers join or leave. ### Lease-Based Cleanup Each runtime maintains a lease with etcd (default TTL: 10 seconds). If a worker crashes or loses connectivity: ![Lease lifecycle showing DistributedRuntime keep-alive heartbeat to etcd](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/07e289e99febab895d6a0542ff0345fccd5e0f1466bd20584ead9167946030e3/pages-v1.2.0/assets/img/discovery-plane-lease.svg) 1. Keep-alive heartbeats stop. 2. The lease expires after the TTL. 3. All registered endpoints are automatically deleted. 4. Clients receive removal events and reroute traffic to healthy workers. This ensures stale endpoints are cleaned up without manual intervention. ## KV Store Dynamo provides a KV store abstraction for storing metadata (endpoint instances, model deployment cards, event channels). Multiple backends are supported: | Backend | Use Case | |---------|----------| | etcd | Production deployments | | Memory | Testing and development | | NATS | NATS-only deployments | | File | Local persistence | ## Operational Guidance ### Use Kubernetes Discovery on K8s The Dynamo operator automatically sets `DYN_DISCOVERY_BACKEND=kubernetes` for pods. No additional setup required. ### Deploy an etcd Cluster for Bare Metal For bare-metal production deployments, deploy a 3-node etcd cluster for high availability. ### Tune Lease TTLs Balance between failure detection speed and overhead: - **Short TTL (5s)** -- Faster failure detection, more keep-alive traffic. - **Long TTL (30s)** -- Less overhead, slower detection. The default (10s) is a reasonable starting point for most deployments. ## Related Documentation - [Event Plane](/dynamo/design-docs/communication-planes/event-plane) -- Pub/sub for KV cache events and worker metrics - [Distributed Runtime](/dynamo/design-docs/distributed-runtime) -- Runtime architecture - [Request Plane](/dynamo/design-docs/communication-planes/request-plane) -- Request transport configuration - [Fault Tolerance](/dynamo/user-guides/fault-tolerance) -- Failure handling # Request Plane ## Overview Dynamo supports two transport mechanisms for its request plane (the communication layer between services): - **TCP** (default): Direct TCP connection for optimal performance - **NATS**: Message broker-based request plane This guide explains how to configure and use request plane in your Dynamo deployment. ## What is a Request Plane? The request plane is the transport layer that handles communication between Dynamo services (e.g., frontend to backend, worker to worker). Different request planes offer different trade-offs: | Request Plane | Suitable For | Characteristics | |--------------|----------|-----------------| | **NATS** | Deployments that choose brokered request transport | Requires NATS infrastructure, provides pub/sub patterns, highest flexibility | | **TCP** | Low-latency direct communication | Direct connections, minimal overhead | ## Request Plane vs KV Event Plane Dynamo has **two independent communication planes**: - **Request plane** (**`DYN_REQUEST_PLANE`**): how **RPC requests** flow between components (frontend → router → worker), via `tcp`, or `nats`. - **KV event plane** (**`DYN_EVENT_PLANE`**): how **KV cache events** (and optional router replica sync) are distributed for KV-aware routing, via `nats` or `zmq`. **Note:** If you are using `tcp` request plane with KV events enabled on the router (the default router-side setting), the configured event plane is initialized independently. NATS-based event transport uses `NATS_SERVER` (default `nats://localhost:4222`), while ZMQ avoids external NATS infrastructure. SGLang requires explicit `--kv-events-config` and TRT-LLM requires `--publish-events-and-metrics` to publish events. For vLLM, KV events are currently auto-configured when prefix caching is active (deprecated — use `--kv-events-config` explicitly to prepare for a future release where all backends will default to off). To disable the router's KV event listener, use `--no-router-kv-events` on the frontend. Because they are independent, you can mix them. For example, a deployment with TCP request plane can use different KV event planes: - **JetStream KV events**: requests use TCP, KV routing still uses NATS JetStream + object store for persistence. - **NATS Core KV events (local indexer)**: requests use TCP, KV events use NATS Core pub/sub and persistence lives on workers. - **no KV events**: requests use TCP and KV routing runs without events (no NATS required, but no event-backed persistence). ## Configuration ### Environment Variable Set the request plane mode using the `DYN_REQUEST_PLANE` environment variable: ```bash export DYN_REQUEST_PLANE= ``` Where `` is one of: - `tcp` (default) - `nats` The value is case-insensitive. ### Default Behavior If `DYN_REQUEST_PLANE` is not set or contains an invalid value, Dynamo defaults to `tcp`. ## Usage Examples ### Using TCP (Default) TCP is the default request plane and provides direct, low-latency communication between services. **Configuration:** ```bash # TCP is the default, so no need to set DYN_REQUEST_PLANE explicitly # But you can explicitly set it if desired: export DYN_REQUEST_PLANE=tcp # Optional: Configure TCP server host and port export DYN_TCP_RPC_HOST=0.0.0.0 # Default host # export DYN_TCP_RPC_PORT=9999 # Optional: specify a fixed port # Run your Dynamo service DYN_REQUEST_PLANE=tcp python -m dynamo.frontend --http-port=8000 & DYN_REQUEST_PLANE=tcp python -m dynamo.vllm --model Qwen/Qwen3-0.6B ``` **Note:** By default, TCP uses an OS-assigned free port (port 0). This is ideal for environments where multiple services may run on the same machine or when you want to avoid port conflicts. If you need a specific port (e.g., for firewall rules), set `DYN_TCP_RPC_PORT` explicitly. **When to use TCP:** - Simple deployments with direct service-to-service communication (e.g. frontend to backend) - Minimal infrastructure requirements (NATS is initialized when the router listens for KV events; disable with `--no-router-kv-events`) - Low-latency requirements **TCP Configuration Options:** Additional TCP-specific environment variables: - `DYN_TCP_RPC_HOST`: Server host address (default: auto-detected) - `DYN_TCP_RPC_PORT`: Server port. If not set, the OS assigns a free port automatically (recommended for most deployments). Set explicitly only if you need a specific port for firewall rules. - `DYN_TCP_MAX_MESSAGE_SIZE`: Maximum message size for TCP client (default: 32MB) - `DYN_TCP_SHRINK_MESSAGE_SIZE`: Threshold for shrinking the zero-copy decoder buffer back to initial size after processing large messages (default: 8MB, max: DYN_TCP_MAX_MESSAGE_SIZE) - `DYN_TCP_REQUEST_TIMEOUT`: Request timeout for TCP client (default: 10 seconds) - `DYN_TCP_POOL_SIZE`: Connection pool size for TCP client (default: 50) - `DYN_TCP_CONNECT_TIMEOUT`: Connect timeout for TCP client (default: 3 seconds) - `DYN_TCP_CHANNEL_BUFFER`: Request channel buffer size for TCP client (default: 100) ### Using NATS NATS provides durable jetstream messaging for request plane and can be used for KV events (and router replica sync). **Prerequisites:** - NATS server must be running and accessible - Configure NATS connection via standard Dynamo NATS environment variables ```bash # Explicitly set to NATS export DYN_REQUEST_PLANE=nats # Run your Dynamo service DYN_REQUEST_PLANE=nats python -m dynamo.frontend --http-port=8000 & DYN_REQUEST_PLANE=nats python -m dynamo.vllm --model Qwen/Qwen3-0.6B ``` **When to use NATS:** - Production deployments with service discovery - Event-backed KV-aware routing when using NATS as the event transport. Note: ZMQ event transport and approximate mode (`--no-router-kv-events`) both provide KV routing without NATS, with approximate mode using predicted cache state. - Need for message replay and persistence features Limitations: - NATS does not support payloads beyond 16MB (use TCP for larger payloads) ## Complete Example Here's a complete example showing how to launch a Dynamo deployment with different request planes: See [`examples/backends/vllm/launch/agg_request_planes.sh`](https://github.com/ai-dynamo/dynamo/tree/v1.2.0/examples/backends/vllm/launch/agg_request_planes.sh) for a complete working example that demonstrates launching Dynamo with TCP or NATS request planes. ## Real-World Example The Dynamo repository includes a complete example demonstrating both request planes: **Location:** `examples/backends/vllm/launch/agg_request_planes.sh` ```bash cd examples/backends/vllm/launch # Run with TCP ./agg_request_planes.sh --tcp # Run with NATS ./agg_request_planes.sh --nats ``` ## Architecture Details ### Network Manager The request plane implementation is centralized in the Network Manager (`lib/runtime/src/pipeline/network/manager.rs`), which: 1. Reads the `DYN_REQUEST_PLANE` environment variable at startup 2. Creates the appropriate server and client implementations 3. Provides a transport-agnostic interface to the rest of the codebase 4. Manages all network configuration and lifecycle ### Transport Abstraction All request plane implementations conform to common trait interfaces: - `RequestPlaneServer`: Server-side interface for receiving requests - `RequestPlaneClient`: Client-side interface for sending requests This abstraction means your application code doesn't need to change when switching request planes. ### Configuration Loading Request plane configuration is loaded from environment variables at startup and cached globally. The configuration hierarchy is: 1. **Mode Selection**: `DYN_REQUEST_PLANE` (defaults to `tcp`) 2. **Transport-Specific Config**: Mode-specific environment variables (e.g., `DYN_TCP_*`) ## Migration Guide ### From NATS to TCP 1. Stop your Dynamo services 2. Set environment variable `DYN_REQUEST_PLANE=tcp` 3. Optionally configure TCP-specific settings (e.g., `DYN_TCP_RPC_HOST`). Note: `DYN_TCP_RPC_PORT` is optional; if not set, an OS-assigned free port is used automatically. 4. Restart your services ### Testing the Migration After switching request planes, verify your deployment: ```bash # Test with a simple request curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [{"role": "user", "content": "Hello!"}] }' ``` ## Troubleshooting ### Issue: Services Can't Communicate **Symptoms:** Requests timeout or fail to reach the backend **Solutions:** - Verify all services use the same `DYN_REQUEST_PLANE` setting - Check that server ports are not blocked by k8s network policies or firewalls - For TCP: Ensure host/port configurations are correct and accessible - For NATS: Verify NATS server is running and accessible ### Issue: "Invalid request plane mode" Error **Symptoms:** Service fails to start with configuration error **Solutions:** - Check `DYN_REQUEST_PLANE` spelling (valid values: `nats`, `tcp`) - Value is case-insensitive but must be one of the two options - If not set, defaults to `tcp` ### Issue: Port Conflicts **Symptoms:** Server fails to start due to "address already in use" **Solutions:** - TCP: By default, TCP uses an OS-assigned free port, so port conflicts should be rare. If you explicitly set `DYN_TCP_RPC_PORT` to a specific port and get conflicts, either change the port or remove the setting to use automatic port assignment. ## Performance Considerations ### Latency - **TCP**: Lowest latency due to direct connections and binary serialization - **NATS**: Moderate latency due to nats jet stream persistence ### Resource Usage - **TCP**: Minimal request-plane infrastructure. KV events use the configured event plane; NATS is needed only when `DYN_EVENT_PLANE=nats`, and router-side event consumption can be disabled with `--no-router-kv-events`. - **NATS**: Requires running NATS server (additional memory/CPU) # Event Plane The event plane provides Dynamo with a pub/sub layer for near real-time event exchange between components. It delivers KV cache updates, worker load metrics, and sequence tracking events, enabling features like KV-aware routing and disaggregated serving. ## When Is the Event Plane Used? Key use cases: - **KV cache events** -- Workers publish cache state so the router can make cache-aware scheduling decisions. - **Worker load metrics** -- Workers report utilization so the router can balance load. - **Sequence tracking** -- Coordinates active sequences across router replicas for fault-tolerant routing. ![Event plane architecture showing NATS and ZMQ transport options connecting Frontend, Planner, and Worker](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/712e7d02a897573d522012e7d5a53df187b4af405198970cb526cc0ce89947e0/pages-v1.2.0/assets/img/event-plane-transport.svg) ## Choosing a Transport The event plane supports two transports: | | NATS (default) | ZMQ | |---|---|---| | **External infrastructure** | Requires a NATS server | None (peer-to-peer) | | **Setup complexity** | Simple -- point at a NATS server | Automatic -- workers bind sockets and register via discovery | | **Best for** | Large-scale deployments | Low operational overhead | ## Configuration ### Transport Selection Set the `DYN_EVENT_PLANE` environment variable to choose a transport: ```bash # Use NATS (default -- no need to set explicitly) export DYN_EVENT_PLANE=nats # Use ZMQ export DYN_EVENT_PLANE=zmq ``` Python components also accept this as a CLI flag: ```bash # SGLang backend python3 -m dynamo.sglang --event-plane zmq --model Qwen/Qwen3-0.6B # vLLM backend python3 -m dynamo.vllm --event-plane zmq --model Qwen/Qwen3-0.6B ``` ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `DYN_EVENT_PLANE` | Transport: `nats` or `zmq` | Context-dependent (see below) | | `NATS_SERVER` | NATS server URL (NATS transport only) | `nats://localhost:4222` | When `DYN_EVENT_PLANE` is not set, the default is chosen based on the discovery backend: - `--discovery-backend file` or `mem` (local backends): defaults to **zmq** — no external services required. - `--discovery-backend etcd` or `kubernetes` (distributed backends): defaults to **nats**. Set `DYN_EVENT_PLANE` explicitly to override this automatic selection. ## NATS Transport When using NATS (`DYN_EVENT_PLANE=nats`, or unset with a distributed backend): - Requires a running NATS server. Set `NATS_SERVER` if it is not on `localhost:4222`. - Events are published to NATS subjects scoped by namespace and component. - Built-in reconnection and message buffering during brief disconnections. Example setup: ```bash export NATS_SERVER=nats://nats-server:4222 export DYN_EVENT_PLANE=nats # Start workers -- explicitly enable KV event publishing python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B \ --kv-events-config '{"publisher":"nats","topic":"kv-events","enable_kv_cache_events":true}' # Start frontend -- it subscribes to events from NATS automatically python3 -m dynamo.frontend --router-mode kv ``` ## ZMQ Transport When using ZMQ (`DYN_EVENT_PLANE=zmq`): - No external server required. Each worker binds a ZMQ PUB socket and advertises its address through the discovery system. - Subscribers automatically discover and connect to all active publishers. - When publishers come and go (e.g., workers scaling up/down), subscribers dynamically adjust their connections. Example setup: ```bash export DYN_EVENT_PLANE=zmq # Start workers -- each binds a ZMQ socket, registers with discovery python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B \ --kv-events-config '{"publisher":"zmq","endpoint":"tcp://*:20080","enable_kv_cache_events":true}' # Start frontend -- discovers workers and connects directly python3 -m dynamo.frontend --router-mode kv ``` ## Disabling the Event Plane If you do not need KV-aware routing, you can disable the event plane entirely: ```bash python3 -m dynamo.frontend --router-mode kv --no-router-kv-events ``` With `--no-router-kv-events`: - The router falls back to prediction-based cache-aware routing (estimates cache state from routing decisions). - No NATS server or ZMQ sockets are needed. - TTL-based expiration keeps predicted state from growing stale. ## Deployment Modes ### Bare Metal / Local Both transports work out of the box: ```bash # NATS (requires nats-server running) export NATS_SERVER=nats://localhost:4222 # OR ZMQ (no extra infrastructure) export DYN_EVENT_PLANE=zmq ``` ### Kubernetes (with Dynamo Operator) The operator can inject `DYN_EVENT_PLANE` into pods. The same transport options apply. If using NATS, deploy a NATS server in the cluster and set `NATS_SERVER` accordingly. ## Related Documentation - [Discovery Plane](/dynamo/design-docs/communication-planes/discovery-plane) -- Service discovery and coordination (etcd, Kubernetes) - [Distributed Runtime](/dynamo/design-docs/distributed-runtime) -- Runtime architecture - [Request Plane](/dynamo/design-docs/communication-planes/request-plane) -- Request transport configuration - [Fault Tolerance](/dynamo/user-guides/fault-tolerance) -- Failure handling # Router Design This document describes the internal architecture of the Dynamo KV Router, including block tracking mechanisms, the KV cache optimization system, event handling, and transport modes. ## KV Router Architecture The KV Router tracks two key metrics for each worker: 1. **Potential Active Blocks**: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request. 2. **Potential New Prefill Blocks**: The number of tokens that need to be computed from scratch on a worker, calculated as: - New prefill tokens = Total input tokens - (Overlap blocks × Block size) - Potential prefill blocks = New prefill tokens / Block size ### Block Tracking Mechanisms The router maintains block information through two complementary systems: - **Active Decoding Blocks**: Tracked locally by the router throughout the request lifecycle: - Incremented when adding a new request - Updated during token generation - Decremented upon request completion - **Cached Blocks**: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions. ## KV Cache Router The leading Large Language Models (LLMs) today are auto-regressive and based off of the [transformer architecture](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf). One key inference optimization technique is to cache the already computed keys and values and to reuse them for the future tokens. This is called the [KV Cache](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/#key-value_caching). ### KV Cache Routing and Load Balancing ```mermaid graph TD T[Tokens] --> R[KV Aware Router] R -.-> W1["Worker 1
Cached: 2 blocks
Prefill: 8 blks
Decode: 10 blks"] R ==>|Selected| W2["Worker 2
Cached: 5 blocks
Prefill: 5 blks
Decode: 5 blks"] R -.-> W3["Worker 3
Cached: 8 blocks
Prefill: 2 blks
Decode: 9 blks"] style T fill:#fff3e0,stroke:#333,color:#333 style R fill:#2e8b57,stroke:#333,color:#fff style W1 fill:#f3e5f5,stroke:#333,color:#333 style W2 fill:#c8e6c9,stroke:#333,color:#333 style W3 fill:#f3e5f5,stroke:#333,color:#333 linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px ``` The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions. #### Cost Calculation 1. **Prefill blocks**: Calculated from active prompt-side token load plus the incoming request's input tokens, divided by the block size. The system updates active prompt load when the first output token signals prefill completion. 2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed. 3. **Cost formula**: ```text adjusted_prefill_blocks = max( prefill_blocks - overlap_score_credit * device_overlap_blocks - host_cache_hit_weight * host_overlap_blocks - disk_cache_hit_weight * disk_overlap_blocks - shared_cache_multiplier * shared_beyond_blocks, 0, ) cost = prefill_load_scale * adjusted_prefill_blocks + decode_blocks ``` - Lower costs indicate better routing choices - `overlap_score_credit` is the device-local prefix-overlap credit multiplier, from 0.0 to 1.0 - `prefill_load_scale` controls adjusted prompt-side load relative to decode blocks - Higher overlap credits favor cache reuse (improving TTFT), while lower credits prioritize even load distribution (improving ITL) #### Worker Selection The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution. Example calculation with `overlap_score_credit = 1.0`: - Worker 1: raw prefill 10 blocks, device overlap 2 blocks, decode 10 blocks => cost = 8 + 10 = 18 - **Worker 2: raw prefill 10 blocks, device overlap 5 blocks, decode 5 blocks => cost = 5 + 5 = 10** (selected - lowest cost) - Worker 3: raw prefill 10 blocks, device overlap 8 blocks, decode 9 blocks => cost = 2 + 9 = 11 ### KV Cache Optimizations Every inference framework will have a KV Cache for each worker. A popular inference framework library is [vLLM](https://github.com/vllm-project/vllm) where a key contribution was [PagedAttention](https://arxiv.org/abs/2309.06180), which allowed them to manage KV Cache in an efficient way by chunking requests into blocks. Another popular inference framework, [SGLang](https://github.com/sgl-project/sglang), contributed [RadixAttention](https://arxiv.org/abs/2312.07104) which introduced a prefix tree which allows for efficient matching, inserting and eviction of KV Cache blocks. The prefix tree structure popularized KV Cache reuse. In Dynamo, we introduce a KVPublisher which emits KV Cache events that occur at each worker and a KVIndexer which keeps track of these events globally. ### KV Block Management Flow To get a feel for how KV Cache management works on a single worker with KV Cache reuse turned on and where the KVPublisher gets plugged in, we can walk through the KV Block management flow: 1. **Request tokenization**: The incoming prompt is converted into tokens 2. **Block partitioning**: The token sequence is divided into fixed-size blocks (e.g., 16 or 64 tokens per block) 3. **Block hashing**: Each block of tokens is hashed to create a unique identifier. When a LoRA adapter is active, the adapter name is incorporated into the hash so that blocks cached under different adapters produce distinct identifiers. 4. **Cache lookup**: - For each block, the system checks if a matching block already exists in the KV cache - If a match is found, the existing KV cache block is reused - If no match is found, the system proceeds to the next step 5. **Resource allocation**: - For blocks without matches, the system attempts to allocate new memory space - If sufficient memory is available, allocate memory space and proceed to step 7 - If memory is constrained, proceed to step 6 6. **Cache eviction** (when necessary): - The system applies an eviction policy (e.g., LRU, LFU) to identify blocks for removal - Selected blocks are evicted from the cache - **KVPublisher emits a KV removed event notifying KVIndexer about the removed block.** - Alternatively, some systems may offload less-frequently used blocks to CPU memory. 7. **KV computation**: - For new blocks, the model computes key and value tensors - These tensors are stored in the newly allocated cache blocks - **KVPublisher emits a kv stored event notifying KVIndexer about newly stored blocks**. Further details can be found for: [SGLang](https://lmsys.org/blog/2024-01-17-sglang/), [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/) and [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching). ## Events ### KVPublisher The KVPublisher can be initialized and then called in the inference framework where blocks are allocated and removed. The two types of events are: - KV stored event - KV removed event The publisher can be initialized and used through Python bindings. ### Deterministic Event IDs Engines do not need to emit deterministic block identifiers in KV events, as the router uses local block hashes (computed from token content) for tracking and matching blocks across workers. However, it is strongly preferred that engines do emit deterministic block identifiers, as this keeps the KvIndexer's internal lookup table smaller and more efficient. To ensure deterministic behavior, all workers should use identical engine versions/configuration. If your engine relies on Python's built-in `hash()` for any event IDs, set `PYTHONHASHSEED=0`; otherwise this setting has no effect. ### KVIndexer The KVIndexer builds and maintains a global view of cached blocks in a prefix tree. We modify the original prefix tree by also storing the worker id on each node. This is so we can return the number of matched blocks for each worker. The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks. The KVIndexer supports two backend implementations, selected via `--router-event-threads`: - **Single-threaded RadixTree** (`--router-event-threads 1`): Events are processed in a dedicated single-threaded tokio runtime via channel-based dispatch. Also supports TTL-based expiration for `--no-router-kv-events` approximate mode. - **ConcurrentRadixTree** (default, `--router-event-threads N` where N > 1): A thread-safe radix tree with a pool of N worker threads for event processing and approximate routing-decision writes (default: 4). Uses sticky worker routing (events or synthetic approximate writes for the same worker always go to the same thread) to ensure per-worker serialization. Read operations (`find_matches`) execute concurrently with writes. ### Inter-Router Communication In distributed deployments with multiple routers, each router maintains visibility over only a portion of the total requests. To ensure consistent routing decisions, routers synchronize their states through three event types: 1. **AddRequest**: Notifies other routers when a request is assigned to a worker. Includes request ID, worker ID, token sequence blocks, and overlap score to track block usage across the system. 2. **MarkPrefillCompleted**: Signals when a request moves from prefill to decode phase, allowing routers to update their worker load calculations by excluding completed prefill tokens. 3. **Free**: Indicates request completion and resource release, enabling accurate block reference counting across all routers. Each event carries a unique router ID to prevent self-event processing. This asynchronous communication system ensures optimal routing decisions by maintaining consistent KV cache state across all routers, even as they handle different request streams. ## Event Transport Modes The router supports two event transport modes for KV cache state synchronization: - **NATS Core / Event Plane with Local Indexer (default)**: Fire-and-forget pub/sub where workers maintain local radix trees (enabled by default). Router rebuilds state by querying workers on startup. Lower latency, simpler setup. Works with both NATS Core and ZMQ event planes. - **JetStream** (`--durable-kv-events` on **both** frontend **and** workers): Persistent event stream with durable consumers. State persists across router restarts via snapshots in NATS object store. Best for production with multi-replica consistency. **Important:** Both the frontend and all workers must specify `--durable-kv-events` for JetStream mode to work correctly. ### JetStream Mode (Opt-in) KV events are sent to a persistent NATS JetStream. Each KV router/indexer replica acts as a durable consumer, pulling messages from this shared stream. This architecture ensures consistency across router replicas and persistence across restarts. - **Best for**: Production deployments requiring durability and multi-replica router consistency - **Tradeoffs**: Requires JetStream setup; slightly higher latency due to persistence guarantees - **Enable with**: `--durable-kv-events` flag on **both** the frontend **and** all workers **Both frontend and workers must specify `--durable-kv-events`** for JetStream mode to work correctly. The frontend uses this flag to consume from JetStream, while workers use it to publish to JetStream instead of the local indexer. ```mermaid graph TD subgraph Engines E1[Engine 1
KVPublisher] E2[Engine 2
KVPublisher] E3[Engine 3
KVPublisher] end subgraph "NATS JetStream" JS[(Persistent KV Events Stream
- Block created
- Block removed)] end subgraph "NATS Object Store" OS[(Radix Tree
State Snapshot)] end subgraph "Router Replicas" R1[Router 1
KVIndexer] R2[Router 2
KVIndexer] end E1 -->|Publish Events| JS E2 -->|Publish Events| JS E3 -->|Publish Events| JS JS -->|Consume as Durable Consumer| R1 JS -->|Consume as Durable Consumer| R2 JS -->|Periodic Snapshot| OS style JS fill:#e1f5fe,stroke:#333,color:#333 style OS fill:#e1f5fe,stroke:#333,color:#333 style E1 fill:#f3e5f5,stroke:#333,color:#333 style E2 fill:#f3e5f5,stroke:#333,color:#333 style E3 fill:#f3e5f5,stroke:#333,color:#333 style R1 fill:#2e8b57,stroke:#333,color:#fff style R2 fill:#2e8b57,stroke:#333,color:#fff linkStyle 0,1,2,3,4,5 stroke:#2196f3,stroke-width:2px ``` ### NATS Core / Event Plane with Local Indexer (Default) By default, workers have local indexer enabled. Each worker maintains its own local radix tree (local indexer) and publishes events over the generic event plane (NATS Core or ZMQ, depending on `--event-plane`). Each worker assigns monotonically increasing event IDs to its events. The router detects gaps in event sequences and recovers missed events by querying the worker's local indexer directly. - **Best for**: Lower-latency setups; simpler deployments without JetStream; single-router scenarios; deployments without NATS (using ZMQ event plane) - **Tradeoffs**: State persists on workers (not centralized); recovery depends on workers being available - **Switch to JetStream**: Use `--durable-kv-events` flag on **both** workers (SGLang, TRT-LLM, vLLM, mocker) **and** frontend ```mermaid graph TD subgraph Engines E1[Engine 1
LocalKvIndexer] E2[Engine 2
LocalKvIndexer] E3[Engine 3
LocalKvIndexer] end subgraph "Event Plane (NATS / ZMQ)" NC[KV Events Pub/Sub
- Block created
- Block removed] end subgraph "Router Replicas" R1[Router 1
KVIndexer] R2[Router 2
KVIndexer] end E1 -->|Publish Events| NC E2 -->|Publish Events| NC E3 -->|Publish Events| NC NC -->|Subscribe| R1 NC -->|Subscribe| R2 style NC fill:#e1f5fe,stroke:#333,color:#333 style E1 fill:#f3e5f5,stroke:#333,color:#333 style E2 fill:#f3e5f5,stroke:#333,color:#333 style E3 fill:#f3e5f5,stroke:#333,color:#333 style R1 fill:#2e8b57,stroke:#333,color:#fff style R2 fill:#2e8b57,stroke:#333,color:#fff linkStyle 0,1,2,3,4 stroke:#2196f3,stroke-width:2px ``` **How gap detection works:** 1. Each worker assigns monotonically increasing event IDs starting from 0 2. The router tracks the last received event ID per worker 3. If an event arrives with `event_id > last_id + 1`, the router detects a gap 4. The router queries the worker's local indexer for the missing event range `[last_id+1, event_id-1]` 5. On worker discovery (Added event), the router dumps the worker's entire local indexer state **Startup behavior:** - When a worker is discovered, the router queries and ingests its full local indexer state - When a worker is removed, the router removes all its blocks from the global radix tree By default, all workers have `enable_local_indexer=true`, so the router uses NATS Core / Event Plane mode with local indexer. To use JetStream mode instead, specify `--durable-kv-events` on **both** the frontend and all workers. ### Local Active Block Management with Replica Sync In addition to cached blocks, each router replica needs to track active blocks (blocks being used for ongoing generation) as load metrics. Since this information is highly time-sensitive, it should be predicted immediately when: - The router receives and routes a request - The first token is generated (prefill complete) - The response ends (request freed) This is managed locally in each router via a "slot manager". To maintain consistency across the system, router replicas synchronize these local predictions with each other through NATS core messaging. ```mermaid sequenceDiagram participant C1 as Client 1 participant R1 as Router 1
(Slot Manager) participant R2 as Router 2
(Slot Manager) participant C2 as Client 2 Note over R1,R2: Router Replica Sync Enabled C1->>R1: Request A activate R1 R1->>R1: Predict blocks & route to worker R1-->>R2: Sync: AddRequest(A) C2->>R2: Request B activate R2 R2->>R2: Predict blocks & route to worker R2-->>R1: Sync: AddRequest(B) R1->>R1: First token received
(prefill complete) R1-->>R2: Sync: MarkPrefillCompleted(A) R1->>C1: Stream response R2->>R2: First token received
(prefill complete) R2-->>R1: Sync: MarkPrefillCompleted(B) R2->>C2: Stream response R1->>R1: Response complete
(free blocks) R1-->>R2: Sync: Free(A) deactivate R1 R2->>R2: Response complete
(free blocks) R2-->>R1: Sync: Free(B) deactivate R2 Note over R1,R2: Both routers have consistent
view of active blocks ``` This dual-layer approach—persistent global KV cache state via JetStream and ephemeral active block synchronization via router replicas—enables the system to make optimal routing decisions that balance cache reuse with load distribution. ## See Also - **[Router README](/dynamo/components/router)**: Quick start guide for the KV Router - **[Configuration and Tuning](/dynamo/components/router/configuration-and-tuning)**: Router flags, tuning, and production setup - **[Router Examples](/dynamo/components/router/router-examples)**: Python API usage and custom routing patterns - **[KV Event Publishing for Custom Engines](/dynamo/integrations/kv-cache-integrations/kv-events-for-custom-engines)**: Integrate custom inference engines with KV-aware routing # KVBM Design This document provides an in-depth look at the architecture, components, framework integrations via the connector API, and the detailed workings of the Dynamo KV Block Manager (KVBM). The design of KVBM takes inspiration from the KV block managers used in SGLang and vLLM, with added influence from historical memory tiering strategies common in general GPU programming. For more details, see [Further Reading](#further-reading). ## KVBM Components ![Internal Components of Dynamo KVBM](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/5b0075bf0e321be71d204bf237a3ced35402bce029989e335a7396c1a1f4108b/pages-v1.2.0/assets/img/kvbm-components.svg) *Internal Components of Dynamo KVBM* ### Core - **KvBlockManager**: Public facade. Constructs/owns the internal state and exposes the pools and onboarding APIs. - **Scheduler**: Gates transfer execution relative to model progress (iteration/layer completion) when integrated with a framework connector (e.g., vLLM V1). - **Config (config.rs)**: Describes model dims, page size, layout choices, and runtime flags used to build pools and layouts. - **KvBlockManagerState**: Central object wiring together layouts, storage backends, and pools; owns the OffloadManager, metrics, and events hooks. - **Events/Metrics**: Observability components emitting counters/gauges and event hooks for integration/testing. ### Layouts and Blocks - **LayoutConfig & LayoutType**: Translate tensor shapes into KV cache layouts (layer-separated or fully-contiguous), including block counts and geometry. - **Blocks & Metadata**: Typed block handles (mutable/immutable), metadata (e.g., priority), and views by layer/outer dims; used to allocate, register, and match by `sequence_hash`. ### Transfer Manager - **TransferManager**: Asynchronous transfer orchestrator with per-path queues (Device→Host, Host→Disk, Host→Device, Disk→Device). ### Storage & Pools - **Device Pool (G1)**: GPU-resident KV block pool. Allocates mutable GPU blocks, registers completed blocks (immutable), serves lookups by sequence hash, and is the target for onboarding (Host→Device, Disk→Device). - **Host Pool (G2)**: CPU pinned-memory KV block pool. Receives Device offloads (Device→Host), can onboard to Device (Host→Device), and offloads to Disk. Uses pinned (page-locked) memory for efficient CUDA transfers and NIXL I/O. - **Disk Pool (G3)**: Local SSD NVMe-backed KV block pool. Receives Host offloads (Host→Disk) and provides blocks for onboarding to Device (Disk→Device). NIXL descriptors expose file offsets/regions for zero-copy I/O and optional GDS. - **Remote Storage (G4)**: Remote or cloud-backed KV block storage. KVBM treats G4 as an opaque blob store accessed through NIXL, unaware of internal layout optimizations. ## KVBM Data Flows ![KVBM Data Flows](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/004d5acb937c15512095b6b6396f633e40ce1f03f1210fc9601be5646e2c87ab/pages-v1.2.0/assets/img/kvbm-data-flows.png) *KVBM Data Flows from device to other memory hierarchies* ### Device → Host (Offload) - Triggered when explicitly requested to offload by the connector scheduler - Worker allocates a Host block and performs CUDA D2H/Custom Kernel copy - Host pool registers the new immutable block (dedup by sequence hash) ### Host → Disk (Offload) - **Local Disk (G3)**: NIXL Write via POSIX; GDS when available - **Remote Disk (G4)** (Network FS like NFS/Lustre/GPFS): NIXL Write via POSIX to the mounted FS; batching/concurrency identical - Triggered on registered host blocks or explicit offload requests - Worker allocates a Disk block and performs NIXL Write (Host→Disk) - Disk pool registers the new immutable block (dedup by sequence hash) ### Host → Device (Onboard) - Called to bring a host block into GPU memory - Worker uses provided Device targets and performs CUDA H2D/Custom Kernel copy - Device pool registers the new immutable block ### Disk → Device (Onboard) - Called to bring a disk block directly into GPU memory - Worker uses provided Device targets and performs NIXL Read (Disk→Device), possibly via GDS - Device pool registers the new immutable block ## Internal Architecture Deep Dive ![Internal architecture and key modules in the Dynamo KVBM](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/81eb7e56a2be8745490415e2e01322d2d8f1ed9b299a87fcdb553a7b3d581c7b/pages-v1.2.0/assets/img/kvbm-internal-arch.png) *Internal architecture and key modules in the Dynamo KVBM* ### KvBlockManager as Orchestration Layer The `KvBlockManager` acts as a coordinator across memory tiers—host (CPU), device (GPU), and remote—by managing per-backend block pools and exposing consistent block lifecycle APIs. It tracks KV block locations across device memory (G1), CPU memory within and across nodes (G2), local/pooled SSDs (G3), and remote storage (G4). G1-G4 are key tiers enabled by KVBM. Note that KVBM treats G4 storage as an opaque blob store, unaware of internal layout optimizations. `KvBlockManager` owns: - A device-side `BlockPool` - A host-side `BlockPool` - A remote NIXL agent that supports communication and memory sharing across nodes - A block set registry for remote lookup and import/export of block metadata Implementation-wise, `KvBlockManagerState` holds the logic: it's initialized by `KvBlockManagerConfig`, which merges runtime, model, and layout configurations. `NixlOptions` injects remote awareness. ### Block Layout and Memory Mapping Each block is a 2D array `[num_layers][page_size × inner_dim]`. The `BlockLayout` trait abstracts the memory layout. The default implementation, `FullyContiguous`, stores all layers for all blocks in one region with alignment-aware stride computation: ```text block_stride_in_bytes = align_up(num_layers × layer_stride, alignment); ``` Both CPU and GPU pools share this memory layout, but they use storage-specific backends: - `DeviceStorage` → CUDA device buffer - `PinnedStorage` → page-locked host memory - `SystemStorage` → CPU heap memory (fallback/test) - `NixlStorage` → remote memory through NIXL RDMA handles (includes storage) Each layout is constructed using a `LayoutConfig`, and storage is either passed directly or allocated using a `StorageAllocator`. ### BlockPool and Memory Pools (Active and Inactive) Each `BlockPool` (where `T` is `DeviceStorage`, `PinnedStorage`, etc.) tracks two sub-pools: - **ActivePool**: Contains blocks currently in use by sequences - **InactivePool**: Recycled blocks ready for allocation (free list) When a token block is requested (e.g., `get_mutable_block()`), the allocator pops from `InactivePool`, transitions its state, and returns a writable handle. On sequence commit or eviction, the system resets blocks and returns them to the inactive pool. ### Block State Machine The state machine (`BlockState`) tracks block lifecycle transitions: | State | Description | Ownership | Valid Actions/Transitions | |-------|-------------|-----------|---------------------------| | Reset | Block hasn't been initialized or was reset. No associated sequence. | Held in InactivePool, reusable | `init_sequence(salt_hash)` → Partial | | Partial | Block is being filled with tokens for a new sequence. In-progress. | Owned by the sequence creator | `add_token()` / `add_tokens()` (accumulate), `commit()` → Complete, `reset()` → Reset | | Complete | Block is fully filled with token data but not yet visible to others. | Still owned by creator thread | `register()` → Registered, `reset()` → Reset | | Registered | Block is finalized and visible for reuse. Available in the deduplication cache. | Shared ownership (global registry) | Auto `drop()` → triggers Remove event and transitions to Reset | #### Valid State Transitions | From → To | Trigger | Validation | |-----------|---------|------------| | Reset → Partial | `init_sequence(salt_hash)` | Must not be in use | | Partial → Complete | `commit()` | Must be full | | Complete → Registered | `register()` | Must be finalized | | Registered → Reset | Drop of `RegistrationHandle` | Automatic | | Partial → Reset | Aborted sequence | Explicit or drop | | Complete → Reset | Invalidated | Explicit or drop | #### Example Block Lifecycle A sequence requests a new KV block: 1. Allocator pops from InactivePool → Block is in Reset 2. `init_sequence()` → Transitions to Partial 3. Tokens are appended → State remains Partial 4. On full → `commit()` → State becomes Complete 5. `register()` → Block is hashed and moved to Registered. Blocks can now be used for lookup. 6. On eviction or end-of-life → `drop()` of RAII handle returns block to Reset ### Lifecycle Management using RAII and Event Plane The system uses RAII for memory lifecycle management. Every block holds metadata and registration state, and registration is coupled with an `EventManager`. On registration and drop: - `PublishHandle` triggers Register events - Dropping it triggers Remove events This pattern ensures consistency for shared memory tracking across workers without requiring explicit deallocation logic. The events are propagated in the Dynamo Events plane. Any Dynamo component subscribed to the events plane can listen to these changes. Note that even the storage provider can subscribe to the events plane and create an internal prefix tree representation that is tailored and optimized for the specific platform. ### Remote Memory Integration using NIXL The NIXL agent exposes remote memory buffers using `NixlBlockSet`, `RemoteBlocks`, and layout descriptors. Key operations include: - `nixl_register()`: Registers memory region with NIXL runtime - `serialize() / deserialize()`: Converts layout and memory into transferable descriptors - `import_remote_blockset()`: Loads remote node's block layouts into the manager - `get_remote_blocks_mutable()`: Fetches transferable memory views from another node `RemoteBlocks` is a lightweight abstraction over shared memory for cross-node block usage (through UCX or other backends). #### Remote Memory Registration Protocol The following describes a bidirectional remote memory registration and layout synchronization protocol between workers (e.g., Worker 1 and Worker 2) using NIXL: **1. Agent Creation & Memory Registration** Each worker independently sets up a NixlAgent: - Registers its memory regions (i.e., device memory) through `nixl_register()` - These regions correspond to blocks managed in the local BlockPool Once the worker registers the memory, NIXL creates remote-accessible descriptors, which it binds to the memory layout. **2. Metadata Exchange** After memory registration, workers exchange serialized layout metadata, encapsulated in a `SerializedNixlBlockLayout`. Why is this step critical? - LLM inference workloads often differ in *tensor parallel (TP)* configurations: - Worker 1 might have TP=4, while Worker 2 has TP=8 - Even if both systems use similar `FullyContiguous` layouts, their internal slicing and alignment assumptions differ - The metadata exchange bridges this semantic mismatch by sharing: - LayoutConfig (num_layers, page_size, inner_dim, dtype) - BlockSetID - Base address + stride information (including alignment) - Device ID + memory type (host/device) - Once workers share metadata, each can reconstruct the layout on its side using `deserialize()` This enables NIXL to: - Understand where each layer/block resides - Perform correct gather-scatter operations during RDMA-like transfers Without this step, remote fetches would result in data corruption or misaligned tokens. **3. Serialization & Deserialization: Making Layouts Portable** In the serialization stage, KVBM exports and `FullyContiguous::serialize()` encodes: - FullyContiguousConfig - base_offset - Physical memory descriptors (NixlStorage), including: - Memory type (VRAM, DRAM) - Address & size - Device ID The system sends this using NIXL transfer and then injects it into a KVBM scheduler state. In the deserialization stage, `SerializedNixlBlockLayout::deserialize()` rehydrates this into: - A fully reconstructed memory layout view - Local representation of a remote memory slice with correct offsets and size semantics It also enables direct access to remote memory with consistent logical semantics. This guarantees that even across different system configurations (hardware or LLM shape), both parties agree on the memory view for each KV block. **4. Ownership Handles and Lifetime Tracking** Memory ownership in NIXL is tightly coupled with RAII-based handles: - When a block is registered, it returns a `PublishHandle` which wraps a `RegistrationHandle` - On drop of this handle, an automatic Remove event is published, which: - Deregisters the block from the NIXL layer - Removes it from the remote block registry - This ensures that once the block is evicted from the cache or no longer used in inference, all references are invalidated cleanly across nodes This mechanism avoids: - Stale memory access - Dangling pointers on GPU or host - Manual deregistration bugs The system can batch and publish registration events using a Publisher, optimizing performance under high concurrency. ### Storage Backends and Pluggability You can integrate KVBM with a storage backend by extending or wrapping `NixlEnabledStorage` to support cross-node RDMA registration. All layouts and block pools are generic over these backends, allowing for fine-grained control over memory tiers. ```mermaid --- title: Example KVBM System Architecture --- flowchart TD A["Distributed Inference Engine"] --> B["Dynamo KV Block Manager"] B --> C["NIXL Storage Agent
- Volume registration
- get()/put() abstraction"] B --> D["Event Plane
- Pub/Sub (NATS or ZMQ)
- StoreEvent / RemoveEvent"] C --> E["G4 Storage Infrastructure
(SSD, Object store, etc.)
- Store KV blocks"] D --> F["Storage Provider Subscriber
- Parse Events
- Build fast tree/index
- Optimize G4 tiering"] ``` #### NIXL Storage Interface (for Backend Integration) The NIXL interface abstracts volume interaction and decouples it from mounting, metadata tracking, or direct system I/O. It provides: - `registerVolume(descriptor)`: Register a logical volume for KV cache data - `unregisterVolume()`: Cleanly deregister and release volume mappings - `get() / put()`: Block-level APIs used by KVBM to fetch and store token blocks These abstractions allow backends to be integrated without tying into the host's file system stack, enabling safe interaction with block devices, local filesystems, and RDMA-capable volumes. Note that these APIs are still being finalized. #### Dynamo Event Plane (Pub/Sub Coordination Layer) To support external storage optimizations without modifying KVBM logic, we provide an **event plane** (supporting NATS and ZMQ transports) that emits lifecycle events for all block operations: - **StoreEvent**: Emitted when a KV block is registered - **RemoveEvent**: Emitted when a KV block is released or evicted Each KVEvent (~100 bytes) contains: | Field | Description | |-------|-------------| | `sequence_hash` | Unique identifier of the KV block | | `prefix_hash` | Prefix grouping for query-level aggregation | | `block_size` | Size in bytes | | `storage_location` | Logical volume identifier | | `event_type` | Store or Remove | | `extra_metadata` | Reserved fields for partner-specific optimization | For scalability, the system batches and publishes these events periodically (e.g., every ~10s, or dynamically based on system load). #### Conceptual Design of a Storage Advisor This section provides an overview for storage providers interested in integrating as a custom backend to KVBM. **This is optional for KVBM integration with a backend.** External storage systems are not tightly coupled with Dynamo's execution pipeline. Instead, they passively observe KV block lifecycle events through a subscription model: 1. Storage volumes are pre-provisioned and mounted by the storage provider 2. These volumes are registered with Dynamo through the NIXL Storage Agent using `registerVolume()` APIs 3. Dynamo KV Block Manager interacts only with logical block-level APIs (`get()` and `put()`) 4. The Event Plane asynchronously broadcasts KV lifecycle events via pub/sub (NATS or ZMQ) 5. Storage vendors implement a lightweight subscriber process that listens to these events To enable fast lookup and dynamic tiering, storage vendors may build internal data structures using the received event stream: - On receiving a **StoreEvent**: Insert a record into an internal prefix tree, hash map, or LRU index with `prefix_hash`, `sequence_hash`, and associated metadata - On receiving a **RemoveEvent**: Delete or prune the corresponding record, optionally triggering cleanup or tier migration workflows With real-time visibility into KV block usage patterns, the storage system can implement smart tiering policies: - **Hot block promotion**: Frequently accessed KV blocks can be migrated to fast SSD volumes - **Cold block demotion**: Infrequently used blocks can be demoted to slower storage (HDDs, cloud object storage) - **Proactive compaction**: If block sizes or prefix patterns indicate fragmentation, the storage backend can coalesce or rewrite blocks This design ensures that performance, resilience, and extensibility scale independently across the KV layer and the storage backend layer. ## Framework Integrations KVBM integrates with inference frameworks (SGLang, TensorRT-LLM, vLLM) via Connector APIs to influence KV caching behavior, scheduling, and forward pass execution. ### Connector Architecture There are two components of the interface: - **Scheduler (Leader)**: Responsible for orchestration of KV block offload/onboard, builds metadata specifying transfer data to the workers. It also maintains hooks for handling asynchronous transfer completion. - **Worker**: Responsible for reading metadata built by the scheduler (leader), performs async onboarding/offloading at the end of the forward pass. ![vLLM KVBM Integration](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/92ac64a22be262ad2c043a48578539e3f0482b2f658e8455b27cd23c8f2c8592/pages-v1.2.0/assets/img/kvbm-integrations.png) *Typical integration of KVBM with inference frameworks (vLLM shown as example)* ### Onboarding Operations ![Onboarding blocks from Host to Device](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/215b20ab77c035973e76fc9ea589674144485b62ef874ffdf86640408d222680/pages-v1.2.0/assets/img/kvbm-onboard-host2device.png) *Onboarding blocks from Host to Device* ![Onboarding blocks from Disk to Device](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/56bfdbb942a421b0b1d12ebc058ca9c56d666d349047fb6a3c828f0c3a5e836d/pages-v1.2.0/assets/img/kvbm-onboard-disk2device.png) *Onboarding blocks from Disk to Device* ### Offloading Operations ![Offloading blocks from Device to Host & Disk](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/f139e863257fb63c4862fb5377e3c679b12f444ca5e1179597fe7a122e1741b5/pages-v1.2.0/assets/img/kvbm-offload.png) *Offloading blocks from Device to Host & Disk* ## Further Reading - [vLLM Automatic Prefix Caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html) - [SGLang HiCache Benchmarks](https://github.com/sgl-project/sglang/tree/main/benchmark/hicache) - [EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal](https://arxiv.org/abs/2006.06890) ## See Also - [KVBM Overview](/dynamo/components/kvbm) - [KVBM Guide](/dynamo/user-guides/kv-cache-offloading) - [NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md) # Planner Design > **Tier 3 design documentation** for contributors and architects. For user-facing docs, see [docs/components/planner/](/dynamo/components/planner). ## Overview The Planner is Dynamo's autoscaling controller. It supports two scaling modes: **throughput-based** (using profiling data and traffic prediction) and **load-based** (using real-time engine metrics and online regression). This document covers the internal architecture, algorithms, and design trade-offs for both modes. ## Throughput-Based Scaling ![Planner architecture showing Metric Collector, Load Predictor, and Performance Interpolator feeding into the Scaling Algorithm and Connector Layer](https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/ff41fd393ebe01d92134ea23a32abe555d139a609914b8adcc02bfd3ff7e9b3e/pages-v1.2.0/assets/img/planner-architecture.svg) ## Scaling Algorithm ### Step 1: Metric Collection Every `adjustment_interval` seconds, the planner queries Prometheus for: - Average TTFT and ITL over the interval - Total request count - Average input sequence length (ISL) and output sequence length (OSL) The Prometheus query targets the Frontend's `/metrics` endpoint, which exposes histograms and counters. ### Step 2: Correction Factor Calculation The planner maintains correction factors that adapt profiling-based predictions to real-world behavior: ```text prefill_correction = actual_ttft / expected_ttft decode_correction = actual_itl / expected_itl ``` These factors account for hard to model factors such as: - **Request queueing**: Bursty traffic causes higher TTFT than profiled steady-state - **Prefix cache hits**: KV reuse reduces effective prefill tokens, lowering actual TTFT - **Chunked prefill in decode**: Small prefills processed in decode engine affect ITL - **Metric variance**: Average ISL/OSL may not represent the actual distribution The correction factors are applied as multipliers to the next scaling decision. Setting `--no-correction` disables this for debugging or when cold-start artifacts dominate. ### Step 3: Load Prediction The planner forecasts three values for the next interval: - `next_num_req`: Number of requests - `next_isl`: Average input sequence length - `next_osl`: Average output sequence length Four predictor implementations are available: | Predictor | Algorithm | Best For | | ------------ | ---------------------------------------- | -------------------------------- | | **Constant** | `next = current` | Stable workloads, long intervals | | **ARIMA** | Auto-ARIMA with optional log1p transform | Trending/seasonal patterns | | **Kalman** | Local linear trend Kalman filter | Bursty traffics | | **Prophet** | Facebook Prophet time-series model | Complex seasonality | All predictors support warm-starting from trace files (`--load-predictor-warmup-trace`). ### Step 4: Replica Calculation **Prefill replicas:** ```python predicted_load = next_requests * next_isl / interval * min(1, prefill_correction) prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine) ``` The prefill correction factor has a linear effect on throughput because prefill is single-batched. **Decode replicas:** ```python # Apply correction to the ITL SLA target corrected_itl = target_itl / decode_correction_factor # Find best throughput/GPU that achieves corrected ITL at predicted context length throughput_per_gpu = decode_interpolator.find_best_throughput_per_gpu( itl=corrected_itl, context_length=next_isl + next_osl / 2 ) # Calculate required replicas decode_replicas = ceil(next_num_req * next_osl / interval / throughput_per_gpu / gpus_per_engine) ``` ### Step 5: Scaling Execution The planner calls `connector.set_component_replicas()` with the calculated targets. Scaling is non-blocking by default: the planner continues monitoring while replicas are adjusting. ## Connector Design ### Interface ```python class PlannerConnector(ABC): async def add_component(self, component_name) async def remove_component(self, component_name) # Extended interface (not on ABC, but implemented by both connectors): async def set_component_replicas(self, targets, blocking) async def validate_deployment(self, ...) async def wait_for_deployment_ready(self) ``` ### KubernetesConnector Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments. **Design decisions:** - Uses `DYN_PARENT_DGD_K8S_NAME` to find its parent DGD (injected by operator) - Resolves services by `subComponentType` field (prefill/decode), with fallback to legacy component names - Validates deployment structure on startup: checks that prefill and decode services exist and model names match ### VirtualConnector For non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via `VirtualConnectorCoordinator` (Rust binding). External systems use `VirtualConnectorClient` to poll decisions and report completion. **Scaling decision flow:** 1. Planner writes `(num_prefill, num_decode, decision_id)` to runtime 2. External system reads decision via `client.wait()` 3. External system executes scaling 4. External system reports completion via `client.complete(decision)` 5. Planner sees `scaled_decision_id >= decision_id` and proceeds **Timeout**: If scaling isn't acknowledged within 1800s (configurable), the planner proceeds with new decisions anyway. ## Performance Interpolation The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation). Two interpolators are maintained: - **Prefill interpolator**: Maps (throughput_per_gpu, ISL) -> TTFT - **Decode interpolator**: Maps (throughput_per_gpu, context_length) -> ITL The interpolators use the profiling sweep granularity to determine precision. Finer granularity means more profiling samples but more accurate interpolation. ## Initialization The planner currently waits 30 seconds (`INIT_PLANNER_START_DELAY` in `components/src/dynamo/planner/__main__.py`) as a temporary workaround while other components (frontend, workers) register and stabilize; see [Known Limitations](#known-limitations) for the planned readiness-probing replacement. After the delay: 1. Initialize the connector (K8s or Virtual based on `--environment`) 2. Validate deployment structure 3. Load profiling results 4. Build interpolators 5. Initialize load predictor 6. Enter main scaling loop ## Performance Considerations - **Adjustment interval sizing**: The interval must be long enough for scaling operations to complete. If `adjustment_interval` is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals. - **Correction factor stability**: Correction factors are recalculated each interval. During traffic transitions (e.g., ramp-up), they can oscillate. The `--no-correction` flag disables correction for scenarios where cold-start artifacts dominate and distort the factor. - **Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration. - **Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback. ## Load-Based Scaling The load-based mode uses ForwardPassMetrics (FPM) from the Dynamo event plane to make SLA-aware scaling decisions without requiring profiling data or the KV Router. ### Metrics Each engine emits per-iteration `ForwardPassMetrics` via ZMQ -> FpmEventRelay -> event plane. The planner subscribes via `FpmEventSubscriber` with automatic engine discovery and MDC-based lifecycle tracking. Key fields used: - **wall_time**: per-iteration execution time (regression target) - **scheduled_requests.sum_prefill_tokens**: prefill regression input - **scheduled_requests.sum_decode_kv_tokens**: decode regression input - **queued_requests**: queued prefill/decode load for TTFT/ITL simulation - Idle heartbeats (wall_time=0) are skipped ### Diagnostics Each tick, the scaling state machine fills `TickDiagnostics` with intermediate decision data—estimated latencies, predicted load, per-engine RPS, and decision reasons—via internal `_diag_*` fields. The adapter layer reads this from `PlannerEffects.diagnostics` and: - Sets Prometheus gauges (e.g. `dynamo_planner_estimated_ttft_ms` and related estimates) - Records enum metrics for load-scaling decision reasons (`dynamo_planner_load_scaling_decision`) - Feeds `DiagnosticsRecorder`, which accumulates per-tick snapshots and emits Plotly-based HTML reports on a schedule Per-engine FPM queue depths from `_collect_fpm()` are exported as labeled Prometheus gauges. ### Regression Models Three specialized regression models (`fpm_regression.py`): - **PrefillRegressionModel**: 1D regression `sum_prefill_tokens -> wall_time`. Estimates TTFT by simulating chunked prefill scheduling (chunks of `max_num_batched_tokens`). - **DecodeRegressionModel**: 1D regression `sum_decode_kv_tokens -> wall_time`. Estimates ITL for total decode load (scheduled + queued + avg decode length). - **AggRegressionModel**: 2D regression `(sum_prefill_tokens, sum_decode_kv_tokens) -> wall_time`. Estimates both TTFT (simulated prefill with piggybacked decode) and ITL (decode with average piggybacked prefill). ### Scaling Decisions - **Prefill/Decode**: Scale up if ALL engines' estimated TTFT/ITL > SLA; scale down if ALL < SLA * sensitivity - **Agg**: Scale up if (ALL TTFT > SLA) OR (ALL ITL > SLA); scale down if (ALL TTFT < SLA * sensitivity) AND (ALL ITL < SLA * sensitivity) - Only scales by +/-1 per interval (non-blocking with pending-desired guard: metrics continue to be observed while scaling is in progress, but no new scaling action is issued until the previous one completes) ### Co-existence with Throughput-Based Scaling When both modes are enabled, throughput-based scaling (longer interval) sets a lower bound on replicas while load-based scaling (shorter interval) handles real-time adjustments above that floor. ### Aggregated Mode In aggregated mode (`--mode agg`), engines handle both prefill and decode via chunked prefill. The planner maintains both TTFT and ITL regression models but uses per-worker time-averaged metrics (not instantaneous) for regression training to smooth out chunked prefill noise. Scale up if either prefill or decode signals overload; scale down only if both signal underload. ## Known Limitations 1. **30-second startup delay**: Hardcoded wait for component registration. It should be replaced with runtime readiness probing. 2. **Adjustment interval vs scaling latency**: If `adjustment_interval` \< time to scale, scaling decisions can pile up. The planner logs warnings but doesn't queue. 3. **Average-based interpolation**: Throughput-based scaling uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well. 4. **Single DGD scope**: Each planner instance manages exactly one DGD. Multi-model/multi-DGD coordination is not supported. ## Future Work - Multi-DGD coordination for shared-cluster scenarios - Distribution-aware interpolation (beyond mean ISL/OSL) - Adaptive adjustment interval based on observed scaling latency ## File Map | File | Purpose | | ---------------------------- | ----------------------------------------------------- | | `planner_core.py` | Base planner, shared scaling loop, algorithm core | | `disagg_planner.py` | Disaggregated mode orchestrator (prefill + decode) | | `agg_planner.py` | Aggregated mode orchestrator (load-based only) | | `prefill_planner.py` | Prefill-specific scaling logic | | `decode_planner.py` | Decode-specific scaling logic | | `load_based_regression.py` | Sliding-window linear regression for load-based scaling | | `prometheus.py` | Prometheus/router metrics clients, data classes | | `perf_interpolation.py` | NPZ data loading and throughput/latency interpolation | | `load_predictor.py` | ARIMA, Prophet, Kalman, Constant predictors | | `pre_swept_results_utils.py` | Pre-computed H100/H200 profiling data loader | | `kubernetes_connector.py` | K8s API integration for DGD scaling | | `kube.py` | Low-level K8s client wrapper | | `exceptions.py` | Custom exception hierarchy | | `defaults.py` | Default configs, backend name mappings | | `planner_argparse.py` | CLI argument definitions | # How to Build and Publish Dynamo Docs This document describes the architecture, workflows, and maintenance procedures for the NVIDIA Dynamo documentation website powered by [Fern](https://buildwithfern.com). The documentation website is hosted entirely on [Fern](https://buildwithfern.com). CI publishes to `dynamo.docs.buildwithfern.com`; the production domain `docs.dynamo.nvidia.com` is a custom domain alias that points to the Fern-hosted site. There is no separate server — Fern handles hosting, CDN, and versioned URL routing. The `docs-website` branch is **CI-managed and must never be edited by hand**. All documentation authoring happens on `main` (or a feature branch based on `main`). The sync workflow copies changes to `docs-website` automatically. --- ## Table of Contents - [Branch Architecture](#branch-architecture) - [Directory Layout](#directory-layout) - [Configuration Files](#configuration-files) - [GitHub Workflows](#github-workflows) - [Fern Docs Workflow](#fern-docs-workflow-fern-docsyml) - [Docs Link Check Workflow](#docs-link-check-workflow-docs-link-checkyml) - [Content Authoring](#content-authoring) - [Callout Conversion](#callout-conversion) - [Running Locally](#running-locally) - [Version Management](#version-management) - [How Publishing Works](#how-publishing-works) - [Common Tasks](#common-tasks) - [Claude Code Skills](#claude-code-skills) --- ## Claude Code Skills A single Claude Code skill automates common docs tasks. Invoke it as a slash command in Claude Code (e.g., `/dynamo-docs`) — the skill walks through the full workflow: creating, editing, or removing the markdown file, updating the navigation in `docs/index.yml`, and running `fern check` to validate. | Skill | Description | |-------|-------------| | [dynamo-docs](https://github.com/ai-dynamo/dynamo/blob/v1.2.0/.agents/contributor-skills/dynamo-docs/SKILL.md) | Add, update, move, or remove a docs page | --- ## Branch Architecture The documentation system uses a **dual-branch model**: | Branch | Purpose | Content | Fern config | |---|---|---|---| | `main` | Source of truth for **dev** (unreleased) documentation | `docs/` | `fern/` | | `docs-website` | Published documentation including **all versioned snapshots** | `fern/pages/` | `fern/` | Authors edit pages on `main`. A GitHub Actions workflow automatically syncs changes to the `docs-website` branch and publishes them to Fern. The `docs-website` branch is never edited by hand — it is entirely managed by CI. ### Why two branches? The `docs-website` branch accumulates versioned snapshots over time (e.g. `pages-v0.8.0/`, `pages-v0.8.1/`). Keeping these on a separate branch avoids bloating the `main` branch with frozen copies of old documentation. --- ## Directory Layout ### On `main` ```text fern/ # Fern CLI configuration (fern/ is a Fern convention) ├── fern.config.json # Fern org + CLI version pin ├── docs.yml # Site configuration (instances, branding, layout) ├── components/ │ └── CustomFooter.tsx # React component for the site footer ├── main.css # Custom CSS (NVIDIA branding, dark mode, etc.) ├── convert_callouts.py # GitHub → Fern admonition converter script └── .gitignore # Fern-specific ignores docs/ # Documentation content ├── index.yml # Navigation tree for the dev version ├── getting-started/ # Markdown content (the actual docs) ├── kubernetes/ ├── reference/ ├── ... ├── assets/ # Images, fonts, SVGs, logos ├── digest/ # Digest posts └── diagrams/ # D2 diagram source files ``` ### On `docs-website` The `docs-website` branch has a different layout optimized for Fern's directory conventions, plus versioned snapshots: ```text fern/ ├── fern.config.json # Fern org + CLI version pin ├── docs.yml # Includes the full versions array ├── versions/ │ ├── dev.yml # "Next" / dev navigation (synced from main) │ ├── v0.8.1.yml # Navigation for v0.8.1 snapshot │ └── v0.8.0.yml # Navigation for v0.8.0 snapshot ├── pages/ # Current dev content (synced from main) ├── pages-v0.8.1/ # Frozen snapshot of pages/ at v0.8.1 ├── pages-v0.8.0/ # Frozen snapshot of pages/ at v0.8.0 ├── components/ # React components ├── main.css # Custom CSS ├── convert_callouts.py # Callout converter ├── digest/ # Digest posts (synced from main) └── assets/ # Images, fonts, SVGs ``` Each `pages-vX.Y.Z/` directory is an immutable copy of `pages/` taken at release time. The corresponding `versions/vX.Y.Z.yml` file is a copy of `dev.yml` with all `../pages/` paths rewritten to `../pages-vX.Y.Z/`. The sync workflow copies content from `main`'s `docs/` into `fern/pages/` and transforms navigation paths in `index.yml` → `versions/dev.yml` accordingly. --- ## Configuration Files ### `fern/fern.config.json` ```json { "organization": "nvidia", "version": "3.73.0" } ``` - **organization**: The Fern organization that owns the docs site. - **version**: Pins the Fern CLI version used for generation. ### `fern/docs.yml` This is the main Fern site configuration. Key sections: | Section | Purpose | |---|---| | `instances` | Deployment targets — staging URL and custom production domain | | `products` | Defines the product ("Dynamo") and its version list | | `navbar-links` | GitHub repo link in the navigation bar | | `footer` | Points to `CustomFooter.tsx` React component | | `layout` | Page width, sidebar width, searchbar placement, etc. | | `colors` | NVIDIA green (`#76B900`) accent, black/white backgrounds | | `typography` | NVIDIA Sans body font, Roboto Mono code font | | `logo` | NVIDIA logos (dark + light variants), 20px height | | `js` | Adobe Analytics script injection | | `css` | Custom `main.css` stylesheet | **Important:** On `main`, `docs.yml` only lists the `dev` version. On `docs-website`, it contains the **full versions array** (dev + all releases). The sync workflow preserves the versions array from `docs-website` when copying `docs.yml` from `main`. ### `docs/index.yml` Defines the navigation tree — the sidebar structure of the docs site. Each entry maps a page title to a markdown file path: ```yaml navigation: - section: Getting Started contents: - page: Quickstart path: getting-started/quickstart.mdxx - page: Support Matrix path: reference/support-matrix.md ``` Paths are relative to the `docs/` directory. Sections can be nested. Pages can be marked as `hidden: true` to make them accessible by URL but invisible in the sidebar. During sync to `docs-website`, the workflow copies `index.yml` to `fern/versions/dev.yml` and transforms paths (e.g., `getting-started/X` → `../pages/getting-started/X`) to match the docs-website directory layout. --- ## GitHub Workflows ### Fern Docs Workflow (`fern-docs.yml`) **Location:** `.github/workflows/fern-docs.yml` This single consolidated workflow handles linting, syncing, versioning, and publishing. It runs three jobs depending on the trigger: #### Job 1: Lint (PRs) **Triggers:** Pull requests that modify `docs/**` files. **Steps:** 1. `fern check` — validates Fern configuration syntax 2. `fern docs broken-links` — checks for broken internal links **Purpose:** Catches broken docs before they merge. #### Job 2: Sync dev (push to `main`) **Triggers:** Push to `main` that modifies `docs/**` files, or manual `workflow_dispatch` (with no tag specified). **Steps:** 1. Checks out both `main` and `docs-website` branches side-by-side 2. Copies content from `main`'s `docs/` → `docs-website`'s `fern/pages/` 3. Copies `docs/index.yml` → `fern/versions/dev.yml` and transforms paths for the docs-website layout using `yq` 4. Syncs assets from `docs/assets/` and Digest posts from `docs/digest/` 5. Copies Fern config files from `fern/` → docs-website's `fern/` (`fern.config.json`, `components/`, `main.css`, `convert_callouts.py`) 6. Runs `convert_callouts.py` to transform GitHub-style callouts to Fern format 7. Updates `docs.yml` from `main` **while preserving the versions array** from `docs-website` (uses `yq` to save/restore the versions list) 8. Commits and pushes to `docs-website` 9. Publishes to Fern via `fern generate --docs` #### Job 3: Version Release (tags) **Triggers:** New Git tags matching `vX.Y.Z` (e.g., `v0.9.0`, `v1.0.0`), or manual `workflow_dispatch` with a tag specified. **Steps:** 1. Validates tag format (must be exactly `vX.Y.Z`, no suffixes like `-rc1`) 2. Checks that the version doesn't already exist (no duplicate snapshots) 3. Creates `fern/pages-vX.Y.Z/` by copying `fern/pages/` 4. Rewrites GitHub links in the snapshot: - `github.com/ai-dynamo/dynamo/tree/v1.2.0` → `tree/vX.Y.Z` - `github.com/ai-dynamo/dynamo/blob/v1.2.0` → `blob/vX.Y.Z` 5. Runs `convert_callouts.py` on the snapshot 6. Creates `fern/versions/vX.Y.Z.yml` from `dev.yml` with paths updated to `../pages-vX.Y.Z/` 7. Updates `fern/docs.yml`: - Inserts new version right after the "dev" entry - Sets the product's default `path` to the new version - Updates the "Latest" display-name to `"Latest (vX.Y.Z)"` 8. Commits and pushes to `docs-website` 9. Publishes to Fern via `fern generate --docs` **Anti-recursion note:** Pushes made with `GITHUB_TOKEN` do not trigger other workflows (GitHub's built-in guard). This is why the publish step is inline in each job rather than in a separate workflow. ### Docs Link Check Workflow (`docs-link-check.yml`) **Location:** `.github/workflows/docs-link-check.yml` **Triggers:** Push to `main` and pull requests. Runs two independent link-checking jobs: | Job | Tool | What it checks | |---|---|---| | `lychee` | [Lychee](https://lychee.cli.rs/) | External HTTP links (with caching, retries, rate-limit handling). Runs offline for PRs. | | `broken-links-check` | Custom Python script (`detect_broken_links.py`) | Internal relative markdown links and symlinks. Creates GitHub annotations on PRs pointing to exact lines with broken links. | --- ## Content Authoring ### Writing docs on `main` 1. Edit or add markdown files in `docs/`. 2. If adding a new page, add an entry in `docs/index.yml` to make it appear in the sidebar navigation. 3. Use standard GitHub-flavored markdown. Callouts (admonitions) should use GitHub's native syntax — they are automatically converted during sync: ```markdown This is a note that will become a Fern `` component. This warning will become a Fern `` component. ``` 4. Open a PR. The lint jobs (`fern check`, `fern docs broken-links`, lychee, broken-links-check) run automatically. 5. Once merged to `main`, the sync-dev workflow publishes changes within minutes. ### Assets and images Place images in `docs/assets/` and reference them with relative paths from your markdown files: ```markdown ![Architecture Diagram](../assets/img/dynamo-architecture.svg) ``` ### Custom components React components in `fern/components/` can be used in markdown via MDX. The `CustomFooter.tsx` renders the NVIDIA footer with legal links and branding. --- ## Callout Conversion The `fern/convert_callouts.py` script bridges the gap between GitHub-flavored markdown and Fern's admonition format. This lets authors use GitHub's native callout syntax on `main` while Fern gets its required component format. ### Mapping | GitHub Syntax | Fern Component | |---|---| | `> [!NOTE]` | `` | | `> [!TIP]` | `` | | `> [!IMPORTANT]` | `` | | `> [!WARNING]` | `` | | `> [!CAUTION]` | `` | ### Usage ```bash # Convert all files in a directory (recursive, in-place) python fern/convert_callouts.py --dir docs/ # Convert a single file python fern/convert_callouts.py input.md output.md # Run built-in tests python fern/convert_callouts.py --test ``` The conversion happens automatically during the sync-dev and release-version workflows. Authors never need to run it manually. --- ## Running Locally You can preview the documentation site on your machine using the [Fern CLI](https://buildwithfern.com/learn/cli-api/overview). This is useful for verifying layout, navigation, and content before opening a PR. ### Prerequisites Install the Fern CLI globally via npm: ```bash npm install -g fern-api ``` ### Validate configuration Run `fern check` from the repo root to validate that `fern/docs.yml`, `fern/fern.config.json`, and the navigation files are syntactically correct: ```bash fern check ``` ### Check for broken links Use `fern docs broken-links` to scan all pages for internal links that don't resolve: ```bash fern docs broken-links ``` This is the same check that runs in CI on every pull request. ### Start a local preview server Run `fern docs dev` to build the site and serve it locally with hot-reload: ```bash fern docs dev ``` The local server lets you see exactly how pages will look on the live site, including navigation, version dropdowns, and custom styling. --- ## Version Management ### How versions work The Fern site supports a version dropdown in the UI. Each version is defined by: 1. **A navigation file** (`fern/versions/vX.Y.Z.yml`) — sidebar structure pointing to version-specific pages (on the `docs-website` branch). 2. **A pages directory** (`fern/pages-vX.Y.Z/`) — frozen snapshot of the markdown content at release time (on the `docs-website` branch). 3. **An entry in `fern/docs.yml`** — tells Fern about the version's display name, slug, and config path. ### Version types | Version | Display Name | Slug | Description | |---|---|---|---| | Latest | `Latest (vX.Y.Z)` | `/` | Default version; points to the newest release | | Stable releases | `vX.Y.Z` | `vX.Y.Z` | Immutable snapshots | | Dev | `dev` | `dev` | Tracks `main`; updated on every push | ### URL structure - **Latest (default):** `docs.dynamo.nvidia.com/dynamo/` - **Specific version:** `docs.dynamo.nvidia.com/dynamo/v0.8.1/` - **Dev:** `docs.dynamo.nvidia.com/dynamo/dev/` ### Creating a new version Simply push a semver tag: ```bash git tag v0.9.0 git push origin v0.9.0 ``` The `release-version` job in `fern-docs.yml` handles everything else automatically. --- ## How Publishing Works ```text ┌─────────────────────────────────────────────────────────────────────┐ │ CONTINUOUS (dev) │ │ │ │ Developer pushes to main │ │ │ │ │ ▼ │ │ docs/** changed? ── No ──▶ (nothing happens) │ │ │ │ │ Yes │ │ │ │ │ ▼ │ │ sync-dev job: │ │ 1. Copy docs/ content → fern/pages/ on docs-website branch │ │ 2. Copy fern/ configs → fern/ on docs-website branch │ │ 3. Convert GitHub callouts → Fern admonitions │ │ 4. Preserve version list from docs-website's docs.yml │ │ 5. Commit + push to docs-website │ │ 6. fern generate --docs (publishes to Fern) │ │ │ │ │ ▼ │ │ Live on docs.dynamo.nvidia.com/dynamo/dev/ within minutes │ └─────────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────┐ │ VERSION RELEASE │ │ │ │ Maintainer pushes vX.Y.Z tag │ │ │ │ │ ▼ │ │ release-version job: │ │ 1. Validate tag format (vX.Y.Z only) │ │ 2. Check version doesn't already exist │ │ 3. Snapshot fern/pages/ → fern/pages-vX.Y.Z/ │ │ 4. Rewrite GitHub links (tree/main → tree/vX.Y.Z) │ │ 5. Convert callouts in snapshot │ │ 6. Create fern/versions/vX.Y.Z.yml (paths → pages-vX.Y.Z/) │ │ 7. Update fern/docs.yml (insert version, set as default) │ │ 8. Commit + push to docs-website │ │ 9. fern generate --docs (publishes to Fern) │ │ │ │ │ ▼ │ │ New version visible in dropdown at docs.dynamo.nvidia.com/dynamo/ │ └─────────────────────────────────────────────────────────────────────┘ ``` ### Secrets | Secret | Purpose | |---|---| | `FERN_TOKEN` | Authentication token for `fern generate --docs`. Required for publishing. Stored in GitHub repo secrets. | --- ## Common Tasks ### Update existing documentation 1. Edit files in `docs/` on a feature branch. 2. If adding a new page, add its entry in `docs/index.yml`. 3. Open a PR — linting runs automatically. 4. Merge — sync + publish happens automatically. ### Add a new top-level section 1. Create a directory under `docs/` (e.g., `docs/new-section/`). 2. Add markdown files for each page. 3. Add a new `- section:` block in `docs/index.yml` with the desired hierarchy. ### Release versioned documentation ```bash git tag v1.0.0 git push origin v1.0.0 ``` That's it. The workflow snapshots the current dev docs, creates the version config, and publishes. ### Manually trigger a sync or release Go to **Actions → Fern Docs → Run workflow**: - Leave **tag** empty to trigger a dev sync. - Enter a tag (e.g., `v0.9.0`) to trigger a version release. ### Debug a failed publish 1. Check the **Actions** tab for the failed `Fern Docs` workflow run. 2. Common issues: - **Broken links:** Fix the links flagged by `fern docs broken-links`. - **Invalid YAML:** Check `fern/docs.yml` or `docs/index.yml` syntax. - **Expired `FERN_TOKEN`:** Rotate the token in repo secrets. - **Duplicate version:** The tag was already released; check `docs-website` for existing `fern/pages-vX.Y.Z/` directory.