Introduction to Dynamo
Dynamo is an open-source, high-throughput, low-latency inference framework, designed to serve generative AI workloads in distributed environments. It is Kubernetes-native for production deployments, with an operator, CRDs, Helm charts, service discovery, Gateway API integration, and topology-aware scheduling, while still supporting local containers, Python workers, and standalone components for development or incremental adoption.
This page gives an overview of Dynamo’s design principles, performance benefits, and production-grade features.
Looking to get started right away? See the Quickstart to install and run Dynamo in minutes.
Why Dynamo?
Inference engines optimize the GPU; Dynamo optimizes the system around them.
- System-level optimization on top of any engine — Inference engines optimize the single-GPU forward pass. Dynamo adds the distributed layer: disaggregated serving, smart routing, KV cache management across memory tiers, and auto-scaling.
- Composable performance improvement techniques — The techniques, disaggregated serving, KV cache-aware routing, and KV cache offloading, each improve performance on their own; using them together yields compounding gains.
- Engine-agnostic — Works with vLLM, SGLang, and TensorRT-LLM. Swap engines without changing your serving infrastructure. Extending support for Intel XPU and AMD hardware.
- Kubernetes-native production path — Dynamo exposes inference graphs as Kubernetes resources (
DynamoGraphDeployment,DynamoComponentDeployment,DynamoGraphDeploymentRequest) and reconciles them with an operator, while integrating with Kubernetes service discovery, Gateway API Inference Extension, scheduling, observability, and model loading workflows. - Production-ready at scale — Dynamo covers the full deployment lifecycle: automatic configuration (AIConfigurator), runtime auto-scaling (Planner), topology-aware gang scheduling (Grove), fault tolerance, and observability.
- Modular adoption — Start with one component (e.g., just the Router for KV-aware routing on top of your existing engine). Adopt more as needed. Each component is independently installable via pip.
Design Principles
Strong Foundations for AI Inference
Dynamo adds system-level optimizations on top of inference engines. To provide such optimizations, Dynamo takes an operating systems approach by laying down the foundations for scheduling, memory management, and data transfer. These foundations allow Dynamo to evolve as new system-level performance techniques emerge.
One of the motivations for Dynamo’s system-level design was to support disaggregated serving: running prefill and decode on different devices so each can be scaled and parallelized independently. Disaggregated serving required three capabilities: (1) scheduling to assign prefill and decode phases without interference, (2) memory management for KV cache offloading and onboarding, and (3) low-latency data transfer to move KV cache between nodes and across the memory hierarchy.
Dynamo’s foundations first addressed disaggregated serving, then extended to EPD disaggregation for multimodal, and now support workloads such as diffusion, RL, and agents.
Modular but Well-Integrated Ecosystem
Dynamo is designed to reduce the burden of replacing an existing stack in production. It offers modular, standalone components as Rust crates and pip wheels. For example, the three foundations of Dynamo for scheduling (Dynamo), memory management (KV Block Manager), and data transfer (NIXL) are each independently installable:
Pre-built containers with all dependencies are also available. See Release Artifacts for container images.
The Dynamo ecosystem includes these additional modular components, and will continue to grow over time:
These components are modular but are designed to work together as a unified family. New components will follow the same design principle.
Vendor-Agnostic Ecosystem Enablement
Dynamo is not designed for vendor lock-in. Dynamo aims to enable the broader AI ecosystem and to provide the functionality developers need, such as integrations with third-party components.
From the beginning, Dynamo is designed to support all LLM inference engines (vLLM, SGLang, and TensorRT-LLM). Support for additional engines is planned to enable more developer use cases.
Support for non-NVIDIA hardware is also available: Dynamo is working with HW vendors such as Intel and AMD to extend hardware support.
The full list of supported ecosystem components:
Deployment Posture
Dynamo’s production path is Kubernetes-native, not Kubernetes-only. The same core runtime concepts can be used from a local process, a container, or a Kubernetes cluster:
Request Routing Modes
Dynamo supports two request-routing modes. Both expose the same OpenAI-compatible API and the same backends; they differ in where request routing happens.
-
Standalone mode (default) — The Dynamo Frontend serves HTTP requests directly, and the integrated Dynamo Router makes KV-aware routing decisions before dispatching to workers. No external gateway is required. This is the mode used by all local installs and the default Kubernetes deployment. Request flow:
client -> Frontend -> Router -> workers. -
Gateway mode (GAIE) — Dynamo runs behind a Kubernetes Gateway API Inference Extension gateway. KV-aware routing is performed at the gateway layer by the Dynamo Endpoint Picker Plugin (EPP); the Frontend runs as a sidecar in
--router-mode directand forwards requests to the worker the EPP selected. Use this mode when your platform standardizes on the Inference Gateway, or when you want gateway-level policy (auth, rate limiting, observability) co-located with KV-aware routing. Request flow:client -> Inference Gateway -> EPP (KV-aware) -> Frontend sidecar (direct) -> workers.
Both modes support disaggregated serving, multimodal, and the same set of backends (vLLM, SGLang, TensorRT-LLM). For full setup, supported features, and configuration of gateway mode, see the Inference Gateway (GAIE) guide.
Performance
Dynamo achieves state-of-the-art LLM performance by composing three core techniques: Disaggregated Serving, KV Cache-Aware Routing, and KV Cache Offloading. These techniques are underpinned by NIXL, a low-latency data transfer layer that enables seamless KV cache movement between nodes.
-
KV cache-aware routing Smartly routes requests based on worker load and existing cache hits. By reusing precomputed KV pairs, it bypasses the prefill compute, starting the decode phase immediately. Baseten applied Dynamo KV cache-aware routing and saw 2x faster TTFT and 1.6x throughput on Qwen3 Coder 480B A35B.
-
KV cache offloading Expands the available context window by moving KV cache from HBM to cheaper storage tiers such as host memory, local disk, or remote storage. Reusing precomputed state improves TTFT, reduces Total Cost of Ownership (TCO), and allows for longer context processing.
-
Disaggregated serving In the Design Principles section, we introduced the concept of disaggregated serving. Its performance has been showcased by InferenceX. DeepSeek V3 can be served with ~7x throughput/GPU, with disaggregated serving and large-scale expert parallelism. Furthermore, when these three techniques are composed together, they yield compounding benefits as shown in the following diagram.
- Disaggregated Serving + KV Cache-Aware Routing — KV cache-aware routing load balances for both compute (on prefill) and memory (on decode), optimizing latency and throughput simultaneously.
- Disaggregated Serving + KV Cache Offloading — KV cache offloading results in faster TTFT, and the number of prefill workers can be reduced to reduce TCO.
- KV Cache-Aware Routing + KV Cache Offloading — Offloading increases the total addressable cache size, increasing the KV cache hit rate, which in turn accelerates the TTFT.
Ready to try these techniques? See Dynamo recipes for step-by-step deployment examples that compose disaggregated serving, routing, and offloading.
From Configuration to Production-Grade Deployment
Finding Best Configurations Under 30 Seconds with AIConfigurator
Manually finding the optimal parallelism for disaggregated serving can take days of exhaustive configuration sweeps—a challenge that only intensifies at scale.
Dynamo uses AIC-backed DynoSim-style modeling to identify strong configurations in under 30 seconds, providing clear projections of the performance gains over standard aggregated serving. This logic is natively integrated into Kubernetes Custom Resource Definition (CRD), Dynamo Graph Deployment Request (DGDR), allowing users to deploy using automatically generated optimized configs.
Auto-Adjusting Deployment Based on SLA with Planner
Once the offline configuration is found with AIConfigurator or DGDR, developers can deploy their desired model into production. However, the production traffic can vary greatly online, and static configuration determined offline will not be able to adequately handle spikes in traffic.
Dynamo offers Planner to circumvent this problem. Developers can simply set their SLA in terms of TTFT and Time Per Output Token (TPOT). Planner examines online traffic and automatically makes decisions to scale prefill and decode workers to effectively deal with traffic spikes while maintaining the specified SLA.
Recently, Planner was expanded to deal with even more sophisticated scenarios such as drastically varying Input Sequence Length (ISL) given the same SLA. See the Planner documentation for more details.
Applying Topology-Aware Hierarchical Gang Scheduling with Grove
When Planner decides to autoscale, developers need a way to effectively scale workers independently and hierarchically. Especially for prefill/decode disaggregation, prefill and decode workers need to be scaled independently to meet the specified SLA, and they need to be scheduled in physical proximity to each other for best performance.
Dynamo offers Grove which is a Kubernetes operator that provides a single declarative API for orchestrating any AI inference workload from simple single-pod deployments to complex multi-node, disaggregated systems.
Grove enables:
- Hierarchical gang scheduling
- Topology-aware placement
- Multi-level horizontal autoscaling
- Explicit startup ordering
- Rolling updates with configurable replacement strategies
These features are crucial for deploying and scaling inference at data center scale for optimal performance.
Ensuring Fault Tolerance for LLMs
Kubernetes comes with some fault tolerance functionalities, but LLM deployment requires specialized fault tolerance and resiliency. Dynamo provides comprehensive fault tolerance mechanisms across multiple layers to ensure reliable LLM inference in production deployments:
- Router and Frontend — Dynamo supports launching multiple frontend + router replicas for improved fault tolerance by sharing router states.
- Request Migration — When a worker fails during request processing, Dynamo can migrate in-progress requests to healthy workers while preserving partial generation state and maintaining seamless token flow to clients.
- Request Cancellation — Dynamo supports canceling in-flight requests through the AsyncEngineContext trait, which provides graceful stop signals and hierarchical cancellation propagation through request chains.
- Request Rejection (Load Shedding) — When workers are overloaded, Dynamo rejects new requests with HTTP 503 responses based on configurable thresholds for KV cache utilization and prefill tokens.
Observability
Dynamo provides built-in metrics, distributed tracing, and logging for monitoring inference deployments. See the Observability Guide for setup details.
What’s Next?
Explore the following resources to go deeper:
- Recipes — Compose disaggregated serving, routing, and offloading
- KV Cache-Aware Routing — Configure smart request routing
- KV Cache Offloading — Set up multi-tier memory management
- Planner — Configure SLA-based autoscaling
- Kubernetes Deployment — Deploy at scale with Grove
- Inference Gateway (GAIE) — Run Dynamo in gateway mode behind the K8s Inference Gateway
- Overall Architecture — Full technical design
- Support Matrix — Check hardware and engine compatibility
Further reading: Dynamo Digest.