Dynamo is an open-source, high-throughput, low-latency inference framework, designed to serve generative AI workloads in distributed environments. This page gives an overview of Dynamo’s design principles, performance benefits, and production-grade features.
Looking to get started right away? See the Quickstart to install and run Dynamo in minutes.
Inference engines optimize the GPU; Dynamo optimizes the system around them.
Dynamo adds system-level optimizations on top of inference engines. To provide such optimizations, Dynamo takes an operating systems approach by laying down the foundations for scheduling, memory management, and data transfer. These foundations allow Dynamo to evolve as new system-level performance techniques emerge.
One of the motivations for Dynamo’s system-level design was to support disaggregated serving: running prefill and decode on different devices so each can be scaled and parallelized independently. Disaggregated serving required three capabilities: (1) scheduling to assign prefill and decode phases without interference, (2) memory management for KV cache offloading and onboarding, and (3) low-latency data transfer to move KV cache between nodes and across the memory hierarchy.
Dynamo’s foundations first addressed disaggregated serving, then extended to EPD disaggregation for multimodal, and now support workloads such as diffusion, RL, and agents.
Dynamo is designed to reduce the burden of replacing an existing stack in production. It offers modular, standalone components as Rust crates and pip wheels. For example, the three foundations of Dynamo for scheduling (Dynamo), memory management (KV Block Manager), and data transfer (NIXL) are each independently installable:
Pre-built containers with all dependencies are also available. See Release Artifacts for container images.
The Dynamo ecosystem includes these additional modular components, and will continue to grow over time:
These components are modular but are designed to work together as a unified family. New components will follow the same design principle.
Dynamo is not designed for vendor lock-in. Dynamo aims to enable the broader AI ecosystem and to provide the functionality developers need, such as integrations with third-party components.
From the beginning, Dynamo is designed to support all LLM inference engines (vLLM, SGLang, and TensorRT-LLM). Support for additional engines is planned to enable more developer use cases.
Support for non-NVIDIA hardware is also available: Dynamo is working with HW vendors such as Intel and AMD to extend hardware support.
The full list of supported ecosystem components:
Dynamo achieves state-of-the-art LLM performance by composing three core techniques: Disaggregated Serving, KV Cache-Aware Routing, and KV Cache Offloading. These techniques are underpinned by NIXL, a low-latency data transfer layer that enables seamless KV cache movement between nodes.
KV cache-aware routing Smartly routes requests based on worker load and existing cache hits. By reusing precomputed KV pairs, it bypasses the prefill compute, starting the decode phase immediately. Baseten applied Dynamo KV cache-aware routing and saw 2x faster TTFT and 1.6x throughput on Qwen3 Coder 480B A35B.
KV cache offloading Expands the available context window by moving KV cache from HBM to cheaper storage tiers such as host memory, local disk, or remote storage. Reusing precomputed state improves TTFT, reduces Total Cost of Ownership (TCO), and allows for longer context processing.
Disaggregated serving In the Design Principles section, we introduced the concept of disaggregated serving. Its performance has been showcased by InferenceX. DeepSeek V3 can be served with ~7x throughput/GPU, with disaggregated serving and large-scale expert parallelism. Furthermore, when these three techniques are composed together, they yield compounding benefits as shown in the following diagram.
Ready to try these techniques? See Dynamo recipes for step-by-step deployment examples that compose disaggregated serving, routing, and offloading.
Manually finding the optimal parallelism for disaggregated serving can take days of exhaustive configuration sweeps—a challenge that only intensifies at scale.
Dynamo’s AIConfigurator solves this by identifying the best-performing configurations in under 30 seconds, providing clear projections of the performance gains over standard aggregated serving. This logic is natively integrated into Kubernetes Custom Resource Definition (CRD), Dynamo Graph Deployment Request (DGDR), allowing users to deploy using automatically generated optimized configs.
Once the offline configuration is found with AIConfigurator or DGDR, developers can deploy their desired model into production. However, the production traffic can vary greatly online, and static configuration determined offline will not be able to adequately handle spikes in traffic.
Dynamo offers Planner to circumvent this problem. Developers can simply set their SLA in terms of TTFT and Time Per Output Token (TPOT). Planner examines online traffic and automatically makes decisions to scale prefill and decode workers to effectively deal with traffic spikes while maintaining the specified SLA.
Recently, Planner was expanded to deal with even more sophisticated scenarios such as drastically varying Input Sequence Length (ISL) given the same SLA. See the Planner documentation for more details.
When Planner decides to autoscale, developers need a way to effectively scale workers independently and hierarchically. Especially for prefill/decode disaggregation, prefill and decode workers need to be scaled independently to meet the specified SLA, and they need to be scheduled in physical proximity to each other for best performance.
Dynamo offers Grove which is a Kubernetes operator that provides a single declarative API for orchestrating any AI inference workload from simple single-pod deployments to complex multi-node, disaggregated systems.
Grove enables:
These features are crucial for deploying and scaling inference at data center scale for optimal performance.
Kubernetes comes with some fault tolerance functionalities, but LLM deployment requires specialized fault tolerance and resiliency. Dynamo provides comprehensive fault tolerance mechanisms across multiple layers to ensure reliable LLM inference in production deployments:
Dynamo provides built-in metrics, distributed tracing, and logging for monitoring inference deployments. See the Observability Guide for setup details.
Explore the following resources to go deeper: