For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Dynamo is a distributed inference runtime for generative AI systems that must operate at high throughput, low latency, and high reliability under changing traffic conditions. It is backend-agnostic (SGLang, TRT-LLM, vLLM, and others) and is built around three cooperating concerns:
A fast request path for token generation
A responsive control path for scaling and placement
A resilient state path for KV reuse and failure recovery
This document presents Dynamo as an architecture, not a feature list: what each plane owns, how requests move, how the system adapts, and how it remains correct under failure.
Design Goals
Dynamo is designed to satisfy the following goals simultaneously:
Latency stability: keep TTFT and ITL predictable under bursty and mixed-length traffic.
GPU efficiency: disaggregate prefill and decode so each can scale independently.
Compute reuse: minimize KV recomputation through KV-aware routing and cache lifecycle management.
Operational resilience: treat worker crashes, restarts, and overload as normal operating events.
Deployment portability: support Kubernetes-native control paths and non-Kubernetes runtime modes.
Why This Architecture Exists
Modern LLM serving hits recurring bottlenecks:
Prefill/decode imbalance leaves GPUs underutilized when traffic mix shifts (DistServe).
KV recomputation increases TTFT and wastes compute when routing ignores cache overlap (DeepSeek).
Memory pressure from long contexts and concurrency exceeds HBM capacity without multi-tier cache management (KVBM, Mooncake, AIBrix, FlexKV, LMCache).