For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Dynamo is a distributed inference runtime for generative AI systems that must operate at high throughput, low latency, and high reliability under changing traffic conditions. It is backend-agnostic (SGLang, TRT-LLM, vLLM, and others) and is built around three cooperating concerns:
A fast request path for token generation
A responsive control path for scaling and placement
A resilient state path for KV reuse and failure recovery
This document presents Dynamo as an architecture, not a feature list: what each plane owns, how requests move, how the system adapts, and how it remains correct under failure.
Design Goals
Dynamo is designed to satisfy the following goals simultaneously:
Latency stability: keep TTFT and ITL predictable under bursty and mixed-length traffic.
GPU efficiency: disaggregate prefill and decode so each can scale independently.
Compute reuse: minimize KV recomputation through KV-aware routing and cache lifecycle management.
Operational resilience: treat worker crashes, restarts, and overload as normal operating events.
Deployment portability: support Kubernetes-native control paths and non-Kubernetes runtime modes.
Why This Architecture Exists
Modern LLM serving hits recurring bottlenecks:
Prefill/decode imbalance leaves GPUs underutilized when traffic mix shifts (DistServe).
KV recomputation increases TTFT and wastes compute when routing ignores cache overlap (DeepSeek).
Memory pressure from long contexts and concurrency exceeds HBM capacity without multi-tier cache management (KVBM, Mooncake, AIBrix, FlexKV, LMCache).
In Kubernetes deployments, the same architecture maps to declarative resources:
Dynamo Operator reconciles DynamoGraphDeployment.
Discoverability is derived from DynamoWorkerMetadata + EndpointSlices.
Grove-backed multinode deployments model worker groups as PodCliqueSet and PodClique.
Independent prefill/decode elasticity is represented via PodCliqueScalingGroup with separate replicas and min targets.
The diagram labels such as PodClique A/B, ScalingGroup "Prefill", ScalingGroup "Decode", and (replicas, min) represent this grouped scaling model.
Deployment Modes
The request plane can be exposed in two ways:
Standalone mode (default) — the Dynamo Frontend is the request entry point and the integrated Dynamo Router selects workers using KV-aware scoring. Used by all local installs and the default Kubernetes deployment.
Gateway mode (GAIE) — Dynamo runs behind a Kubernetes Gateway API Inference Extension gateway. KV-aware routing is performed at the gateway layer by the Dynamo Endpoint Picker Plugin (EPP); the Frontend runs as a sidecar in --router-mode direct and respects the EPP’s per-request worker selection passed via request headers.
Both modes share the same control plane, storage/events plane, and backend integrations — only the request entry point and the location of the routing decision differ. See the Inference Gateway (GAIE) guide for the gateway-mode setup and configuration reference.
Fault Tolerance Architecture
Fault tolerance is embedded across layers:
Layer
Mechanism
Practical effect
Request
Migration, cancellation
In-flight work can continue or terminate intentionally
Worker
Health checks, graceful shutdown, endpoint draining
Failed/terminating workers stop taking new traffic safely
System
Request rejection/load shedding
Prevents overload from propagating across workers
Infrastructure
Discovery lease expiry, event-path recovery
Stale membership is removed and traffic reroutes
This model assumes failures are routine, not exceptional.
Performance Rationale
Disaggregated Serving
Separating prefill and decode improves utilization and enables phase-specific scaling.
Tested on H100 with R1 Distilled Llama 70B FP8 on vLLM. 3K ISL / 150 OSL.