For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Installation
    • Support Matrix
    • Feature Matrix
    • Examples
  • Kubernetes Deployment
  • User Guides
    • Tool Calling
    • Multimodality Support
    • Finding Best Initial Configs
    • Dynamo Benchmarking Guide
    • Tuning Disaggregated Performance
    • Writing Python Workers in Dynamo
    • Glossary
  • Components
    • Router
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • 🔵 Main Request Flow (Blue)
  • 🟠 Decision and Allocation Flow (Orange)
  • 🟢 Prefill Worker Flow (Green)
  • 🟣 Completion Flow (Purple)
  • 🔗 Infrastructure Connections (Dotted lines)
  • Service Discovery
  • Request Plane
  • NATS Connections (Optional, for KV routing)
  • Planning Connections (Gold, dotted)
  • Technical Implementation Details
  • NIXL (NVIDIA Interchange Library):
  • Disaggregated KV Cache:
Design Docs

Dynamo Architecture Flow

||View as Markdown|
Edit this page
Previous

High Level Architecture

Next

Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance

This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in examples/backends/vllm. Color-coded flows indicate different types of operations.

Note: The “Processor” shown in the diagram represents the request processing logic (tokenization, chat template application, routing) that runs within the Frontend component. It is not a separate deployment—the Frontend handles both HTTP serving and request preprocessing via the make_engine function.

🔵 Main Request Flow (Blue)

The primary user journey through the system:

  1. Discovery (S1): Client discovers the service endpoint
  2. Request (S2): HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
  3. Validate (S3): Frontend preprocesses the request (applies chat template, tokenizes) and validates it
  4. Route (S3): Frontend routes the validated request to appropriate Decode Worker

🟠 Decision and Allocation Flow (Orange)

The system’s intelligent routing and resource allocation:

  1. Query (S4): Decode Worker queries for prefix cache hits to optimize processing
  2. Disagg Decision (S5): Based on prefill length and queue size, the system decides whether it needs remote prefill 5a. Allocate (S5a): Decode Worker pre-allocates KV cache blocks in its local GPU memory
  3. Queue (S6): If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue

🟢 Prefill Worker Flow (Green)

The dedicated prefill processing pipeline:

  1. NATS Pull (S7): PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers
  2. Load Metadata (S8): PrefillWorker loads NIXL metadata from ETCD to establish GPU communication
  3. Prefill (S9): Worker executes the prefill computation on the input tokens
  4. NIXL Transfer (S10): Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker’s pre-allocated blocks

🟣 Completion Flow (Purple)

The response generation and delivery:

  1. Notify (S11): PrefillWorker sends completion notification to Decode Worker
  2. Decode (S12): Decode Worker decodes from its local KV cache containing prefilled data
  3. Response (S13): The generated response flows back through the Frontend for post-processing (detokenization) and delivery to the Client

🔗 Infrastructure Connections (Dotted lines)

Coordination and messaging support:

Service Discovery

  • On Kubernetes (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required.
  • On bare metal: Uses etcd for service discovery and endpoint registration.

Request Plane

  • TCP (default): Direct TCP connections between Frontend and Workers for request/response transport.
  • HTTP/NATS: Alternative transports configurable via DYN_REQUEST_PLANE.

NATS Connections (Optional, for KV routing)

  • PrefillQueue: JetStream consumer group for reliable work distribution in disaggregated serving
  • KV Events: Cache state events for KV-aware routing (can be disabled with --no-kv-events)

Planning Connections (Gold, dotted)

  • Frontend → Planner: Metrics collection for auto-scaling decisions
  • Planner → Workers: Resource scaling commands for both Decode Worker and PrefillWorker

Technical Implementation Details

NIXL (NVIDIA Interchange Library):

  • Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
  • Decode Worker publishes GPU metadata to ETCD for coordination
  • PrefillWorker loads metadata to establish direct communication channels
  • Block-based transfers (64–128 tokens per block) for efficient batching

Disaggregated KV Cache:

  • Each Decode Worker maintains local KV cache in its GPU memory
  • No shared storage bottlenecks—all transfers are direct worker-to-worker
  • Pre-allocated blocks ensure deterministic memory layout and performance