For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • 🔵 Main Request Flow (Blue)
  • 🟢 Prefill Flow (Green)
  • 🟠 Decode Routing Flow (Orange)
  • 🟣 Completion Flow (Purple)
  • 🔗 Infrastructure Connections (Dotted lines)
  • Service Discovery
  • Request Plane
  • NATS Connections (Optional, for KV routing)
  • Planning Connections (Gold, dotted)
  • Technical Implementation Details
  • PrefillRouter Orchestration:
  • NIXL (NVIDIA Interchange Library):
  • Disaggregated KV Cache:
Design Docs

Architecture Flow

||View as Markdown|
Edit this page
Previous

Overall Architecture

Next

Disaggregated Serving

This diagram shows the NVIDIA Dynamo disaggregated inference system. Color-coded flows indicate different types of operations.

🔵 Main Request Flow (Blue)

The primary user journey through the system:

  1. Request (S1): HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
  2. Preprocess (S2): Frontend preprocesses the request (applies chat template, tokenizes) and validates it
  3. Route to Prefill (S3): PrefillRouter selects a prefill worker using KV-aware routing or load balancing

🟢 Prefill Flow (Green)

The prefill processing pipeline:

  1. Prefill (S4): Prefill worker executes the prefill computation on the input tokens and generates KV cache
  2. Return Metadata (S5): Prefill worker returns disaggregated_params containing backend-specific transfer metadata

🟠 Decode Routing Flow (Orange)

Router orchestration to decode phase:

  1. Route to Decode (S6): PrefillRouter injects prefill result into decode request and routes to decode worker
  2. KV Transfer (S7): Decode worker coordinates with prefill worker for direct GPU-to-GPU KV cache transfer via NIXL

🟣 Completion Flow (Purple)

The response generation and delivery:

  1. Decode (S8): Decode worker generates tokens using the transferred KV cache
  2. Response (S9): Generated tokens stream back through Frontend for post-processing (detokenization) and delivery to Client

🔗 Infrastructure Connections (Dotted lines)

Coordination and messaging support:

Service Discovery

  • On Kubernetes (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required.
  • On bare metal: Uses etcd or filesystem for service discovery and endpoint registration.

Request Plane

  • TCP (default): Direct TCP connections between Frontend and Workers for request/response transport.
  • HTTP/NATS: Alternative transports configurable via DYN_REQUEST_PLANE.

NATS Connections (Optional, for KV routing)

  • KV Events: Cache state events for KV-aware routing (can be disabled with --no-kv-events)

Planning Connections (Gold, dotted)

  • Frontend → Planner: Metrics collection for auto-scaling decisions
  • Planner → Workers: Resource scaling commands for workers

Technical Implementation Details

PrefillRouter Orchestration:

  • The PrefillRouter sits between the Frontend and workers, orchestrating disaggregated serving
  • Selects prefill workers using KV-aware routing (cache overlap scores + load) or simple load balancing
  • Injects transfer metadata into decode requests for KV cache coordination

NIXL (NVIDIA Interchange Library):

  • Enables high-speed GPU-to-GPU data transfers using NVLink, InfiniBand/UCX, or PCIe
  • Transfer metadata exchanged via disaggregated_params in prefill response
  • Backend-specific coordination: SGLang uses bootstrap connections, TRTLLM uses opaque state, vLLM uses block IDs

Disaggregated KV Cache:

  • Each worker maintains local KV cache in its GPU memory
  • No shared storage bottlenecks—transfers are direct worker-to-worker via NIXL
  • Non-blocking transfers allow GPU forward passes to continue during KV transfer