Dynamo Architecture Flow#
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in components/backends/vllm. Color-coded flows indicate different types of operations:
🔵 Main Request Flow (Blue)#
The primary user journey through the system:
Discovery (S1): Client discovers the service endpoint
Request (S2): HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
Validate (S3): Frontend forwards request to Processor for validation and routing
Route (S3): Processor routes the validated request to appropriate Decode Worker
🟠Decision and Allocation Flow (Orange)#
The system’s intelligent routing and resource allocation:
Query (S4): Decode Worker queries for prefix cache hits to optimize processing
Disagg Decision (S5): Based on prefill length and queue size, the system decides whether it needs remote prefill 5a. Allocate (S5a): Decode Worker pre-allocates KV cache blocks in its local GPU memory
Queue (S6): If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue
🟢 Prefill Worker Flow (Green)#
The dedicated prefill processing pipeline:
NATS Pull (S7): PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers
Load Metadata (S8): PrefillWorker loads NIXL metadata from ETCD to establish GPU communication
Prefill (S9): Worker executes the prefill computation on the input tokens
NIXL Transfer (S10): Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker’s pre-allocated blocks
🟣 Completion Flow (Purple)#
The response generation and delivery:
Notify (S11): PrefillWorker sends completion notification to Decode Worker
Decode (S12): Decode Worker decodes from its local KV cache containing prefilled data
Response (S13): The system sends the generated response to the Processor for post-processing, then through the Frontend to the Client
🔗 Infrastructure Connections (Dotted lines)#
Coordination and messaging support:
ETCD Connections (Gray, dotted)#
Frontend, Processor, Planner: Service discovery and registration
Decode Worker, PrefillWorker: NIXL metadata storage for GPU communication setup
NATS Connections (Teal, dotted)#
PrefillQueue: JetStream consumer group for reliable work distribution
Processor: Load balancing across workers
Planning Connections (Gold, dotted)#
Frontend → Planner: Metrics collection for auto-scaling decisions
Planner → Workers: Resource scaling commands for both Decode Worker and PrefillWorker
Technical Implementation Details#
NIXL (NVIDIA Interchange Library):#
Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
Decode Worker publishes GPU metadata to ETCD for coordination
PrefillWorker loads metadata to establish direct communication channels
Block-based transfers (64–128 tokens per block) for efficient batching
Disaggregated KV Cache:#
Each Decode Worker maintains local KV cache in its GPU memory
No shared storage bottlenecks—all transfers are direct worker-to-worker
Pre-allocated blocks ensure deterministic memory layout and performance
%%{init: {'theme':'dark', 'themeVariables': {'primaryColor': '#f4f4f4', 'primaryTextColor': '#333333', 'primaryBorderColor': '#888888', 'lineColor': '#4A90E2', 'sectionBkgColor': '#f9f9f9', 'altSectionBkgColor': '#eeeeee', 'tertiaryColor': '#f0f0f0', 'background': '#ffffff', 'mainBkg': '#f8f8f8', 'secondaryColor': '#f4f4f4', 'nodeTextColor': '#333333'}, 'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'fontFamily': 'Inter, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif', 'fontSize': '18px'}%%
graph TD
%% Top Layer - Client & Frontend
Client["<b>HTTP Client</b>"]
S1[["<b>1 DISCOVERY</b>"]]
Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"]
S2[["<b>2 REQUEST</b>"]]
%% Processing Layer
Processor["<b>Processor</b><br/><i>Request Handler & Router</i>"]
S3[["<b>3 VALIDATE</b>"]]
%% Infrastructure - Positioned strategically to minimize crossings
subgraph INF["<b>Infrastructure Layer</b>"]
ETCD[("<b>ETCD</b><br/><i>Service Discovery &<br/>NIXL Metadata</i>")]
NATS[("<b>NATS</b><br/><i>Message Broker</i>")]
Planner["<b>Planner</b><br/><i>Resource Management<br/>Auto-scaling</i>"]
end
%% Worker Layer - Main processing
subgraph WL["<b>Worker Layer</b>"]
%% VllmWorker section
VllmWorker["<b>Decode Worker</b><br/><i>Handles Decoding & Disagg Decisions</i>"]
S4[["<b>4 QUERY</b>"]]
S5[["<b>5 DISAGG DECISION</b>"]]
S5a[["<b>5a ALLOCATE</b>"]]
S12[["<b>12 DECODE</b>"]]
S6[["<b>6 QUEUE</b>"]]
S13[["<b>13 RESPONSE</b>"]]
%% Storage positioned near workers
LocalKVCache[("<b>Local KV Cache</b><br/><i>Pre-allocated Blocks</i>")]
%% Prefill System - Right side to minimize crossings
subgraph PS["<b>Prefill System</b>"]
PrefillQueue["<b>Prefill Queue</b><br/><i>NATS JetStream<br/>Consumer Group</i>"]
PrefillWorker["<b>Prefill Worker</b><br/><i>Dedicated Prefill Processing<br/>(Multiple Instances)</i>"]
S7[["<b>7 NATS PULL</b>"]]
S8[["<b>8 LOAD METADATA</b>"]]
S9[["<b>9 PREFILL</b>"]]
S10[["<b>10 NIXL TRANSFER</b>"]]
S11[["<b>11 NOTIFY</b>"]]
end
end
%% Main Request Flow (Blue) - Clean vertical flow
Client -.-> S1
S1 -->|HTTP API Call| Frontend
Frontend -.-> S2
S2 -->|Process & Validate| Processor
Processor -.-> S3
S3 -->|Route to Worker| VllmWorker
%% VllmWorker Internal Flow (Orange)
VllmWorker -.-> S4
S4 -->|Query Prefix Cache Hit| S5
S5 -->|Prefill Length & Queue Check| S5a
S5a -->|Continue to Decode| S12
%% Allocation & Queuing (Orange) - Minimize crossings
S5a -->|Allocate KV Cache Blocks| LocalKVCache
VllmWorker --> S6
S6 -->|Put RemotePrefillRequest| PrefillQueue
%% Prefill Worker Flow (Green) - Self-contained within PS
PrefillQueue -.-> S7
S7 -->|Consumer Group Pull| PrefillWorker
PrefillWorker -.-> S8
PrefillWorker -.-> S9
S9 -->|Execute Prefill| S10
S10 -->|Direct GPU Transfer| LocalKVCache
PrefillWorker --> S11
%% Return Flow (Purple) - Clean return path
S11 -->|Completion Notification| S12
S12 -->|Decode from KV Cache| S13
S13 -->|Post-process Response| Processor
Processor -->|HTTP Response| Frontend
Frontend -->|Final Response| Client
%% Infrastructure Connections - Organized to avoid crossings
%% ETCD Connections - Grouped by proximity
Frontend -.->|Service Discovery| ETCD
Processor -.->|Service Discovery| ETCD
VllmWorker -.->|NIXL Metadata| ETCD
PrefillWorker -.->|NIXL Metadata| ETCD
S8 -.->|Load NIXL Metadata| ETCD
Planner -.->|Service Discovery| ETCD
%% NATS Connections - Direct to queue system
PrefillQueue -.->|JetStream| NATS
Processor -.->|Load Balancing| NATS
%% Planning Connections - Strategic positioning
Frontend -.->|Metrics| Planner
Planner -.->|Auto-scaling| VllmWorker
Planner -.->|Auto-scaling| PrefillWorker
%% Styling - Each component with unique colors
classDef client fill:#e8f5e8,stroke:#2E7D32,stroke-width:3px
classDef frontend fill:#fff3e0,stroke:#F57C00,stroke-width:3px
classDef processor fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px
classDef worker fill:#e3f2fd,stroke:#1565C0,stroke-width:3px
classDef prefillQueue fill:#fff8e1,stroke:#E65100,stroke-width:3px
classDef prefillWorker fill:#fce4ec,stroke:#C2185B,stroke-width:3px
classDef prefillBox fill:#eceff1,stroke:#455A64,stroke-width:3px
classDef planner fill:#f1f8e9,stroke:#558B2F,stroke-width:3px
classDef storage fill:#e0f2f1,stroke:#00695C,stroke-width:3px
classDef etcd fill:#fff9c4,stroke:#F9A825,stroke-width:3px
classDef nats fill:#ede7f6,stroke:#5E35B1,stroke-width:3px
classDef infraLayer fill:#fff9c4,stroke:#FFC107,stroke-width:3px
classDef workerLayer fill:#e3f2fd,stroke:#2196F3,stroke-width:3px
class Client client
class Frontend frontend
class Processor processor
class VllmWorker worker
class PrefillQueue prefillQueue
class PrefillWorker prefillWorker
class Planner planner
class LocalKVCache storage
class ETCD etcd
class NATS nats
class PS prefillBox
class INF infraLayer
class WL workerLayer
%% Flow Colors - Different line styles to reduce visual clutter
%% Main Request Flow - Blue (solid)
linkStyle 0 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 1 stroke:#1565C0,stroke-width:4px
linkStyle 2 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 3 stroke:#1565C0,stroke-width:4px
linkStyle 4 stroke:#1565C0,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 5 stroke:#1565C0,stroke-width:4px
%% Decision & Allocation Flow - Orange (mixed)
linkStyle 6 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 7 stroke:#E65100,stroke-width:4px
linkStyle 8 stroke:#E65100,stroke-width:4px
linkStyle 9 stroke:#E65100,stroke-width:3px,stroke-dasharray: 3 3
%% KV Cache & Queue - Orange (solid)
linkStyle 10 stroke:#E65100,stroke-width:4px
linkStyle 11 stroke:#E65100,stroke-width:4px
linkStyle 12 stroke:#E65100,stroke-width:4px
%% Prefill Worker Flow - Green (mixed)
linkStyle 13 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 14 stroke:#2E7D32,stroke-width:4px
linkStyle 15 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 16 stroke:#2E7D32,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 17 stroke:#2E7D32,stroke-width:4px
linkStyle 18 stroke:#2E7D32,stroke-width:4px
linkStyle 19 stroke:#2E7D32,stroke-width:4px
%% Completion Flow - Purple (mixed)
linkStyle 20 stroke:#6A1B9A,stroke-width:4px
linkStyle 21 stroke:#6A1B9A,stroke-width:3px,stroke-dasharray: 3 3
linkStyle 22 stroke:#6A1B9A,stroke-width:4px
linkStyle 23 stroke:#6A1B9A,stroke-width:4px
linkStyle 24 stroke:#6A1B9A,stroke-width:4px
%% Infrastructure Flows - Lighter and dotted to reduce visual noise
%% ETCD Connections - Gray (dotted, thinner)
linkStyle 25 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 26 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 27 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 28 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 30 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
%% NATS Connections - Teal (dotted, thinner)
linkStyle 31 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 32 stroke:#26A69A,stroke-width:2px,stroke-dasharray: 8 8
%% Planning Connections - Gold (dotted, thinner)
linkStyle 33 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 34 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8
linkStyle 35 stroke:#FFA726,stroke-width:2px,stroke-dasharray: 8 8