SGLang Disaggregated Serving#
This document explains how SGLang’s disaggregated prefill-decode architecture works, both standalone and within Dynamo.
Overview#
Disaggregated serving separates the prefill and decode phases of LLM inference into different workers. This architecture allows for:
Independent scaling of prefill and decode resources
Better resource utilization (prefill is compute-bound, decode is memory-bound)
Efficient KV cache transfer between workers using RDMA
How Dynamo Integrates with SGLang Disaggregation#
SGLang’s standalone approach:
The load balancer receives a request from the client
A random
(prefill, decode)pair is selected from the pool of available workersRequest is sent to both
prefillanddecodeworkers via asyncio tasksInternally disaggregation is done from prefill → decode
Dynamo’s approach:
Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead:
Route to a decode worker first
Choose a prefill worker via round-robin or KV-aware selection
Send the request to both workers
SGLang’s bootstrap server (part of the
tokenizer_manager) is used in conjunction with NIXL/Mooncake to handle the KV transfer
Disaggregation Flow#
The following diagram shows the complete request flow for disaggregated serving:
sequenceDiagram
participant Client
participant Decode
participant Prefill
Note over Decode,Prefill: 0. Setup Phase (One-Time)
Decode->>Prefill: Register RDMA connection info (base GPU memory pointers)
Note over Client,Prefill: Per-Request Phase
Client->>Decode: 1. Send request
Decode->>Prefill: 2. Forward request + get bootstrap_room
Prefill-->>Decode: Return bootstrap_room ID
Note over Decode: 3. Allocate GPU memory for KV cache
Decode->>Prefill: Send allocation info (page indices, metadata buffer)
Note over Prefill: 4. Prefill forward pass
par Decode polls
loop Poll transfer
Note over Decode: 5. Poll for KV arrival
end
and Prefill transfers
Note over Prefill: 6. RDMA write KV to decode
Prefill->>Decode: Transfer KV cache + metadata
end
Note over Prefill: 7. Poll RDMA handles
Note over Prefill: Transfer complete, deallocate metadata
Note over Decode: 8. KV received, start decode
loop Generate tokens
Note over Decode: Decode forward pass
Decode-->>Client: Stream output token
end
Key Steps Explained#
Setup Phase (One-Time)
Decode workers register their RDMA connection information with prefill workers
This includes base GPU memory pointers for direct memory access
Per-Request Flow
Request initiation: Client sends request to decode worker
Bootstrap room allocation: Decode forwards to prefill and receives a bootstrap_room ID for coordination
Memory allocation: Decode allocates GPU memory pages for incoming KV cache
Prefill execution: Prefill worker processes the prompt and generates KV cache
KV transfer: Prefill uses RDMA to write KV cache directly to decode’s GPU memory (while decode polls for completion)
Cleanup: Prefill deallocates transfer metadata after confirming completion
Decode phase: Decode worker generates tokens using the transferred KV cache
Streaming: Tokens are streamed back to the client as they’re generated
Performance Characteristics#
RDMA transfer: Zero-copy GPU-to-GPU transfer with minimal CPU involvement
Parallel operations: Decode can poll while prefill transfers data
One-time setup: RDMA connections established once, reused for all requests