SGLang Disaggregated Serving#

This document explains how SGLang’s disaggregated prefill-decode architecture works, both standalone and within Dynamo.

Overview#

Disaggregated serving separates the prefill and decode phases of LLM inference into different workers. This architecture allows for:

Independent scaling of prefill and decode resources
Better resource utilization (prefill is compute-bound, decode is memory-bound)
Efficient KV cache transfer between workers using RDMA

How Dynamo Integrates with SGLang Disaggregation#

SGLang’s standalone approach:

The load balancer receives a request from the client
A random (prefill, decode) pair is selected from the pool of available workers
Request is sent to both prefill and decode workers via asyncio tasks
Internally disaggregation is done from prefill → decode

Dynamo’s approach:

Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead:

Route to a decode worker first
Choose a prefill worker via round-robin or KV-aware selection
Send the request to both workers
SGLang’s bootstrap server (part of the tokenizer_manager) is used in conjunction with NIXL/Mooncake to handle the KV transfer

Disaggregation Flow#

The following diagram shows the complete request flow for disaggregated serving:

        sequenceDiagram
    participant Client
    participant Decode
    participant Prefill

    Note over Decode,Prefill: 0. Setup Phase (One-Time)
    Decode->>Prefill: Register RDMA connection info (base GPU memory pointers)
    Note over Client,Prefill: Per-Request Phase
    Client->>Decode: 1. Send request
    Decode->>Prefill: 2. Forward request + get bootstrap_room
    Prefill-->>Decode: Return bootstrap_room ID
    Note over Decode: 3. Allocate GPU memory for KV cache
    Decode->>Prefill: Send allocation info (page indices, metadata buffer)
    Note over Prefill: 4. Prefill forward pass
    par Decode polls
        loop Poll transfer
            Note over Decode: 5. Poll for KV arrival
        end
    and Prefill transfers
        Note over Prefill: 6. RDMA write KV to decode
        Prefill->>Decode: Transfer KV cache + metadata
    end
    Note over Prefill: 7. Poll RDMA handles
    Note over Prefill: Transfer complete, deallocate metadata
    Note over Decode: 8. KV received, start decode
    loop Generate tokens
        Note over Decode: Decode forward pass
        Decode-->>Client: Stream output token
    end

Key Steps Explained#

Setup Phase (One-Time)

Decode workers register their RDMA connection information with prefill workers
This includes base GPU memory pointers for direct memory access

Per-Request Flow

Request initiation: Client sends request to decode worker
Bootstrap room allocation: Decode forwards to prefill and receives a bootstrap_room ID for coordination
Memory allocation: Decode allocates GPU memory pages for incoming KV cache
Prefill execution: Prefill worker processes the prompt and generates KV cache
KV transfer: Prefill uses RDMA to write KV cache directly to decode’s GPU memory (while decode polls for completion)
Cleanup: Prefill deallocates transfer metadata after confirming completion
Decode phase: Decode worker generates tokens using the transferred KV cache
Streaming: Tokens are streamed back to the client as they’re generated

Performance Characteristics#

RDMA transfer: Zero-copy GPU-to-GPU transfer with minimal CPU involvement
Parallel operations: Decode can poll while prefill transfers data
One-time setup: RDMA connections established once, reused for all requests