SGLang Disaggregated Serving#

This document explains how SGLang’s disaggregated prefill-decode architecture works, both standalone and within Dynamo.

Overview#

Disaggregated serving separates the prefill and decode phases of LLM inference into different workers. This architecture allows for:

  • Independent scaling of prefill and decode resources

  • Better resource utilization (prefill is compute-bound, decode is memory-bound)

  • Efficient KV cache transfer between workers using RDMA

How Dynamo Integrates with SGLang Disaggregation#

SGLang’s standalone approach:

  1. The load balancer receives a request from the client

  2. A random (prefill, decode) pair is selected from the pool of available workers

  3. Request is sent to both prefill and decode workers via asyncio tasks

  4. Internally disaggregation is done from prefill → decode

Dynamo’s approach:

Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead:

  1. Route to a decode worker first

  2. Choose a prefill worker via round-robin or KV-aware selection

  3. Send the request to both workers

  4. SGLang’s bootstrap server (part of the tokenizer_manager) is used in conjunction with NIXL/Mooncake to handle the KV transfer

Disaggregation Flow#

The following diagram shows the complete request flow for disaggregated serving:

        sequenceDiagram
    participant Client
    participant Decode
    participant Prefill

    Note over Decode,Prefill: 0. Setup Phase (One-Time)
    Decode->>Prefill: Register RDMA connection info (base GPU memory pointers)
    Note over Client,Prefill: Per-Request Phase
    Client->>Decode: 1. Send request
    Decode->>Prefill: 2. Forward request + get bootstrap_room
    Prefill-->>Decode: Return bootstrap_room ID
    Note over Decode: 3. Allocate GPU memory for KV cache
    Decode->>Prefill: Send allocation info (page indices, metadata buffer)
    Note over Prefill: 4. Prefill forward pass
    par Decode polls
        loop Poll transfer
            Note over Decode: 5. Poll for KV arrival
        end
    and Prefill transfers
        Note over Prefill: 6. RDMA write KV to decode
        Prefill->>Decode: Transfer KV cache + metadata
    end
    Note over Prefill: 7. Poll RDMA handles
    Note over Prefill: Transfer complete, deallocate metadata
    Note over Decode: 8. KV received, start decode
    loop Generate tokens
        Note over Decode: Decode forward pass
        Decode-->>Client: Stream output token
    end
    

Key Steps Explained#

Setup Phase (One-Time)

  • Decode workers register their RDMA connection information with prefill workers

  • This includes base GPU memory pointers for direct memory access

Per-Request Flow

  1. Request initiation: Client sends request to decode worker

  2. Bootstrap room allocation: Decode forwards to prefill and receives a bootstrap_room ID for coordination

  3. Memory allocation: Decode allocates GPU memory pages for incoming KV cache

  4. Prefill execution: Prefill worker processes the prompt and generates KV cache

  5. KV transfer: Prefill uses RDMA to write KV cache directly to decode’s GPU memory (while decode polls for completion)

  6. Cleanup: Prefill deallocates transfer metadata after confirming completion

  7. Decode phase: Decode worker generates tokens using the transferred KV cache

  8. Streaming: Tokens are streamed back to the client as they’re generated

Performance Characteristics#

  • RDMA transfer: Zero-copy GPU-to-GPU transfer with minimal CPU involvement

  • Parallel operations: Decode can poll while prefill transfers data

  • One-time setup: RDMA connections established once, reused for all requests