For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
      • Reference Guide
      • Chat Processor
      • Examples
      • Disaggregation
      • Diffusion
      • Observability
      • Agentic Workloads
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Overview
  • How Dynamo Integrates with SGLang Disaggregation
  • Disaggregation Flow
  • Key Steps Explained
  • Performance Characteristics
BackendsSGLang

Disaggregation

||View as Markdown|
Edit this page
Previous

Examples

Next

Diffusion

This document explains how SGLang’s disaggregated prefill-decode architecture works, both standalone and within Dynamo.

Overview

Disaggregated serving separates the prefill and decode phases of LLM inference into different workers. This architecture allows for:

  • Independent scaling of prefill and decode resources
  • Better resource utilization (prefill is compute-bound, decode is memory-bound)
  • Efficient KV cache transfer between workers using RDMA

How Dynamo Integrates with SGLang Disaggregation

SGLang’s standalone approach:

  1. The load balancer receives a request from the client
  2. A random (prefill, decode) pair is selected from the pool of available workers
  3. Request is sent to both prefill and decode workers via asyncio tasks
  4. Internally disaggregation is done from prefill → decode

Dynamo’s approach:

Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead:

  1. Route to a decode worker first
  2. Choose a prefill worker via round-robin or KV-aware selection
  3. Send the request to both workers
  4. SGLang’s bootstrap server (part of the tokenizer_manager) is used in conjunction with NIXL/Mooncake to handle the KV transfer

Disaggregation Flow

The following diagram shows the complete request flow for disaggregated serving:

Key Steps Explained

Setup Phase (One-Time)

  • Decode workers register their RDMA connection information with prefill workers
  • This includes base GPU memory pointers for direct memory access

Per-Request Flow

  1. Request initiation: Client sends request to decode worker
  2. Bootstrap room allocation: Decode forwards to prefill and receives a bootstrap_room ID for coordination
  3. Memory allocation: Decode allocates GPU memory pages for incoming KV cache
  4. Prefill execution: Prefill worker processes the prompt and generates KV cache
  5. KV transfer: Prefill uses RDMA to write KV cache directly to decode’s GPU memory (while decode polls for completion)
  6. Cleanup: Prefill deallocates transfer metadata after confirming completion
  7. Decode phase: Decode worker generates tokens using the transferred KV cache
  8. Streaming: Tokens are streamed back to the client as they’re generated

Performance Characteristics

  • RDMA transfer: Zero-copy GPU-to-GPU transfer with minimal CPU involvement
  • Parallel operations: Decode can poll while prefill transfers data
  • One-time setup: RDMA connections established once, reused for all requests