For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
      • Reference Guide
      • Chat Processor
      • Examples
      • Disaggregation
      • Diffusion
      • Observability
      • Agentic Workloads
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Infrastructure Setup
  • LLM Serving
  • Aggregated Serving
  • Aggregated Serving with KV Routing
  • Disaggregated Serving
  • Disaggregated Serving with KV-Aware Prefill Routing
  • Multimodal Serving
  • Aggregated Multimodal
  • Multimodal with Disaggregated Components
  • Diffusion Models
  • Diffusion LM
  • Image Diffusion
  • Video Generation
  • Kubernetes Deployment
  • Troubleshooting
  • CuDNN Version Check Fails
  • Model Registration Fails with config.json Error
  • GPU OOM on Startup
  • Disaggregated Workers Cannot Connect
  • See Also
BackendsSGLang

Examples

||View as Markdown|
Edit this page
Previous

SGLang Chat Processor

Next

Disaggregation

For quick start instructions, see the SGLang README. This document provides all deployment patterns for running SGLang with Dynamo, including LLMs, multimodal, and diffusion models, and Kubernetes deployment.

Infrastructure Setup

For local/bare-metal development, start etcd and optionally NATS using Docker Compose:

$docker compose -f deploy/docker-compose.yml up -d
  • etcd is optional but is the default local discovery backend. You can also use --discovery-backend file to use file system based discovery.
  • NATS is only needed when using KV routing with events (--kv-events-config). Use --no-router-kv-events on the frontend for prediction-based routing without NATS.
  • On Kubernetes, neither is required when using the Dynamo operator (DYN_DISCOVERY_BACKEND=kubernetes).

Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. For AI agents working with Dynamo, you can run the launch script in the background and use the curl commands to test the deployment.

LLM Serving

Aggregated Serving

The simplest deployment pattern: a single worker handles both prefill and decode.

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/agg.sh

Aggregated Serving with KV Routing

Two workers behind a KV-aware router that maximizes cache reuse:

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/agg_router.sh

This launches the frontend with --router-mode kv and two workers with ZMQ-based KV event publishing.

Disaggregated Serving

Separates prefill and decode into independent workers connected via NIXL for KV cache transfer. Requires 2 GPUs.

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/disagg.sh

For details on how SGLang disaggregation works with Dynamo, including the bootstrap mechanism and RDMA transfer flow, see SGLang Disaggregation.

Disaggregated Serving with KV-Aware Prefill Routing

Scales to 2 prefill + 2 decode workers with KV-aware routing on both pools. Requires 4 GPUs.

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/disagg_router.sh

The frontend uses --router-mode kv and automatically detects prefill workers to activate an internal prefill router. Each worker publishes KV events over ZMQ on unique ports.

Multimodal Serving

Aggregated Multimodal

Serve multimodal models using SGLang’s built-in multimodal support:

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/agg_vision.sh
Verify the deployment
$curl http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen3-VL-8B-Instruct",
> "messages": [
> {
> "role": "user",
> "content": [
> {"type": "text", "text": "Explain why Roger Federer is considered one of the greatest tennis players of all time"},
> {"type": "image_url", "image_url": {"url": "https://media.newyorker.com/photos/63249cff39ac97c4c23ff5d0/master/w_2560%2Cc_limit/Marzorati%2520-%2520Federer%2520Retirement%25202.jpg"}}
> ]
> }
> ],
> "max_tokens": 50,
> "stream": false
> }' | jq

Multimodal with Disaggregated Components

For advanced multimodal deployments with separate encoder, prefill, and decode workers (E/PD and E/P/D patterns), see the dedicated SGLang Multimodal documentation.

PatternScriptDescription
E/PD./launch/multimodal_epd.shSeparate vision encoder + combined PD worker
E/P/D./launch/multimodal_disagg.shSeparate encoder, prefill, and decode workers

Diffusion Models

Diffusion LM

Run diffusion language models like LLaDA2.0:

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/diffusion_llada.sh

Image Diffusion

Generate images from text prompts using FLUX or other diffusion models:

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/image_diffusion.sh

Options: --model-path, --fs-url (local or S3), --http-url.

Video Generation

Generate videos from text prompts using Wan2.1 models:

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/text-to-video-diffusion.sh

Options: --wan-size 1b|14b, --num-frames, --height, --width, --num-inference-steps.

For full details on all diffusion worker types (LLM, image, video), see Diffusion.

Kubernetes Deployment

For complete K8s deployment examples, see:

  • SGLang K8s deployment guide
  • SGLang aggregated router K8s example
  • Kubernetes Deployment Guide

Troubleshooting

CuDNN Version Check Fails

RuntimeError: cuDNN frontend 1.8.1 requires cuDNN lib >= 9.5.0

Set SGLANG_DISABLE_CUDNN_CHECK=1 before launching. This is common when PyTorch ships a CuDNN version older than what SGLang’s Conv3d models require. Affects vision and diffusion models.

Model Registration Fails with config.json Error

unable to extract config.json from directory ...

This happens with diffusers models (FLUX.1-dev, Wan2.1, etc.) that use model_index.json instead of config.json. Ensure you are using the correct worker flag (--image-diffusion-worker or --video-generation-worker) rather than the standard LLM worker mode. These flags use a registration path that does not require config.json.

GPU OOM on Startup

If a previous run left orphaned GPU processes, the next launch may OOM. Check for zombie processes:

$nvidia-smi # look for lingering sgl_diffusion::scheduler or python processes
$kill -9 <PID>

Disaggregated Workers Cannot Connect

Ensure both prefill and decode workers can reach each other over TCP. The bootstrap mechanism uses --disaggregation-bootstrap-port (default: 12345). For multi-node setups, ensure the port is reachable across hosts and set --host 0.0.0.0.

See Also

  • SGLang README: Quick start and feature overview
  • Reference Guide: Architecture, configuration, and operational details
  • SGLang Multimodal: Vision model deployment patterns
  • SGLang HiCache: Hierarchical cache integration
  • Benchmarking: Performance benchmarking tools
  • Tuning Disaggregated Performance: P/D tuning guide