For quick start instructions, see the SGLang README. This document provides all deployment patterns for running SGLang with Dynamo, including LLMs, multimodal, and diffusion models, and Kubernetes deployment.
For local/bare-metal development, start etcd and optionally NATS using Docker Compose:
--discovery-backend file to use file system based discovery.--kv-events-config). Use --no-router-kv-events on the frontend for prediction-based routing without NATS.DYN_DISCOVERY_BACKEND=kubernetes).Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for testing. For AI agents working with Dynamo, you can run the launch script in the background and use the curl commands to test the deployment.
The simplest deployment pattern: a single worker handles both prefill and decode.
Two workers behind a KV-aware router that maximizes cache reuse:
This launches the frontend with --router-mode kv and two workers with ZMQ-based KV event publishing.
Separates prefill and decode into independent workers connected via NIXL for KV cache transfer. Requires 2 GPUs.
For details on how SGLang disaggregation works with Dynamo, including the bootstrap mechanism and RDMA transfer flow, see SGLang Disaggregation.
Scales to 2 prefill + 2 decode workers with KV-aware routing on both pools. Requires 4 GPUs.
The frontend uses --router-mode kv and automatically detects prefill workers to activate an internal prefill router. Each worker publishes KV events over ZMQ on unique ports.
Serve multimodal models using SGLang’s built-in multimodal support:
For advanced multimodal deployments with separate encoder, prefill, and decode workers (E/PD and E/P/D patterns), see the dedicated SGLang Multimodal documentation.
Run diffusion language models like LLaDA2.0:
Generate images from text prompts using FLUX or other diffusion models:
Options: --model-path, --fs-url (local or S3), --http-url.
Generate videos from text prompts using Wan2.1 models:
Options: --wan-size 1b|14b, --num-frames, --height, --width, --num-inference-steps.
For full details on all diffusion worker types (LLM, image, video), see Diffusion.
For complete K8s deployment examples, see:
Set SGLANG_DISABLE_CUDNN_CHECK=1 before launching. This is common when PyTorch ships a CuDNN version older than what SGLang’s Conv3d models require. Affects vision and diffusion models.
config.json ErrorThis happens with diffusers models (FLUX.1-dev, Wan2.1, etc.) that use model_index.json instead of config.json. Ensure you are using the correct worker flag (--image-diffusion-worker or --video-generation-worker) rather than the standard LLM worker mode. These flags use a registration path that does not require config.json.
If a previous run left orphaned GPU processes, the next launch may OOM. Check for zombie processes:
Ensure both prefill and decode workers can reach each other over TCP. The bootstrap mechanism uses --disaggregation-bootstrap-port (default: 12345). For multi-node setups, ensure the port is reachable across hosts and set --host 0.0.0.0.