Running SGLang with Dynamo#

Use the Latest Release#

We recommend using the latest stable release of dynamo to avoid breaking changes:

You can find the latest release here and check out the corresponding branch with:

git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

Table of Contents#

Feature Support Matrix
Dynamo SGLang Integration
Installation
Quick Start
Single Node Examples
Multi-Node and Advanced Examples
Deploy on SLURM or Kubernetes

Feature Support Matrix#

Core Dynamo Features#

Feature	SGLang	Notes
Disaggregated Serving	✅
Conditional Disaggregation	🚧	WIP PR
KV-Aware Routing	✅
SLA-Based Planner	✅
Multimodal Support	✅
KVBM	❌	Planned

Dynamo SGLang Integration#

Dynamo SGLang integrates SGLang engines into Dynamo’s distributed runtime, enabling advanced features like disaggregated serving, KV-aware routing, and request migration while maintaining full compatibility with SGLang’s engine arguments.

Argument Handling#

Dynamo SGLang uses SGLang’s native argument parser, so most SGLang engine arguments work identically. You can pass any SGLang argument (like --model-path, --tp, --trust-remote-code) directly to dynamo.sglang.

Dynamo-Specific Arguments#

Argument	Description	Default	SGLang Equivalent
`--endpoint`	Dynamo endpoint in `dyn://namespace.component.endpoint` format	Auto-generated based on mode	N/A
`--migration-limit`	Max times a request can migrate between workers for fault tolerance. See Request Migration Architecture.	`0` (disabled)	N/A
`--dyn-tool-call-parser`	Tool call parser for structured outputs (takes precedence over `--tool-call-parser`)	`None`	`--tool-call-parser`
`--dyn-reasoning-parser`	Reasoning parser for CoT models (takes precedence over `--reasoning-parser`)	`None`	`--reasoning-parser`
`--use-sglang-tokenizer`	Use SGLang’s tokenizer instead of Dynamo’s	`False`	N/A
`--custom-jinja-template`	Use custom chat template for that model (takes precedence over default chat template in model repo)	`None`	`--chat-template`

Tokenizer Behavior#

Default (--use-sglang-tokenizer not set): Dynamo handles tokenization/detokenization via our blazing fast frontend and passes input_ids to SGLang
With --use-sglang-tokenizer: SGLang handles tokenization/detokenization, Dynamo passes raw prompts

Note

When using --use-sglang-tokenizer, only v1/chat/completions is available through Dynamo’s frontend.

Request Cancellation#

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

Cancellation Support Matrix#

	Prefill	Decode
Aggregated	✅	✅
Disaggregated	⚠️	✅

Warning

⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.

For more details, see the Request Cancellation Architecture documentation.

Installation#

Install latest release#

We suggest using uv to install the latest release of ai-dynamo[sglang]. You can install it with curl -LsSf https://astral.sh/uv/install.sh | sh

Expand for instructions

# create a virtual env
uv venv --python 3.12 --seed
# install the latest release (which comes bundled with a stable sglang version)
uv pip install "ai-dynamo[sglang]"

Install editable version for development#

Expand for instructions

This requires having rust installed. We also recommend having a proper installation of the cuda toolkit as sglang requires nvcc to be available.

# create a virtual env
uv venv --python 3.12 --seed
# build dynamo runtime bindings
uv pip install maturin
cd $DYNAMO_HOME/lib/bindings/python
maturin develop --uv
cd $DYNAMO_HOME
# installs sglang supported version along with dynamo
# include the prerelease flag to install flashinfer rc versions
uv pip install -e .
# install any sglang version >= 0.5.3.post2
uv pip install "sglang[all]==0.5.3.post2"

Using docker containers#

Expand for instructions

We are in the process of shipping pre-built docker containers that contain installations of DeepEP, DeepGEMM, and NVSHMEM in order to support WideEP and P/D. For now, you can quickly build the container from source with the following command.

cd $DYNAMO_ROOT
./container/build.sh \
  --framework SGLANG \
  --tag dynamo-sglang:latest \

And then run it using

docker run \
    --gpus all \
    -it \
    --rm \
    --network host \
    --shm-size=10G \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --ulimit nofile=65536:65536 \
    --cap-add CAP_SYS_PTRACE \
    --ipc host \
    dynamo-sglang:latest

Quick Start#

Below we provide a guide that lets you run all of our common deployment patterns on a single node.

Start Infrastructure Services (Local Development Only)#

For local/bare-metal development, start etcd and optionally NATS using Docker Compose:

docker compose -f deploy/docker-compose.yml up -d

Note

etcd is optional but is the default local discovery backend. You can also use --kv_store file to use file system based discovery.
NATS is optional - only needed if using KV routing with events (default). You can disable it with --no-kv-events flag for prediction-based routing
On Kubernetes, neither is required when using the Dynamo operator, which explicitly sets DYN_DISCOVERY_BACKEND=kubernetes to enable native K8s service discovery (DynamoWorkerMetadata CRD)

Tip

Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.

Additionally - because we use sglang’s argument parser, you can pass in any argument that sglang supports to the worker!

Aggregated Serving#

cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg.sh

Aggregated Serving with KV Routing#

cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg_router.sh

Aggregated Serving for Embedding Models#

Here’s an example that uses the Qwen/Qwen3-Embedding-4B model.

cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg_embed.sh

Send the following request to verify your deployment:

curl localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Embedding-4B",
    "input": "Hello, world!"
  }'

Disaggregated serving#

See SGLang Disaggregation to learn more about how sglang and dynamo handle disaggregated serving.

cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg.sh

Disaggregated Serving with KV Aware Prefill Routing#

cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg_router.sh

Disaggregated Serving with Mixture-of-Experts (MoE) models and DP attention#

You can use this configuration to test out disaggregated serving with dp attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.

# note this will require 4 GPUs
cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg_dp_attn.sh

Testing the Deployment#

Send a test request to verify your deployment:

curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
    {
        "role": "user",
        "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
    }
    ],
    "stream": true,
    "max_tokens": 30
  }'

Deployment#

We currently provide deployment examples for Kubernetes and SLURM.

Kubernetes#

Deploying Dynamo with SGLang on Kubernetes

SLURM#

Deploying Dynamo with SGLang on SLURM