SGLang Multimodal

View as Markdown

This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal supports EPD, E/PD, and E/P/D flows, with NIXL (RDMA) for zero-copy tensor transfer in disaggregated modes.

Support Matrix

ModalityInput FormatAggregatedDisaggregatedNotes
ImageHTTP/HTTPS URLYesYesVision encoder generates embeddings
ImageData URL (Base64)NoNo
VideoHTTP/HTTPS URLNoNo
AudioHTTP/HTTPS URLNoNo

Supported URL Formats

FormatExampleDescription
HTTP/HTTPShttp://example.com/image.jpgRemote media files

Deployment Patterns

SGLang supports EPD, E/PD, and E/P/D patterns. See Multimodal Architecture Patterns for detailed explanations.

PatternSupportedLaunch ScriptNotes
EPD (Simple Aggregated)agg.shInternal encoding
E/PD (Encode Separate)multimodal_epd.shVision encoder separate
E/P/D (Full Disaggregation)multimodal_disagg.shKV cache via bootstrap
EP/D (Traditional Disaggregated)N/ANot supported

Component Flags

ComponentFlagPurpose
Processor--multimodal-processorHTTP entry, OpenAI→SGLang conversion
Encode Worker--multimodal-encode-workerVision encoder, embeddings generation
PD Worker--multimodal-workerPrefill + Decode with embeddings
Decode Worker--multimodal-worker --serving-mode=decodeEntry point for disaggregation
Prefill Worker--multimodal-worker --serving-mode=prefillCalled by Decode, bootstrap coordination

SGLang-Specific Characteristics

  • Vision Encoder in Python: Encode worker uses SGLang’s MMEncoder for model-agnostic vision encoding
  • Token Expansion: Single <|image_pad|> token replaced with N tokens based on embedding shape
  • NIXL Transfer: Embeddings transferred from Encoder → PD Worker using NIXL
  • No Rust Processing: All tokenization and image handling happens in Python

Use the Latest Release

We recommend using the latest stable release of dynamo to avoid breaking changes:

GitHub Release

You can find the latest release and check out the corresponding branch with:

$git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

EPD Serving (Simple Aggregated)

Components

Workflow

The DecodeWorkerHandler receives multimodal requests with image URLs and passes them directly to SGLang’s engine. SGLang’s internal mm_data_processor handles image fetching, loading, encoding, and token expansion.

Launch

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/agg.sh --model Qwen/Qwen2.5-VL-7B-Instruct --chat-template qwen2-vl

Client:

$curl http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen2.5-VL-7B-Instruct",
> "messages": [
> {
> "role": "user",
> "content": [
> {
> "type": "text",
> "text": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
> },
> {
> "type": "image_url",
> "image_url": {
> "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
> }
> }
> ]
> }
> ],
> "max_tokens": 50,
> "stream": false
> }' | jq

E/PD Serving (Encode Separate)

Components

Workflow

The MultimodalEncodeWorker downloads and encodes the image and passes the embeddings to the MultimodalWorker. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. The MultimodalWorker then prefills and decodes the prompt in the same engine, as in the LLM aggregated serving example. Only the processor is registered to the Dynamo frontend as an available endpoint. Workers do NOT register - they are internal components and communicate via NATS.

Launch

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/multimodal_epd.sh

Client:

$curl http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen2.5-VL-7B-Instruct",
> "messages": [
> {
> "role": "user",
> "content": [
> {
> "type": "text",
> "text": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
> },
> {
> "type": "image_url",
> "image_url": {
> "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
> }
> }
> ]
> }
> ],
> "max_tokens": 50,
> "stream": false
> }' | jq

E/P/D Serving (Full Disaggregation)

Components

Workflow

In models like Qwen2.5-VL, embeddings are only required during the prefill stage. The image embeddings are transferred via NIXL from the Encode Worker to the Decode Worker (the entry point for disaggregation), which then coordinates with the Prefill Worker. The Prefill Worker processes the embeddings and forwards the KV cache back to the Decode Worker for token generation.

Launch

$cd $DYNAMO_HOME/examples/backends/sglang
$./launch/multimodal_disagg.sh

Client:

$curl http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "Qwen/Qwen2.5-VL-7B-Instruct",
> "messages": [
> {
> "role": "user",
> "content": [
> {
> "type": "text",
> "text": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
> },
> {
> "type": "image_url",
> "image_url": {
> "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
> }
> }
> ]
> }
> ],
> "max_tokens": 50,
> "stream": false
> }' | jq

Bootstrap Coordination

SGLang disaggregation uses a bootstrap mechanism for P->D coordination:

Request Flow (Important)

Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker
Entry point for disaggregation!

Bootstrap Process

  1. Decode Worker receives request from Encode Worker
  2. Decode Worker calls Prefill Worker via NATS to request bootstrap info
  3. Prefill Worker generates {host, port, room} and returns immediately
  4. Both workers connect to same “room” using bootstrap coordinates
  5. SGLang internally transfers KV cache state via bootstrap connection (not NIXL)

Key Difference from vLLM

  • vLLM: Frontend → Prefill → Decode (Prefill is entry point)
  • SGLang: Frontend → Processor → Encode → Decode → Prefill (Decode is entry point)

Inter-Component Communication

Control Flow (NATS)

All component-to-component communication happens via NATS:

E/PD Mode (Encode Separate)

Processor → Encode Worker → PD Worker
(NATS) (NATS + NIXL embeddings)

E/P/D Mode (Full Disaggregation)

Processor → Encode Worker → DECODE Worker → Prefill Worker
(NATS) (NATS) (NATS)
Decode requests bootstrap
Prefill returns {host, port, room}
Both connect via bootstrap
SGLang internal KV cache transfer

Detailed Message Flow

Processor → Encode Worker:
- NATS round_robin with SglangMultimodalRequest
- Contains: tokenized input_ids, image URL, sampling params
Encode Worker → Decode/PD Worker:
- NATS round_robin to "backend" component
- Contains: expanded token_ids, NIXL metadata, embeddings shape
- NIXL transfer: embeddings tensor
Decode Worker → Prefill Worker (disagg only):
- NATS call to "prefill" component
- Decode requests bootstrap coordinates
- Prefill returns: {bootstrap_host, bootstrap_port, bootstrap_room}
Prefill ↔ Decode (via bootstrap):
- SGLang internal connection (not NATS)
- KV cache state shared via bootstrap mechanism

Data Transfer (NIXL)

NIXL is used only for embedding transfer:

1# Encode Worker
2descriptor = connect.Descriptor(precomputed_embeddings)
3with connector.create_readable(descriptor) as readable:
4 request.serialized_request = readable.metadata()
5 await pd_worker_client.round_robin(request)
6 await readable.wait_for_completion()
7
8# PD Worker
9embeddings = torch.empty(request.embeddings_shape, dtype=torch.float16)
10descriptor = connect.Descriptor(embeddings)
11read_op = await connector.begin_read(request.serialized_request, descriptor)
12await read_op.wait_for_completion()

Vision Encoding Details

Encode Worker Components

The encode worker uses SGLang’s MMEncoder for model-agnostic vision encoding. MMEncoder handles vision model loading, image preprocessing, and feature extraction internally:

1from sglang.srt.disaggregation.encode_server import MMEncoder
2
3self.encoder = MMEncoder(
4 server_args=config.server_args,
5 dist_init_method="tcp://127.0.0.1:0",
6 rank=0,
7)
8
9# At request time:
10image_grid_dim, mm_embedding = await self.encoder._encode([image_url])

Token Expansion Process

  1. Processor inserts single image token (e.g., <|image_pad|>)
  2. Encode worker generates embeddings: shape = (batch, num_patches, hidden_dim)
  3. Encode worker replaces single token with num_patches tokens
  4. Downstream worker receives expanded token sequence

Example:

1# Before: ["Hello", "<|image_pad|>", "world"]
2# After: ["Hello", "<|image_pad|>", "<|image_pad|>", ...(576 tokens), "world"]

Chat Template Processing

SGLang uses its own chat template system:

1from sglang.srt.parser.conversation import chat_templates
2
3conv = chat_templates["qwen2-vl"].copy()
4conv.append_message(conv.roles[0], f"{conv.image_token} Describe this image")
5processed = tokenizer(text=conv.get_prompt(), return_tensors="pt")

Supported templates: qwen2-vl, llama-3, vicuna, etc.

NIXL Usage

Use CaseNIXL Used?Data TransferNotes
EPD (Simple Aggregated)NoN/AAll processing internal to SGLang
E/PD (Encode Separate)YesEncoder → PD (embeddings)Vision encoder separate
E/P/D (Full Disaggregation)YesEncoder → Prefill (embeddings)KV cache via SGLang bootstrap

Key Difference: SGLang P/D uses bootstrap mechanism, not NIXL for KV cache like vLLM.

Environment Variables

SGLANG_ENCODER_MM_LOAD_WORKERS

Controls how many threads the encoder uses to fetch and load images concurrently. When a request contains multiple images (URLs, file paths, or base64 data), each image is loaded in a separate thread. Default is 4. Increase if image loading (network fetch or disk I/O) is the bottleneck rather than GPU compute. Has no effect if the vision encoder itself is the bottleneck, since encoding is sequential on GPU after all images are loaded.

$# Example: allow up to 16 concurrent image loads per encoder
$export SGLANG_ENCODER_MM_LOAD_WORKERS=16

Only applies to the EPD encode worker (which uses SGLang’s MMEncoder internally).

Profiling

Dynamo’s SGLang multimodal workers include NVTX markers for nsys profiling. They are disabled by default (zero overhead) and enabled by setting DYN_NVTX=1.

$cd $DYNAMO_HOME/examples/backends/sglang
$DYN_NVTX=1 nsys profile --trace=cuda,nvtx -o profile.nsys-rep \
> bash launch/multimodal_epd.sh ...
ENV VariableDefaultDescription
DYN_NVTX0Set to 1 to enable NVTX range/mark annotations in multimodal encode/prefill/decode worker paths for nsys profiling

Key NVTX ranges emitted:

RangeWorkerDescription
mm:enc:generateEncodeFull encode request lifetime
mm:enc:vision_encodeEncodeVision encode call (MMEncoder._encode)
mm:enc:embedding_transferEncodeEmbedding handoff to downstream worker
mm:nixl:begin_readPD (agg) / PrefillBegin NIXL read operation for embeddings
mm:nixl:wait_completionPD (agg) / PrefillWait for NIXL embedding transfer completion
mm:pd:generateAggregated worker / Decode worker (MultimodalWorkerHandler)Full worker-side request lifetime
mm:pd:generate_aggPD (agg)Aggregated generation path
mm:pd:load_multimodalPD (agg)Build multimodal items from transferred embeddings
mm:pd:generate_disaggDecode worker (disagg entrypoint)Disaggregated generation path
mm:prefill:bootstrapPrefill (disagg)Bootstrap coordination path before returning {bootstrap_host, bootstrap_port, bootstrap_room}
mm:prefill:load_multimodalPrefill (disagg)Build multimodal items from transferred embeddings in the prefill worker
mm:prefill:engine_async_generatePrefill (disagg)SGLang prefill engine invocation (engine.async_generate)
mm:pd:ttftAggregated worker / Decode worker (MultimodalWorkerHandler)Worker-entry TTFT: from request arrival at this worker to first output token (excludes client->frontend->worker network transit)
mm:dec:first_tokenAggregated worker / Decode worker (MultimodalWorkerHandler)Decode-stage first-token range (starts when decode stream is launched; not worker-entry TTFT)

Known Limitations

  • No Data URL support - Only HTTP/HTTPS URLs supported; data:image/... base64 URLs not supported
  • No pre-computed embeddings - Cannot use .pt, .pth, .bin embedding files; vision encoder runs for every request
  • No video support - No video encoder implementation
  • No audio support - No audio encoder implementation
  • Only Processor registers with Dynamo - Workers are internal components, frontend routes to Processor only
  • Disaggregated routing - Decode Worker is the entry point (calls Prefill), cannot route directly to Prefill workers
  • Limited model generalization - Token expansion logic is model-specific; adding new models may require implementation updates

Supported Models

SGLang multimodal only supports image-based vision-language models:

  • Qwen2-VL / Qwen2.5-VL - Qwen/Qwen2.5-VL-7B-Instruct
  • Qwen3-VL - Qwen/Qwen3-VL-30B-A3B-Instruct
  • Models supported by SGLang’s MMEncoder

Key Files

FileDescription
components/src/dynamo/sglang/main.pyComponent initialization, only Processor registers
components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.pyProcessor implementation, OpenAI→SGLang
components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.pyVision encoder, embeddings generation
components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.pyPD/Prefill/Decode workers, NIXL read
components/src/dynamo/sglang/multimodal_utils/multimodal_chat_processor.pyChat template processing
components/src/dynamo/sglang/protocol.pyRequest/response data structures
components/src/dynamo/sglang/register.pyRegistration logic (only called for Processor)