For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal uses specialized E/PD or E/P/D flows with NIXL (RDMA) for zero-copy tensor transfer.
Support Matrix
Modality
Input Format
Aggregated
Disaggregated
Notes
Image
HTTP/HTTPS URL
Yes
Yes
Vision encoder generates embeddings
Image
Data URL (Base64)
No
No
Video
HTTP/HTTPS URL
No
No
Audio
HTTP/HTTPS URL
No
No
Supported URL Formats
Format
Example
Description
HTTP/HTTPS
http://example.com/image.jpg
Remote media files
Deployment Patterns
SGLang supports E/PD and E/P/D patterns only (always has a separate encode worker). See Multimodal Architecture Patterns for detailed explanations.
Pattern
Supported
Launch Script
Notes
EPD (Simple Aggregated)
❌
N/A
Not supported
E/PD (Encode Separate)
✅
multimodal_agg.sh
Vision encoder separate
E/P/D (Full Disaggregation)
✅
multimodal_disagg.sh
KV cache via bootstrap
EP/D (Traditional Disaggregated)
❌
N/A
Not supported
Component Flags
Component
Flag
Purpose
Processor
--multimodal-processor
HTTP entry, OpenAI→SGLang conversion
Encode Worker
--multimodal-encode-worker
Vision encoder, embeddings generation
PD Worker
--multimodal-worker
Prefill + Decode with embeddings
Decode Worker
--multimodal-worker --serving-mode=decode
Entry point for disaggregation
Prefill Worker
--multimodal-worker --serving-mode=prefill
Called by Decode, bootstrap coordination
SGLang-Specific Characteristics
Vision Encoder in Python: Encode worker loads vision model (AutoModel) and image processor (AutoImageProcessor)
Token Expansion: Single <|image_pad|> token replaced with N tokens based on embedding shape
NIXL Transfer: Embeddings transferred from Encoder → PD Worker using NIXL
No Rust Processing: All tokenization and image handling happens in Python
Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
You can find the latest release and check out the corresponding branch with:
passes the text and image url to the MultimodalEncodeWorker.
Workflow
The MultimodalEncodeWorker downloads and encodes the image and passes the embeddings to the MultimodalWorker. The work complete event is sent via NATS, while the embeddings tensor is transferred via RDMA through the NIXL interface. The MultimodalWorker then prefills and decodes the prompt in the same engine, as in the LLM aggregated serving example. Only the processor is registered to the Dynamo frontend as an available endpoint. Workers do NOT register - they are internal components and communicate via NATS.
In models like Qwen2.5-VL, embeddings are only required during the prefill stage. The image embeddings are transferred via NIXL from the Encode Worker to the Decode Worker (the entry point for disaggregation), which then coordinates with the Prefill Worker. The Prefill Worker processes the embeddings and forwards the KV cache back to the Decode Worker for token generation.