For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
  • User Guides
    • Disaggregated Serving
    • KV Cache Aware Routing
    • KV Cache Offloading
    • Tool Calling
    • Reasoning
    • Agents
    • Multimodal
      • Embedding Cache
      • Encoder Disaggregation
      • Multimodal KV Routing
    • Diffusion
    • LoRA Adapters
    • Fastokens Tokenizer
    • Observability (Local)
    • Fault Tolerance
    • Benchmarking
    • Writing Python Workers
    • Writing Python Unified Backends
    • Writing Rust Unified Backends
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Overview
  • When to Use
  • Support Matrix
  • Deployment Patterns
  • Launching
  • vLLM
  • TRT-LLM
  • SGLang
User GuidesMultimodal

Encoder Disaggregation

Separate vision encoding into a dedicated worker for independent scaling
||View as Markdown|
Previous

Embedding Cache

Next

Multimodal KV Routing

Overview

Encoder disaggregation separates the vision encoder from the prefill/decode pipeline into its own worker. Instead of running image encoding inline, a dedicated encode worker handles media processing and transfers the resulting embeddings to downstream workers via NIXL (RDMA).

This enables:

  • Independent scaling of encode workers based on vision workload
  • Reduced GPU memory pressure on prefill/decode workers
  • Better GPU utilization by matching worker counts to actual bottlenecks

When to Use

Use encoder disaggregation when:

  • Vision encoding is a bottleneck and you need to scale encoders independently of LLM workers
  • You want to run the vision encoder on different hardware (e.g., smaller GPUs for encoding, larger GPUs for LLM inference)
  • Your deployment handles high volumes of multimodal requests and encoding throughput is limiting

For simple deployments or development/testing, the aggregated (EPD) pattern is easier to set up.

Support Matrix

BackendE/PDE/P/DNotes
vLLM✅✅Separate encode worker currently handles image_url inputs; video_url inputs stay on the prefill/PD path
TRT-LLM❌✅Supports image URLs (via MultimodalEncoder) and pre-computed embeddings (via NIXL)
SGLang✅✅NIXL for embeddings; bootstrap mechanism for P/D KV transfer

Deployment Patterns

E/PD — Separate encoder, combined prefill+decode:

Frontend → Processor → Encode Worker → PD Worker → Response
(NIXL)

The encode worker runs the vision model and transfers embeddings via NIXL to a combined prefill+decode worker.

E/P/D — All stages separate:

Frontend → Processor → Encode Worker → Prefill Worker → Decode Worker → Response
(NIXL) (KV transfer)

Full disaggregation with separate workers for each stage. The encode worker transfers embeddings to the prefill worker, which then transfers KV cache to the decode worker.

Launching

vLLM

$cd $DYNAMO_HOME/examples/backends/vllm
$
$# E/PD
$bash launch/disagg_multimodal_e_pd.sh --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
$
$# E/P/D
$bash launch/disagg_multimodal_epd.sh --model "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"

TRT-LLM

$cd $DYNAMO_HOME/examples/backends/trtllm
$
$# E/PD
$bash launch/disagg_e_pd.sh
$
$# E/P/D
$./launch/epd_multimodal_image_and_embeddings.sh

SGLang

$cd $DYNAMO_HOME/examples/backends/sglang
$
$# E/PD
$./launch/multimodal_epd.sh
$
$# E/P/D
$./launch/multimodal_disagg.sh

See the backend-specific documentation (vLLM, TRT-LLM, SGLang) for full configuration details and component flags.