For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
      • Embedding Cache
      • Encoder Disaggregation
      • Multimodal KV Routing
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Overview
  • When to Use
  • Support Matrix
  • How It Works
  • Launching
  • vLLM
  • TRT-LLM
  • Known Limitations
User GuidesMultimodal

Multimodal KV Routing

Route multimodal requests to workers with the best KV cache overlap
||View as Markdown|
Edit this page
Previous

Encoder Disaggregation

Next

Diffusion

Overview

Multimodal KV routing extends Dynamo’s KV-aware router to account for image content when computing cache overlap scores. A dedicated MM router worker sits between the frontend and backend workers. It downloads images, computes a hash of each image (mm_hash), and includes this hash in per-block routing metadata. The KV router then selects the backend worker with the highest cache overlap, including overlap on image embedding blocks.

Repeated requests containing the same image are routed to the worker that already has the corresponding KV cache blocks, maximizing prefix cache reuse.

Note: KV cache is separate from embedding cache (also called encoder cache), which reuses vision encoder outputs (image→embeddings) to avoid re-running the encoder. For encoder-side reuse see Embedding Cache.

When to Use

Use multimodal KV routing when:

  • You have multiple backend workers serving multimodal requests
  • Your workload includes repeated images across requests (e.g., the same product photo, shared reference images)
  • You want to maximize KV cache hit rates for multimodal content

Without MM-aware routing, the standard router treats image token blocks as opaque and cannot match which worker has cached a particular image’s KV blocks.

Support Matrix

BackendSupportedNotes
vLLM✅*Requires vLLM with KV events extra_keys support (PR #33304)
TRT-LLM✅Requires --publish-events-and-metrics on TRT-LLM workers
SGLang❌Not supported yet

*Requires an upcoming version of vLLM that has not yet been released. Support will be available once the new vLLM release is published.

How It Works

Frontend (round-robin) → MM Router Worker → Backend Workers
│
├─ Download image
├─ Compute mm_hash
├─ Build per-block MM metadata
└─ KvRouter selects best worker
  1. The frontend routes to the MM router worker via round-robin
  2. The MM router downloads each image and computes an mm_hash
  3. Per-block routing metadata (block_mm_infos) is built, tagging blocks that contain image tokens
  4. The KV router evaluates overlap across all backend workers, accounting for image-bearing blocks
  5. The request is forwarded to the worker with the highest overlap

On repeated requests with the same image, the selected worker shows higher cached block counts, reducing prefill latency.

Launching

vLLM

$cd $DYNAMO_HOME/examples/backends/vllm/mm_router_worker
$MODEL=Qwen/Qwen3-VL-2B-Instruct ./launch.sh

TRT-LLM

$cd $DYNAMO_HOME/examples/backends/trtllm/mm_router_worker
$./launch.sh

See the vLLM MM Router README and TRT-LLM MM Router README for full setup instructions and configuration options.

Known Limitations

  • Currently supports Qwen-family multimodal processors (Qwen2-VL, Qwen2.5-VL, Qwen3-VL) for per-image visual token counting
  • Images are downloaded twice: once in the MM router (for hash computation) and once in the backend worker (for processing)