Multimodal KV routing extends Dynamo’s KV-aware router to account for image content when computing cache overlap scores. An image hash (mm_hash) is computed per request — in the Rust frontend by default for vLLM backends, by vLLM’s own processor when the chat-processor variant is enabled, or by a dedicated MM router worker for TRT-LLM backends — and included in per-block routing metadata. The KV router then selects the backend worker with the highest cache overlap, including overlap on image embedding blocks.
Repeated requests containing the same image are routed to the worker that already has the corresponding KV cache blocks, maximizing prefix cache reuse.
Note: KV cache is separate from embedding cache (also called encoder cache), which reuses vision encoder outputs (image→embeddings) to avoid re-running the encoder. For encoder-side reuse see Embedding Cache.
Use multimodal KV routing when:
Without MM-aware routing, the standard router treats image token blocks as opaque and cannot match which worker has cached a particular image’s KV blocks.
The Rust frontend’s MM-aware routing path supports whatever VLM families the
lightseek llm-multimodal crate registers — see
ImageProcessorRegistry::with_defaults()
for the up-to-date list. A model that crate doesn’t recognize falls back to
text-prefix-only KV routing (request still completes; just no prefix-cache
benefit across images).
The Python chat-processor variant doesn’t share this constraint — it delegates to vLLM’s own multimodal processor and works with any VLM vLLM supports.
mm_hash per image: xxh3_64 of the decoded bytes for data: URIs (and for http(s):// when media_decoder is enabled on the model), otherwise xxh3_64 of the full URL string. Two callers will share an mm_hash only when they send byte-identical URLs.ModelProcessorSpec (one spec per supported VLM family — Qwen3-VL, Qwen2.5-VL, Qwen2-VL, LLaVA-NeXT, LLaVA-1.5, Phi-3-vision, Llama-4, Kimi-K2.5). Each spec reads the appropriate config.json field for its model family (image_token_id, image_token_index, or media_placeholder_token_id) and falls back to probing the tokenizer’s vocab when only the placeholder string is registered. Models the registry doesn’t recognise fall back to text-prefix-only routing.Note: Qwen3.5 / Qwen3.6 image token expansion is not yet supported in the Rust frontend for MM routing, so KV routing will only consider the text inputs + unexpanded image token placeholders. Support will come in a follow-up release.
(W, H) is read from a 64KB Range-bounded header fetch (or from in-memory bytes for data: URIs); the lightseek llm-multimodal crate computes the per-image expanded token count.routing_token_ids (a router-only view); the worker still sees one placeholder per image in token_ids.block_mm_infos) is built from the expanded view; the KV router evaluates overlap across workers including image-bearing blocks.mm_hash (16-hex-char prefix, padded) via extra_args["mm_hashes"]; the backend handler injects them as vLLM’s multi_modal_uuids, so vLLM’s own KV-cache key matches the hash the router used.Use this variant (--dyn-chat-processor=vllm) when you want the frontend to run vLLM’s HF image processor in-process and ship pre-processed mm_kwargs to the selected worker via shared memory or NIXL RDMA, so the backend skips the HF processor entirely. See the Transfer Mode Details section below for the DYNAMO_MM_TRANSFER flags.
For TRT-LLM, a dedicated MM Router Worker sits between the frontend and backend workers. See the TRT-LLM MM Router README for setup instructions.
The Rust frontend uses the lightseek llm-multimodal crate
(source) for per-image token-count and placeholder
expansion. llm-multimodal provides a pure-Rust calculate_num_tokens(W, H, PreProcessorConfig) per VLM family (Qwen2/2.5/3-VL, LLaVA, Pixtral, …),
golden-tested against transformers, so the router can match vLLM’s
expanded image-token count without invoking the HF image processor. The
frontend then forwards each mm_hash to the worker as multi_modal_uuids
so vLLM’s KV events publish the same key the router computes.
Key environment variables:
To opt into frontend image decoding (so the frontend downloads + decodes once and mm_hash becomes content-addressed instead of URL-addressed):
The worker then registers a media_decoder on its model card; the frontend’s MediaLoader runs in-process and hashes decoded RGB bytes via xxh3. Two distinct (signed) URLs of the same image bytes collide on the same routing key.
Uses --dyn-chat-processor=vllm so the frontend runs vLLM’s HF processor
in-process. Adds the DYNAMO_MM_TRANSFER shm/NIXL pre-rendered mm_kwargs
delivery channel between frontend and worker.
See the TRT-LLM MM Router README for full setup instructions and configuration options.
Applies to the --dyn-chat-processor=vllm launch (agg_multimodal_router_chat_processor.sh), not the default Rust frontend path. In the chat-processor variant the frontend runs the HF image processor in-process and ships the pre-processed mm_kwargs to the selected backend worker so the backend can skip re-processing; the DYNAMO_MM_TRANSFER environment variable controls how that payload is transferred.
The default Rust frontend path doesn’t run the HF processor or pre-render mm_kwargs — it forwards only mm_hashes, and each worker re-processes the image itself. TRT-LLM backends similarly re-run their own preprocessing and don’t honor DYNAMO_MM_TRANSFER.
shm (default): POSIX shared memory via a /dev/shm segment. Intended for same-node deployments, where frontend and backend share the host filesystem. If the backend can’t access the segment (e.g., running on a different node), it falls back to re-processing the image from the URL.nixl: NIXL RDMA transfer. Required for cross-node deployments where /dev/shm is not shared between frontend and backend. Works across nodes over InfiniBand or TCP (whichever UCX selects).DYNAMO_DISABLE_NIXL_MM=1: Disables pre-processed mm_kwargs transfer entirely. The backend downloads and processes images itself from the original URLs. Useful for debugging or when transfer overhead exceeds re-processing cost.