Multimodal KV routing extends Dynamo’s KV-aware router to account for image content when computing cache overlap scores. A dedicated MM router worker sits between the frontend and backend workers. It downloads images, computes a hash of each image (mm_hash), and includes this hash in per-block routing metadata. The KV router then selects the backend worker with the highest cache overlap, including overlap on image embedding blocks.
Repeated requests containing the same image are routed to the worker that already has the corresponding KV cache blocks, maximizing prefix cache reuse.
Note: KV cache is separate from embedding cache (also called encoder cache), which reuses vision encoder outputs (image→embeddings) to avoid re-running the encoder. For encoder-side reuse see Embedding Cache.
Use multimodal KV routing when:
Without MM-aware routing, the standard router treats image token blocks as opaque and cannot match which worker has cached a particular image’s KV blocks.
*Requires an upcoming version of vLLM that has not yet been released. Support will be available once the new vLLM release is published.
mm_hashblock_mm_infos) is built, tagging blocks that contain image tokensOn repeated requests with the same image, the selected worker shows higher cached block counts, reducing prefill latency.
See the vLLM MM Router README and TRT-LLM MM Router README for full setup instructions and configuration options.