For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
      • Embedding Cache
      • Encoder Disaggregation
      • Multimodal KV Routing
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Overview
  • When to Use
  • Support Matrix
  • How It Works
  • Configuration
User GuidesMultimodal

Embedding Cache

Cache vision encoder embeddings to skip re-encoding repeated images

||View as Markdown|
Edit this page
Previous

Multimodal Model Serving

Next

Encoder Disaggregation

Overview

The embedding cache is a CPU-side LRU cache that stores vision encoder outputs. When the same image appears in multiple requests, the cached embedding is reused instead of running the vision encoder again. This reduces GPU load on the encoder and lowers latency for repeated images.

Note: This feature can also be referred to as encoder cache. Embedding cache is separate from KV cache, which reuses attention key/value state after prefill to skip prefill and go straight to decode. For KV cache reuse and routing, see Multimodal KV Routing.

When to Use

Use the embedding cache when your workload includes repeated images across requests. Common scenarios:

  • Product catalog queries where users ask about the same product images
  • Document processing pipelines that reference shared diagrams or figures
  • Chat sessions where the same image is discussed across multiple turns, like an architecture diagram in a code-gen use case.

If your workload consists entirely of unique images, the cache provides no benefit.

Support Matrix

BackendAggregatedDisaggregated (E/PD)Notes
vLLM✅*✅Aggregated uses vLLM-native ec_both; disaggregated uses Dynamo EmbeddingCacheManager
TRT-LLM❌✅Dynamo MultimodalEmbeddingCacheManager in PD worker
SGLang❌❌Not supported yet

*Requires an upcoming version of vLLM that has not yet been released. Support will be available once the new vLLM release is published.

How It Works

The prefill worker owns the CPU-side LRU cache. On a hit, the encode worker is skipped entirely. On a miss, the encode worker produces the embedding, transfers it via NIXL, and the prefill worker saves it to the cache.

Launch (vLLM):

$cd $DYNAMO_HOME/examples/backends/vllm
$bash launch/disagg_multimodal_e_pd.sh --multimodal-embedding-cache-capacity-gb 10

Launch (TRT-LLM):

$cd $DYNAMO_HOME/examples/backends/trtllm
$./launch/disagg_e_pd.sh --multimodal-embedding-cache-capacity-gb 10

Configuration

ParameterDescriptionDefault
--multimodal-embedding-cache-capacity-gbCPU-side LRU cache size in GB0 (disabled)

Set the capacity based on your expected working set of unique images. A larger cache holds more embeddings but consumes more host memory.

See the backend-specific documentation (vLLM, TRT-LLM) for more details.