Embedding Cache
Cache vision encoder embeddings to skip re-encoding repeated images
Cache vision encoder embeddings to skip re-encoding repeated images
The embedding cache is a CPU-side LRU cache that stores vision encoder outputs. When the same image appears in multiple requests, the cached embedding is reused instead of running the vision encoder again. This reduces GPU load on the encoder and lowers latency for repeated images.
Note: This feature can also be referred to as encoder cache. Embedding cache is separate from KV cache, which reuses attention key/value state after prefill to skip prefill and go straight to decode. For KV cache reuse and routing, see Multimodal KV Routing.
Use the embedding cache when your workload includes repeated images across requests. Common scenarios:
If your workload consists entirely of unique images, the cache provides no benefit.
*Requires an upcoming version of vLLM that has not yet been released. Support will be available once the new vLLM release is published.
The prefill worker owns the CPU-side LRU cache. On a hit, the encode worker is skipped entirely. On a miss, the encode worker produces the embedding, transfers it via NIXL, and the prefill worker saves it to the cache.
Launch (vLLM):
Launch (TRT-LLM):
Set the capacity based on your expected working set of unique images. A larger cache holds more embeddings but consumes more host memory.
See the backend-specific documentation (vLLM, TRT-LLM) for more details.