Qwen3-VL Embedding Cache A/B
Qwen3-VL Embedding Cache A/B
How much does enabling the vLLM multimodal embedding cache improve repeated-image traffic on one GB200 worker?
Two configurations run the same deploy.yaml on a single aggregated GB200 worker — the only delta is the DYN_MULTIMODAL_EMBEDDING_CACHE_GB environment variable (10 GB for cache ON, 0 for cache OFF). With an image pool of 200 across 1,000 requests, the first 200 requests see unique images and the remaining 800 hit images the engine has already encoded, so a cache hit skips the vision encoder on the prefill path. Enabling the cache delivers +16.4% output throughput and −27.7% average TTFT.
Benchmark setup
Results
Enabling the embedding cache on a single aggregated GB200 replica with the vLLM backend delivers +16% throughput, -28% TTFT, and -13% request latency (single representative run, reproduced from the recipe README):
Compared Configurations
Reproduce
A dataset-generation job creates the synthetic multimodal dataset (qwen3_vl_1000req_1img_pool200.jsonl: 1,000 requests, 1 image per request, 200-image pool, 400 text tokens per request) on the perf-cache PVC. The perf.yaml then wraps this AIPerf command:
Run each configuration in sequence — redeploy with the toggled cache setting between runs:
The helper script recipes/qwen3-vl-30b/vllm/agg-embedding-cache/run-benchmark.sh automates each run — pass on or off to run one cache mode per invocation. AIPerf artifacts land under /perf-cache/artifacts/qwen3_vl_30b_embedding_cache/agg/<cache_mode>.
Notes
- Exact cache hit rates cannot be pinned via the dataset because of LRU eviction; shrinking the image pool relative to request count (or growing the cache) raises the hit probability.
- The aggregated embedding cache uses vLLM’s native
ec_bothECConnector role, supported in vLLM 0.17+ with no patches — see multimodal vLLM docs. - Replace the
storageClassNameandimage:placeholders in the YAML files before running. - Source: recipes/qwen3-vl-30b
Winning Configuration
The cache-ON configuration is the winning configuration and is deployable from its assets above; a recommended Recipe may be promoted from this benchmark in a future release. The cache-OFF configuration is the same manifest with the cache disabled, kept as the benchmark control.