Qwen3-VL Embedding Cache A/B

How much does enabling the vLLM multimodal embedding cache improve repeated-image traffic on one GB200 worker?

View as Markdown

Two configurations run the same deploy.yaml on a single aggregated GB200 worker — the only delta is the DYN_MULTIMODAL_EMBEDDING_CACHE_GB environment variable (10 GB for cache ON, 0 for cache OFF). With an image pool of 200 across 1,000 requests, the first 200 requests see unique images and the remaining 800 hit images the engine has already encoded, so a cache hit skips the vision encoder on the prefill path. Enabling the cache delivers +16.4% output throughput and −27.7% average TTFT.

Benchmark setup

Model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8GPUs 1x GB200 (one aggregated replica)Runtime vLLMWorkload 1,000 single-turn multimodal requests, 1 image each from a 200-image pool (80% image reuse), 400 text tokens, concurrency 64Metrics Output TPS, TTFT, ITL, and request latencyHeld constant Model, vLLM runtime, one aggregated GB200 replica, generated dataset, request count, concurrency, and forced 150-token outputs

Results

Enabling the embedding cache on a single aggregated GB200 replica with the vLLM backend delivers +16% throughput, -28% TTFT, and -13% request latency (single representative run, reproduced from the recipe README):

MetricCache ONCache OFFDelta
Output TPS (tok/s)3,575.63,072.3+16.4%
TTFT avg (ms)526.0727.5-27.7%
TTFT p50 (ms)356.8510.8-30.1%
ITL avg (ms)14.115.5-8.8%
Request latency avg (ms)2,630.03,035.7-13.4%

Compared Configurations

RoleConfigurationDeployBenchmark
ComparisonEmbedding cache ONDYN_MULTIMODAL_EMBEDDING_CACHE_GB=10 (deploy.yaml default)deploy.yamlperf.yaml
BaselineEmbedding cache OFFSet DYN_MULTIMODAL_EMBEDDING_CACHE_GB=0 in deploy.yaml and CACHE_MODE=cache_off in perf.yamldeploy.yamlperf.yaml

Reproduce

A dataset-generation job creates the synthetic multimodal dataset (qwen3_vl_1000req_1img_pool200.jsonl: 1,000 requests, 1 image per request, 200-image pool, 400 text tokens per request) on the perf-cache PVC. The perf.yaml then wraps this AIPerf command:

$aiperf profile --model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
> --input-file /perf-cache/datasets/qwen3_vl_1000req_1img_pool200.jsonl \
> --custom-dataset-type single_turn \
> --url http://qwen3-vl-agg-frontend:8000 --streaming \
> --request-count 1000 --concurrency 64 --warmup-request-count 3 \
> --extra-inputs max_tokens:150 --extra-inputs min_tokens:150 \
> --extra-inputs ignore_eos:true

Run each configuration in sequence — redeploy with the toggled cache setting between runs:

$export NAMESPACE=your-namespace
$
$# One-time prep: storage, model download, dataset generation
$kubectl apply -f recipes/qwen3-vl-30b/model-cache/model-cache.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/qwen3-vl-30b/model-cache/model-download.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=3600s
$kubectl apply -f recipes/qwen3-vl-30b/data-gen/generate-datasets-job.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/qwen3-vl-30b-generate-datasets -n ${NAMESPACE} --timeout=3600s
$
$# Cache ON (deploy.yaml default), then benchmark
$kubectl apply -f recipes/qwen3-vl-30b/vllm/agg-embedding-cache/deploy.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Ready dynamographdeployment/qwen3-vl-agg -n ${NAMESPACE} --timeout=900s
$kubectl apply -f recipes/qwen3-vl-30b/vllm/agg-embedding-cache/perf.yaml -n ${NAMESPACE}
$
$# Cache OFF: set DYN_MULTIMODAL_EMBEDDING_CACHE_GB=0 in deploy.yaml and
$# CACHE_MODE=cache_off in perf.yaml, then re-apply both.

The helper script recipes/qwen3-vl-30b/vllm/agg-embedding-cache/run-benchmark.sh automates each run — pass on or off to run one cache mode per invocation. AIPerf artifacts land under /perf-cache/artifacts/qwen3_vl_30b_embedding_cache/agg/<cache_mode>.

Notes

  • Exact cache hit rates cannot be pinned via the dataset because of LRU eviction; shrinking the image pool relative to request count (or growing the cache) raises the hit probability.
  • The aggregated embedding cache uses vLLM’s native ec_both ECConnector role, supported in vLLM 0.17+ with no patches — see multimodal vLLM docs.
  • Replace the storageClassName and image: placeholders in the YAML files before running.
  • Source: recipes/qwen3-vl-30b

Winning Configuration

The cache-ON configuration is the winning configuration and is deployable from its assets above; a recommended Recipe may be promoted from this benchmark in a future release. The cache-OFF configuration is the same manifest with the cache disabled, kept as the benchmark control.