Qwen3.6 Frontend and Cache Benchmark

How do Dynamo frontend-decoding and the embedding cache change a single-GPU multimodal sliding-window benchmark versus vanilla vLLM serve?

View as Markdown

Three configurations run on the same single-GPU node (H100 or GB200, selected at deploy time), so the only thing that varies is the Dynamo feature set. Each turn shares 4-of-5 images with the previous turn of the same user, so repeated images dominate — exactly the shape the embedding cache is designed for. The full Dynamo stack delivers roughly +30% RPS and large TTFT reductions versus vanilla vllm serve (H100: +29.8% RPS / −66.4% TTFT avg; GB200: +31.5% / −43.3%).

Benchmark setup

Model Qwen/Qwen3.6-35B-A3B-FP8GPUs 1x H100 or GB200 (selected at deploy time)Runtime vLLMWorkload Sliding-window multimodal dataset: 30 users x 8 turns (240 requests), 5-image window, 8,000 text tokens, max_tokens 1024, concurrency 30Metrics RPS, TTFT (avg/p50/p90/p99), and ITLHeld constant Model, one-GPU hardware target, generated sliding-window dataset, 240 requests at concurrency 30, and one shared perf.yaml benchmark template

Results

Full result tables are reproduced below from the source study; the headline numbers:

H100

ConfigRPSITL (ms)TTFT avg (ms)TTFT p50TTFT p90TTFT p99
vllm-serve0.71921.2818,17318,83026,99248,589
dynamo-fd0.81128.427,1933,56710,99134,901
dynamo-fd-ec0.93324.566,1012,36922,86934,583

Δ vs vllm-serve: dynamo-fd +12.8% RPS / −60.4% TTFT avg; dynamo-fd-ec +29.8% RPS / −66.4% TTFT avg (−87.4% TTFT p50).

GB200

ConfigRPSITL (ms)TTFT avg (ms)TTFT p50TTFT p90TTFT p99
vllm-serve0.94015.3714,95415,39121,66024,221
dynamo-fd1.11716.829,4789,06115,39917,326
dynamo-fd-ec1.23615.228,4788,32413,99216,075

Δ vs vllm-serve: dynamo-fd +18.8% RPS / −36.6% TTFT avg; dynamo-fd-ec +31.5% RPS / −43.3% TTFT avg (−45.9% TTFT p50).

Frontend-decoding alone captures most of the TTFT win; the embedding cache layers on additional throughput and tighter median TTFT. ITL stays roughly flat because the cache shortens the prefill path (skipping the vision encoder for repeated images), not decode.

Compared Configurations

RoleConfigurationDeployBenchmark
WinnerDynamo FD + embedding cacheFrontend-decoding ON, 8 GiB embedding cache — promoted recipe targetdynamo-fd-ec.yamlperf.yaml
ComparisonDynamo frontend-decodingFrontend-decoding ON, embedding cache OFF — isolates the frontend effectdynamo-fd.yamlperf.yaml
BaselineVanilla vLLM servePlain vLLM Deployment + Service, no Dynamovllm-serve.yamlperf.yaml

Reproduce

A dataset-generation job writes the sliding-window dataset (30u_8t_5w_8000word_base64.jsonl: 30 users x 8 turns, window 5, 2400x1080 base64 images, 8,000 text tokens, one session_id=user_<N> row per turn) to the shared PVC. The shared perf.yaml wraps this AIPerf command for every configuration:

$aiperf profile --model Qwen/Qwen3.6-35B-A3B-FP8 \
> --input-file /perf-cache/datasets/30u_8t_5w_8000word_base64.jsonl \
> --custom-dataset-type single_turn \
> --url http://<frontend>:8000 --streaming \
> --request-count 240 --concurrency 30 --warmup-request-count 2 \
> --extra-inputs max_tokens:1024 --extra-inputs min_tokens:1024 \
> --extra-inputs ignore_eos:true

The recipe is driven end-to-end by scripts that template the YAML via envsubst (the per-configuration pod name and frontend service are injected at apply time):

$cd recipes/qwen3.6-35b
$export NAMESPACE=your-namespace
$export HW=gb200 # or h100; fill in your hostname in hw/${HW}.env first
$
$# Run all three configs sequentially (prep + deploy + bench + retrieve + clean per config)
$./run-all-benchmarks.sh -n ${NAMESPACE} --hw ${HW}
$
$# Or one config at a time
$./run-benchmark.sh -n ${NAMESPACE} --hw ${HW} --config vllm-serve
$./run-benchmark.sh -n ${NAMESPACE} --hw ${HW} --config dynamo-fd
$./run-benchmark.sh -n ${NAMESPACE} --hw ${HW} --config dynamo-fd-ec

Each config’s profile_export_aiperf.json is retrieved locally and holds the headline metrics.

Notes

  • AIPerf is installed from source pinned to a main SHA that includes PR 824, which makes single_turn mode honor session_id ordering — that causal ordering is what lets prefix-cache hits across a user’s turns actually land.
  • The recipe expects a shared-model-cache RWX PVC in the namespace; Qwen/Qwen3.6-35B-A3B-FP8 is public, so no HuggingFace token is required.
  • The vLLM command uses --mm-processor-cache-gb 30 and --max-model-len 32768 to handle the 5-image multimodal context.
  • Source: recipes/qwen3.6-35b

Winning Configuration

The dynamo-fd-ec configuration is the winning configuration; its deploy assets are above, and a recommended Recipe may be promoted from this benchmark in a future release. The vllm-serve baseline and dynamo-fd step exist as benchmark controls.