Qwen3.6 Frontend and Cache Benchmark | NVIDIA Dynamo Documentation

Three configurations run on the same single-GPU node (H100 or GB200, selected at deploy time), so the only thing that varies is the Dynamo feature set. Each turn shares 4-of-5 images with the previous turn of the same user, so repeated images dominate — exactly the shape the embedding cache is designed for. The full Dynamo stack delivers roughly +30% RPS and large TTFT reductions versus vanilla vllm serve (H100: +29.8% RPS / −66.4% TTFT avg; GB200: +31.5% / −43.3%).

Benchmark setup

Model Qwen/Qwen3.6-35B-A3B-FP8GPUs 1x H100 or GB200 (selected at deploy time)Runtime vLLMWorkload Sliding-window multimodal dataset: 30 users x 8 turns (240 requests), 5-image window, 8,000 text tokens, max_tokens 1024, concurrency 30Metrics RPS, TTFT (avg/p50/p90/p99), and ITLHeld constant Model, one-GPU hardware target, generated sliding-window dataset, 240 requests at concurrency 30, and one shared perf.yaml benchmark template

Results

Full result tables are reproduced below from the source study; the headline numbers:

H100

Config	RPS	ITL (ms)	TTFT avg (ms)	TTFT p50	TTFT p90	TTFT p99
`vllm-serve`	0.719	21.28	18,173	18,830	26,992	48,589
`dynamo-fd`	0.811	28.42	7,193	3,567	10,991	34,901
`dynamo-fd-ec`	0.933	24.56	6,101	2,369	22,869	34,583

Δ vs vllm-serve: dynamo-fd +12.8% RPS / −60.4% TTFT avg; dynamo-fd-ec +29.8% RPS / −66.4% TTFT avg (−87.4% TTFT p50).

GB200

Config	RPS	ITL (ms)	TTFT avg (ms)	TTFT p50	TTFT p90	TTFT p99
`vllm-serve`	0.940	15.37	14,954	15,391	21,660	24,221
`dynamo-fd`	1.117	16.82	9,478	9,061	15,399	17,326
`dynamo-fd-ec`	1.236	15.22	8,478	8,324	13,992	16,075

Δ vs vllm-serve: dynamo-fd +18.8% RPS / −36.6% TTFT avg; dynamo-fd-ec +31.5% RPS / −43.3% TTFT avg (−45.9% TTFT p50).

Frontend-decoding alone captures most of the TTFT win; the embedding cache layers on additional throughput and tighter median TTFT. ITL stays roughly flat because the cache shortens the prefill path (skipping the vision encoder for repeated images), not decode.

Compared Configurations

Role	Configuration	Deploy	Benchmark
Winner	Dynamo FD + embedding cacheFrontend-decoding ON, 8 GiB embedding cache — promoted recipe target	dynamo-fd-ec.yaml	perf.yaml
Comparison	Dynamo frontend-decodingFrontend-decoding ON, embedding cache OFF — isolates the frontend effect	dynamo-fd.yaml	perf.yaml
Baseline	Vanilla vLLM servePlain vLLM Deployment + Service, no Dynamo	vllm-serve.yaml	perf.yaml

Reproduce

A dataset-generation job writes the sliding-window dataset (30u_8t_5w_8000word_base64.jsonl: 30 users x 8 turns, window 5, 2400x1080 base64 images, 8,000 text tokens, one session_id=user_<N> row per turn) to the shared PVC. The shared perf.yaml wraps this AIPerf command for every configuration:

$ aiperf profile --model Qwen/Qwen3.6-35B-A3B-FP8 \
>   --input-file /perf-cache/datasets/30u_8t_5w_8000word_base64.jsonl \
>   --custom-dataset-type single_turn \
>   --url http://<frontend>:8000 --streaming \
>   --request-count 240 --concurrency 30 --warmup-request-count 2 \
>   --extra-inputs max_tokens:1024 --extra-inputs min_tokens:1024 \
>   --extra-inputs ignore_eos:true

The recipe is driven end-to-end by scripts that template the YAML via envsubst (the per-configuration pod name and frontend service are injected at apply time):

$ cd recipes/qwen3.6-35b
$ export NAMESPACE=your-namespace
$ export HW=gb200   # or h100; fill in your hostname in hw/${HW}.env first
$ 
$ # Run all three configs sequentially (prep + deploy + bench + retrieve + clean per config)
$ ./run-all-benchmarks.sh -n ${NAMESPACE} --hw ${HW}
$ 
$ # Or one config at a time
$ ./run-benchmark.sh -n ${NAMESPACE} --hw ${HW} --config vllm-serve
$ ./run-benchmark.sh -n ${NAMESPACE} --hw ${HW} --config dynamo-fd
$ ./run-benchmark.sh -n ${NAMESPACE} --hw ${HW} --config dynamo-fd-ec

Each config’s profile_export_aiperf.json is retrieved locally and holds the headline metrics.

Notes

AIPerf is installed from source pinned to a main SHA that includes PR 824, which makes single_turn mode honor session_id ordering — that causal ordering is what lets prefix-cache hits across a user’s turns actually land.
The recipe expects a shared-model-cache RWX PVC in the namespace; Qwen/Qwen3.6-35B-A3B-FP8 is public, so no HuggingFace token is required.
The vLLM command uses --mm-processor-cache-gb 30 and --max-model-len 32768 to handle the 5-image multimodal context.
Source: recipes/qwen3.6-35b

Winning Configuration

The dynamo-fd-ec configuration is the winning configuration; its deploy assets are above, and a recommended Recipe may be promoted from this benchmark in a future release. The vllm-serve baseline and dynamo-fd step exist as benchmark controls.