> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt. # Qwen3.6 Frontend and Cache Benchmark Three configurations run on the same single-GPU node (H100 or GB200, selected at deploy time), so the only thing that varies is the Dynamo feature set. Each turn shares 4-of-5 images with the previous turn of the same user, so repeated images dominate — exactly the shape the embedding cache is designed for. The full Dynamo stack delivers roughly **+30% RPS and large TTFT reductions** versus vanilla `vllm serve` (H100: +29.8% RPS / −66.4% TTFT avg; GB200: +31.5% / −43.3%).

Benchmark setup

Model Qwen/Qwen3.6-35B-A3B-FP8 GPUs 1x H100 or GB200 (selected at deploy time) Runtime vLLM Workload Sliding-window multimodal dataset: 30 users x 8 turns (240 requests), 5-image window, 8,000 text tokens, max\_tokens 1024, concurrency 30 Metrics RPS, TTFT (avg/p50/p90/p99), and ITL Held constant Model, one-GPU hardware target, generated sliding-window dataset, 240 requests at concurrency 30, and one shared perf.yaml benchmark template ## Results Full result tables are reproduced below from the [source study](https://github.com/ai-dynamo/dynamo/blob/main/docs/benchmarks/embedding_cache.md); the headline numbers: ### H100 | Config | RPS | ITL (ms) | TTFT avg (ms) | TTFT p50 | TTFT p90 | TTFT p99 | | -------------- | ----: | -------: | ------------: | -------: | -------: | -------: | | `vllm-serve` | 0.719 | 21.28 | 18,173 | 18,830 | 26,992 | 48,589 | | `dynamo-fd` | 0.811 | 28.42 | 7,193 | 3,567 | 10,991 | 34,901 | | `dynamo-fd-ec` | 0.933 | 24.56 | 6,101 | 2,369 | 22,869 | 34,583 | Δ vs `vllm-serve`: `dynamo-fd` **+12.8% RPS / −60.4% TTFT avg**; `dynamo-fd-ec` **+29.8% RPS / −66.4% TTFT avg** (−87.4% TTFT p50). ### GB200 | Config | RPS | ITL (ms) | TTFT avg (ms) | TTFT p50 | TTFT p90 | TTFT p99 | | -------------- | ----: | -------: | ------------: | -------: | -------: | -------: | | `vllm-serve` | 0.940 | 15.37 | 14,954 | 15,391 | 21,660 | 24,221 | | `dynamo-fd` | 1.117 | 16.82 | 9,478 | 9,061 | 15,399 | 17,326 | | `dynamo-fd-ec` | 1.236 | 15.22 | 8,478 | 8,324 | 13,992 | 16,075 | Δ vs `vllm-serve`: `dynamo-fd` **+18.8% RPS / −36.6% TTFT avg**; `dynamo-fd-ec` **+31.5% RPS / −43.3% TTFT avg** (−45.9% TTFT p50). Frontend-decoding alone captures most of the TTFT win; the embedding cache layers on additional throughput and tighter median TTFT. ITL stays roughly flat because the cache shortens the prefill path (skipping the vision encoder for repeated images), not decode. ## Compared Configurations

Role	Configuration	Deploy	Benchmark
Winner	Dynamo FD + embedding cache Frontend-decoding ON, 8 GiB embedding cache — promoted recipe target	dynamo-fd-ec.yaml	perf.yaml
Comparison	Dynamo frontend-decoding Frontend-decoding ON, embedding cache OFF — isolates the frontend effect	dynamo-fd.yaml	perf.yaml
Baseline	Vanilla vLLM serve Plain vLLM Deployment + Service, no Dynamo	vllm-serve.yaml	perf.yaml

## Reproduce A dataset-generation job writes the sliding-window dataset (`30u_8t_5w_8000word_base64.jsonl`: 30 users x 8 turns, window 5, 2400x1080 base64 images, 8,000 text tokens, one `session_id=user_` row per turn) to the shared PVC. The shared `perf.yaml` wraps this AIPerf command for every configuration: ```bash aiperf profile --model Qwen/Qwen3.6-35B-A3B-FP8 \ --input-file /perf-cache/datasets/30u_8t_5w_8000word_base64.jsonl \ --custom-dataset-type single_turn \ --url http://:8000 --streaming \ --request-count 240 --concurrency 30 --warmup-request-count 2 \ --extra-inputs max_tokens:1024 --extra-inputs min_tokens:1024 \ --extra-inputs ignore_eos:true ``` The recipe is driven end-to-end by scripts that template the YAML via `envsubst` (the per-configuration pod name and frontend service are injected at apply time): ```bash cd recipes/qwen3.6-35b export NAMESPACE=your-namespace export HW=gb200 # or h100; fill in your hostname in hw/${HW}.env first # Run all three configs sequentially (prep + deploy + bench + retrieve + clean per config) ./run-all-benchmarks.sh -n ${NAMESPACE} --hw ${HW} # Or one config at a time ./run-benchmark.sh -n ${NAMESPACE} --hw ${HW} --config vllm-serve ./run-benchmark.sh -n ${NAMESPACE} --hw ${HW} --config dynamo-fd ./run-benchmark.sh -n ${NAMESPACE} --hw ${HW} --config dynamo-fd-ec ``` Each config's `profile_export_aiperf.json` is retrieved locally and holds the headline metrics. ## Notes * AIPerf is installed from source pinned to a `main` SHA that includes [PR 824](https://github.com/ai-dynamo/aiperf/pull/824), which makes `single_turn` mode honor `session_id` ordering — that causal ordering is what lets prefix-cache hits across a user's turns actually land. * The recipe expects a `shared-model-cache` RWX PVC in the namespace; `Qwen/Qwen3.6-35B-A3B-FP8` is public, so no HuggingFace token is required. * The vLLM command uses `--mm-processor-cache-gb 30` and `--max-model-len 32768` to handle the 5-image multimodal context. * Source: [recipes/qwen3.6-35b](https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3.6-35b) ## Winning Configuration The `dynamo-fd-ec` configuration is the winning configuration; its deploy assets are above, and a recommended Recipe may be promoted from this benchmark in a future release. The `vllm-serve` baseline and `dynamo-fd` step exist as benchmark controls.