> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt. # Qwen3-32B KV Routing A/B Both configurations run on **16x H200 GPUs across 2 nodes**: the baseline uses 8x TP2 aggregated workers with round-robin routing; the comparison splits the same GPUs into 6 prefill + 2 decode TP2 workers with KV-aware routing. The trace's 36.64% cache efficiency means KV-aware routing can send requests to workers that already hold the relevant KV blocks, cutting redundant prefill, while the prefill/decode split keeps long-context prefills (avg 12K input tokens) from injecting ITL spikes into ongoing decodes.

Benchmark setup

Model Qwen/Qwen3-32B GPUs 16x H200 (2 nodes) Runtime vLLM Workload Mooncake conversation trace, fixed-schedule replay: 12,031 requests over \~59 min, 12,035 avg ISL, 343 avg OSL, 36.64% cache efficiency Metrics TTFT, ITL, total request latency, and goodput at TTFT 2s / ITL 25ms Held constant Model, vLLM runtime, 16x H200 across 2 nodes, fixed-schedule Mooncake trace replay, and TTFT 2s / ITL 25ms goodput thresholds ## Compared Configurations

Role	Configuration	Deploy	Benchmark
Comparison	Disaggregated KV router 6 prefill + 2 decode workers (TP2), KV-aware routing	deploy.yaml	perf.yaml
Baseline	Aggregated round-robin 8x TP2 aggregated workers, round-robin routing	deploy.yaml	perf.yaml

## Reproduce The benchmark replays the [Mooncake FAST25 conversation trace](https://github.com/kvcache-ai/Mooncake) with `--fixed-schedule`, so request arrival times are pinned to the original timestamps — throughput is fixed by the trace, and the comparison is on the latency metrics (TTFT, ITL, total request latency) plus goodput. Each configuration's `perf.yaml` downloads the trace automatically and wraps this AIPerf command: ```bash aiperf profile -m Qwen/Qwen3-32B \ --input-file conversation_trace.jsonl \ --custom-dataset-type mooncake_trace \ --fixed-schedule \ --url http://:8000 \ --streaming \ --goodput "time_to_first_token:2000 inter_token_latency:25" ``` The frontend service is `agg-8xtp2-frontend` for the baseline and `disagg-router-6p-2d-frontend` for the disaggregated configuration. Deploy one configuration at a time — each is sized for the full 16 GPUs: ```bash export NAMESPACE=your-namespace # One-time prep: storage + model download kubectl apply -f recipes/qwen3-32b/model-cache/cache.yaml -n ${NAMESPACE} kubectl apply -f recipes/qwen3-32b/model-cache/model-download.yaml -n ${NAMESPACE} kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s # Pick one configuration from the table above, deploy it, wait for readiness, then apply its perf.yaml. kubectl apply -f recipes/qwen3-32b/vllm//deploy.yaml -n ${NAMESPACE} kubectl apply -f recipes/qwen3-32b/vllm//perf.yaml -n ${NAMESPACE} ``` The benchmark runs inside a tmux session on the benchmark pod (`kubectl exec -it -- tmux a -t benchmark`), and artifacts land on the `perf-cache` PVC under `/perf-cache/artifacts/`. ## Notes * The source publishes the comparison as a results video and directional analysis rather than a numeric table; run both configurations to produce TTFT/ITL/latency deltas for your cluster. * Edit `model-cache/cache.yaml` and set `storageClassName` to match your cluster before applying. * Source: [recipes/qwen3-32b](https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3-32b) ## Related Recipe The disaggregated KV router configuration is the promoted deployment target: [Qwen3-32B](/dynamo/dev/recipes/qwen3-32b). The aggregated round-robin configuration exists as a benchmark control only.