Qwen3-32B KV Routing A/B
Qwen3-32B KV Routing A/B
Does disaggregated KV-aware routing reduce TTFT and ITL on multi-turn prefix-reuse traffic compared with aggregated round-robin routing?
Both configurations run on 16x H200 GPUs across 2 nodes: the baseline uses 8x TP2 aggregated workers with round-robin routing; the comparison splits the same GPUs into 6 prefill + 2 decode TP2 workers with KV-aware routing. The trace’s 36.64% cache efficiency means KV-aware routing can send requests to workers that already hold the relevant KV blocks, cutting redundant prefill, while the prefill/decode split keeps long-context prefills (avg 12K input tokens) from injecting ITL spikes into ongoing decodes.
Benchmark setup
Compared Configurations
Reproduce
The benchmark replays the Mooncake FAST25 conversation trace with --fixed-schedule, so request arrival times are pinned to the original timestamps — throughput is fixed by the trace, and the comparison is on the latency metrics (TTFT, ITL, total request latency) plus goodput. Each configuration’s perf.yaml downloads the trace automatically and wraps this AIPerf command:
The frontend service is agg-8xtp2-frontend for the baseline and disagg-router-6p-2d-frontend for the disaggregated configuration. Deploy one configuration at a time — each is sized for the full 16 GPUs:
The benchmark runs inside a tmux session on the benchmark pod (kubectl exec -it <benchmark-pod> -- tmux a -t benchmark), and artifacts land on the perf-cache PVC under /perf-cache/artifacts/.
Notes
- The source publishes the comparison as a results video and directional analysis rather than a numeric table; run both configurations to produce TTFT/ITL/latency deltas for your cluster.
- Edit
model-cache/cache.yamland setstorageClassNameto match your cluster before applying. - Source: recipes/qwen3-32b
Related Recipe
The disaggregated KV router configuration is the promoted deployment target: Qwen3-32B. The aggregated round-robin configuration exists as a benchmark control only.