Qwen3-32B KV Routing A/B

Does disaggregated KV-aware routing reduce TTFT and ITL on multi-turn prefix-reuse traffic compared with aggregated round-robin routing?

View as Markdown

Both configurations run on 16x H200 GPUs across 2 nodes: the baseline uses 8x TP2 aggregated workers with round-robin routing; the comparison splits the same GPUs into 6 prefill + 2 decode TP2 workers with KV-aware routing. The trace’s 36.64% cache efficiency means KV-aware routing can send requests to workers that already hold the relevant KV blocks, cutting redundant prefill, while the prefill/decode split keeps long-context prefills (avg 12K input tokens) from injecting ITL spikes into ongoing decodes.

Benchmark setup

Model Qwen/Qwen3-32BGPUs 16x H200 (2 nodes)Runtime vLLMWorkload Mooncake conversation trace, fixed-schedule replay: 12,031 requests over ~59 min, 12,035 avg ISL, 343 avg OSL, 36.64% cache efficiencyMetrics TTFT, ITL, total request latency, and goodput at TTFT 2s / ITL 25msHeld constant Model, vLLM runtime, 16x H200 across 2 nodes, fixed-schedule Mooncake trace replay, and TTFT 2s / ITL 25ms goodput thresholds

Compared Configurations

RoleConfigurationDeployBenchmark
ComparisonDisaggregated KV router6 prefill + 2 decode workers (TP2), KV-aware routingdeploy.yamlperf.yaml
BaselineAggregated round-robin8x TP2 aggregated workers, round-robin routingdeploy.yamlperf.yaml

Reproduce

The benchmark replays the Mooncake FAST25 conversation trace with --fixed-schedule, so request arrival times are pinned to the original timestamps — throughput is fixed by the trace, and the comparison is on the latency metrics (TTFT, ITL, total request latency) plus goodput. Each configuration’s perf.yaml downloads the trace automatically and wraps this AIPerf command:

$aiperf profile -m Qwen/Qwen3-32B \
> --input-file conversation_trace.jsonl \
> --custom-dataset-type mooncake_trace \
> --fixed-schedule \
> --url http://<frontend>:8000 \
> --streaming \
> --goodput "time_to_first_token:2000 inter_token_latency:25"

The frontend service is agg-8xtp2-frontend for the baseline and disagg-router-6p-2d-frontend for the disaggregated configuration. Deploy one configuration at a time — each is sized for the full 16 GPUs:

$export NAMESPACE=your-namespace
$
$# One-time prep: storage + model download
$kubectl apply -f recipes/qwen3-32b/model-cache/cache.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/qwen3-32b/model-cache/model-download.yaml -n ${NAMESPACE}
$kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s
$
$# Pick one configuration from the table above, deploy it, wait for readiness, then apply its perf.yaml.
$kubectl apply -f recipes/qwen3-32b/vllm/<configuration>/deploy.yaml -n ${NAMESPACE}
$kubectl apply -f recipes/qwen3-32b/vllm/<configuration>/perf.yaml -n ${NAMESPACE}

The benchmark runs inside a tmux session on the benchmark pod (kubectl exec -it <benchmark-pod> -- tmux a -t benchmark), and artifacts land on the perf-cache PVC under /perf-cache/artifacts/.

Notes

  • The source publishes the comparison as a results video and directional analysis rather than a numeric table; run both configurations to produce TTFT/ITL/latency deltas for your cluster.
  • Edit model-cache/cache.yaml and set storageClassName to match your cluster before applying.
  • Source: recipes/qwen3-32b

The disaggregated KV router configuration is the promoted deployment target: Qwen3-32B. The aggregated round-robin configuration exists as a benchmark control only.