> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# Qwen3-32B KV Routing A/B

Both configurations run on **16x H200 GPUs across 2 nodes**: the baseline uses 8x TP2 aggregated workers with round-robin routing; the comparison splits the same GPUs into 6 prefill + 2 decode TP2 workers with KV-aware routing. The trace's 36.64% cache efficiency means KV-aware routing can send requests to workers that already hold the relevant KV blocks, cutting redundant prefill, while the prefill/decode split keeps long-context prefills (avg 12K input tokens) from injecting ITL spikes into ongoing decodes.

<p>
  Benchmark setup
</p>

<b>Model</b> Qwen/Qwen3-32B

<b>GPUs</b> 16x H200 (2 nodes)

<b>Runtime</b> vLLM

<b>Workload</b> Mooncake conversation trace, fixed-schedule replay: 12,031 requests over \~59 min, 12,035 avg ISL, 343 avg OSL, 36.64% cache efficiency

<b>Metrics</b> TTFT, ITL, total request latency, and goodput at TTFT 2s / ITL 25ms

<b>Held constant</b> Model, vLLM runtime, 16x H200 across 2 nodes, fixed-schedule Mooncake trace replay, and TTFT 2s / ITL 25ms goodput thresholds

## Compared Configurations

<table>
  <thead>
    <tr><th>Role</th><th>Configuration</th><th>Deploy</th><th>Benchmark</th></tr>
  </thead>

  <tbody>
    <tr>
      <td>
        <em>Comparison</em>
      </td>

      <td>
        <strong>Disaggregated KV router</strong>

        6 prefill + 2 decode workers (TP2), KV-aware routing
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-32b/vllm/disagg-kv-router/deploy.yaml">deploy.yaml</a>
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-32b/vllm/disagg-kv-router/perf.yaml">perf.yaml</a>
      </td>
    </tr>

    <tr>
      <td>
        <em>Baseline</em>
      </td>

      <td>
        <strong>Aggregated round-robin</strong>

        8x TP2 aggregated workers, round-robin routing
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-32b/vllm/agg-round-robin/deploy.yaml">deploy.yaml</a>
      </td>

      <td>
        <a href="https://github.com/ai-dynamo/dynamo/blob/main/recipes/qwen3-32b/vllm/agg-round-robin/perf.yaml">perf.yaml</a>
      </td>
    </tr>
  </tbody>
</table>

## Reproduce

The benchmark replays the [Mooncake FAST25 conversation trace](https://github.com/kvcache-ai/Mooncake) with `--fixed-schedule`, so request arrival times are pinned to the original timestamps — throughput is fixed by the trace, and the comparison is on the latency metrics (TTFT, ITL, total request latency) plus goodput. Each configuration's `perf.yaml` downloads the trace automatically and wraps this AIPerf command:

```bash
aiperf profile -m Qwen/Qwen3-32B \
  --input-file conversation_trace.jsonl \
  --custom-dataset-type mooncake_trace \
  --fixed-schedule \
  --url http://<frontend>:8000 \
  --streaming \
  --goodput "time_to_first_token:2000 inter_token_latency:25"
```

The frontend service is `agg-8xtp2-frontend` for the baseline and `disagg-router-6p-2d-frontend` for the disaggregated configuration. Deploy one configuration at a time — each is sized for the full 16 GPUs:

```bash
export NAMESPACE=your-namespace

# One-time prep: storage + model download
kubectl apply -f recipes/qwen3-32b/model-cache/cache.yaml -n ${NAMESPACE}
kubectl apply -f recipes/qwen3-32b/model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s

# Pick one configuration from the table above, deploy it, wait for readiness, then apply its perf.yaml.
kubectl apply -f recipes/qwen3-32b/vllm/<configuration>/deploy.yaml -n ${NAMESPACE}
kubectl apply -f recipes/qwen3-32b/vllm/<configuration>/perf.yaml -n ${NAMESPACE}
```

The benchmark runs inside a tmux session on the benchmark pod (`kubectl exec -it <benchmark-pod> -- tmux a -t benchmark`), and artifacts land on the `perf-cache` PVC under `/perf-cache/artifacts/`.

## Notes

* The source publishes the comparison as a results video and directional analysis rather than a numeric table; run both configurations to produce TTFT/ITL/latency deltas for your cluster.
* Edit `model-cache/cache.yaml` and set `storageClassName` to match your cluster before applying.
* Source: [recipes/qwen3-32b](https://github.com/ai-dynamo/dynamo/tree/main/recipes/qwen3-32b)

## Related Recipe

The disaggregated KV router configuration is the promoted deployment target: [Qwen3-32B](/dynamo/dev/recipes/qwen3-32b). The aggregated round-robin configuration exists as a benchmark control only.