DeepSeek V3.2 WideEP Routing A/B
DeepSeek V3.2 WideEP Routing A/B
Does disaggregated KV-aware routing with WideEP improve latency and goodput against an aggregated round-robin GB200 control?
Both configurations run Dynamo + TensorRT-LLM on 32x GB200 GPUs across 8 nodes: the baseline uses 4x DEP8 aggregated workers with round-robin routing; the comparison splits into 2 prefill + 2 decode workers with WideEP (DEP8) and KV-aware routing. The trace is heavily reuse-biased — roughly 44% KV cache hit rate and 57% of input tokens from shared context prefixes — so KV-aware routing can avoid large amounts of redundant long-context prefill.
Benchmark setup
Compared Configurations
Reproduce
The trace is synthesized from the Mooncake FAST25 conversation trace using Dynamo’s prefix data generator, scaling input lengths and prefix reuse up to a coding-workload shape:
The replay uses --fixed-schedule, so request arrivals are pinned to the trace — throughput is fixed and the comparison is on TTFT, ITL, total request latency, and goodput. Each configuration’s perf.yaml wraps this AIPerf command:
The frontend services are agg-round-robin-dsv32-nvfp4-frontend and disagg-kv-dsv32-nvfp4-frontend. Deploy one configuration at a time:
The benchmark runs as a Kubernetes Job; tail it with kubectl logs -f -l job-name=<bench-job-name> -n ${NAMESPACE} (each config’s perf.yaml defines its Job name). Results land under /model-cache/perf/<epoch>_<job-name>/ on the model-cache PVC; copy them out with kubectl cp.
Notes
- The source publishes the comparison as a results video plus the dataset statistics above, not a numeric results table; run both configurations to produce TTFT/ITL/goodput deltas for your cluster.
perf.yamlpinstransformers==4.57.6alongsideaiperf==0.6.0— older transformers cannot load thedeepseek_v32tokenizer and AIPerf surfaces it as “Failed to load tokenizer”.- Multi-node GB200 deployments need the ComputeDomain CR so the DRA scheduler co-locates worker pods on MNNVL-connected nodes; if you rename it, mirror the change in each
deploy.yamlunderextraPodSpec.resourceClaimsandresources.claims. - Background on the underlying optimizations: Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs.
- Source: recipes/deepseek-v32-fp4
Related Recipe
The disaggregated KV router + WideEP configuration is the promoted deployment target: DeepSeek V3.2 NVFP4. The aggregated round-robin configuration exists as a benchmark control only.