> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt. # DeepSeek V3.2 WideEP Routing A/B Both configurations run Dynamo + TensorRT-LLM on **32x GB200 GPUs across 8 nodes**: the baseline uses 4x DEP8 aggregated workers with round-robin routing; the comparison splits into 2 prefill + 2 decode workers with WideEP (DEP8) and KV-aware routing. The trace is heavily reuse-biased — roughly 44% KV cache hit rate and 57% of input tokens from shared context prefixes — so KV-aware routing can avoid large amounts of redundant long-context prefill.

Benchmark setup

Model nvidia/DeepSeek-V3.2-NVFP4 GPUs 32x GB200 (8 nodes) Runtime TensorRT-LLM Workload Mooncake-derived synthetic coding trace, fixed-schedule replay: 10,000 requests, 39,186 avg ISL (max 109,459), 344 avg OSL, 44.1% block-level KV hit rate Metrics TTFT, ITL, total request latency, and goodput at TTFT 20s / ITL 50ms Held constant Model, TensorRT-LLM runtime, 32x GB200 across 8 nodes, fixed-schedule trace replay, and TTFT 20s / ITL 50ms goodput thresholds ## Compared Configurations

Role	Configuration	Deploy	Benchmark
Comparison	Disaggregated KV router + WideEP 2x prefill + 2x decode with WideEP (DEP8), KV-aware routing	deploy.yaml	perf.yaml
Baseline	Aggregated round-robin 4x DEP8 aggregated workers, round-robin routing	deploy.yaml	perf.yaml

## Reproduce The trace is synthesized from the [Mooncake FAST25 conversation trace](https://github.com/kvcache-ai/Mooncake) using Dynamo's [prefix data generator](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/prefix_data_generator), scaling input lengths and prefix reuse up to a coding-workload shape: ```bash datagen synthesize \ --input-file conversation_trace.jsonl \ --prefix-len-multiplier 16 \ --prompt-len-multiplier 10 \ --max-isl 110000 \ --num-requests 10000 # emits conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl ``` The replay uses `--fixed-schedule`, so request arrivals are pinned to the trace — throughput is fixed and the comparison is on TTFT, ITL, total request latency, and goodput. Each configuration's `perf.yaml` wraps this AIPerf command: ```bash aiperf profile -m nvidia/DeepSeek-V3.2-NVFP4 \ --tokenizer nvidia/DeepSeek-V3.2-NVFP4 \ --input-file /model-cache/traces/conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl \ --custom-dataset-type mooncake_trace \ --fixed-schedule \ --url http://:8000 \ --streaming \ --goodput "time_to_first_token:20000 inter_token_latency:50" ``` The frontend services are `agg-round-robin-dsv32-nvfp4-frontend` and `disagg-kv-dsv32-nvfp4-frontend`. Deploy one configuration at a time: ```bash export NAMESPACE=your-namespace # One-time prep: storage, ComputeDomain (for MNNVL co-location), model download kubectl apply -f recipes/deepseek-v32-fp4/model-cache/model-cache.yaml -n ${NAMESPACE} kubectl apply -f recipes/deepseek-v32-fp4/model-cache/compute-domain.yaml -n ${NAMESPACE} kubectl apply -f recipes/deepseek-v32-fp4/model-cache/model-download.yaml -n ${NAMESPACE} kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s # Copy the synthesized trace onto the PVC kubectl cp ${NAMESPACE}/:/model-cache/traces/ # Pick one configuration from the table above, deploy it, wait for readiness, then apply its perf.yaml. kubectl apply -f recipes/deepseek-v32-fp4/trtllm//deploy.yaml -n ${NAMESPACE} kubectl apply -f recipes/deepseek-v32-fp4/trtllm//perf.yaml -n ${NAMESPACE} ``` The benchmark runs as a Kubernetes Job; tail it with `kubectl logs -f -l job-name= -n ${NAMESPACE}` (each config's `perf.yaml` defines its Job name). Results land under `/model-cache/perf/_/` on the `model-cache` PVC; copy them out with `kubectl cp`. ## Notes * The source publishes the comparison as a results video plus the dataset statistics above, not a numeric results table; run both configurations to produce TTFT/ITL/goodput deltas for your cluster. * `perf.yaml` pins `transformers==4.57.6` alongside `aiperf==0.6.0` — older transformers cannot load the `deepseek_v32` tokenizer and AIPerf surfaces it as "Failed to load tokenizer". * Multi-node GB200 deployments need the ComputeDomain CR so the DRA scheduler co-locates worker pods on MNNVL-connected nodes; if you rename it, mirror the change in each `deploy.yaml` under `extraPodSpec.resourceClaims` and `resources.claims`. * Background on the underlying optimizations: [Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog15_Optimizing_DeepSeek_V32_on_NVIDIA_Blackwell_GPUs.html). * Source: [recipes/deepseek-v32-fp4](https://github.com/ai-dynamo/dynamo/tree/main/recipes/deepseek-v32-fp4) ## Related Recipe The disaggregated KV router + WideEP configuration is the promoted deployment target: [DeepSeek V3.2 NVFP4](/dynamo/dev/recipes/deepseek-v3-2-nvfp4). The aggregated round-robin configuration exists as a benchmark control only.