> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt. # Feature Benchmarks Feature Benchmarks evaluate Dynamo features, topologies, and feature stacks under controlled traffic. Each page states the question, compares deployable configurations, shows how to reproduce the run, and links to the [Recipe](/dynamo/dev/recipes/browse) target when one deployment should be used directly.

Features Under Test

Serving techniques and topology changes benchmarked across the comparisons below.

KV routing

Routes traffic to workers with reusable KV cache so TTFT, ITL, and goodput can improve on prefix-heavy workloads.

Prefill/decode split

Separates prompt prefill and token decode into specialized worker pools for long-context latency and throughput tests.

WideEP

Spreads MoE experts across a wider GPU set so expert-heavy requests get more parallel capacity.

Embedding cache

Reuses multimodal embeddings, especially repeated images, instead of recomputing them for every request.

Speculative decoding

Drafts candidate tokens and verifies them with the target model; Eagle3 is the speculative path used here.

KV offload

Moves colder KV blocks to a host-memory tier so longer context can fit without keeping all KV on GPU.

Frontend decoding

Moves decode coordination into the Dynamo frontend so routing and cache policy can act before backend execution.

Multi-node topology

Runs serving workers across node boundaries to compare aggregate, single-node P/D, and multi-node P/D shapes.

Benchmark Features Model Target Feature composition

Agentic coding throughput stack

How much do KV routing, speculative decoding, P/D split, and KV offload gain when composed?

KV routing Spec decoding P/D split KV offload

Kimi-K2.5 NVFP4 TrafficAgentic coding trace Hardware24x GB200 Open Open Feature composition

Frontend decoding plus embedding cache

How do Dynamo frontend decoding and embedding cache change a single-GPU multimodal benchmark versus vanilla vLLM serve?

Frontend decoding Embedding cache

Qwen3.6-35B-A3B FP8 TrafficMultimodal sliding window Hardware1x H100 or GB200 Open Open A/B test

Multimodal embedding cache

How much does enabling the vLLM multimodal embedding cache improve repeated-image traffic on one GB200 worker?

Embedding cache

Qwen3-VL-30B Traffic80% image reuse Hardware1x GB200 Open Open A/B test

KV-aware routing + WideEP + P/D split

Does disaggregated KV-aware routing with WideEP improve latency and goodput against a GB200 control?

KV routing WideEP P/D split

DeepSeek V3.2 NVFP4 TrafficLong-context coding trace Hardware32x GB200 Open Open A/B test

KV-aware routing + prefill/decode split

Does disaggregated KV-aware routing reduce TTFT and ITL compared with aggregated round-robin routing?

KV routing P/D split

Qwen3-32B TrafficMooncake prefix reuse Hardware16x H200 Open Open Topology benchmark

Aggregate vs single-node P/D vs multi-node P/D

How do vLLM topologies compare when normalized by GPU?

P/D split Multi-node

Llama-3.3-70B FP8 Traffic8K ISL / 1K OSL Hardware4-16 GPUs Open Open