> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

# Feature Benchmarks

Feature Benchmarks evaluate Dynamo features, topologies, and feature stacks under controlled traffic. Each page states the question, compares deployable configurations, shows how to reproduce the run, and links to the [Recipe](/dynamo/dev/recipes/browse) target when one deployment should be used directly.

<svg aria-hidden="true">
  <circle cx="22" cy="24" r="4" fill="#76B900" />

  <circle cx="24" cy="24" r="4" fill="#76B900" />

  <circle cx="24" cy="8" r="3" />

  <circle cx="24" cy="40" r="3" />

  <circle cx="9" cy="15" r="3" />

  <circle cx="39" cy="15" r="3" />

  <circle cx="9" cy="33" r="3" />

  <circle cx="39" cy="33" r="3" />

  <path d="M28 8L20 25H29L24 40L40 20H31L36 8Z" fill="#76B900" />

  <circle cx="14.5" cy="24" r="3.5" fill="#76B900" />
</svg>

<h2 id="dynamo-technique-guide-title">
  Features Under Test
</h2>

<p>
  Serving techniques and topology changes benchmarked across the comparisons below.
</p>

<svg>
  <use href="#dynamo-icon-kv" />
</svg>

<strong>
  KV routing
</strong>

<p>
  Routes traffic to workers with reusable KV cache so TTFT, ITL, and goodput can improve on prefix-heavy workloads.
</p>

<svg>
  <use href="#dynamo-icon-pd" />
</svg>

<strong>
  Prefill/decode split
</strong>

<p>
  Separates prompt prefill and token decode into specialized worker pools for long-context latency and throughput tests.
</p>

<svg>
  <use href="#dynamo-icon-ep" />
</svg>

<strong>
  WideEP
</strong>

<p>
  Spreads MoE experts across a wider GPU set so expert-heavy requests get more parallel capacity.
</p>

<svg>
  <use href="#dynamo-icon-ec" />
</svg>

<strong>
  Embedding cache
</strong>

<p>
  Reuses multimodal embeddings, especially repeated images, instead of recomputing them for every request.
</p>

<svg>
  <use href="#dynamo-icon-sp" />
</svg>

<strong>
  Speculative decoding
</strong>

<p>
  Drafts candidate tokens and verifies them with the target model; Eagle3 is the speculative path used here.
</p>

<svg>
  <use href="#dynamo-icon-of" />
</svg>

<strong>
  KV offload
</strong>

<p>
  Moves colder KV blocks to a host-memory tier so longer context can fit without keeping all KV on GPU.
</p>

<svg>
  <use href="#dynamo-icon-fd" />
</svg>

<strong>
  Frontend decoding
</strong>

<p>
  Moves decode coordination into the Dynamo frontend so routing and cache policy can act before backend execution.
</p>

<svg>
  <use href="#dynamo-icon-mn" />
</svg>

<strong>
  Multi-node topology
</strong>

<p>
  Runs serving workers across node boundaries to compare aggregate, single-node P/D, and multi-node P/D shapes.
</p>

Benchmark

Features

Model

Target

Feature composition

<h2>
  Agentic coding throughput stack
</h2>

<p>
  How much do KV routing, speculative decoding, P/D split, and KV offload gain when composed?
</p>

<svg><use href="#dynamo-icon-kv" /></svg>

KV routing

<svg><use href="#dynamo-icon-sp" /></svg>

Spec decoding

<svg><use href="#dynamo-icon-pd" /></svg>

P/D split

<svg><use href="#dynamo-icon-of" /></svg>

KV offload

<img src="https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/6c9ba98efd7d603b5476db9313617ebffd57a53acb0f9ef724ef4c7288a8a527/pages-dev/assets/img/recipes/providers/moonshotai.webp" alt="" />

<strong>Kimi-K2.5 NVFP4</strong>

<b>Traffic</b><strong>Agentic coding trace</strong>

<b>Hardware</b><strong>24x GB200</strong>

Open

<a href="/dynamo/dev/benchmarks/kimi-k2-5-feature-stack" aria-label="Open the Agentic coding throughput stack benchmark">
  Open
</a>

Feature composition

<h2>
  Frontend decoding plus embedding cache
</h2>

<p>
  How do Dynamo frontend decoding and embedding cache change a single-GPU multimodal benchmark versus vanilla vLLM serve?
</p>

<svg><use href="#dynamo-icon-fd" /></svg>

Frontend decoding

<svg><use href="#dynamo-icon-ec" /></svg>

Embedding cache

<img src="https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/588266b8e1b85bad9da8849973fb14ea59b7422a2b30b9dbf7e7e8db5c246321/pages-dev/assets/img/recipes/providers/qwen.webp" alt="" />

<strong>Qwen3.6-35B-A3B FP8</strong>

<b>Traffic</b><strong>Multimodal sliding window</strong>

<b>Hardware</b><strong>1x H100 or GB200</strong>

Open

<a href="/dynamo/dev/benchmarks/qwen3-6-35b-feature-stack" aria-label="Open the Frontend decoding plus embedding cache benchmark">
  Open
</a>

A/B test

<h2>
  Multimodal embedding cache
</h2>

<p>
  How much does enabling the vLLM multimodal embedding cache improve repeated-image traffic on one GB200 worker?
</p>

<svg><use href="#dynamo-icon-ec" /></svg>

Embedding cache

<img src="https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/588266b8e1b85bad9da8849973fb14ea59b7422a2b30b9dbf7e7e8db5c246321/pages-dev/assets/img/recipes/providers/qwen.webp" alt="" />

<strong>Qwen3-VL-30B</strong>

<b>Traffic</b><strong>80% image reuse</strong>

<b>Hardware</b><strong>1x GB200</strong>

Open

<a href="/dynamo/dev/benchmarks/qwen3-vl-embedding-cache" aria-label="Open the Multimodal embedding cache benchmark">
  Open
</a>

A/B test

<h2>
  KV-aware routing + WideEP + P/D split
</h2>

<p>
  Does disaggregated KV-aware routing with WideEP improve latency and goodput against a GB200 control?
</p>

<svg><use href="#dynamo-icon-kv" /></svg>

KV routing

<svg><use href="#dynamo-icon-ep" /></svg>

WideEP

<svg><use href="#dynamo-icon-pd" /></svg>

P/D split

<img src="https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/4cd5dd62104f4bfa1caa6bfa3a13e06b1577d24ea03bfd1e78e4c5aea8ceed9f/pages-dev/assets/img/recipes/providers/deepseek-ai.webp" alt="" />

<strong>DeepSeek V3.2 NVFP4</strong>

<b>Traffic</b><strong>Long-context coding trace</strong>

<b>Hardware</b><strong>32x GB200</strong>

Open

<a href="/dynamo/dev/benchmarks/deepseek-v3-2-wideep-routing" aria-label="Open the KV-aware routing + WideEP + P/D split benchmark">
  Open
</a>

A/B test

<h2>
  KV-aware routing + prefill/decode split
</h2>

<p>
  Does disaggregated KV-aware routing reduce TTFT and ITL compared with aggregated round-robin routing?
</p>

<svg><use href="#dynamo-icon-kv" /></svg>

KV routing

<svg><use href="#dynamo-icon-pd" /></svg>

P/D split

<img src="https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/588266b8e1b85bad9da8849973fb14ea59b7422a2b30b9dbf7e7e8db5c246321/pages-dev/assets/img/recipes/providers/qwen.webp" alt="" />

<strong>Qwen3-32B</strong>

<b>Traffic</b><strong>Mooncake prefix reuse</strong>

<b>Hardware</b><strong>16x H200</strong>

Open

<a href="/dynamo/dev/benchmarks/qwen3-32b-kv-routing" aria-label="Open the KV-aware routing + prefill/decode split benchmark">
  Open
</a>

Topology benchmark

<h2>
  Aggregate vs single-node P/D vs multi-node P/D
</h2>

<p>
  How do vLLM topologies compare when normalized by GPU?
</p>

<svg><use href="#dynamo-icon-pd" /></svg>

P/D split

<svg><use href="#dynamo-icon-mn" /></svg>

Multi-node

<img src="https://files.buildwithfern.com/dynamo.docs.buildwithfern.com/dynamo/b202ff7373a9f14706f9cd4afd55cbb4c513e0f53763675dd93552693a623bf8/pages-dev/assets/img/recipes/providers/meta-llama.webp" alt="" />

<strong>Llama-3.3-70B FP8</strong>

<b>Traffic</b><strong>8K ISL / 1K OSL</strong>

<b>Hardware</b><strong>4-16 GPUs</strong>

Open

<a href="/dynamo/dev/benchmarks/llama-3-70b-topology" aria-label="Open the Aggregate vs single-node P/D vs multi-node P/D benchmark">
  Open
</a>