> For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt. # Recipes Start with the model, runtime, and hardware you need to run. Model cards are sorted newest to oldest by release date. Use [Feature Benchmarks](/dynamo/dev/benchmarks) when you want evidence that a Dynamo feature or topology helps under controlled traffic.

Recipe catalog

10 model families

Filter by provider, runtime, and hardware. Each card notes its serving technique and workload shape.

29 Deployable configurations Provider All NVIDIA Qwen DeepSeek Moonshot Meta OpenAI Z.AI Runtime All vLLM TensorRT-LLM SGLang Hardware All H100 H200 GB200 B200

Kimi-K2.6

Moonshot / NVIDIA

Multi-target vLLM matrix covering B200 and H200 for chat and agentic traffic, with Eagle3 MLA speculation and KV-aware routing.

vLLM4x B200 / 8x H200Chat + agentic Open recipe

Nemotron 3 Ultra

NVIDIA

Aggregated vLLM targets for B200 and H200 chat and agentic traffic with MTP speculative decoding and trace-backed benchmarks.

vLLM4x B200 / 8x H200Chat + agentic Open recipe

Nemotron 3 Super

NVIDIA

NVFP4 and FP8 vLLM targets for B200 and H200 chat and agentic traffic with MTP speculative decoding and KV-aware routing.

vLLM4x B200 / 4x H200Chat + agentic Open recipe

GLM-5 NVFP4

NVIDIA / Z.AI

20x GB200 SGLang P/D recipe for 1K input / 8K output traffic with EAGLE speculative decoding, plus an AWS EFA variant.

SGLang20x GB200Long output / static ISL-OSL Open recipe

DeepSeek V3.2 NVFP4

NVIDIA checkpoint / DeepSeek

32x GB200 TensorRT-LLM recipe with P/D split, KV-aware routing, and WideEP for long-context coding traces.

TRT-LLM32x GB200Long-context reuseRelated benchmark Open recipe

GPT-OSS-120B

OpenAI

TensorRT-LLM GB200 recipe pair organized by traffic target: aggregated serving and a disaggregated P/D split.

TRT-LLM4x GB200Static ISL-OSL Open recipe

Qwen3-32B

Qwen

Disaggregated vLLM serving with KV-aware routing on 16x H200, for multi-turn conversational traffic with prefix reuse.

vLLM16x H200Multi-turn conversationRelated benchmark Open recipe

Qwen3-235B-A22B FP8

Qwen

16-GPU TensorRT-LLM recipe matrix for Hopper/Blackwell and aggregate/P-D serving.

TRT-LLM16x H100/H200Static ISL-OSL Open recipe

Qwen3-32B FP8

Qwen

FP8 recipe set spanning TensorRT-LLM aggregate, TensorRT-LLM P/D, and vLLM P/D targets.

TRT-LLM2-8x GPUStatic ISL-OSL Open recipe

Llama-3.3-70B FP8

Meta / Red Hat

Llama FP8 topology recipe set for 8K input / 1K output traffic on H100/H200.

vLLM4x H100/H200Long output / static ISL-OSLRelated benchmark Open recipe

## Related Feature Benchmarks Use [Feature Benchmarks](/dynamo/dev/benchmarks) when your question starts with a feature or performance claim, such as KV routing, embedding cache, speculative decoding, or topology. Recipe pages link back to Feature Benchmark pages whenever a recipe came from a comparison or feature-stack run.