For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes
    • DynoSim: Simulating the Pareto Frontier
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
  • Feature Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Tool Call and Reasoning Parsing
    • Benchmarking
    • Fault Tolerance
    • Observability (Local)
    • Inference Simulation
    • Agents
    • LoRA Adapters
    • Multimodal
    • Diffusion
    • Fastokens Tokenizer
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Recommended path
  • Where to go next

Feature Guides

Start with Dynamo's core serving optimizations, then branch into operations and model capabilities.
||View as Markdown|
Previous

Google Kubernetes Engine (GKE)

Next

Router Guide

Use these guides after you have Dynamo running and want to improve serving behavior, operate a deployment, or adapt Dynamo to a new workload.

Recommended path

Most deployments start with the core performance loop:

StepGuideUse when
1KV Cache Aware RoutingRoute requests to workers that already hold useful KV cache.
2Disaggregated ServingScale prefill and decode workers independently.
3KV Cache OffloadingExtend usable cache capacity beyond GPU memory.
4BenchmarkingCompare configurations before you move to production.

Where to go next

GoalStart with
Make serving more resilientFault Tolerance
Monitor local deploymentsObservability (Local)
Reproduce traffic without a full engineMocker Engine Simulation
Add structured model outputsTool Calling and Reasoning
Build agent workloadsAgents
Serve specialized workloadsLoRA Adapters, Multimodal, and Diffusion

For cluster deployments, pair these guides with the Kubernetes Deployment docs. The same features can be explored locally, then expressed through Dynamo’s Kubernetes-native CRDs and operator when you move to a shared GPU cluster.