For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
  • User Guides
    • Disaggregated Serving
    • KV Cache Aware Routing
    • KV Cache Offloading
    • Tool Calling
    • Reasoning
    • Agents
    • Multimodal
    • Diffusion
    • LoRA Adapters
    • Observability (Local)
    • Fault Tolerance
    • Benchmarking
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
      • Router Guide
      • Routing Concepts
      • Configuration and Tuning
      • Disaggregated Serving
      • Router Operations
      • Router Examples
      • Standalone Indexer
      • KV Event Replay — Dynamo vs vLLM
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Quick Start
  • Standalone Router
  • Prerequisites and Limitations
  • Next Steps
Components

Router

||View as Markdown|
Previous

Tokenizer

Next

Router Guide

The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.

Quick Start

To launch the Dynamo frontend with the KV Router:

$python -m dynamo.frontend --router-mode kv --http-port 8000

For Kubernetes, set DYN_ROUTER_MODE=kv on the Frontend service. Workers automatically report KV cache events — no worker-side configuration changes needed.

ArgumentDefaultDescription
--router-mode kvround_robinEnable KV cache-aware routing
--router-kv-overlap-score-weight1.0Balance prefill vs decode optimization (higher = better TTFT)
--no-router-kv-eventsenabledFall back to approximate routing (no event consumption from workers)
--router-queue-threshold4.0Backpressure queue threshold; enables priority scheduling via nvext.agent_hints.priority
--router-queue-policyfcfsQueue scheduling policy: fcfs (tail TTFT), wspt (avg TTFT), or lcfs (comparison-only reverse ordering)
--no-router-track-prefill-tokensdisabledIgnore prompt-side prefill tokens in router load accounting; useful for decode-only routing paths

Standalone Router

You can also run the KV router as a standalone service (without the Dynamo frontend). See the Standalone Router component for more details.

For deployment modes and quick start steps, see the Router Guide. For CLI arguments and tuning guidelines, see Configuration and Tuning. For A/B benchmarking, see the KV Router A/B Benchmarking Guide.

Prerequisites and Limitations

Requirements:

  • Dynamic endpoints only: KV router requires register_model() with model_input=ModelInput.Tokens. Your backend handler receives pre-tokenized requests with token_ids instead of raw text.
  • Backend workers must call register_model() with model_input=ModelInput.Tokens (see Backend Guide)
  • You cannot use --static-endpoint mode with KV routing (use dynamic discovery instead)

Multimodal Support:

  • TRT-LLM and vLLM: Multimodal routing supported for images via multimodal hashes
  • SGLang: Image routing not yet supported
  • Other modalities (audio, video, etc.): Not yet supported

Limitations:

  • Static endpoints not supported—KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states

For basic model registration without KV routing, use --router-mode round-robin, --router-mode random, --router-mode least-loaded, or --router-mode device-aware-weighted with both static and dynamic endpoints.

Next Steps

  • Router Guide: Deployment modes, quick start, and page map
  • Routing Concepts: Cost model and worker-selection behavior
  • Configuration and Tuning: Router flags, transport modes, and metrics
  • Disaggregated Serving: Prefill and decode routing setups
  • Router Operations: Replicas, persistence, and recovery
  • Router Examples: Python API usage, K8s examples, and custom routing patterns
  • Router Testing: Test layers from Rust unit tests to fixture-backed replay and full process E2E
  • Standalone Indexer: Run the KV indexer as a separate service for independent scaling
  • Router Design: Architecture details, algorithms, and event transport modes