The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
To launch the Dynamo frontend with the KV Router:
This command:
Backend workers register themselves using the register_llm API, after which the KV Router automatically:
To enable the KV Router in a Kubernetes deployment, add the DYN_ROUTER_MODE environment variable to your frontend service:
Key Points:
DYN_ROUTER_MODE=kv on the Frontend service onlyComplete K8s Examples:
For A/B Testing and Advanced K8s Setup: See the comprehensive KV Router A/B Benchmarking Guide for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
The KV Router supports several key configuration options:
--router-mode kv: Enable KV cache-aware routing (required)
--kv-cache-block-size <size>: Sets the KV cache block size (default: backend-specific). Larger blocks reduce overlap detection granularity but improve memory efficiency. This should match your backend configuration.
--router-temperature <float>: Controls routing randomness (default: 0.0)
0.0: Deterministic selection of the best worker> 0.0: Probabilistic selection using softmax sampling--kv-events / --no-kv-events: Controls how the router tracks cached blocks (default: --kv-events)
--kv-events: Uses real-time events from workers for accurate cache tracking--no-kv-events: Uses approximation based on routing decisions (lower overhead, less accurate)--kv-overlap-score-weight <float>: Balance between prefill and decode optimization (default: 1.0)
For a complete list of available options:
All CLI arguments can be configured via environment variables in Kubernetes deployments. Use the DYN_ prefix with uppercase parameter names:
You can also pass CLI arguments directly in the container command:
Recommendation: Use environment variables for easier configuration management and consistency with Dynamo’s K8s patterns.
The KV Router tracks two key metrics for each worker:
Potential Active Blocks: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request.
Potential New Prefill Blocks: The number of tokens that need to be computed from scratch on a worker, calculated as:
The router maintains block information through two complementary systems:
Active Decoding Blocks: Tracked locally by the router throughout the request lifecycle:
Cached Blocks: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions.
The KV Router’s routing decision is based on a simple cost function:
Where:
The kv-overlap-score-weight parameter (default: 1.0) controls the balance between prefill and decode optimization:
Higher values (> 1.0): Emphasize reducing prefill cost
Lower values (< 1.0): Emphasize decode performance
The router uses KV events from workers by default to maintain an accurate global view of cached blocks. You can disable this with the --no-kv-events flag:
With KV Events (default):
Without KV Events (—no-kv-events):
kv-overlap-score-weightkv-overlap-score-weightThe router logs the cost calculation for each worker:
This shows:
The router_temperature parameter controls routing randomness:
kv-overlap-score-weight to meet your performance goals: