This document describes the internal architecture of the Dynamo KV Router, including block tracking mechanisms, the KV cache optimization system, event handling, and transport modes.
The KV Router tracks two key metrics for each worker:
Potential Active Blocks: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request.
Potential New Prefill Blocks: The number of tokens that need to be computed from scratch on a worker, calculated as:
The router maintains block information through two complementary systems:
Active Decoding Blocks: Tracked locally by the router throughout the request lifecycle:
Cached Blocks: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions.
The leading Large Language Models (LLMs) today are auto-regressive and based off of the transformer architecture. One key inference optimization technique is to cache the already computed keys and values and to reuse them for the future tokens. This is called the KV Cache.
The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions.
Prefill blocks: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
Decode blocks: Estimated from the request’s input tokens and each worker’s active sequences. The count updates when requests complete and their blocks are freed.
Cost formula: cost = overlap_score_weight * prefill_blocks + decode_blocks
overlap_score_weight balances cache hit optimization against load distributionThe router selects the worker with the lowest cost. When router_temperature is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution.
Example calculation with overlap_score_weight = 1.0:
Every inference framework will have a KV Cache for each worker. A popular inference framework library is vLLM where a key contribution was PagedAttention, which allowed them to manage KV Cache in an efficient way by chunking requests into blocks.
Another popular inference framework, SGLang, contributed RadixAttention which introduced a prefix tree which allows for efficient matching, inserting and eviction of KV Cache blocks. The prefix tree structure popularized KV Cache reuse.
In Dynamo, we introduce a KVPublisher which emits KV Cache events that occur at each worker and a KVIndexer which keeps track of these events globally.
To get a feel for how KV Cache management works on a single worker with KV Cache reuse turned on and where the KVPublisher gets plugged in, we can walk through the KV Block management flow:
Further details can be found for: SGLang, TRT-LLM and vLLM.
The KVPublisher can be initialized and then called in the inference framework where blocks are allocated and removed.
The two types of events are:
The publisher can be initialized and used through Python bindings.
Engines do not need to emit deterministic block identifiers in KV events, as the router uses local block hashes (computed from token content) for tracking and matching blocks across workers. However, it is strongly preferred that engines do emit deterministic block identifiers, as this keeps the KvIndexer’s internal lookup table smaller and more efficient. To ensure deterministic behavior, all workers should use identical engine versions/configuration. If your engine relies on Python’s built-in hash() for any event IDs, set PYTHONHASHSEED=0; otherwise this setting has no effect.
The KVIndexer builds and maintains a global view of cached blocks in a prefix tree. We modify the original prefix tree by also storing the worker id on each node. This is so we can return the number of matched blocks for each worker.
The KVIndexer has a method find_matches_for_request, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.
The KVIndexer supports two backend implementations, selected via --router-event-threads:
Single-threaded RadixTree (--router-event-threads 1): Events are processed in a dedicated single-threaded tokio runtime via channel-based dispatch. Supports TTL-based expiration and size-based pruning (for --no-kv-events approximate mode).
ConcurrentRadixTree (default, --router-event-threads N where N > 1): A thread-safe radix tree with a pool of N worker threads for event processing (default: 4). Uses sticky worker routing (events for the same worker always go to the same thread) to ensure per-worker event serialization. Read operations (find_matches) execute concurrently with writes. Does not support TTL/pruning.
In distributed deployments with multiple routers, each router maintains visibility over only a portion of the total requests. To ensure consistent routing decisions, routers synchronize their states through three event types:
AddRequest: Notifies other routers when a request is assigned to a worker. Includes request ID, worker ID, token sequence blocks, and overlap score to track block usage across the system.
MarkPrefillCompleted: Signals when a request moves from prefill to decode phase, allowing routers to update their worker load calculations by excluding completed prefill tokens.
Free: Indicates request completion and resource release, enabling accurate block reference counting across all routers.
Each event carries a unique router ID to prevent self-event processing. This asynchronous communication system ensures optimal routing decisions by maintaining consistent KV cache state across all routers, even as they handle different request streams.
The router supports two event transport modes for KV cache state synchronization:
NATS Core / Event Plane with Local Indexer (default): Fire-and-forget pub/sub where workers maintain local radix trees (enabled by default). Router rebuilds state by querying workers on startup. Lower latency, simpler setup. Works with both NATS Core and ZMQ event planes.
JetStream (--durable-kv-events on both frontend and workers): Persistent event stream with durable consumers. State persists across router restarts via snapshots in NATS object store. Best for production with multi-replica consistency. Important: Both the frontend and all workers must specify --durable-kv-events for JetStream mode to work correctly.
KV events are sent to a persistent NATS JetStream. Each KV router/indexer replica acts as a durable consumer, pulling messages from this shared stream. This architecture ensures consistency across router replicas and persistence across restarts.
--durable-kv-events flag on both the frontend and all workersBoth frontend and workers must specify --durable-kv-events for JetStream mode to work correctly. The frontend uses this flag to consume from JetStream, while workers use it to publish to JetStream instead of the local indexer.
By default, workers have local indexer enabled. Each worker maintains its own local radix tree (local indexer) and publishes events over the generic event plane (NATS Core or ZMQ, depending on --event-plane). Each worker assigns monotonically increasing event IDs to its events. The router detects gaps in event sequences and recovers missed events by querying the worker’s local indexer directly.
--durable-kv-events flag on both workers (SGLang, TRT-LLM, vLLM, mocker) and frontendHow gap detection works:
event_id > last_id + 1, the router detects a gap[last_id+1, event_id-1]Startup behavior:
By default, all workers have enable_local_indexer=true, so the router uses NATS Core / Event Plane mode with local indexer. To use JetStream mode instead, specify --durable-kv-events on both the frontend and all workers.
In addition to cached blocks, each router replica needs to track active blocks (blocks being used for ongoing generation) as load metrics. Since this information is highly time-sensitive, it should be predicted immediately when:
This is managed locally in each router via a “slot manager”. To maintain consistency across the system, router replicas synchronize these local predictions with each other through NATS core messaging.
This dual-layer approach—persistent global KV cache state via JetStream and ephemeral active block synchronization via router replicas—enables the system to make optimal routing decisions that balance cache reuse with load distribution.