KV Event Replay — Dynamo vs vLLM
KV Event Replay — Dynamo vs vLLM
KV Event Replay — Dynamo vs vLLM
Both Dynamo and vLLM publish KV cache events (block stored, block removed, etc.) over a fire-and-forget transport (ZMQ PUB/SUB). Because PUB/SUB is lossy, both systems need a mechanism for consumers to detect missed messages and recover. This document compares the two approaches.
A KV event consumer (router, cache coordinator) subscribes to a live stream of block events from workers. Events carry monotonically increasing sequence numbers. When the consumer detects a gap in the sequence (e.g., received seq 42 then seq 45), it needs to recover the missed events or it will have a stale, incorrect view of the worker’s KV cache state.
vLLM’s ZmqEventPublisher (in vllm/distributed/kv_events.py) runs two ZMQ sockets in a background thread:
tcp://*:5557): Streams KVEventBatch messages tagged with a monotonic sequence number.tcp://*:5558): Handles replay requests from consumers.The publisher keeps a deque of the last buffer_steps (default 10,000) serialized batches. When a consumer detects a gap, it sends the missing start sequence number to the ROUTER socket. The publisher linearly scans the buffer and streams back all batches from that sequence onward, ending with a sentinel (seq=-1, payload=empty).
Trade-offs:
replay_seq > last_seq.Dynamo’s LocalKvIndexer (in lib/kv-router/src/indexer/local.rs) wraps a KvIndexer (backed by a RadixTree) with a circular event buffer:
When the router queries a worker for events via get_events_in_id_range(start_id, end_id), the local indexer returns one of three responses:
The tree dump fallback means that when the buffer can’t satisfy the request, the indexer falls back to dumping the entire tree state. This makes “buffer too old” a recoverable condition at the cost of additional complexity and memory for maintaining the tree.
Both systems detect gaps the same way: the consumer tracks the last sequence/event ID it processed and compares it against the next one received.
vLLM (from examples/online_serving/kv_events_subscriber.py):
Dynamo (from lib/llm/src/kv_router/worker_query.rs):
The router tracks last_recovered_event_id per worker and requests recover_from_worker(worker_id, dp_rank, start_event_id, end_event_id) when it detects a gap or on initial discovery. The local indexer handles the complexity of deciding whether to replay from buffer or dump the tree.
vLLM’s built-in replay is a good fit when:
Dynamo’s local indexer is a good fit when:
The two approaches share the same core idea — a FIFO ring buffer for catching up on small, transient gaps. Dynamo adds a RadixTree underneath as authoritative state, which enables the tree dump fallback for full state recovery at the cost of additional memory and complexity. vLLM keeps it simple with just the buffer, which is sufficient when consumers are stable and gaps are short-lived.
For deployments using Dynamo’s KV-aware routing, the local indexer is used automatically. For standalone vLLM deployments where you want to build your own event consumer, vLLM’s replay buffer provides a lightweight starting point.
RadixTree, ConcurrentRadixTree, and PositionalIndexer internals