This page covers day-2 operational topics for router deployments. For flags and tuning guidance, see Configuration and Tuning.
For improved fault tolerance, you can launch multiple frontend-plus-router replicas. If multiple dynamo.frontend processes share the same host or network namespace, give each instance a different HTTP port. In Kubernetes or on separate hosts, replicas can usually reuse the same container port. Alternatively, you can deploy the router separately as the standalone python -m dynamo.router service.
The KV router maintains two independent state families with different synchronization, persistence, and recovery behavior:
For the architecture behind these states, see Router Design.
Prefix cache state is maintained by the KV indexer in each router or frontend. In event-driven mode, workers publish KV Stored and Removed events, and each router replica consumes those events to update its radix tree. Because KV events are distributed through the event plane, multiple router replicas naturally receive the same prefix-cache updates; they do not need router-to-router synchronization for prefix blocks.
When --no-router-kv-events is used, the router does not consume worker KV events. It instead predicts cache state from its own routing decisions and expires predicted blocks with --router-ttl-secs. This approximate mode is useful for development or for backends whose KV events are not yet reliable, but it is not the recommended production path.
Prefix cache recovery matters because stale or missing prefix state directly affects cache-hit routing decisions. Dynamo supports two recovery strategies.
For more on gap detection and replay, see KV Event Replay — Dynamo vs vLLM.
JetStream mode requires --router-durable-kv-events on both frontend and workers.
If you need to start with a fresh state in JetStream mode, you have two options:
--router-reset-states, which purges the entire stream and radix snapshot. Only do this when launching the first router replica in a component, because it can bring existing replicas into an inconsistent state.Active block state tracks in-flight request load. It is derived from the request lifecycle: the router records a request when it is assigned to a worker, updates prefill completion and optional output-block growth as responses arrive, and frees the request when it finishes.
This state is deliberately ephemeral. If a router replica restarts, it starts with no active-block knowledge. That is usually acceptable for fault tolerance because active requests are short lived relative to prefix cache state: old active blocks leave the system as requests complete, and the router’s view becomes accurate again as it handles new requests.
The operational concern is replica synchronization. Active blocks are tracked locally by the router that routed a request, so multiple frontend or router replicas do not automatically share the same active-load view.
There are two operating modes for active blocks:
--router-replica-sync so replicas publish and subscribe to active-sequence lifecycle events through NATS core messaging. This gives each replica a more complete active-load view across the router fleet.With replica sync enabled, a new router still starts with zero active-block knowledge, but it converges through live request handling and active-sequence events from other replicas. Without it, each replica keeps an isolated active-block view, which can lead to suboptimal load balancing.
For Dynamo-native deployments, the remote indexer is served by dynamo.frontend or dynamo.router, not by dynamo.indexer.
--serve-indexer on router or frontend replicas that should expose kv_indexer_query from the worker component.--use-remote-indexer on consumer routers or frontends that should query that served endpoint instead of maintaining a local overlap indexer.dynamo.indexer remains the standalone HTTP plus ZMQ microservice for non-Dynamo or direct-ZMQ deployments.Frontend example:
The served service is request-plane only. Each serving router or frontend keeps its normal local KV event ingestion, gap detection, and worker-query recovery path; remote consumers only issue hash-based overlap queries.
Approximate mode (--no-router-kv-events) is singleton-only for remote serving: only one --serve-indexer replica may exist for a given worker component. Event-driven mode allows multiple serving replicas behind the same worker component.
Request-plane transport is independent of KV event transport. The request plane (DYN_REQUEST_PLANE or --request-plane) controls how requests reach workers. KV events use NATS in JetStream or NATS Core modes, or ZMQ when --event-plane zmq is set. With --event-plane zmq and --discovery-backend file or mem, the router can run without etcd or NATS. When using a NATS-based event plane, NATS is initialized automatically; set NATS_SERVER=nats://... to override the default localhost:4222.
When --router-kv-overlap-score-credit is set to 0, no KV indexer is created and prefix matching is disabled. When --no-router-kv-events is set, a KV indexer is still created but no event subscriber is launched; the router predicts cache state from its own routing decisions with TTL-based expiration.
Backend KV event publishing is independent of the frontend’s --no-router-kv-events flag. The frontend flag controls whether the router consumes events; backend flags control whether workers publish them. If the router is not consuming events, workers that still publish will waste resources but cause no harm.
--kv-events-config '{"enable_kv_cache_events": false}' to disable, or '{"enable_kv_cache_events": true, "publisher": "zmq", "endpoint": "tcp://*:5557"}' to enable.--kv-events-config with a JSON config to enable, or omit it to keep publishing disabled.--publish-events-and-metrics to enable, or omit it to keep publishing disabled.The CLI arg --router-ttl-secs controls local cache prediction lifetime when the router operates without receiving events from workers. When workers are configured to publish KV events, the router relies on worker-side eviction events and this parameter is ignored.
--router-queue-threshold and the busy thresholds (--active-decode-blocks-threshold, --active-prefill-tokens-threshold, --active-prefill-tokens-threshold-frac) serve different purposes. Busy thresholds reject a worker entirely from the candidate set when it exceeds a utilization limit. In contrast, --router-queue-threshold defers the entire routing decision until at least one worker has capacity, so the request is routed with the freshest load metrics. The busy thresholds can be updated at runtime without restarting the frontend via the /busy_threshold HTTP endpoint. For the eligibility and backpressure distinction, see Router Filtering. For rejection behavior details, see Request Rejection.