This page covers day-2 operational topics for router deployments. For flags and tuning guidance, see Configuration and Tuning.
For improved fault tolerance, you can launch multiple frontend-plus-router replicas. If multiple dynamo.frontend processes share the same host or network namespace, give each instance a different HTTP port. In Kubernetes or on separate hosts, replicas can usually reuse the same container port. Alternatively, you can deploy the router separately as the standalone python -m dynamo.router service.
For Dynamo-native deployments, the remote indexer is served by dynamo.frontend or dynamo.router, not by dynamo.indexer.
--serve-indexer on router or frontend replicas that should expose kv_indexer_query from the worker component.--use-remote-indexer on consumer routers or frontends that should query that served endpoint instead of maintaining a local overlap indexer.dynamo.indexer remains the standalone HTTP plus ZMQ microservice for non-Dynamo or direct-ZMQ deployments.Frontend example:
The served service is request-plane only. Each serving router or frontend keeps its normal local KV event ingestion, gap detection, and worker-query recovery path; remote consumers only issue hash-based overlap queries.
Approximate mode (--no-router-kv-events) is singleton-only for remote serving: only one --serve-indexer replica may exist for a given worker component. Event-driven mode allows multiple serving replicas behind the same worker component.
The KV router tracks two types of state:
--router-durable-kv-events) it is backed by JetStream events and object store snapshots.For the architecture behind these states, see Router Design.
The --router-replica-sync flag enables active block synchronization between replicas:
Without this flag, each replica maintains its own isolated view of active blocks, which can lead to suboptimal routing.
Persistence behavior depends on the event transport mode.
For more on gap detection and replay, see KV Event Replay — Dynamo vs vLLM.
JetStream mode requires --router-durable-kv-events on both frontend and workers.
If you need to start with a fresh state in JetStream mode, you have two options:
--router-reset-states, which purges the entire stream and radix snapshot. Only do this when launching the first router replica in a component, because it can bring existing replicas into an inconsistent state.State persistence depends on the event transport mode:
--no-router-kv-events): State persistence is not supported.Request-plane transport is independent of KV event transport. The request plane (DYN_REQUEST_PLANE or --request-plane) controls how requests reach workers. KV events use NATS in JetStream or NATS Core modes, or ZMQ when --event-plane zmq is set. With --event-plane zmq and --discovery-backend file or mem, the router can run without etcd or NATS. When using a NATS-based event plane, NATS is initialized automatically; set NATS_SERVER=nats://... to override the default localhost:4222.
When --router-kv-overlap-score-weight is set to 0, no KV indexer is created and prefix matching is disabled. When --no-router-kv-events is set, a KV indexer is still created but no event subscriber is launched; the router predicts cache state from its own routing decisions with TTL-based expiration.
Backend KV event publishing is independent of the frontend’s --no-router-kv-events flag. The frontend flag controls whether the router consumes events; backend flags control whether workers publish them. If the router is not consuming events, workers that still publish will waste resources but cause no harm.
--kv-events-config '{"enable_kv_cache_events": false}' to disable, or '{"enable_kv_cache_events": true, "publisher": "zmq", "endpoint": "tcp://*:5557"}' to enable.--kv-events-config with a JSON config to enable, or omit it to keep publishing disabled.--publish-events-and-metrics to enable, or omit it to keep publishing disabled.The CLI arg --router-ttl-secs controls local cache prediction lifetime when the router operates without receiving events from workers. When workers are configured to publish KV events, the router relies on worker-side eviction events and this parameter is ignored.
--router-queue-threshold and the busy thresholds (--active-decode-blocks-threshold, --active-prefill-tokens-threshold, --active-prefill-tokens-threshold-frac) serve different purposes. Busy thresholds reject a worker entirely from the candidate set when it exceeds a utilization limit. In contrast, --router-queue-threshold defers the entire routing decision until at least one worker has capacity, so the request is routed with the freshest load metrics. The busy thresholds can be updated at runtime without restarting the frontend via the /busy_threshold HTTP endpoint. For details, see Request Rejection.