Additional Resources

Standalone KV Indexer

Run the KV cache indexer as an independent HTTP service for querying block state
View as Markdown

Overview

The standalone KV indexer (dynamo-kv-indexer) is a lightweight HTTP binary that subscribes to ZMQ KV event streams from workers, maintains a radix tree of cached blocks, and exposes HTTP endpoints for querying and managing workers.

This is distinct from the Standalone Router, which is a full routing service. The standalone indexer provides only the indexing and query layer without routing logic.

The HTTP API follows the Mooncake KV Indexer RFC conventions.

Multi-Model and Multi-Tenant Support

The indexer maintains one radix tree per (model_name, tenant_id) pair. Workers registered with different model names or tenant IDs are isolated into separate indexers — queries against one model/tenant never return scores from another.

  • model_name (required on /register and /query): Identifies the model. Workers serving different models get separate radix trees.
  • tenant_id (optional, defaults to "default"): Enables multi-tenant isolation within the same model. Omit for single-tenant deployments.
  • block_size is per-indexer: the first /register call for a given (model_name, tenant_id) sets the block size. Subsequent registrations for the same pair must use the same block size or the request will fail.

Compatibility

The standalone indexer works with any engine that publishes KV cache events over ZMQ in the expected msgpack format. This includes bare vLLM and SGLang engines, which emit ZMQ KV events natively — no Dynamo-specific wrapper is required.

Use Cases

  • Debugging: Inspect the radix tree state to verify which blocks are cached on which workers.
  • State verification: Confirm that the indexer’s view of KV cache state matches the router’s internal state (used in integration tests).
  • Custom routing: Build external routing logic that queries the indexer for overlap scores and makes its own worker selection decisions.
  • Monitoring: Observe KV cache distribution across workers without running a full router.

P2P Recovery

Multiple indexer replicas can subscribe to the same ZMQ worker endpoints for fault tolerance. When a replica starts (or restarts after a crash), it bootstraps its radix tree state from a healthy peer before processing live events.

How It Works

  1. Workers are registered via --workers CLI, which connects ZMQ SUB sockets immediately.
  2. A 1-second delay ensures the peer’s tree state has advanced past the ZMQ connection point, so the dump covers any events that would otherwise be lost to the slow-joiner window.
  3. The indexer fetches a /dump from the first reachable peer in --peers.
  4. Dump events are applied to populate the radix tree.
  5. ZMQ listeners are unblocked and begin draining any events that buffered during recovery.

If no peers are reachable, the indexer starts with an empty state.

Example: Two-Replica Setup

$# Replica A (first instance, no peers)
$dynamo-kv-indexer --port 8090 --block-size 16 \
> --workers "1=tcp://worker1:5557,2=tcp://worker2:5558"
$
$# Replica B (recovers from A on startup)
$dynamo-kv-indexer --port 8091 --block-size 16 \
> --workers "1=tcp://worker1:5557,2=tcp://worker2:5558" \
> --peers "http://localhost:8090"

Both replicas subscribe to the same workers. Replica B recovers A’s tree state on startup, then both independently process live ZMQ events going forward.

Consistency

The dump is a weakly consistent BFS snapshot of the radix tree — concurrent writes may race with the traversal. This is acceptable because:

  • Stale blocks (partially removed branches): live Remove events will clean them up.
  • Missing blocks (partially added branches): live Stored events will add them.
  • The tree converges to the correct state after live events catch up.

Peer Management

Peers can be registered at startup via --peers or dynamically via the HTTP API. The peer list is used for recovery only — peers do not synchronize state in real time.

Building

The binary is a feature-gated target in the dynamo-kv-router crate:

$cargo build -p dynamo-kv-router --features indexer-bin --bin dynamo-kv-indexer

CLI

$dynamo-kv-indexer --port 8090 [--threads 4] [--block-size 16 --model-name my-model --tenant-id default --workers "1=tcp://host:5557,2:1=tcp://host:5558"] [--peers "http://peer1:8090,http://peer2:8091"]
FlagDefaultDescription
--block-size(none)KV cache block size for initial --workers (required when --workers is set)
--port8090HTTP server listen port
--threads4Number of indexer threads (1 = single-threaded, >1 = thread pool)
--workers(none)Initial workers as instance_id[:dp_rank]=zmq_address,... pairs (dp_rank defaults to 0)
--model-namedefaultModel name for initial --workers
--tenant-iddefaultTenant ID for initial --workers
--peers(none)Comma-separated peer indexer URLs for P2P recovery on startup

HTTP API

GET /health — Liveness check

Returns 200 OK unconditionally.

$curl http://localhost:8090/health

GET /metrics — Prometheus metrics

Returns metrics in Prometheus text exposition format. Available when the binary is built with the metrics feature (enabled by default via standalone-indexer).

$curl http://localhost:8090/metrics
MetricTypeLabelsDescription
dynamo_kvindexer_request_duration_secondsHistogramendpointHTTP request latency
dynamo_kvindexer_requests_totalCounterendpoint, methodTotal HTTP requests
dynamo_kvindexer_errors_totalCounterendpoint, status_classHTTP error responses (4xx/5xx)
dynamo_kvindexer_modelsGaugeNumber of active model+tenant indexers
dynamo_kvindexer_workersGaugeNumber of registered worker instances

POST /register — Register an endpoint

Register a ZMQ endpoint for an instance. Each call creates or reuses the indexer for the given (model_name, tenant_id) pair.

$# Single model, default tenant
$curl -X POST http://localhost:8090/register \
> -H 'Content-Type: application/json' \
> -d '{
> "instance_id": 1,
> "endpoint": "tcp://127.0.0.1:5557",
> "model_name": "llama-3-8b",
> "block_size": 16
> }'
$
$# With tenant isolation
$curl -X POST http://localhost:8090/register \
> -H 'Content-Type: application/json' \
> -d '{
> "instance_id": 2,
> "endpoint": "tcp://127.0.0.1:5558",
> "model_name": "llama-3-8b",
> "tenant_id": "customer-a",
> "block_size": 16,
> "dp_rank": 0
> }'
FieldRequiredDefaultDescription
instance_idyesWorker instance identifier
endpointyesZMQ PUB address to subscribe to
model_nameyesModel name (used to select the indexer)
block_sizeyesKV cache block size (must match the engine)
tenant_idno"default"Tenant identifier for isolation
dp_rankno0Data parallel rank
replay_endpointnoZMQ ROUTER address for gap replay (e.g. tcp://host:5560)

POST /unregister — Deregister an instance

Remove an instance. Omitting tenant_id removes the instance from all tenants for the given model; providing it targets only that tenant’s indexer.

$# Remove from all tenants
$curl -X POST http://localhost:8090/unregister \
> -H 'Content-Type: application/json' \
> -d '{"instance_id": 1, "model_name": "llama-3-8b"}'
$
$# Remove from a specific tenant
$curl -X POST http://localhost:8090/unregister \
> -H 'Content-Type: application/json' \
> -d '{"instance_id": 1, "model_name": "llama-3-8b", "tenant_id": "customer-a"}'
$
$# Remove a specific dp_rank
$curl -X POST http://localhost:8090/unregister \
> -H 'Content-Type: application/json' \
> -d '{"instance_id": 1, "model_name": "llama-3-8b", "tenant_id": "default", "dp_rank": 0}'
FieldRequiredDefaultDescription
instance_idyesWorker instance to remove
model_nameyesModel name (identifies the indexer)
tenant_idnoTenant identifier (omit to remove from all tenants)
dp_ranknoSpecific dp_rank to remove (omit to remove all)

GET /workers — List registered instances

$curl http://localhost:8090/workers

Returns:

1[{"instance_id": 1, "endpoints": {"0": "tcp://127.0.0.1:5557", "1": "tcp://127.0.0.1:5558"}}]

POST /query — Query overlap for token IDs

Given raw token IDs, compute block hashes and return per-instance overlap scores (in matched tokens):

$curl -X POST http://localhost:8090/query \
> -H 'Content-Type: application/json' \
> -d '{"token_ids": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], "model_name": "llama-3-8b"}'

Returns:

1{
2 "scores": {"1": {"0": 32}, "2": {"1": 0}},
3 "frequencies": [1, 1],
4 "tree_sizes": {"1": {"0": 5}, "2": {"1": 3}}
5}

Scores are in matched tokens (block overlap count × block size). Nested by instance_id then dp_rank.

FieldRequiredDefaultDescription
token_idsyesToken sequence to query
model_nameyesModel name (selects the indexer)
tenant_idno"default"Tenant identifier
lora_namenoLoRA adapter (overrides indexer-level lora_name for this query)

POST /query_by_hash — Query overlap for pre-computed hashes

$curl -X POST http://localhost:8090/query_by_hash \
> -H 'Content-Type: application/json' \
> -d '{"block_hashes": [123456, 789012], "model_name": "llama-3-8b"}'

Same response format as /query. Scores are in matched tokens.

FieldRequiredDefaultDescription
block_hashesyesPre-computed block hash array
model_nameyesModel name (selects the indexer)
tenant_idno"default"Tenant identifier

GET /dump — Dump all radix tree events

Returns the full radix tree state as a JSON object keyed by model_name:tenant_id:

$curl http://localhost:8090/dump

Returns:

1{
2 "llama-3-8b:default": {
3 "block_size": 16,
4 "events": [<RouterEvent>, ...]
5 },
6 "mistral-7b:customer-a": {
7 "block_size": 16,
8 "events": [<RouterEvent>, ...]
9 }
10}

Each indexer is dumped concurrently. The block_size field lets recovering peers create indexers with the correct block size without requiring --block-size on every replica.

POST /register_peer — Register a peer indexer

$curl -X POST http://localhost:8090/register_peer \
> -H 'Content-Type: application/json' \
> -d '{"url": "http://peer:8091"}'

POST /deregister_peer — Remove a peer indexer

$curl -X POST http://localhost:8090/deregister_peer \
> -H 'Content-Type: application/json' \
> -d '{"url": "http://peer:8091"}'

GET /peers — List registered peers

$curl http://localhost:8090/peers

Returns:

1["http://peer:8091"]

DP Rank Handling

When a worker registers with the standalone KV indexer (/register), it provides an instance_id, a ZMQ endpoint, and an optional dp_rank (defaults to 0). The service spawns one ZMQ listener per registration.

Each incoming KvEventBatch may carry an optional data_parallel_rank field. If present, it overrides the statically-registered dp_rank for that batch. This allows a single ZMQ port to multiplex events from multiple DP ranks.

Caveat: the registry only tracks dp_ranks from explicit /register calls. If an engine dynamically emits batches with a dp_rank that was never registered, the indexer will store those blocks correctly (under the dynamic WorkerWithDpRank key), but per-dp_rank deregistration (/unregister with dp_rank) will not find them. Full-instance deregistration (/unregister without dp_rank) still cleans up all dp_ranks for a given worker_id in the tree via remove_worker.

Gap Detection and Replay

ZMQ PUB/SUB is lossy — messages can be dropped under backpressure or brief disconnects. The indexer detects gaps by tracking the sequence number of each batch: if seq > last_seq + 1, a gap is detected.

When a replay_endpoint is provided during /register, the indexer connects a DEALER socket to the engine’s ROUTER socket and requests the missing batches by sequence number. The engine streams back buffered (seq, payload) pairs from its ring buffer until an empty-payload sentinel.

If no replay_endpoint is configured, gaps are logged as warnings but not recovered.

The sequence counter (last_seq) persists across unregister/register cycles, so re-registering a worker after a gap will trigger replay on the first batch received by the new listener.

Limitations

  • ZMQ only: Workers must publish KV events via ZMQ PUB sockets. The standalone indexer does not subscribe to NATS event streams.
  • No routing logic: The indexer only maintains the radix tree and answers queries. It does not track active blocks, manage request lifecycle, or perform worker selection.

Architecture

P2P Recovery Flow

See Also