Standalone Selection Service

Select workers and account for reservations without forwarding inference requests
View as Markdown

Overview

The standalone selection service (python -m dynamo.select_service) exposes the KV router’s worker selection and active-load accounting over HTTP. It does not forward model requests or own response streams. External runtimes such as Ray register their worker catalog, request a selection, contact the selected worker, and report the reservation lifecycle.

The service combines:

  • KV overlap indexing from worker ZMQ events.
  • KV-aware and load-aware worker selection.
  • Explicit or atomic selection and reservation.
  • Best-effort active-load synchronization between selector replicas.
  • Startup KV index recovery from another selector or standalone indexer.

Build And Launch

Build the Python bindings with the select-service feature:

$cd lib/bindings/python
$VIRTUAL_ENV=../../../.venv ../../../.venv/bin/maturin develop --uv --features select-service

Launch the service from the repository root:

$.venv/bin/python -m dynamo.select_service --port 8092

The service binds to 0.0.0.0 and does not provide authentication. Run it on a trusted internal network or place it behind an appropriate network policy.

CLI

FlagDefaultDescription
--port8092HTTP server port.
--threads4KV indexer worker threads.
--indexer-peersnoneComma-separated HTTP URLs used for startup KV recovery through /dump.
--replica-sync-portnoneLocal ZMQ PUB port for active-load lifecycle events. The selector binds tcp://*:<port> internally.
--replica-sync-peersnoneComma-separated ZMQ PUB endpoints for selector peers. Requires --replica-sync-port.

Router scheduling behavior continues to use the standard Dynamo router environment configuration.

Worker Registration

Every selector replica must receive the same worker catalog before it serves selection traffic. Replica traffic never creates workers.

1POST /workers
2Content-Type: application/json
3
4{
5 "worker_id": 1,
6 "model_name": "model",
7 "tenant_id": "default",
8 "endpoint": "http://worker:8000",
9 "block_size": 16,
10 "data_parallel_start_rank": 0,
11 "data_parallel_size": 2,
12 "kv_events_endpoints": {
13 "0": "tcp://worker:5557",
14 "1": "tcp://worker:5558"
15 },
16 "replay_endpoint": "tcp://worker:5560"
17}

POST /workers returns 201. PATCH /workers/{worker_id} updates supplied fields, DELETE /workers/{worker_id} removes the worker, and GET /workers lists catalog state. model_name and tenant_id scope all selection, indexer, and load state; both default to "default" when omitted.

GET /health is process liveness. GET /ready returns 200 only after at least one worker is schedulable, otherwise 503 with lifecycle details.

Selection API

POST /select

Select a worker without booking active load:

1{
2 "selection_id": "select-123",
3 "model_name": "model",
4 "tenant_id": "default",
5 "block_hashes": [11, 12, 13, 14, 15, 16, 17, 18],
6 "sequence_hashes": [21, 22, 23, 24, 25, 26, 27, 28],
7 "isl_tokens": 512
8}

POST /select_and_reserve

Select and atomically book load in the receiving selector process. Supply a globally unique reservation_id, or allow the service to generate one:

1{
2 "selection_id": "select-123",
3 "reservation_id": "request-123",
4 "model_name": "model",
5 "tenant_id": "default",
6 "block_hashes": [11, 12, 13, 14, 15, 16, 17, 18],
7 "sequence_hashes": [21, 22, 23, 24, 25, 26, 27, 28],
8 "isl_tokens": 512
9}

Both endpoints return the same selection shape:

1{
2 "selection_id": "select-123",
3 "reservation_id": "request-123",
4 "model_name": "model",
5 "tenant_id": "default",
6 "worker_id": 1,
7 "dp_rank": 0,
8 "endpoint": "http://worker:8000",
9 "block_size": 16,
10 "overlap": {
11 "longest_matched": 128,
12 "gpu": 64,
13 "dp": {"0": 64, "1": 32},
14 "cpu": 96,
15 "disk": 128
16 },
17 "effective_prefill_tokens": 384
18}

selection_id and reservation_id are omitted when absent. All overlap values are matched token counts. gpu, cpu, and disk use the cumulative Mooncake tier semantics documented in the standalone indexer’s per-instance tier breakdown. A zero-overlap response includes the selected dp_rank with value 0.

The overlap summary is raw observability. effective_prefill_tokens is the authoritative weighted prefill-load value computed by the same cache-credit formula used for scheduler booking. It is not derived from longest_matched.

The previous public fields cached_tokens and effective_overlap_blocks have been removed. Their values remain internal scheduler inputs.

Ray Select-Then-Reserve Flow

Ray can keep model invocation separate from selector admission:

  1. Call POST /select.
  2. Send the request to the returned endpoint and dp_rank.
  3. Call POST /reservations with a globally unique reservation ID, selected worker identity, the same prompt representation, and the returned effective_prefill_tokens.
  4. Report prefill completion and request completion through the lifecycle API.
1POST /reservations
2Content-Type: application/json
3
4{
5 "reservation_id": "request-123",
6 "model_name": "model",
7 "tenant_id": "default",
8 "worker_id": 1,
9 "dp_rank": 0,
10 "sequence_hashes": [21, 22, 23, 24, 25, 26, 27, 28],
11 "isl_tokens": 512,
12 "effective_prefill_tokens": 384
13}

When supplied, effective_prefill_tokens is authoritative and directly enables prefill-load tracking. It must not exceed the normalized input sequence length. When omitted, existing router configuration controls prefill tracking. The reservation API does not accept or derive accounting from overlap fields.

Reservation Lifecycle

1POST /reservations/{reservation_id}/prefill_complete
2POST /reservations/{reservation_id}/output_block
3DELETE /reservations/{reservation_id}

prefill_complete clears active prefill load. output_block updates only the receiving selector’s local decode-block accounting and accepts an optional decay_fraction in [0.0, 1.0]. DELETE frees the reservation.

NOTE: Output-block updates are intentionally not replica-synchronized. They can occur at high frequency, and broadcasting them would consume disproportionate network bandwidth.

Peer Planes

The selector has two independent peer configurations:

PlaneTransportFlagsPurpose
Indexer recoveryHTTP--indexer-peersFetch a compatible /dump during startup and replay KV events into the local indexer.
Replica synchronizationZMQ--replica-sync-port, --replica-sync-peersShare admission, prefill-complete, and free events by model and tenant.

Example:

$.venv/bin/python -m dynamo.select_service \
> --port 8092 \
> --indexer-peers http://selector-b:8092 \
> --replica-sync-port 9092 \
> --replica-sync-peers 'tcp://selector-b:9092'

Configure the reverse peer direction on selector B for bidirectional lifecycle synchronization. GET /dump exposes the selector’s current indexer snapshot in the same recovery format as the standalone indexer.

Replica-sync peers may also be changed without restarting the selector:

1POST /replica_sync/register_peer
2Content-Type: application/json
3
4{"endpoint":"tcp://selector-b:9092"}

The same body is accepted by POST /replica_sync/deregister_peer. GET /replica_sync/peers returns the sorted configured endpoints. Dynamic membership is in-memory; after restart, only peers supplied through --replica-sync-peers are restored. These routes only manage live ZMQ replica-sync peers. They do not alter the HTTP indexer-recovery peers.

Consistency Invariants

  • Replica synchronization is bounded and best-effort. Delays, reordering, dropped events, and temporary active-load divergence are accepted.
  • There is no sequencing, acknowledgement, replay, backpressure, or resynchronization for replica lifecycle events.
  • Unknown worker, model, tenant, DP-rank, and block-size events are dropped. Register the same worker catalog on every selector before routing traffic.
  • Admission, prefill-complete, and free are synchronized. Output-block growth remains local to avoid excessive network bandwidth.
  • Startup recovery waits for recovered events to be submitted to the indexer, not for complete processing. Early selections may temporarily miss recovered KV state.
  • /select followed by /reservations provides eventual, not atomic, cross-replica admission. Use /select_and_reserve for atomic local booking.
  • Reservation IDs must be globally unique. Existing conflict and retry semantics are unchanged; no idempotency ledger is added.

Inspection APIs

  • GET /loads returns active-load snapshots, optionally filtered by model_name and tenant_id.
  • POST /potential_loads estimates worker load for a prompt without selection.
  • POST /overlap_scores returns per-worker/per-rank tiered overlap rows.
  • GET /dump returns the compatible indexer recovery snapshot.