Standalone Selection Service
Overview
The standalone selection service (python -m dynamo.select_service) exposes the
KV router’s worker selection and active-load accounting over HTTP. It does not
forward model requests or own response streams. External runtimes such as Ray
register their worker catalog, request a selection, contact the selected worker,
and report the reservation lifecycle.
The service combines:
- KV overlap indexing from worker ZMQ events.
- KV-aware and load-aware worker selection.
- Explicit or atomic selection and reservation.
- Best-effort active-load synchronization between selector replicas.
- Startup KV index recovery from another selector or standalone indexer.
Build And Launch
Build the Python bindings with the select-service feature:
Launch the service from the repository root:
The service binds to 0.0.0.0 and does not provide authentication. Run it on a
trusted internal network or place it behind an appropriate network policy.
CLI
Router scheduling behavior continues to use the standard Dynamo router environment configuration.
Worker Registration
Every selector replica must receive the same worker catalog before it serves selection traffic. Replica traffic never creates workers.
POST /workers returns 201. PATCH /workers/{worker_id} updates supplied
fields, DELETE /workers/{worker_id} removes the worker, and GET /workers
lists catalog state. model_name and tenant_id scope all selection, indexer,
and load state; both default to "default" when omitted.
GET /health is process liveness. GET /ready returns 200 only after at
least one worker is schedulable, otherwise 503 with lifecycle details.
Selection API
POST /select
Select a worker without booking active load:
POST /select_and_reserve
Select and atomically book load in the receiving selector process. Supply a
globally unique reservation_id, or allow the service to generate one:
Both endpoints return the same selection shape:
selection_id and reservation_id are omitted when absent. All overlap
values are matched token counts. gpu, cpu, and disk use the cumulative
Mooncake tier semantics documented in the standalone indexer’s
per-instance tier breakdown.
A zero-overlap response includes the selected dp_rank with value 0.
The overlap summary is raw observability. effective_prefill_tokens is the
authoritative weighted prefill-load value computed by the same cache-credit
formula used for scheduler booking. It is not derived from longest_matched.
The previous public fields cached_tokens and effective_overlap_blocks have
been removed. Their values remain internal scheduler inputs.
Ray Select-Then-Reserve Flow
Ray can keep model invocation separate from selector admission:
- Call
POST /select. - Send the request to the returned
endpointanddp_rank. - Call
POST /reservationswith a globally unique reservation ID, selected worker identity, the same prompt representation, and the returnedeffective_prefill_tokens. - Report prefill completion and request completion through the lifecycle API.
When supplied, effective_prefill_tokens is authoritative and directly enables
prefill-load tracking. It must not exceed the normalized input sequence length.
When omitted, existing router configuration controls prefill tracking. The
reservation API does not accept or derive accounting from overlap fields.
Reservation Lifecycle
prefill_complete clears active prefill load. output_block updates only the
receiving selector’s local decode-block accounting and accepts an optional
decay_fraction in [0.0, 1.0]. DELETE frees the reservation.
NOTE: Output-block updates are intentionally not replica-synchronized. They can occur at high frequency, and broadcasting them would consume disproportionate network bandwidth.
Peer Planes
The selector has two independent peer configurations:
Example:
Configure the reverse peer direction on selector B for bidirectional lifecycle
synchronization. GET /dump exposes the selector’s current indexer snapshot in
the same recovery format as the standalone indexer.
Replica-sync peers may also be changed without restarting the selector:
The same body is accepted by POST /replica_sync/deregister_peer.
GET /replica_sync/peers returns the sorted configured endpoints. Dynamic
membership is in-memory; after restart, only peers supplied through
--replica-sync-peers are restored. These routes only manage live ZMQ
replica-sync peers. They do not alter the HTTP indexer-recovery peers.
Consistency Invariants
- Replica synchronization is bounded and best-effort. Delays, reordering, dropped events, and temporary active-load divergence are accepted.
- There is no sequencing, acknowledgement, replay, backpressure, or resynchronization for replica lifecycle events.
- Unknown worker, model, tenant, DP-rank, and block-size events are dropped. Register the same worker catalog on every selector before routing traffic.
- Admission, prefill-complete, and free are synchronized. Output-block growth remains local to avoid excessive network bandwidth.
- Startup recovery waits for recovered events to be submitted to the indexer, not for complete processing. Early selections may temporarily miss recovered KV state.
/selectfollowed by/reservationsprovides eventual, not atomic, cross-replica admission. Use/select_and_reservefor atomic local booking.- Reservation IDs must be globally unique. Existing conflict and retry semantics are unchanged; no idempotency ledger is added.
Inspection APIs
GET /loadsreturns active-load snapshots, optionally filtered bymodel_nameandtenant_id.POST /potential_loadsestimates worker load for a prompt without selection.POST /overlap_scoresreturns per-worker/per-rank tiered overlap rows.GET /dumpreturns the compatible indexer recovery snapshot.