Standalone Slot Tracker
Run active-request load accounting as an independent HTTP service
Run active-request load accounting as an independent HTTP service
The standalone slot tracker (python -m dynamo.slot_tracker) exposes the KV router’s
active-request accounting as a small HTTP service. It is runtime-independent: consumers
register workers manually, submit request lifecycle events, and read advisory load
snapshots for their own routing decisions.
The service accepts ordered final chained sequence hashes, one hash per prompt block. Hashes are serialized as signed 64-bit JSON integers and reinterpreted bit-for-bit as internal unsigned hashes. Send hashes rather than prompt tokens.
This first version intentionally excludes metrics, discovery-based registration, output block updates, replica synchronization, persistence, and peer recovery.
Build the Python bindings with the slot-tracker feature:
Launch the service:
The default port is 8091. GET /health returns 200 OK with an empty body as soon as
the HTTP listener is ready. This endpoint is liveness-only. After a restart the registry
is empty; consumers must re-register workers and replay active requests if they need
restored accounting.
The service binds to 0.0.0.0 and does not provide authentication. Run it on a trusted
internal network or place it behind an appropriate network policy.
Successful topology and lifecycle writes return:
Errors, including malformed JSON, oversized JSON bodies, unknown routes, and unsupported methods, return:
tenant_id defaults to "default" when omitted. Request bodies use Axum’s default
bounded JSON handling.
POST /registerRegister one contiguous data-parallel range:
Returns 201. block_size and dp_size must be positive, and the DP range must not
overflow. Workers in the same (model_name, tenant_id) tracker must use the same block
size. Worker IDs are scoped by (model_name, tenant_id).
POST /unregisterRemove a worker’s full DP range and active requests immediately:
Returns 200, or 404 if the registration does not exist.
GET /workersList workers with independent optional model_name and tenant_id filters:
The response is sorted for stable inspection.
POST /addRecord prompt blocks on a registered worker rank:
Returns 201. sequence_hashes is required and may be empty. new_isl_tokens defaults
to 0; positive values enable prefill-token accounting. Duplicate request IDs return
409. Unknown trackers or worker ranks return 404.
POST /prefill_completeMark prompt processing complete:
Returns 200 for an active request. Repeated completion is a no-op. Unknown requests
return 404.
POST /freeRelease prompt blocks and any remaining prefill state:
Returns 200. Free is idempotent while the model/tenant tracker exists, including for
an unknown request. Unknown trackers return 404.
Lifecycle writes preserve the core slot tracker’s arrival ordering. Consumers should
normally wait for /add success before sending later lifecycle writes. The service does
not repair reordered delivery: an early unknown /free or /prefill_complete is
forgotten, so a later /add may remain accounted until a later free or expiry. A request
older than 300 seconds may be removed by inherited stale-request cleanup.
GET /loadsRead current load snapshots with independent optional model_name and tenant_id
filters:
The response is sorted for stable inspection.
POST /potential_loadsProject the loads for a new request:
Returns:
Projection response order is unspecified to keep the routing read path lean. /loads
and /potential_loads are advisory snapshots, not reservations. A selected worker may
disappear before /add; recompute after /add returns 404. An ambiguous /add
timeout is also consumer-owned: automatically retrying the same request is not
guaranteed safe because duplicate adds return 409.