For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes
    • DynoSim: Simulating the Pareto Frontier
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
    • API Reference
  • User Guides
    • Disaggregated Serving
    • KV Cache Aware Routing
    • KV Cache Offloading
    • Tool Call and Reasoning Parsing
    • Agents
    • Multimodal
    • Diffusion
    • LoRA Adapters
    • Fastokens Tokenizer
    • Observability (Local)
    • Fault Tolerance
    • Benchmarking
    • Writing Python Workers
    • Writing Python Unified Backends
    • Writing Rust Unified Backends
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
      • Router Guide
      • Routing Concepts
      • Configuration and Tuning
      • Disaggregated Serving
      • Topology-Aware KV Transfer
      • Router Operations
      • Router Examples
      • Standalone Indexer
      • Standalone Slot Tracker
      • KV Event Replay — Dynamo vs vLLM
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Overview
  • Build And Launch
  • Common Responses
  • Topology API
  • POST /register
  • POST /unregister
  • GET /workers
  • Lifecycle API
  • POST /add
  • POST /prefill_complete
  • POST /free
  • Load API
  • GET /loads
  • POST /potential_loads
ComponentsRouter

Standalone Slot Tracker

Run active-request load accounting as an independent HTTP service

||View as Markdown|
Previous

Standalone KV Indexer

Next

KV Event Replay — Dynamo vs vLLM

Overview

The standalone slot tracker (python -m dynamo.slot_tracker) exposes the KV router’s active-request accounting as a small HTTP service. It is runtime-independent: consumers register workers manually, submit request lifecycle events, and read advisory load snapshots for their own routing decisions.

The service accepts ordered final chained sequence hashes, one hash per prompt block. Hashes are serialized as signed 64-bit JSON integers and reinterpreted bit-for-bit as internal unsigned hashes. Send hashes rather than prompt tokens.

This first version intentionally excludes metrics, discovery-based registration, output block updates, replica synchronization, persistence, and peer recovery.

Build And Launch

Build the Python bindings with the slot-tracker feature:

$cd lib/bindings/python
$VIRTUAL_ENV=../../../.venv ../../../.venv/bin/maturin develop --uv --features slot-tracker

Launch the service:

$.venv/bin/python -m dynamo.slot_tracker --port 8091

The default port is 8091. GET /health returns 200 OK with an empty body as soon as the HTTP listener is ready. This endpoint is liveness-only. After a restart the registry is empty; consumers must re-register workers and replay active requests if they need restored accounting.

The service binds to 0.0.0.0 and does not provide authentication. Run it on a trusted internal network or place it behind an appropriate network policy.

Common Responses

Successful topology and lifecycle writes return:

1{"status": "ok"}

Errors, including malformed JSON, oversized JSON bodies, unknown routes, and unsupported methods, return:

1{"error": "concise description"}

tenant_id defaults to "default" when omitted. Request bodies use Axum’s default bounded JSON handling.

Topology API

POST /register

Register one contiguous data-parallel range:

1{
2 "worker_id": 7,
3 "model_name": "llama-3-8b",
4 "tenant_id": "default",
5 "block_size": 16,
6 "dp_start": 0,
7 "dp_size": 2
8}

Returns 201. block_size and dp_size must be positive, and the DP range must not overflow. Workers in the same (model_name, tenant_id) tracker must use the same block size. Worker IDs are scoped by (model_name, tenant_id).

POST /unregister

Remove a worker’s full DP range and active requests immediately:

1{
2 "worker_id": 7,
3 "model_name": "llama-3-8b",
4 "tenant_id": "default"
5}

Returns 200, or 404 if the registration does not exist.

GET /workers

List workers with independent optional model_name and tenant_id filters:

1[
2 {
3 "worker_id": 7,
4 "model_name": "llama-3-8b",
5 "tenant_id": "default",
6 "block_size": 16,
7 "dp_start": 0,
8 "dp_size": 2
9 }
10]

The response is sorted for stable inspection.

Lifecycle API

POST /add

Record prompt blocks on a registered worker rank:

1{
2 "model_name": "llama-3-8b",
3 "tenant_id": "default",
4 "request_id": "req-123",
5 "worker_id": 7,
6 "dp_rank": 0,
7 "sequence_hashes": [101, -22, 303],
8 "new_isl_tokens": 48
9}

Returns 201. sequence_hashes is required and may be empty. new_isl_tokens defaults to 0; positive values enable prefill-token accounting. Duplicate request IDs return 409. Unknown trackers or worker ranks return 404.

POST /prefill_complete

Mark prompt processing complete:

1{
2 "model_name": "llama-3-8b",
3 "tenant_id": "default",
4 "request_id": "req-123"
5}

Returns 200 for an active request. Repeated completion is a no-op. Unknown requests return 404.

POST /free

Release prompt blocks and any remaining prefill state:

1{
2 "model_name": "llama-3-8b",
3 "tenant_id": "default",
4 "request_id": "req-123"
5}

Returns 200. Free is idempotent while the model/tenant tracker exists, including for an unknown request. Unknown trackers return 404.

Lifecycle writes preserve the core slot tracker’s arrival ordering. Consumers should normally wait for /add success before sending later lifecycle writes. The service does not repair reordered delivery: an early unknown /free or /prefill_complete is forgotten, so a later /add may remain accounted until a later free or expiry. A request older than 300 seconds may be removed by inherited stale-request cleanup.

Load API

GET /loads

Read current load snapshots with independent optional model_name and tenant_id filters:

1[
2 {
3 "model_name": "llama-3-8b",
4 "tenant_id": "default",
5 "worker_id": 7,
6 "dp_rank": 0,
7 "active_prefill_tokens": 48,
8 "active_decode_blocks": 3
9 }
10]

The response is sorted for stable inspection.

POST /potential_loads

Project the loads for a new request:

1{
2 "model_name": "llama-3-8b",
3 "tenant_id": "default",
4 "sequence_hashes": [101, -22, 303, 404],
5 "new_isl_tokens": 48
6}

Returns:

1[
2 {
3 "worker_id": 7,
4 "dp_rank": 0,
5 "potential_prefill_tokens": 96,
6 "potential_decode_blocks": 4
7 }
8]

Projection response order is unspecified to keep the routing read path lean. /loads and /potential_loads are advisory snapshots, not reservations. A selected worker may disappear before /add; recompute after /add returns 404. An ambiguous /add timeout is also consumer-owned: automatically retrying the same request is not guaranteed safe because duplicate adds return 409.