Dynamic Batching
Dynamic batching wraps an existing ANN index and combines many concurrent small search requests into larger GPU batches. It is useful for serving systems where requests arrive from many CPU threads, each request may contain only a few queries, and launching one GPU search per request would leave throughput on the table.
The dynamic batching API does not build a new ANN index. It owns batching queues and temporary buffers, then calls the upstream index search function with a larger batch. The upstream index, upstream search parameters, and optional filter are chosen when the dynamic batching index is created.
Example API Usage
Wrapping an upstream index
The example below wraps a CAGRA index. The same pattern can wrap supported C++ upstream indexes such as CAGRA, IVF-Flat, IVF-PQ, and brute-force.
C++
Searching through the batcher
Dynamic batching search is thread-safe. Share copies of the lightweight dynamic batching index across request threads, and call dynamic_batching::search from each thread.
C++
Using separate priority classes
There is one queue pool inside each dynamic batching index. To separate latency-sensitive requests from throughput-oriented requests, create more than one dynamic batching index over the same upstream index.
C++
How Dynamic Batching works
Each dynamic batching index owns a fixed set of request queues. Each queue has its own CUDA stream and temporary device buffers for queries, neighbors, and distances. CPU request threads commit their query rows into a queue. The batcher gathers those query rows into a contiguous device batch, calls the upstream search function, and scatters the output back to the request-owned output matrices.
The upstream index is not copied. The dynamic batching index keeps a reference to the upstream index, so the upstream index must outlive the dynamic batching wrapper. The upstream search parameters are captured by value when the wrapper is constructed, which means all requests submitted to the same dynamic batching index use the same upstream search configuration.
If a request already contains at least max_batch_size query rows, dynamic batching bypasses the queue and calls the upstream search directly.
When to use Dynamic Batching
Use dynamic batching for high-concurrency serving workloads where many request threads submit small search batches. It can improve GPU utilization by turning many small searches into fewer larger upstream searches and by using multiple queues to overlap queue filling, kernel launch overhead, and search work.
Avoid dynamic batching when requests already arrive in large batches, when a single thread submits work serially, or when each request needs different upstream search parameters. In those cases, call the upstream index search API directly.
Dynamic batching is currently a C++ API. Use the index-specific guides for normal build and search workflows, then add dynamic batching only when serving concurrency makes request aggregation useful.
Configuration parameters
Index parameters
Search parameters
Tuning
Start by tuning the upstream index normally. For example, tune CAGRA, IVF-Flat, or IVF-PQ search parameters on representative batches before adding the dynamic batching wrapper.
Set max_batch_size to the largest request group that still meets latency and memory targets. Larger batches usually improve throughput, but they also increase temporary buffers and can delay small requests.
Increase n_queues when many CPU threads submit work concurrently and one queue is often busy. More queues can hide launch overhead and keep the GPU fed, but each queue allocates its own query and output buffers.
Use conservative_dispatch=false for latency-sensitive small batches. Use conservative_dispatch=true when max_batch_size is large and it is too expensive to run upstream search on mostly empty batches.
Lower dispatch_timeout_ms for latency-sensitive traffic. Raise it when throughput matters more than tail latency and request arrivals are dense enough to fill larger batches.
Memory footprint
Dynamic batching memory has two parts: the upstream index memory and the batching buffers. Use the upstream index guide to estimate index memory. This section estimates the extra buffers owned by the dynamic batching wrapper.
Variables:
Q: Number of independent request queues. This isn_queues.B: Maximum number of query rows per queue. This ismax_batch_size.D: Vector dimension, or number of values in each query vector.K: Number of neighbors returned per query. This isk.S_q: Bytes per query value. Use4forfloat,2forhalf, and1forint8_toruint8_t.S_i: Bytes per output index. Use4foruint32_tor8forint64_t.S_d: Bytes per output distance. Distances are stored asfloat, so use4.M_upstream: Device memory used by the upstream index.M_scratch: Temporary memory used by the upstream search implementation, CUDA libraries, memory-resource padding, and allocator overhead.
The query staging buffers use:
The output staging buffers use:
The peak memory is approximately:
There are also small pinned host-memory structures for request pointers, queue tokens, events, and timeout bookkeeping. These are usually much smaller than the device query and output buffers.
Scratch and maximum batch size
The upstream search path may allocate temporary buffers based on max_batch_size, especially when conservative_dispatch=false because the upstream search can be launched with the maximum batch shape even when fewer request rows are valid. Plan memory as if the upstream search can see B query rows.
To estimate the largest max_batch_size that fits, first reserve memory for the upstream index and other application buffers:
where:
M_free: Free device memory before constructing the dynamic batching wrapper.M_other: Device memory reserved for application buffers that are not included in this formula.H: Headroom fraction for scratch and allocator overhead. Start with0.10to0.20, then replace it with measured peak overhead from a representative run.M_usable: Device memory available for dynamic batching buffers after reservations and headroom.
Then solve:
Use the result as a planning estimate, then validate with the actual upstream index and request mix.