All-neighbors
All-neighbors
All-neighbors builds a k-NN graph for every vector in a dataset. Instead of asking “what are the nearest neighbors for this new query?”, it asks “what are the nearest neighbors for every row I already have?”
Use it when the graph itself is the output, for example for clustering, visualization, graph analytics, or algorithms that need a neighborhood graph before doing their own work.
Example API Usage
C API | C++ API | Python API
Building a graph
C
C++
Python
C, C++, and Python expose all-neighbors bindings. Java, Rust, and Go do not currently expose standalone all-neighbors wrappers.
Building in batches
For host-resident datasets, all-neighbors can partition the data into overlapping clusters. Each row is assigned to overlap_factor clusters, local k-NN graphs are built inside each cluster, and the local results are merged into one global graph.
Use batching when the full graph build does not fit comfortably on one GPU, or when you want to distribute work across multiple GPUs.
C++
Python
Device-resident datasets require n_clusters = 1. Put the dataset in host memory when using n_clusters > 1.
Saving graph outputs
All-neighbors does not expose cuVS index serialization or deserialization APIs because the graph arrays are the output. Persist indices, distances, and core_distances if used with your application’s tensor, array, or table storage format, then load those arrays into the downstream graph-processing workflow.
Mutual-reachability distances
All-neighbors can compute mutual-reachability distances for workflows such as robust single linkage or HDBSCAN-style graph construction. Provide both distances and core_distances; alpha controls the mutual-reachability scaling.
C++
Python
IVF-PQ does not support mutual-reachability distances in all-neighbors. Use brute-force or NN-Descent for that mode.
Searching an index
All-neighbors does not build a reusable search index. It writes k-NN graph outputs for the input dataset. Use CAGRA, IVF-Flat, IVF-PQ, brute-force, or Vamana when you need to search new query vectors.
How All-neighbors works
When n_clusters = 1, all-neighbors builds one k-NN graph over the whole dataset using the selected local graph builder.
When n_clusters > 1, all-neighbors first trains cluster centroids from a sample of the host dataset. Then it assigns each row to the overlap_factor nearest clusters. Each cluster builds a local k-NN graph, and cuVS merges those local graphs back into one global graph.
The local graph builder can be brute-force, IVF-PQ, or NN-Descent. Brute-force is exact when n_clusters = 1. With batching, even brute-force becomes approximate because each row only compares against rows in its assigned clusters.
When to use All-neighbors
Use all-neighbors when you need a graph for every vector in the dataset.
Use n_clusters = 1 when the dataset and chosen local builder fit on one GPU.
Use host data with n_clusters > 1 when you need batching or multi-GPU execution.
Use brute-force as the local builder when exactness matters and the local batches are small enough.
Use NN-Descent when graph build speed matters and approximate graph quality is acceptable.
Use IVF-PQ when you want a partitioned approximate local builder and do not need mutual-reachability distances.
Using Filters
All-neighbors does not expose filtered search because it does not search new query vectors. Apply filtering in the downstream graph workflow, or use a search index that supports filters.
Configuration parameters
Build parameters
Tuning
Start with the local graph builder. Brute-force gives the clearest baseline, NN-Descent is usually the fastest approximate graph builder, and IVF-PQ can be useful for larger local batches when L2 distance is sufficient.
Set K to the number of neighbors required by the downstream graph algorithm. Increasing K increases output memory and local graph-builder work.
For batched builds, start with n_clusters = 4 or 8 and overlap_factor = 2. Increase n_clusters to reduce per-batch device memory. Increase overlap_factor to improve recall, because each row is compared inside more clusters.
Keep the ratio overlap_factor / n_clusters in mind. It roughly controls how much of the dataset is active in one batch. A larger ratio usually improves graph quality but uses more device memory and work.
Tune the local builder after the batching shape is stable. For NN-Descent, tune graph_degree, intermediate_graph_degree, and max_iterations. For IVF-PQ, tune the IVF-PQ build/search parameters and refinement rate.
Memory footprint
All-neighbors memory has three main parts: graph outputs, optional batching metadata, and the temporary memory used by the selected local graph builder. The output graph is always written to device memory. Batched builds also keep managed global merge buffers while local cluster results are merged.
To keep the formulas readable, this section uses short symbols. All estimates are in bytes. The examples convert bytes to MiB by dividing by 1024 * 1024.
N: Number of database vectors.D: Vector dimension.K: Output graph degree, or number of neighbors requested.B: Bytes per dataset value. All-neighbors currently accepts fp32 data, soB = 4.C: Number of clusters, orn_clusters.O: Cluster overlap, oroverlap_factor.M: Maximum cluster size after overlap assignment. A planning estimate isceil(N * O / C), but real clusters may be uneven.S: Subsample rows used for centroid training. Approximatelymin(N / C, 50000), with a small-data fallback up to5000.R:1whendistancesis provided, otherwise0.Q:1whencore_distancesis provided, otherwise0.S_idx: Bytes per output neighbor ID, currentlysizeof(int64_t).
The named terms in the formulas are also memory sizes:
indices_size: Device memory for output neighbor IDs.distances_size: Device memory for output distances.core_distances_size: Device memory for mutual-reachability core distances.local_builder_peak: Peak temporary memory for the chosen local graph builder on at mostMrows.
Scratch and maximum vectors
The formulas below include output buffers, batching metadata, and local builder peaks. Additional scratch comes from allocator padding, local graph-builder workspaces, CUDA library workspaces, and memory held by the active memory resource. Use H = 0.25 for all-neighbors builds, then add the scratch/headroom recommended by the chosen local builder. If you can measure a representative smaller run, use:
Then set:
The capacity variables in this subsection are:
M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.M_usable: Memory budget left for the formula after subtractingM_otherand reserving headroom.observed_peak: Peak memory observed during a smaller representative run.formula_without_scratch: Value of the selected peak formula with explicitscratchterms removed and without applying headroom.peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usuallyNfor rows or vectors andBfor K-selection batch rows.B_per_row/B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such asD,K,Q, andLare substituted.B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.N_max/B_max: Estimated largest row, vector, or batch-row count that fits inM_usable.
With n_clusters = 1, solve the single-cluster formula using the local builder peak for N rows. With batched builds, substitute M = ceil(N * O / C) and solve:
In this formula, M(N) = ceil(N * O / C) is the estimated largest local cluster size for a candidate dataset size N. B_global_per_vector is the per-vector memory for global outputs and merge metadata. B_local_per_vector is the per-vector memory used by the selected local builder when it processes one local cluster. Use the local builder’s own memory formula to estimate B_local_per_vector.
Then try increasing N until peak_without_scratch(N) <= M_usable no longer holds. The largest feasible dataset is the smaller value from device memory and host/managed-memory limits.
Baseline output memory
The output graph memory is:
The baseline output footprint is:
Example (N = 1e6, K = 32, with distances and no core distances):
indices_size = 256000000 B = 244.14 MiBdistances_size = 128000000 B = 122.07 MiBoutput_size = 384000000 B = 366.21 MiB
Single-cluster build memory
With n_clusters = 1, all-neighbors runs the selected local builder over the full dataset. Device-resident input stays on device; host input is copied to a device build buffer by local builders that require device data.
The planning estimate is:
For brute-force, local_builder_peak is mostly the brute-force query workspace and any temporary distance output if distances was not provided.
For NN-Descent, use the NN-Descent guide formulas with N rows and output degree K.
For IVF-PQ, use the IVF-PQ guide formulas with a candidate count based on the IVF-PQ refinement settings.
Batched build memory
With n_clusters > 1, the dataset must be host-resident. All-neighbors builds local graphs on at most M rows at a time and merges each local graph into managed global buffers.
Centroid training and cluster assignment use:
The global merge buffers are allocated even if the caller does not request final distances:
Reusable per-cluster buffers include the gathered cluster data, local graph outputs, and merge buffers:
Host batching metadata stores inverted cluster membership:
The batched peak is approximately:
If core_distances is provided, all-neighbors computes a first graph to get core distances and then rebuilds in mutual-reachability space. Peak memory is similar, but runtime roughly includes two local graph builds. The extra persistent output is core_distances_size.
Example (N = 1e6, D = 128, K = 32, C = 8, O = 2, with distances):
M ≈ ceil(1e6 * 2 / 8) = 250000output_size = 384000000 B = 366.21 MiBglobal_merge_size = 384000000 B = 366.21 MiBbatch_metadata_size = 16000128 B = 15.26 MiBcluster_buffer_size = 224000000 B = 213.62 MiB
Add the selected local builder peak on M = 250000 rows to estimate the dominant per-cluster build phase.
Search memory usage
All-neighbors does not search new query vectors, so it has no query workspace. Downstream search memory depends on the index or graph workflow that consumes the generated k-NN graph.