NN-Descent
NN-Descent
NN-Descent is a GPU-accelerated graph builder for approximate all-neighbors k-NN graphs. It starts with a rough random graph, repeatedly asks each vector to check its neighbors’ neighbors, and keeps better edges as it finds them.
NN-Descent is useful when you need a k-NN graph as an output, or when another cuVS index such as CAGRA needs an approximate graph as an input. It is not a query-time search index for new query vectors.
Example API Usage
C API | C++ API | Python API
Building a graph
C
C++
Python
C, C++, and Python expose standalone NN-Descent bindings. Java, Rust, and Go do not currently expose standalone NN-Descent wrappers, although those bindings may expose CAGRA settings that use NN-Descent internally.
Reading the graph
C
C++
Python
Distances are available only when return_distances is enabled during build. If you provide a preallocated output graph, disable return_distances unless you also use a binding that can provide a matching distance output.
Searching an index
NN-Descent builds a graph over the input dataset. It does not expose a cuVS search API for new query vectors. Use CAGRA, IVF-Flat, IVF-PQ, brute-force, or Vamana when you need query-time search.
Saving graph outputs
NN-Descent does not expose cuVS index serialization or deserialization APIs because its output is a graph, not a query-time search index. Persist the returned graph and optional distances with your application’s tensor, array, or table storage format, then load those arrays into the downstream workflow that consumes the graph.
How NN-Descent works
NN-Descent starts with a random neighbor graph. Each vector has a list of candidate neighbors, but those candidates are only a first guess.
During each iteration, the algorithm samples new and old neighbors, checks pairs of vectors that are likely to become better neighbors, and updates the graph when it finds closer candidates.
The process stops when max_iterations is reached or when the number of graph updates falls below termination_threshold.
The final result is an all-neighbors graph with graph_degree neighbors per vector.
When to use NN-Descent
Use NN-Descent when you need an approximate k-NN graph rather than a searchable index.
Use NN-Descent as a graph builder when CAGRA or all-neighbors needs a fast approximate initial graph.
Use brute-force all-neighbors when exact graph quality matters more than build speed.
Use CAGRA, IVF-Flat, IVF-PQ, or Vamana when the main task is searching new query vectors.
Interoperability with CAGRA and all-neighbors
CAGRA can use NN-Descent to build its initial graph before pruning. This is useful when you want CAGRA search behavior but prefer NN-Descent graph construction.
C++
Python
All-neighbors can also use NN-Descent as its local graph builder.
Using Filters
NN-Descent does not expose filtered search because it does not expose search. Apply filtering in the downstream search index or graph-processing workflow that consumes the generated graph.
Configuration parameters
Build parameters
Tuning
Start with graph_degree. It controls the size of the final graph and should match the number of neighbors your downstream workflow needs.
Increase intermediate_graph_degree when the final graph quality is too low. This gives NN-Descent more candidates to consider before it trims the graph down to graph_degree.
Increase max_iterations when graph quality is still improving at the end of the build. Lower it when build time matters more than graph quality.
Tune termination_threshold after choosing the degrees. A larger threshold stops earlier; a smaller threshold keeps refining until updates become rarer.
Use dist_comp_dtype="fp16" or DIST_COMP_DTYPE::FP16 when speed and memory matter more than distance precision. Use fp32 when small numeric differences matter.
Memory footprint
NN-Descent memory has two parts: the graph returned by the index and temporary graph-building workspaces. The final index does not keep the original dataset attached, but build needs a converted copy of the dataset on the GPU.
To keep the formulas readable, this section uses short symbols. All estimates are in bytes. The examples convert bytes to MiB by dividing by 1024 * 1024.
N: Number of database vectors.D: Vector dimension.B_in: Bytes per input vector value, such as4for fp32 or2for fp16.B_comp: Bytes per stored compute value. Use4for fp32 distance computation and2for fp16 distance computation.G: Final graph degree, orgraph_degree, after clipping to the dataset size.I: Intermediate graph degree, orintermediate_graph_degree, after clipping to the dataset size.E_G: Expanded graph degree,roundUp32(G)whenG <= 32, otherwiseroundUp32(1.3 * G).E_I: Expanded intermediate degree,roundUp32(I)whenI <= 32, otherwiseroundUp32(1.3 * I).S: Sample width used by the implementation, currently32.R:1whenreturn_distancesis enabled, otherwise0.M_l2:1for L2 metrics that store norms, otherwise0.S_idx: Bytes per graph ID, currentlysizeof(uint32_t).
The named terms in the formulas are also memory sizes:
graph_size: Host memory for the returned neighbor IDs.distances_size: Device memory for returned distances when requested.compute_dataset_size: Device memory for the converted build dataset.*_peak: Temporary peak memory for build workspaces.
Scratch and maximum vectors
The formulas below already include the major NN-Descent workspaces. Additional scratch comes from allocator padding, CUDA library workspaces, memory-resource pools, and small implementation buffers. Use H = 0.30 for build estimates. If you can measure a representative smaller run, use:
Then set:
The capacity variables in this subsection are:
M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.M_usable: Memory budget left for the formula after subtractingM_otherand reserving headroom.observed_peak: Peak memory observed during a smaller representative run.formula_without_scratch: Value of the selected peak formula with explicitscratchterms removed and without applying headroom.peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usuallyNfor rows or vectors andBfor K-selection batch rows.B_per_row/B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such asD,K,Q, andLare substituted.B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.N_max/B_max: Estimated largest row, vector, or batch-row count that fits inM_usable.
For fixed D, G, I, and dtype settings, the device and host peaks are linear in N. Rewrite the selected peak as:
and solve:
Check device and host memory separately. The usable N is the smaller of the device-memory limit and the host/pinned-memory limit.
Baseline memory after build
The baseline host memory kept by the NN-Descent index is:
If return_distances is enabled, the index also keeps graph distances on device:
The total index footprint is approximately:
Example (N = 1e6, graph_degree = 64, return_distances = true):
graph_size = 256000000 B = 244.14 MiBdistances_size = 256000000 B = 244.14 MiBindex_size = 512000000 B = 488.28 MiB
Build peak memory usage
During build, NN-Descent stores a converted dataset on the GPU:
For L2 metrics, it also stores one fp32 norm per vector:
The main device workspaces are a sampled graph buffer, a distance buffer, locks, and list-size counters:
Using the current sample width and uint32_t IDs, this is:
The approximate device peak is:
If the input dataset is already on the host, the first term is not device-resident input data. The build still keeps the converted compute dataset on device.
Host and pinned memory include the returned graph, the expanded internal graph, graph distances, sample buffers, reverse-edge buffers, and a Bloom filter used for sampling:
The N * 912 term comes from fixed-width sample, reverse-edge, update, and list-size buffers. The N * E_I * 2 term estimates the Bloom filter bitset.
Example (N = 1e6, D = 128, fp32 input, fp16 compute, G = 64, I = 128, return_distances = true):
compute_dataset_size = 256000000 B = 244.14 MiBdevice_workspace_size = 276000000 B = 263.21 MiBdistances_size = 256000000 B = 244.14 MiBdevice_build_peak = 1304000000 B = 1243.59 MiB
With G = 64, E_G = 96; with I = 128, E_I = 192:
host_workspace_size = 2320000000 B = 2212.52 MiB
Search memory usage
NN-Descent does not search new query vectors, so it has no query workspace. Downstream search memory depends on the index or workflow that consumes the generated graph.