CAGRA | cuVS

CAGRA is a GPU-optimized graph index for approximate nearest-neighbor search. Think of every vector as a point, and think of the graph as a map that connects each point to nearby points. During search, CAGRA follows that map toward better matches instead of checking every vector.

CAGRA works well when you want strong recall, high GPU throughput, and fast graph construction.

Example API Usage

Building an index

C

C++

Python

Java

Rust

Go

1 #include <cuvs/neighbors/cagra.h>
2 
3 cuvsResources_t res;
4 cuvsCagraIndexParams_t index_params;
5 cuvsCagraIndex_t index;
6 DLManagedTensor *dataset;
7 
8 // populate tensor with data
9 load_dataset(dataset);
10 
11 cuvsResourcesCreate(&res);
12 cuvsCagraIndexParamsCreate(&index_params);
13 cuvsCagraIndexCreate(&index);
14 
15 cuvsCagraBuild(res, index_params, dataset, index);
16 
17 cuvsCagraIndexDestroy(index);
18 cuvsCagraIndexParamsDestroy(index_params);
19 cuvsResourcesDestroy(res);

Extending an index

C

C++

Python

Go

1 #include <cuvs/neighbors/cagra.h>
2 
3 cuvsResources_t res;
4 cuvsCagraExtendParams_t extend_params;
5 cuvsCagraIndex_t index;
6 DLManagedTensor *additional_dataset;
7 
8 load_additional_dataset(additional_dataset);
9 
10 cuvsResourcesCreate(&res);
11 cuvsCagraExtendParamsCreate(&extend_params);
12 cuvsCagraIndexCreate(&index);
13 
14 // ... build or load index ...
15 cuvsCagraExtend(res, extend_params, additional_dataset, index);
16 
17 cuvsCagraIndexDestroy(index);
18 cuvsCagraExtendParamsDestroy(extend_params);
19 cuvsResourcesDestroy(res);

See the C, C++, Python, and Go API references for the full signatures.

Searching an index

C

C++

Python

Java

Rust

Go

1 #include <cuvs/neighbors/cagra.h>
2 
3 cuvsResources_t res;
4 cuvsCagraSearchParams_t search_params;
5 cuvsCagraIndex_t index;
6 DLManagedTensor *queries;
7 DLManagedTensor *neighbors;
8 DLManagedTensor *distances;
9 
10 // populate tensor with data
11 load_queries(queries);
12 
13 cuvsResourcesCreate(&res);
14 cuvsCagraSearchParamsCreate(&search_params);
15 
16 // ... build or load index ...
17 cuvsCagraSearch(res, search_params, index, queries, neighbors, distances);
18 
19 cuvsCagraSearchParamsDestroy(search_params);
20 cuvsResourcesDestroy(res);

Saving and loading an index

Serialize a CAGRA index when you want to reuse the graph without rebuilding it. Include the dataset when the loaded index should be searchable immediately; omit it only when your workflow will attach or provide the dataset separately.

Go does not currently expose CAGRA save/load wrappers.

C

C++

Python

Java

Rust

1 #include <cuvs/neighbors/cagra.h>
2 
3 cuvsResources_t res;
4 cuvsCagraIndex_t index;
5 cuvsCagraIndex_t loaded_index;
6 
7 cuvsResourcesCreate(&res);
8 cuvsCagraIndexCreate(&index);
9 cuvsCagraIndexCreate(&loaded_index);
10 
11 // ... build index ...
12 cuvsCagraSerialize(res, "/tmp/cuvs-cagra.bin", index, true);
13 cuvsCagraDeserialize(res, "/tmp/cuvs-cagra.bin", loaded_index);
14 
15 cuvsCagraIndexDestroy(loaded_index);
16 cuvsCagraIndexDestroy(index);
17 cuvsResourcesDestroy(res);

How CAGRA works

CAGRA builds and searches a nearest-neighbor graph.

First, CAGRA builds an initial kNN graph. This is the first draft of the map: each vector is connected to vectors that look nearby. An exact brute-force build can create a very accurate initial graph, but it is usually too slow. In practice, the first graph does not need to be perfect because CAGRA improves it later. NVIDIA cuVS can build this initial graph with IVF-PQ or NN-Descent.

Second, CAGRA prunes the initial graph. This removes redundant paths and keeps the links that are most useful for search.

At search time, CAGRA starts from one or more graph vertices, follows links to better candidates, and keeps a working set of the best candidates it has seen so far.

When to use CAGRA

Use CAGRA when the index fits in GPU memory and you want fast approximate search.

Use CAGRA when build speed matters. CAGRA can build graphs quickly on the GPU.

Use CAGRA in hybrid environments where a GPU-built graph is converted to HNSW for CPU search.

Use brute-force instead when exact results are required or the dataset is small enough that a full scan is already fast enough.

Interoperability with HNSW

NVIDIA cuVS can convert a CAGRA graph to an HNSW graph. This lets the GPU build the graph while the CPU handles search later. This is useful when GPUs are available for indexing, but production search runs on CPUs.

If the graph is being serialized or converted to HNSW right after build, avoid keeping the dataset attached to the CAGRA index when the binding exposes that option. In C++, for example, set attach_dataset_on_build to false.

These examples cover the bindings that currently expose CAGRA and HNSW interoperability. Go supports CAGRA build and search, but does not currently expose HNSW conversion.

C

C++

Python

Java

Rust

1 #include <cuvs/neighbors/cagra.h>
2 #include <cuvs/neighbors/hnsw.h>
3 
4 cuvsResources_t res;
5 cuvsCagraIndexParams_t cagra_params;
6 cuvsCagraIndex_t cagra_index;
7 cuvsHnswIndexParams_t hnsw_params;
8 cuvsHnswIndex_t hnsw_index;
9 cuvsHnswSearchParams_t hnsw_search_params;
10 DLManagedTensor *dataset;
11 DLManagedTensor *queries;
12 DLManagedTensor *neighbors;
13 DLManagedTensor *distances;
14 
15 int64_t n_rows = 1000000;
16 int64_t dim = 128;
17 int M = 32;
18 int ef_construction = 200;
19 
20 cuvsResourcesCreate(&res);
21 cuvsCagraIndexParamsCreate(&cagra_params);
22 cuvsCagraIndexCreate(&cagra_index);
23 cuvsHnswIndexParamsCreate(&hnsw_params);
24 cuvsHnswIndexCreate(&hnsw_index);
25 cuvsHnswSearchParamsCreate(&hnsw_search_params);
26 
27 load_dataset(dataset);
28 load_host_queries(queries);
29 allocate_hnsw_outputs(neighbors, distances);
30 
31 cuvsCagraIndexParamsFromHnswParams(
32     cagra_params,
33     n_rows,
34     dim,
35     M,
36     ef_construction,
37     CUVS_CAGRA_HEURISTIC_SIMILAR_SEARCH_PERFORMANCE,
38     L2Expanded);
39 
40 hnsw_params->hierarchy = GPU;
41 hnsw_search_params->ef = 200;
42 hnsw_search_params->num_threads = 0;
43 
44 cuvsCagraBuild(res, cagra_params, dataset, cagra_index);
45 cuvsHnswFromCagra(res, hnsw_params, cagra_index, hnsw_index);
46 cuvsHnswSearch(res, hnsw_search_params, hnsw_index, queries, neighbors, distances);
47 
48 cuvsHnswSearchParamsDestroy(hnsw_search_params);
49 cuvsHnswIndexDestroy(hnsw_index);
50 cuvsHnswIndexParamsDestroy(hnsw_params);
51 cuvsCagraIndexDestroy(cagra_index);
52 cuvsCagraIndexParamsDestroy(cagra_params);
53 cuvsResourcesDestroy(res);

Using Filters

CAGRA supports filtered search. A filter hides some vectors from the search result, so CAGRA may need to explore more of the graph to find enough valid neighbors.

CAGRA can adjust itopk_size internally based on the filtering rate. To disable this automatic adjustment, set filtering_rate to 0.0.

The examples below use a bitset filter. A bit value of 1 means a vector is allowed; a bit value of 0 means it is filtered out.

C

C++

Python

Java

Rust

Go

1 #include <cuvs/neighbors/cagra.h>
2 #include <cuvs/neighbors/common.h>
3 
4 cuvsResources_t res;
5 cuvsCagraIndexParams_t index_params;
6 cuvsCagraSearchParams_t search_params;
7 cuvsCagraIndex_t index;
8 DLManagedTensor *dataset;
9 DLManagedTensor *queries;
10 DLManagedTensor *neighbors;
11 DLManagedTensor *distances;
12 
13 cuvsResourcesCreate(&res);
14 cuvsCagraIndexParamsCreate(&index_params);
15 cuvsCagraSearchParamsCreate(&search_params);
16 cuvsCagraIndexCreate(&index);
17 
18 // Populate DLPack tensors with dataset, query, and output data.
19 load_dataset(dataset);
20 load_queries(queries);
21 allocate_outputs(neighbors, distances);
22 
23 cuvsCagraBuild(res, index_params, dataset, index);
24 
25 // Create a device uint32 bitset with one bit per indexed vector. Bit 1 means
26 // allowed; bit 0 means filtered out.
27 DLManagedTensor *bitset = make_device_bitset(allowed_indices, n_vectors);
28 
29 cuvsFilter filter;
30 filter.type = BITSET;
31 filter.addr = (uintptr_t)bitset;
32 
33 cuvsCagraSearch(res, search_params, index, queries, neighbors, distances, filter);
34 
35 cuvsCagraIndexDestroy(index);
36 cuvsCagraSearchParamsDestroy(search_params);
37 cuvsCagraIndexParamsDestroy(index_params);
38 cuvsResourcesDestroy(res);

Configuration parameters

Build parameters

Name	Default	Description
`metric`	`L2Expanded` / `sqeuclidean`	Distance metric used to build and search the graph.
`metric_arg`	`2.0`	Extra argument for metrics that need one, such as Minkowski distance.
`intermediate_graph_degree`	`128`	Number of neighbors kept in the initial graph before pruning. Larger values can improve the final graph, but increase build time and memory use.
`graph_degree`	`64`	Number of neighbors kept for each vertex in the final graph. Larger values can improve recall, but use more memory and search work.
`compression`	None	Optional vector product quantization parameters. When set, the compressed dataset is attached to the index and `attach_dataset_on_build` is effectively enabled.
`graph_build_params`	`std::monostate`	Parameters for the initial graph builder. The default lets NVIDIA cuVS choose a heuristic; explicit options include IVF-PQ, NN-Descent, ACE, and iterative-search graph build parameters.
`guarantee_connectivity`	`False`	Uses a degree-constrained minimum spanning tree to guarantee the initial kNN graph is connected. This can improve recall on some datasets.
`attach_dataset_on_build`	`True`	Keeps the dataset attached to the index after build. Set to `False` when serializing or converting to another graph format right after build.

Search parameters

Name	Default	Description
`max_queries`	`0`	Maximum number of queries searched concurrently. `0` lets NVIDIA cuVS choose automatically.
`itopk_size`	64	Number of intermediate search results kept during search. This must be at least `k` and is the main search tuning knob.
`max_iterations`	0	Maximum number of search iterations. `0` lets NVIDIA cuVS choose automatically.
`algo`	`AUTO`	Search implementation. Options include `SINGLE_CTA`, `MULTI_CTA`, `MULTI_KERNEL`, and `AUTO`.
`team_size`	0	Number of CUDA threads used to calculate each distance. Valid values are 4, 8, 16, or 32. `0` lets NVIDIA cuVS choose automatically.
`search_width`	1	Number of vertices selected as starting points for each search iteration.
`min_iterations`	0	Minimum number of search iterations.
`thread_block_size`	`0`	CUDA thread block size. Supported values include 64, 128, 256, 512, and 1024. `0` lets NVIDIA cuVS choose automatically.
`hashmap_mode`	`AUTO`	Hash map implementation used during search. Options include `HASH`, `SMALL`, and `AUTO`.
`hashmap_min_bitlen`	`0`	Lower limit for the hash map bit length. `0` lets NVIDIA cuVS choose automatically.
`hashmap_max_fill_rate`	`0.5`	Maximum hash map fill rate. Valid values are greater than 0.1 and less than 0.9.
`num_random_samplings`	`1`	Number of initial random seed-node selection iterations.
`rand_xor_mask`	`0x128394`	Bit mask used for initial random seed-node selection.
`persistent`	`False`	Uses the persistent search kernel where supported. Currently this applies only to `SINGLE_CTA`.
`persistent_lifetime`	`2.0`	Seconds before a persistent kernel stops when no requests are received.
`persistent_device_usage`	`1.0`	Fraction of the maximum grid size used by the persistent kernel. Lower values can leave GPU capacity for other work.
`filtering_rate`	`-1.0`	Expected fraction of nodes filtered out during filtered search. Negative values let NVIDIA cuVS estimate it automatically.

Tuning

The three parameters most often tuned are itopk_size, graph_degree, and intermediate_graph_degree.

Start with itopk_size. Increasing it usually improves recall, but lowers throughput because CAGRA keeps more candidates during search.

If search-time tuning is not enough, increase graph_degree. This gives each vertex more links to follow, but uses more memory and search work.

If the final graph quality is still too low, increase intermediate_graph_degree. This gives pruning more choices, but makes build more expensive.

Persistent search

Persistent search can improve throughput in services that run many concurrent CAGRA searches. Instead of launching a new search kernel for each request, NVIDIA cuVS can keep a persistent search kernel resident on the GPU and feed it incoming work. This reduces launch overhead and can help high-volume search services keep the GPU busy.

Enable it with the persistent search parameter. Persistent search currently applies to the SINGLE_CTA search implementation, so set algo to SINGLE_CTA when you want to force this path instead of relying on AUTO.

Use persistent_lifetime to control how long the persistent kernel waits for new work before stopping. Use persistent_device_usage to reserve less than the full GPU for the persistent kernel when the same GPU also needs to run other kernels. Keeping other GPU work active alongside a persistent kernel can be fragile, so tune this setting carefully and validate it under the same concurrency pattern used in production.

Memory footprint

CAGRA memory has two main parts: the dataset and the graph. During build, the dataset must be in GPU memory. After build, the dataset can be detached if it is not needed for search, for example when immediately converting the graph to HNSW.

To keep the formulas readable, this section uses short symbols. All estimates are in bytes. The examples convert bytes to MiB by dividing by 1024 * 1024.

N: Number of database vectors, or rows in the dataset being indexed.
D: Vector dimension, or number of values in each vector.
B: Bytes stored for each vector value. Use 4 for fp32, 2 for fp16, or the byte width of the attached dataset representation.
G: Final graph degree. This is the graph_degree build parameter, and each vector keeps G neighbor IDs after pruning.
I: Intermediate graph degree. This is the intermediate_graph_degree build parameter, and CAGRA uses this larger graph before pruning down to G.
C: Number of IVF-PQ coarse clusters/lists. This is the IVF-PQ n_lists value used by the graph build parameters.
R: IVF-PQ training-set ratio. This is train_set_ratio; R = 10 means training uses roughly N / 10 vectors.
Q: Query batch size, or number of query vectors processed together.
K: Search result count, or the requested k/topk nearest neighbors per query.
S_idx: Bytes per graph neighbor ID. This is sizeof(IdxT), usually 4 for int32_t or uint32_t.

The named terms in the formulas are also memory sizes:

dataset_size: Device memory used by the attached dataset vectors.
graph_size: Host memory used by the CAGRA graph neighbor IDs.
*_peak: Temporary peak memory for one build phase. Sequential phases are not added together.
query_size: Device memory for the current query batch.
result_size: Device memory for neighbor IDs and distances returned for the current query batch.
workspace_size: Query and result memory used during search.

Scratch and maximum vectors

Most CAGRA formulas below are linear in N once build parameters are fixed. The named temporary peaks are the main scratch terms for build phases, but real runs can also include allocator padding, CUDA library workspaces, memory-resource pools, and small implementation buffers. Reserve a headroom factor H = 0.20 for IVF-PQ graph builds and H = 0.30 for NN-Descent or iterative-search graph builds. If you can measure a representative smaller run, use:

H_{\text{measured}} = \frac{\text{observed\_peak} - \text{formula\_without\_scratch}} {\text{formula\_without\_scratch}}

Then set:

M_{\text{usable}} = (M_{\text{free}} - M_{\text{other}}) \cdot (1 - H)

The capacity variables in this subsection are:

M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.
M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.
H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.
M_usable: Memory budget left for the formula after subtracting M_other and reserving headroom.
observed_peak: Peak memory observed during a smaller representative run.
formula_without_scratch: Value of the selected peak formula with explicit scratch terms removed and without applying headroom.
peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usually N for rows or vectors and B for K-selection batch rows.
B_per_row / B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such as D, K, Q, and L are substituted.
B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.
N_max / B_max: Estimated largest row, vector, or batch-row count that fits in M_usable.

Choose the build or search formula that matches the operation, remove the explicit scratch/headroom from it, and rewrite it as:

\text{peak\_without\_scratch}(N) = N \cdot B_{\text{per\_vector}} + B_{\text{fixed}}

Then estimate:

N_{\max} = \left\lfloor \frac{M_{\text{usable}} - B_{\text{fixed}}} {B_{\text{per\_vector}}} \right\rfloor

For out-of-core IVF-PQ graph build, Q, C, and R can make several terms fixed or sublinear for a fixed configuration. Solve the full max(...) expression if the largest phase changes as N changes.

Baseline memory after build

The baseline memory footprint after index construction is:

\begin{aligned} \text{dataset\_size (device)} &= N \times D \times B \end{aligned}

\begin{aligned} \text{graph\_size (host)} &= N \times G \times S_{\text{idx}} \end{aligned}

The dataset must be in GPU memory during index build, but can be detached afterward if it is not needed for search.

Example (1,000,000 vectors, dim = 1024, fp32, graph_degree = 64, IdxT = int32):

dataset_size = 4,096,000,000 B = 3906.25 MB
graph_size = 256,000,000 B = 244.14 MB

Build peak memory usage

Index build has two phases: construct an initial kNN graph, then optimize it by pruning redundant paths. These steps run sequentially, so their peak memory use is not additive. The overall peak depends on the configured RMM memory resource.

The initial graph can be built with IVF-PQ, NN-Descent, or the experimental iterative CAGRA-search builder. IVF-PQ can build in batches, which allows CAGRA to train on datasets larger than available GPU memory. The iterative builder requires the aligned dataset to fit in GPU memory because it repeatedly searches the partially built CAGRA graph.

Initial graph build using IVF-PQ

IVF-PQ builds the initial graph in two stages. First, it trains cluster centroids and PQ codebooks. Then it queries the IVF-PQ index in batches to form approximate nearest-neighbor lists.

IVF-PQ build peak:

Here, N / R is the IVF-PQ training sample size. The 4 byte factors are fp32 values for training vectors and cluster centroids. The uint32_t term stores one 32-bit ID per training vector.

\begin{aligned} \text{IVFPQ\_build\_peak} &= \frac{N}{R} \times D \times 4 \\ &\quad + C \times D \times 4 \\ &\quad + \frac{N}{R} \times \operatorname{sizeof}(\mathrm{uint32\_t}) \end{aligned}

Example (N = 1e6, D = 1024, C = 1024, R = 10): 395.01 MB

IVF-PQ search peak:

Here, Q is the number of vectors in one search batch and I is the number of candidates kept per query while building the intermediate graph. The three terms estimate query vectors, candidate IDs, and candidate distances.

\begin{aligned} \text{IVFPQ\_search\_peak} &= Q \times D \times 4 \\ &\quad + Q \times I \times \operatorname{sizeof}(\mathrm{uint32\_t}) \\ &\quad + Q \times I \times 4 \end{aligned}

Example (Q = 1024, D = 1024, I = 128): 5.00 MB

Initial graph build using NN-Descent

Peak device memory:

The constants in the NN-Descent formulas are per-vector workspace estimates from the implementation. They are added to the vector storage terms before multiplying by N.

\begin{aligned} \text{NND\_device\_peak} &= N \times (D \times 2 + 276) \end{aligned}

Data vectors are transferred to device and stored as fp16: D * 2 bytes per vector.
The small working graph, locks, and edge counters use 276 bytes per vector.
L2 metric adds 4 bytes per vector for precomputed norms.

Peak host memory:

\begin{aligned} \text{NND\_host\_peak} &= N \times (13 \times I + 912) \end{aligned}

Full graph with distances: 1.3 * 8 * I bytes per vector.
Bloom filter for sampling: 1.3 * 2 * I bytes per vector.
5 sample buffers with degree 32: 640 bytes per vector.
Graph update buffer with degree 32: 256 bytes per vector.
Edge counters: 16 bytes per vector.

Initial graph build using iterative CAGRA search

The iterative builder starts with a small connected graph, then repeatedly uses CAGRA search to find neighbors for a larger prefix of the dataset. After each search pass, it optimizes the graph and doubles the active graph size until all rows are included.

This path is useful when the metric or data type is better served by CAGRA search itself, but it is not an out-of-core builder. The dataset is copied or aligned into GPU memory before the first iteration.

Variables used only in this subsection:

D_align: Aligned device stride used by CAGRA search. Use D when no padding is required.
Q_iter: Maximum query chunk size used by the iterative builder. The implementation currently uses min(N, 8192).
K_iter: Number of temporary neighbors kept per query during the last pass. Use I + 1.
G_iter: Largest graph degree used by the temporary searchable graph. Use G; early iterations use a smaller degree and the final iterations use G.
D_iter: Aligned device dataset memory.
G_tmp: Largest temporary device graph memory.
Q_tile: Query tile memory for one search chunk.
R_tile: Result tile memory for one search chunk.
W_iter: Temporary device workspace used by one iterative search pass.
H_iter: Host neighbors-list capacity in bytes after rounding up to a 2 MiB boundary. One MiB is 1024 * 1024 bytes.

The aligned device dataset is:

\begin{aligned} D_{\text{iter}} &= N \times D_{\text{align}} \times B \end{aligned}

The largest temporary device graph used during the search pass is:

\begin{aligned} G_{\text{tmp}} &= N \times G_{\text{iter}} \times S_{\text{idx}} \end{aligned}

Each search chunk needs query storage plus temporary neighbor IDs and distances:

\begin{aligned} Q_{\text{tile}} &= Q_{\text{iter}} \times D_{\text{align}} \times B \\ R_{\text{tile}} &= Q_{\text{iter}} \times K_{\text{iter}} \times (S_{\text{idx}} + 4) \end{aligned}

The host neighbors list stores the temporary neighbor candidates for all rows:

\begin{aligned} H_{\text{iter}} &= \operatorname{round\_up} \big( N \times K_{\text{iter}} \times S_{\text{idx}}, 2\ \text{MiB} \big) \end{aligned}

The temporary device workspace for one search pass is:

\begin{aligned} W_{\text{iter}} &= G_{\text{tmp}} \\ &\quad + Q_{\text{tile}} \\ &\quad + R_{\text{tile}} \end{aligned}

The practical device peak for the iterative graph build is:

\begin{aligned} \text{iterative\_device\_peak} &\approx D_{\text{iter}} \\ &\quad + \max\!\big( W_{\text{iter}}, \text{optimize\_peak} \big) \end{aligned}

The practical host peak is:

\begin{aligned} \text{iterative\_host\_peak} &\approx H_{\text{iter}} + N \times G \times S_{\text{idx}} \end{aligned}

The final N * G * S_idx term is the host graph that remains after build. Check device and host memory separately. The usable N is the smaller value allowed by iterative_device_peak and iterative_host_peak.

Optimize phase

The optimize phase prunes and reorders the intermediate graph. Its peak memory scales linearly with the intermediate degree:

In this formula, the 4 byte term is per-vector bookkeeping. The (S_idx + 1) * I term stores I candidate neighbor IDs plus one byte of pruning state per candidate.

\begin{aligned} \text{optimize\_peak} &= N \times \Big( 4 + (S_{\text{idx}} + 1) \times I \Big) \end{aligned}

Example (N = 1e6, I = 128, IdxT = int32): 614.17 MB

Out-of-core CAGRA build consists of IVF-PQ build, IVF-PQ search, and CAGRA optimization. These steps are sequential, so their temporary memory peaks are not added together.

Overall build peak memory usage

The overall device peak is the dataset size plus the largest temporary allocation from the sequential build steps.

Using IVF-PQ:

\begin{aligned} \text{build\_peak} &= \text{dataset\_size} \\ &\quad + \max\!\big( \text{IVFPQ\_build\_peak}, \\ &\qquad\qquad \text{IVFPQ\_search\_peak}, \\ &\qquad\qquad \text{optimize\_peak} \big) \end{aligned}

Example: 3906.25 + max(395.01, 5.00, 614.17) = 4520.42 MB

Using NN-Descent:

\begin{aligned} \text{build\_peak} &= \text{dataset\_size}^{*} \\ &\quad + \max\!\big( \text{NND\_device\_peak}, \\ &\qquad\qquad \text{optimize\_peak} \big) \end{aligned}

dataset_size* applies only when the user passes data that is already in device memory. NN-Descent internally copies the dataset to the device as fp16, so host-memory inputs do not add this term.

Using iterative CAGRA search:

Use iterative_device_peak for device memory and iterative_host_peak for host memory. These estimates already include the aligned dataset, temporary search chunks, temporary graph storage, optimization workspace, and final host graph.

Search peak memory usage

CAGRA search requires the dataset and graph to already be resident in GPU memory. When using CAGRA-Q, the original dataset can reside in host memory instead. Search also needs temporary workspace for query vectors and results.

If multiple batches run concurrently or overlap, each batch needs separate result buffers. The estimate below assumes one query batch at a time and reused buffers.

\begin{aligned} \text{search\_memory} &= \text{dataset\_size} \\ &\quad + \text{graph\_size} \\ &\quad + \text{workspace\_size} \end{aligned}

The workspace contains query vectors and result storage:

In the query formula, sizeof(float) is 4 bytes because CAGRA search uses fp32 query storage here. In the result formula, each returned neighbor stores one graph ID of size S_idx and one fp32 distance.

\begin{aligned} \text{query\_size} &= Q \times D \times \operatorname{sizeof}(\mathrm{float}) \end{aligned}

\begin{aligned} \text{result\_size} &= Q \times K \\ &\quad \times \big(S_{\text{idx}} + \operatorname{sizeof}(\mathrm{float})\big) \end{aligned}

\begin{aligned} \text{workspace\_size} &= \text{query\_size} + \text{result\_size} \end{aligned}

Example (D = 1024, Q = 100, K = 10, IdxT = int32):

query_size = 409,600 B = 0.39 MB
result_size = 8,000 B = 0.0076 MB
workspace_size = query_size + result_size = 0.40 MB
total search memory ~= 3906.25 + 244.14 + 0.40 = 4150.79 MB