CAGRA
CAGRA is a GPU-optimized graph index for approximate nearest-neighbor search. Think of every vector as a point, and think of the graph as a map that connects each point to nearby points. During search, CAGRA follows that map toward better matches instead of checking every vector.
CAGRA works well when you want strong recall, high GPU throughput, and fast graph construction.
Example API Usage
C API | C++ API | Python API | Java API | Rust API | Go API
Building an index
C
C++
Python
Java
Rust
Go
Searching an index
C
C++
Python
Java
Rust
Go
Saving and loading an index
Serialize a CAGRA index when you want to reuse the graph without rebuilding it. Include the dataset when the loaded index should be searchable immediately; omit it only when your workflow will attach or provide the dataset separately.
Go does not currently expose CAGRA save/load wrappers.
C
C++
Python
Java
Rust
How CAGRA works
CAGRA builds and searches a nearest-neighbor graph.
First, CAGRA builds an initial kNN graph. This is the first draft of the map: each vector is connected to vectors that look nearby. An exact brute-force build can create a very accurate initial graph, but it is usually too slow. In practice, the first graph does not need to be perfect because CAGRA improves it later. cuVS can build this initial graph with IVF-PQ or NN-Descent.
Second, CAGRA prunes the initial graph. This removes redundant paths and keeps the links that are most useful for search.
At search time, CAGRA starts from one or more graph vertices, follows links to better candidates, and keeps a working set of the best candidates it has seen so far.
When to use CAGRA
Use CAGRA when the index fits in GPU memory and you want fast approximate search.
Use CAGRA when build speed matters. CAGRA can build graphs quickly on the GPU.
Use CAGRA in hybrid environments where a GPU-built graph is converted to HNSW for CPU search.
Use brute-force instead when exact results are required or the dataset is small enough that a full scan is already fast enough.
Interoperability with HNSW
cuVS can convert a CAGRA graph to an HNSW graph. This lets the GPU build the graph while the CPU handles search later. This is useful when GPUs are available for indexing, but production search runs on CPUs.
If the graph is being serialized or converted to HNSW right after build, avoid keeping the dataset attached to the CAGRA index when the binding exposes that option. In C++, for example, set attach_dataset_on_build to false.
These examples cover the bindings that currently expose CAGRA and HNSW interoperability. Go supports CAGRA build and search, but does not currently expose HNSW conversion.
C
C++
Python
Java
Rust
Using Filters
CAGRA supports filtered search. A filter hides some vectors from the search result, so CAGRA may need to explore more of the graph to find enough valid neighbors.
CAGRA can adjust itopk_size internally based on the filtering rate. To disable this automatic adjustment, set filtering_rate to 0.0.
The examples below use a bitset filter. A bit value of 1 means a vector is allowed; a bit value of 0 means it is filtered out.
C
C++
Python
Java
Rust
Go
Configuration parameters
Build parameters
Search parameters
Tuning
The three parameters most often tuned are itopk_size, graph_degree, and intermediate_graph_degree.
Start with itopk_size. Increasing it usually improves recall, but lowers throughput because CAGRA keeps more candidates during search.
If search-time tuning is not enough, increase graph_degree. This gives each vertex more links to follow, but uses more memory and search work.
If the final graph quality is still too low, increase intermediate_graph_degree. This gives pruning more choices, but makes build more expensive.
Persistent search
Persistent search can improve throughput in services that run many concurrent CAGRA searches. Instead of launching a new search kernel for each request, cuVS can keep a persistent search kernel resident on the GPU and feed it incoming work. This reduces launch overhead and can help high-volume search services keep the GPU busy.
Enable it with the persistent search parameter. Persistent search currently applies to the SINGLE_CTA search implementation, so set algo to SINGLE_CTA when you want to force this path instead of relying on AUTO.
Use persistent_lifetime to control how long the persistent kernel waits for new work before stopping. Use persistent_device_usage to reserve less than the full GPU for the persistent kernel when the same GPU also needs to run other kernels. Keeping other GPU work active alongside a persistent kernel can be fragile, so tune this setting carefully and validate it under the same concurrency pattern used in production.
Memory footprint
CAGRA memory has two main parts: the dataset and the graph. During build, the dataset must be in GPU memory. After build, the dataset can be detached if it is not needed for search, for example when immediately converting the graph to HNSW.
To keep the formulas readable, this section uses short symbols. All estimates are in bytes. The examples convert bytes to MiB by dividing by 1024 * 1024.
N: Number of database vectors, or rows in the dataset being indexed.D: Vector dimension, or number of values in each vector.B: Bytes stored for each vector value. Use4for fp32,2for fp16, or the byte width of the attached dataset representation.G: Final graph degree. This is thegraph_degreebuild parameter, and each vector keepsGneighbor IDs after pruning.I: Intermediate graph degree. This is theintermediate_graph_degreebuild parameter, and CAGRA uses this larger graph before pruning down toG.C: Number of IVF-PQ coarse clusters/lists. This is the IVF-PQn_listsvalue used by the graph build parameters.R: IVF-PQ training-set ratio. This istrain_set_ratio;R = 10means training uses roughlyN / 10vectors.Q: Query batch size, or number of query vectors processed together.K: Search result count, or the requestedk/topknearest neighbors per query.S_idx: Bytes per graph neighbor ID. This issizeof(IdxT), usually4forint32_toruint32_t.
The named terms in the formulas are also memory sizes:
dataset_size: Device memory used by the attached dataset vectors.graph_size: Host memory used by the CAGRA graph neighbor IDs.*_peak: Temporary peak memory for one build phase. Sequential phases are not added together.query_size: Device memory for the current query batch.result_size: Device memory for neighbor IDs and distances returned for the current query batch.workspace_size: Query and result memory used during search.
Scratch and maximum vectors
Most CAGRA formulas below are linear in N once build parameters are fixed. The named temporary peaks are the main scratch terms for build phases, but real runs can also include allocator padding, CUDA library workspaces, memory-resource pools, and small implementation buffers. Reserve a headroom factor H = 0.20 for IVF-PQ graph builds and H = 0.30 for NN-Descent or iterative-search graph builds. If you can measure a representative smaller run, use:
Then set:
The capacity variables in this subsection are:
M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.M_usable: Memory budget left for the formula after subtractingM_otherand reserving headroom.observed_peak: Peak memory observed during a smaller representative run.formula_without_scratch: Value of the selected peak formula with explicitscratchterms removed and without applying headroom.peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usuallyNfor rows or vectors andBfor K-selection batch rows.B_per_row/B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such asD,K,Q, andLare substituted.B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.N_max/B_max: Estimated largest row, vector, or batch-row count that fits inM_usable.
Choose the build or search formula that matches the operation, remove the explicit scratch/headroom from it, and rewrite it as:
Then estimate:
For out-of-core IVF-PQ graph build, Q, C, and R can make several terms fixed or sublinear for a fixed configuration. Solve the full max(...) expression if the largest phase changes as N changes.
Baseline memory after build
The baseline memory footprint after index construction is:
The dataset must be in GPU memory during index build, but can be detached afterward if it is not needed for search.
Example (1,000,000 vectors, dim = 1024, fp32, graph_degree = 64, IdxT = int32):
dataset_size = 4,096,000,000 B = 3906.25 MBgraph_size = 256,000,000 B = 244.14 MB
Build peak memory usage
Index build has two phases: construct an initial kNN graph, then optimize it by pruning redundant paths. These steps run sequentially, so their peak memory use is not additive. The overall peak depends on the configured RMM memory resource.
The initial graph can be built with IVF-PQ, NN-Descent, or the experimental iterative CAGRA-search builder. IVF-PQ can build in batches, which allows CAGRA to train on datasets larger than available GPU memory. The iterative builder requires the aligned dataset to fit in GPU memory because it repeatedly searches the partially built CAGRA graph.
Initial graph build using IVF-PQ
IVF-PQ builds the initial graph in two stages. First, it trains cluster centroids and PQ codebooks. Then it queries the IVF-PQ index in batches to form approximate nearest-neighbor lists.
IVF-PQ build peak:
Here, N / R is the IVF-PQ training sample size. The 4 byte factors are fp32 values for training vectors and cluster centroids. The uint32_t term stores one 32-bit ID per training vector.
Example (N = 1e6, D = 1024, C = 1024, R = 10): 395.01 MB
IVF-PQ search peak:
Here, Q is the number of vectors in one search batch and I is the number of candidates kept per query while building the intermediate graph. The three terms estimate query vectors, candidate IDs, and candidate distances.
Example (Q = 1024, D = 1024, I = 128): 5.00 MB
Initial graph build using NN-Descent
Peak device memory:
The constants in the NN-Descent formulas are per-vector workspace estimates from the implementation. They are added to the vector storage terms before multiplying by N.
- Data vectors are transferred to device and stored as fp16:
D * 2bytes per vector. - The small working graph, locks, and edge counters use 276 bytes per vector.
- L2 metric adds 4 bytes per vector for precomputed norms.
Peak host memory:
- Full graph with distances:
1.3 * 8 * Ibytes per vector. - Bloom filter for sampling:
1.3 * 2 * Ibytes per vector. - 5 sample buffers with degree 32: 640 bytes per vector.
- Graph update buffer with degree 32: 256 bytes per vector.
- Edge counters: 16 bytes per vector.
Initial graph build using iterative CAGRA search
The iterative builder starts with a small connected graph, then repeatedly uses CAGRA search to find neighbors for a larger prefix of the dataset. After each search pass, it optimizes the graph and doubles the active graph size until all rows are included.
This path is useful when the metric or data type is better served by CAGRA search itself, but it is not an out-of-core builder. The dataset is copied or aligned into GPU memory before the first iteration.
Variables used only in this subsection:
D_align: Aligned device stride used by CAGRA search. UseDwhen no padding is required.Q_iter: Maximum query chunk size used by the iterative builder. The implementation currently usesmin(N, 8192).K_iter: Number of temporary neighbors kept per query during the last pass. UseI + 1.G_iter: Largest graph degree used by the temporary searchable graph. UseG; early iterations use a smaller degree and the final iterations useG.D_iter: Aligned device dataset memory.G_tmp: Largest temporary device graph memory.Q_tile: Query tile memory for one search chunk.R_tile: Result tile memory for one search chunk.W_iter: Temporary device workspace used by one iterative search pass.H_iter: Host neighbors-list capacity in bytes after rounding up to a 2 MiB boundary. One MiB is1024 * 1024bytes.
The aligned device dataset is:
The largest temporary device graph used during the search pass is:
Each search chunk needs query storage plus temporary neighbor IDs and distances:
The host neighbors list stores the temporary neighbor candidates for all rows:
The temporary device workspace for one search pass is:
The practical device peak for the iterative graph build is:
The practical host peak is:
The final N * G * S_idx term is the host graph that remains after build. Check device and host memory separately. The usable N is the smaller value allowed by iterative_device_peak and iterative_host_peak.
Optimize phase
The optimize phase prunes and reorders the intermediate graph. Its peak memory scales linearly with the intermediate degree:
In this formula, the 4 byte term is per-vector bookkeeping. The (S_idx + 1) * I term stores I candidate neighbor IDs plus one byte of pruning state per candidate.
Example (N = 1e6, I = 128, IdxT = int32): 614.17 MB
Out-of-core CAGRA build consists of IVF-PQ build, IVF-PQ search, and CAGRA optimization. These steps are sequential, so their temporary memory peaks are not added together.
Overall build peak memory usage
The overall device peak is the dataset size plus the largest temporary allocation from the sequential build steps.
Using IVF-PQ:
Example: 3906.25 + max(395.01, 5.00, 614.17) = 4520.42 MB
Using NN-Descent:
dataset_size* applies only when the user passes data that is already in device memory. NN-Descent internally copies the dataset to the device as fp16, so host-memory inputs do not add this term.
Using iterative CAGRA search:
Use iterative_device_peak for device memory and iterative_host_peak for host memory. These estimates already include the aligned dataset, temporary search chunks, temporary graph storage, optimization workspace, and final host graph.
Search peak memory usage
CAGRA search requires the dataset and graph to already be resident in GPU memory. When using CAGRA-Q, the original dataset can reside in host memory instead. Search also needs temporary workspace for query vectors and results.
If multiple batches run concurrently or overlap, each batch needs separate result buffers. The estimate below assumes one query batch at a time and reused buffers.
The workspace contains query vectors and result storage:
In the query formula, sizeof(float) is 4 bytes because CAGRA search uses fp32 query storage here. In the result formula, each returned neighbor stores one graph ID of size S_idx and one fp32 distance.
Example (D = 1024, Q = 100, K = 10, IdxT = int32):
query_size = 409,600 B = 0.39 MBresult_size = 8,000 B = 0.0076 MBworkspace_size = query_size + result_size = 0.40 MBtotal search memory ~= 3906.25 + 244.14 + 0.40 = 4150.79 MB