IVF-Flat | cuVS

IVF-Flat is a GPU-accelerated approximate nearest-neighbor index. It partitions vectors into coarse clusters, also called inverted lists, and stores the original vectors inside those lists.

During search, IVF-Flat first finds the closest lists, then scans only those lists. This can be much faster than brute-force because search does not need to compare each query with every vector in the dataset.

IVF-Flat works well when the index fits in GPU memory, exact recall is not required, and partitioning the dataset gives a useful speedup.

Example API Usage

C API | C++ API | Python API | Rust API | Go API

Building an index

C

C++

Python

Rust

Go

1 #include <cuvs/neighbors/ivf_flat.h>
2 
3 cuvsResources_t res;
4 cuvsIvfFlatIndexParams_t index_params;
5 cuvsIvfFlatIndex_t index;
6 DLManagedTensor *dataset;
7 
8 // Populate tensor with data.
9 load_dataset(dataset);
10 
11 cuvsResourcesCreate(&res);
12 cuvsIvfFlatIndexParamsCreate(&index_params);
13 cuvsIvfFlatIndexCreate(&index);
14 
15 index_params->n_lists = 1024;
16 index_params->kmeans_trainset_fraction = 0.5;
17 
18 cuvsIvfFlatBuild(res, index_params, dataset, index);
19 
20 cuvsIvfFlatIndexDestroy(index);
21 cuvsIvfFlatIndexParamsDestroy(index_params);
22 cuvsResourcesDestroy(res);

Extending an index

C

C++

Python

1 #include <cuvs/neighbors/ivf_flat.h>
2 
3 cuvsResources_t res;
4 cuvsIvfFlatIndex_t index;
5 DLManagedTensor *new_vectors;
6 DLManagedTensor *new_indices;
7 
8 load_additional_dataset(new_vectors);
9 load_additional_indices(new_indices);
10 
11 cuvsResourcesCreate(&res);
12 cuvsIvfFlatIndexCreate(&index);
13 
14 // ... build or load index ...
15 cuvsIvfFlatExtend(res, new_vectors, new_indices, index);
16 
17 cuvsIvfFlatIndexDestroy(index);
18 cuvsResourcesDestroy(res);

See the C, C++, and Python API references for the full signatures.

Searching an index

C

C++

Python

Rust

Go

1 #include <cuvs/neighbors/ivf_flat.h>
2 
3 cuvsResources_t res;
4 cuvsIvfFlatSearchParams_t search_params;
5 cuvsIvfFlatIndex_t index;
6 DLManagedTensor *queries;
7 DLManagedTensor *neighbors;
8 DLManagedTensor *distances;
9 
10 // Populate tensor with data.
11 load_queries(queries);
12 allocate_outputs(neighbors, distances);
13 
14 cuvsResourcesCreate(&res);
15 cuvsIvfFlatSearchParamsCreate(&search_params);
16 search_params->n_probes = 20;
17 
18 // ... build or load index ...
19 cuvsFilter filter = {0, NO_FILTER};
20 cuvsIvfFlatSearch(
21     res,
22     search_params,
23     index,
24     queries,
25     neighbors,
26     distances,
27     filter);
28 
29 cuvsIvfFlatSearchParamsDestroy(search_params);
30 cuvsResourcesDestroy(res);

Saving and loading an index

Serialize an IVF-Flat index when you want to reuse trained lists and stored vectors without rebuilding the index.

Java does not currently expose a standalone IVF-Flat binding. Rust and Go currently expose build/search wrappers, but not IVF-Flat save/load wrappers.

C

C++

Python

1 #include <cuvs/neighbors/ivf_flat.h>
2 
3 cuvsResources_t res;
4 cuvsIvfFlatIndex_t index;
5 cuvsIvfFlatIndex_t loaded_index;
6 
7 cuvsResourcesCreate(&res);
8 cuvsIvfFlatIndexCreate(&index);
9 cuvsIvfFlatIndexCreate(&loaded_index);
10 
11 // ... build index ...
12 cuvsIvfFlatSerialize(res, "/tmp/cuvs-ivf-flat.bin", index);
13 cuvsIvfFlatDeserialize(res, "/tmp/cuvs-ivf-flat.bin", loaded_index);
14 
15 cuvsIvfFlatIndexDestroy(loaded_index);
16 cuvsIvfFlatIndexDestroy(index);
17 cuvsResourcesDestroy(res);

How IVF-Flat works

IVF-Flat has two main steps.

First, build trains coarse cluster centers with k-means. Each vector is assigned to the closest center, and vectors assigned to the same center are stored together in one inverted list.

Second, search compares each query with the cluster centers, chooses the closest n_probes lists, and scans the full-precision vectors inside those lists.

The search is approximate because vectors in lists that are not probed are skipped. Increasing n_probes searches more lists and usually improves recall, but it also increases latency.

When to use IVF-Flat

Use IVF-Flat when brute-force is too slow, but you still want to store full-precision vectors.

Use IVF-Flat when the dataset is large enough that partitioning helps search avoid most vectors.

Use IVF-Flat when you can tune recall and throughput by changing how many lists are searched.

Use brute-force instead when exact results are required or the dataset is small enough that scanning every vector is already fast enough.

Use IVF-PQ instead when the full-precision IVF-Flat index is too large for device memory and some compression loss is acceptable.

IVF-SQ and scalar quantization

IVF-SQ follows the same partition-first search pattern as IVF-Flat, but stores scalar-quantized vector values inside each partition. NVIDIA cuVS exposes scalar quantization as a preprocessing API rather than a separate standalone IVF-SQ index API.

Use this pattern when full-precision IVF-Flat storage is too large, but you want simpler compression than IVF-PQ. Scalar quantization reduces memory bandwidth and index size while usually requiring less tuning than product quantization.

Using Filters

IVF-Flat supports filtered search, but filtering happens only inside the lists selected by n_probes.

This means a filtered IVF-Flat search can miss an allowed vector if that vector lives in a list that was not probed. If filtered recall matters, increase n_probes, use a filter-aware evaluation set, or use brute-force when the filter is expected to remove most vectors.

The examples below use a bitset filter. A bit value of 1 means a vector is allowed; a bit value of 0 means it is filtered out.

These examples cover the IVF-Flat bindings that currently expose filters. Rust and Go currently expose IVF-Flat search without a filter argument, and Java does not currently expose a standalone IVF-Flat binding.

C

C++

Python

1 #include <cuvs/neighbors/common.h>
2 #include <cuvs/neighbors/ivf_flat.h>
3 
4 cuvsResources_t res;
5 cuvsIvfFlatIndexParams_t index_params;
6 cuvsIvfFlatSearchParams_t search_params;
7 cuvsIvfFlatIndex_t index;
8 DLManagedTensor *dataset;
9 DLManagedTensor *queries;
10 DLManagedTensor *neighbors;
11 DLManagedTensor *distances;
12 
13 cuvsResourcesCreate(&res);
14 cuvsIvfFlatIndexParamsCreate(&index_params);
15 cuvsIvfFlatSearchParamsCreate(&search_params);
16 cuvsIvfFlatIndexCreate(&index);
17 
18 // Populate DLPack tensors with dataset, query, and output data.
19 load_dataset(dataset);
20 load_queries(queries);
21 allocate_outputs(neighbors, distances);
22 
23 cuvsIvfFlatBuild(res, index_params, dataset, index);
24 
25 // Create a device uint32 bitset with one bit per indexed vector. Bit 1 means
26 // allowed; bit 0 means filtered out.
27 DLManagedTensor *bitset = make_device_bitset(allowed_indices, n_vectors);
28 
29 cuvsFilter filter;
30 filter.type = BITSET;
31 filter.addr = (uintptr_t)bitset;
32 
33 cuvsIvfFlatSearch(
34     res,
35     search_params,
36     index,
37     queries,
38     neighbors,
39     distances,
40     filter);
41 
42 cuvsIvfFlatIndexDestroy(index);
43 cuvsIvfFlatSearchParamsDestroy(search_params);
44 cuvsIvfFlatIndexParamsDestroy(index_params);
45 cuvsResourcesDestroy(res);

Configuration parameters

Build parameters

Name	Default	Description
`metric`	`L2Expanded` / `sqeuclidean`	Distance metric used for cluster training and search.
`metric_arg`	`2.0`	Extra argument for metrics that need one, such as Minkowski distance.
`n_lists`	`1024`	Number of coarse clusters, or inverted lists. More lists make each list smaller, but may require more probes to maintain recall.
`kmeans_n_iters`	`20`	Maximum number of k-means iterations used to train the list centers.
`kmeans_trainset_fraction`	`0.5`	Fraction of the dataset used for k-means training. Reducing this can speed up build on very large datasets.
`adaptive_centers`	`False`	Updates centers when new vectors are added. This can help when appended data shifts distribution, but center values then depend on insertion order.
`conservative_memory_allocation`	`False`	Uses tighter per-list allocations. The default overallocates lists to reduce reallocations during repeated `extend` calls.
`add_data_on_build`	`True`	Adds vectors to the index during build. Set to `False` to train the centers first and add vectors later with `extend`.

Search parameters

Name	Default	Description
`n_probes`	`20`	Number of closest lists to scan for each query. This is the main recall and latency knob.
`metric_udf`	None	Optional custom metric UDF code used by the C++ search path.

Filters are passed as search function arguments in bindings that expose filtered search, not as fields in ivf_flat::search_params.

Tuning

Start with n_lists. A common starting point is near sqrt(N), then adjust based on recall, latency, and list balance.

Tune n_probes next. Increasing n_probes usually improves recall because each query scans more lists, but it also does more distance computation.

If build time is high, reduce kmeans_trainset_fraction or kmeans_n_iters. If recall is poor even with higher n_probes, the trained clusters may be too coarse or poorly matched to the data.

For append-heavy workflows, decide whether add_data_on_build, adaptive_centers, and conservative_memory_allocation match the update pattern. Tighter allocation uses less memory, while the default allocation strategy can make repeated extend calls cheaper.

Memory footprint

IVF-Flat memory has three main parts: full-precision vectors stored in inverted lists, source row IDs for those vectors, and cluster centers. The index does not need the original dataset after build because it stores the vectors in its own list layout.

To keep the formulas readable, this section uses short symbols. All estimates are in bytes. The examples convert bytes to MiB by dividing by 1024 * 1024.

N: Number of database vectors, or rows in the dataset being indexed.
D: Vector dimension, or number of values in each vector.
B: Bytes stored for each vector value. Use 4 for fp32, 2 for fp16, or 1 for int8 and uint8.
L: Number of inverted lists. This is the n_lists build parameter.
N_cap: Total allocated list capacity across all lists after padding and over-allocation.
S_idx: Bytes per source row ID. This is sizeof(IdxT), usually 8 for int64_t.
Q: Query batch size, or number of query vectors processed together.
K: Search result count, or the requested k nearest neighbors per query.
P: Number of lists probed during search. This is the n_probes search parameter.
F: Training-set fraction. This is the kmeans_trainset_fraction build parameter.

The named terms in the formulas are also memory sizes:

list_data_size: Device memory used by vectors stored inside IVF lists.
list_index_size: Device memory used by source row IDs stored beside list vectors.
centers_size: Device memory used by the L cluster centers.
index_size: Approximate device memory used by the trained IVF-Flat index.
query_size: Device memory for the current query batch.
result_size: Device memory for neighbor IDs and distances returned for the current query batch.
workspace_size: Temporary search workspace.

Scratch and maximum vectors

The formulas below show the major persistent and temporary buffers. Additional scratch comes from k-means workspaces, list allocation padding, allocator fragmentation, CUDA library workspaces, and memory held by the active memory resource. Use H = 0.20 for build estimates and H = 0.10 for search estimates. If you can measure a representative smaller run, use:

H_{\text{measured}} = \frac{\text{observed\_peak} - \text{formula\_without\_scratch}} {\text{formula\_without\_scratch}}

Then set:

M_{\text{usable}} = (M_{\text{free}} - M_{\text{other}}) \cdot (1 - H)

The capacity variables in this subsection are:

M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.
M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.
H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.
M_usable: Memory budget left for the formula after subtracting M_other and reserving headroom.
observed_peak: Peak memory observed during a smaller representative run.
formula_without_scratch: Value of the selected peak formula with explicit scratch terms removed and without applying headroom.
peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usually N for rows or vectors and B for K-selection batch rows.
B_per_row / B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such as D, K, Q, and L are substituted.
B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.
N_max / B_max: Estimated largest row, vector, or batch-row count that fits in M_usable.

IVF-Flat uses padded list capacity, so estimate N_cap first. With balanced lists, substitute the approximation for N_cap; with measured list sizes, use the exact sum. Then solve the selected peak formula:

\text{peak\_without\_scratch}(N) = N_{\text{cap}}(N) \cdot B_{\text{per\_stored\_vector}} + N \cdot B_{\text{per\_input\_vector}} + B_{\text{fixed}}

Because N_cap(N) is stepwise, the most reliable capacity estimate is to try increasing N values until peak_without_scratch(N) <= M_usable no longer holds. For rough planning, use N_cap ≈ N plus 5-10% padding when lists are balanced, or more if many lists are small.

List capacity

Each inverted list is padded for efficient memory access. The minimum alignment is 32 rows. With the default allocation strategy, lists may be overallocated up to 1024-row chunks to make future extend calls cheaper. With conservative_memory_allocation = true, the maximum alignment is 32 rows.

For each list i, let n_i be the number of vectors assigned to that list.

\begin{aligned} A_{\text{min}} &= 32 \end{aligned}

\begin{aligned} A_{\text{max}} &= \begin{cases} 32, & \text{conservative allocation} \\ 1024, & \text{default allocation} \end{cases} \end{aligned}

The implementation chooses a capacity for each list based on A_min and A_max:

\begin{aligned} \operatorname{capacity}(n_i) &= \begin{cases} \min\!\big( \operatorname{nextpow2} (\max(n_i, A_{\text{min}})), A_{\text{max}} \big), & n_i < A_{\text{max}} \\ \operatorname{round\_up}(n_i, A_{\text{max}}), & n_i \ge A_{\text{max}} \end{cases} \end{aligned}

The total allocated list capacity is:

\begin{aligned} N_{\text{cap}} &= \sum_{i=1}^{L} \operatorname{capacity}(n_i) \end{aligned}

For balanced lists, a simple approximation is:

\begin{aligned} N_{\text{cap}} &\approx L \times \operatorname{capacity} \left(\frac{N}{L}\right) \end{aligned}

Use the exact list sizes when estimating tightly. Empty or very small lists still carry padding overhead, so choosing far more lists than needed can waste memory.

Baseline memory after build

The baseline memory footprint after index construction is:

\begin{aligned} \text{list\_data\_size} &= N_{\text{cap}} \times D \times B \end{aligned}

\begin{aligned} \text{list\_index\_size} &= N_{\text{cap}} \times S_{\text{idx}} \end{aligned}

\begin{aligned} \text{centers\_size} &= L \times D \times \operatorname{sizeof}(\mathrm{float}) \end{aligned}

\begin{aligned} \text{index\_size} &\approx \text{list\_data\_size} \\ &\quad + \text{list\_index\_size} \\ &\quad + \text{centers\_size} \\ &\quad + \text{small list metadata} \end{aligned}

Example (N = 1e6, D = 128, B = 4, L = 1024, balanced lists, default allocation, IdxT = int64):

N / L = 976.56, so N_cap ~= 1024 * 1024 = 1,048,576
list_data_size = 536,870,912 B = 512.00 MiB
list_index_size = 8,388,608 B = 8.00 MiB
centers_size = 524,288 B = 0.50 MiB
index_size ~= 520.50 MiB + small metadata

Build peak memory usage

Build trains k-means centers, then fills the lists when add_data_on_build = true. Peak memory depends on whether the input dataset is already on device and how the workspace memory resource is configured.

The k-means training sample size is:

\begin{aligned} N_{\text{train}} &= F \times N \end{aligned}

A practical build estimate is:

\begin{aligned} \text{training\_sample\_size} &= N_{\text{train}} \times D \times B \end{aligned}

\begin{aligned} \text{training\_labels\_size} &= N_{\text{train}} \times \operatorname{sizeof}(\mathrm{uint32\_t}) \end{aligned}

\begin{aligned} \text{build\_peak} &\approx \text{input\_dataset\_size} \\ &\quad + \text{training\_sample\_size} \\ &\quad + \text{training\_labels\_size} \\ &\quad + \text{centers\_size} \\ &\quad + \text{index\_size} \end{aligned}

When the input is already on device, input_dataset_size = N * D * B. Host inputs may stage batches on the device instead of adding the full dataset as a separate device allocation.

Search peak memory usage

IVF-Flat search requires the index to be resident on the GPU. Search also needs query vectors, output buffers, and temporary workspace.

\begin{aligned} \text{query\_size} &= Q \times D \times B \end{aligned}

\begin{aligned} \text{result\_size} &= Q \times K \\ &\quad \times \big(S_{\text{idx}} + \operatorname{sizeof}(\mathrm{float})\big) \end{aligned}

A useful workspace estimate is:

\begin{aligned} \text{workspace\_size} &\lesssim \min\!\Big( 1\ \mathrm{GiB}, \\ &\qquad Q \times \Big[ \big(L + 1 + P \times (K + 1)\big) \times \operatorname{sizeof}(\mathrm{float}) \\ &\qquad\qquad + P \times K \times S_{\text{idx}} \Big] \Big) \end{aligned}

The total search memory is:

\begin{aligned} \text{search\_memory} &\approx \text{index\_size} \\ &\quad + \text{query\_size} \\ &\quad + \text{result\_size} \\ &\quad + \text{workspace\_size} \end{aligned}

For many lists, use a pool memory resource when possible. Each IVF list is allocated separately, and a pool allocator helps avoid large per-allocation overhead from the underlying device memory resource.