IVF-Flat
IVF-Flat
IVF-Flat is a GPU-accelerated approximate nearest-neighbor index. It partitions vectors into coarse clusters, also called inverted lists, and stores the original vectors inside those lists.
During search, IVF-Flat first finds the closest lists, then scans only those lists. This can be much faster than brute-force because search does not need to compare each query with every vector in the dataset.
IVF-Flat works well when the index fits in GPU memory, exact recall is not required, and partitioning the dataset gives a useful speedup.
Example API Usage
C API | C++ API | Python API | Rust API | Go API
Building an index
C
C++
Python
Rust
Go
Searching an index
C
C++
Python
Rust
Go
Saving and loading an index
Serialize an IVF-Flat index when you want to reuse trained lists and stored vectors without rebuilding the index.
Java does not currently expose a standalone IVF-Flat binding. Rust and Go currently expose build/search wrappers, but not IVF-Flat save/load wrappers.
C
C++
Python
How IVF-Flat works
IVF-Flat has two main steps.
First, build trains coarse cluster centers with k-means. Each vector is assigned to the closest center, and vectors assigned to the same center are stored together in one inverted list.
Second, search compares each query with the cluster centers, chooses the closest n_probes lists, and scans the full-precision vectors inside those lists.
The search is approximate because vectors in lists that are not probed are skipped. Increasing n_probes searches more lists and usually improves recall, but it also increases latency.
When to use IVF-Flat
Use IVF-Flat when brute-force is too slow, but you still want to store full-precision vectors.
Use IVF-Flat when the dataset is large enough that partitioning helps search avoid most vectors.
Use IVF-Flat when you can tune recall and throughput by changing how many lists are searched.
Use brute-force instead when exact results are required or the dataset is small enough that scanning every vector is already fast enough.
Use IVF-PQ instead when the full-precision IVF-Flat index is too large for device memory and some compression loss is acceptable.
IVF-SQ and scalar quantization
IVF-SQ follows the same partition-first search pattern as IVF-Flat, but stores scalar-quantized vector values inside each partition. cuVS exposes scalar quantization as a preprocessing API rather than a separate standalone IVF-SQ index API.
Use this pattern when full-precision IVF-Flat storage is too large, but you want simpler compression than IVF-PQ. Scalar quantization reduces memory bandwidth and index size while usually requiring less tuning than product quantization.
Using Filters
IVF-Flat supports filtered search, but filtering happens only inside the lists selected by n_probes.
This means a filtered IVF-Flat search can miss an allowed vector if that vector lives in a list that was not probed. If filtered recall matters, increase n_probes, use a filter-aware evaluation set, or use brute-force when the filter is expected to remove most vectors.
The examples below use a bitset filter. A bit value of 1 means a vector is allowed; a bit value of 0 means it is filtered out.
These examples cover the IVF-Flat bindings that currently expose filters. Rust and Go currently expose IVF-Flat search without a filter argument, and Java does not currently expose a standalone IVF-Flat binding.
C
C++
Python
Configuration parameters
Build parameters
Search parameters
Filters are passed as search function arguments in bindings that expose filtered search, not as fields in ivf_flat::search_params.
Tuning
Start with n_lists. A common starting point is near sqrt(N), then adjust based on recall, latency, and list balance.
Tune n_probes next. Increasing n_probes usually improves recall because each query scans more lists, but it also does more distance computation.
If build time is high, reduce kmeans_trainset_fraction or kmeans_n_iters. If recall is poor even with higher n_probes, the trained clusters may be too coarse or poorly matched to the data.
For append-heavy workflows, decide whether add_data_on_build, adaptive_centers, and conservative_memory_allocation match the update pattern. Tighter allocation uses less memory, while the default allocation strategy can make repeated extend calls cheaper.
Memory footprint
IVF-Flat memory has three main parts: full-precision vectors stored in inverted lists, source row IDs for those vectors, and cluster centers. The index does not need the original dataset after build because it stores the vectors in its own list layout.
To keep the formulas readable, this section uses short symbols. All estimates are in bytes. The examples convert bytes to MiB by dividing by 1024 * 1024.
N: Number of database vectors, or rows in the dataset being indexed.D: Vector dimension, or number of values in each vector.B: Bytes stored for each vector value. Use4for fp32,2for fp16, or1for int8 and uint8.L: Number of inverted lists. This is then_listsbuild parameter.N_cap: Total allocated list capacity across all lists after padding and over-allocation.S_idx: Bytes per source row ID. This issizeof(IdxT), usually8forint64_t.Q: Query batch size, or number of query vectors processed together.K: Search result count, or the requestedknearest neighbors per query.P: Number of lists probed during search. This is then_probessearch parameter.F: Training-set fraction. This is thekmeans_trainset_fractionbuild parameter.
The named terms in the formulas are also memory sizes:
list_data_size: Device memory used by vectors stored inside IVF lists.list_index_size: Device memory used by source row IDs stored beside list vectors.centers_size: Device memory used by theLcluster centers.index_size: Approximate device memory used by the trained IVF-Flat index.query_size: Device memory for the current query batch.result_size: Device memory for neighbor IDs and distances returned for the current query batch.workspace_size: Temporary search workspace.
Scratch and maximum vectors
The formulas below show the major persistent and temporary buffers. Additional scratch comes from k-means workspaces, list allocation padding, allocator fragmentation, CUDA library workspaces, and memory held by the active memory resource. Use H = 0.20 for build estimates and H = 0.10 for search estimates. If you can measure a representative smaller run, use:
Then set:
The capacity variables in this subsection are:
M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.M_usable: Memory budget left for the formula after subtractingM_otherand reserving headroom.observed_peak: Peak memory observed during a smaller representative run.formula_without_scratch: Value of the selected peak formula with explicitscratchterms removed and without applying headroom.peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usuallyNfor rows or vectors andBfor K-selection batch rows.B_per_row/B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such asD,K,Q, andLare substituted.B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.N_max/B_max: Estimated largest row, vector, or batch-row count that fits inM_usable.
IVF-Flat uses padded list capacity, so estimate N_cap first. With balanced lists, substitute the approximation for N_cap; with measured list sizes, use the exact sum. Then solve the selected peak formula:
Because N_cap(N) is stepwise, the most reliable capacity estimate is to try increasing N values until peak_without_scratch(N) <= M_usable no longer holds. For rough planning, use N_cap ≈ N plus 5-10% padding when lists are balanced, or more if many lists are small.
List capacity
Each inverted list is padded for efficient memory access. The minimum alignment is 32 rows. With the default allocation strategy, lists may be overallocated up to 1024-row chunks to make future extend calls cheaper. With conservative_memory_allocation = true, the maximum alignment is 32 rows.
For each list i, let n_i be the number of vectors assigned to that list.
The implementation chooses a capacity for each list based on A_min and A_max:
The total allocated list capacity is:
For balanced lists, a simple approximation is:
Use the exact list sizes when estimating tightly. Empty or very small lists still carry padding overhead, so choosing far more lists than needed can waste memory.
Baseline memory after build
The baseline memory footprint after index construction is:
Example (N = 1e6, D = 128, B = 4, L = 1024, balanced lists, default allocation, IdxT = int64):
N / L = 976.56, soN_cap ~= 1024 * 1024 = 1,048,576list_data_size = 536,870,912 B = 512.00 MiBlist_index_size = 8,388,608 B = 8.00 MiBcenters_size = 524,288 B = 0.50 MiBindex_size ~= 520.50 MiB + small metadata
Build peak memory usage
Build trains k-means centers, then fills the lists when add_data_on_build = true. Peak memory depends on whether the input dataset is already on device and how the workspace memory resource is configured.
The k-means training sample size is:
A practical build estimate is:
When the input is already on device, input_dataset_size = N * D * B. Host inputs may stage batches on the device instead of adding the full dataset as a separate device allocation.
Search peak memory usage
IVF-Flat search requires the index to be resident on the GPU. Search also needs query vectors, output buffers, and temporary workspace.
A useful workspace estimate is:
The total search memory is:
For many lists, use a pool memory resource when possible. Each IVF list is allocated separately, and a pool allocator helps avoid large per-allocation overhead from the underlying device memory resource.