IVF-PQ
IVF-PQ
IVF-PQ is a GPU-accelerated approximate nearest-neighbor index. It first partitions vectors into coarse clusters, also called inverted lists, then stores a compressed product-quantized code for each vector in the selected list.
Compared with IVF-Flat, IVF-PQ uses much less GPU memory because it stores compressed codes instead of full vectors. The trade-off is that search distances are approximate. For high recall, IVF-PQ is commonly paired with refinement, where the search returns more candidates than needed and the final top results are reranked with the original vectors.
IVF-PQ works well when the dataset is too large for full-precision storage on the GPU, some recall loss is acceptable, and the original vectors are available for optional reranking.
Example API Usage
C API | C++ API | Python API | Java IVF-PQ Params | Rust API | Go API
Java currently exposes IVF-PQ parameter classes for CAGRA graph construction, not a standalone IVF-PQ index/search binding. The runnable standalone examples below cover C, C++, Python, Rust, and Go.
Building an index
C
C++
Python
Rust
Go
Searching an index
C
C++
Python
Rust
Go
Saving and loading an index
Serialize an IVF-PQ index when you want to reuse trained coarse clusters, PQ codebooks, and compressed lists without rebuilding the index.
Java currently exposes IVF-PQ parameter classes for CAGRA graph construction, not a standalone IVF-PQ index/search binding. Rust and Go currently expose IVF-PQ build/search wrappers, but not IVF-PQ save/load wrappers.
C
C++
Python
How IVF-PQ works
IVF-PQ combines two ideas:
- IVF partitions the dataset into
n_listscoarse clusters. During search, each query visits only the closestn_probeslists. - PQ compresses each vector into a short code. Instead of comparing the query to every original vector value, search compares the query to precomputed codebook entries.
Think of each vector as a long row of numbers. PQ splits that row into smaller chunks, gives each chunk a short code, and stores the codes. At search time, cuVS uses lookup tables to score those codes quickly.
The compression is lossy, so IVF-PQ trades some recall for a smaller and faster index. Refinement can recover much of that recall when the original vectors are still available.
When to use IVF-PQ
Use IVF-PQ when GPU memory is the main limit and IVF-Flat would store too much full-precision data.
IVF-PQ is often a good fit for large datasets, high-throughput candidate generation, and workflows that can rerank candidates with exact distances afterward.
Avoid IVF-PQ when exact distances are required for every result, when recall cannot tolerate compression error, or when the original vectors are unavailable but high recall depends on refinement.
Refinement and reranking
IVF-PQ search returns approximate distances from compressed vectors. A common pattern is to ask IVF-PQ for K0 candidates, where K0 is larger than the final K, then rerank those candidates with exact distances against the original dataset.
For example, search for K0 = 4 * K, refine the candidates, and keep the best K. This needs the original vectors, but it can improve recall substantially without storing full vectors inside the IVF-PQ index.
C
C++
Python
Rust and Go currently expose standalone IVF-PQ search without a separate refinement wrapper.
Using Filters
IVF-PQ C++ search supports sample filters. Filtering happens only inside the lists selected by n_probes, so a filtered search can miss an allowed vector if that vector lives in a list that was not probed.
If filtered recall matters, increase n_probes, evaluate with representative filters, or use brute-force when the filter is expected to remove most vectors. C, Python, Rust, and Go currently expose standalone IVF-PQ search without a filter argument, and Java does not currently expose a standalone IVF-PQ binding.
Configuration parameters
Build parameters
Search parameters
Filters are passed as search function arguments in bindings that expose filtered search, not as fields in ivf_pq::search_params.
Tuning
Start with n_lists. A common target is roughly 1,000 to 10,000 vectors per list. More lists make each list smaller, but search usually needs a higher n_probes value to maintain recall.
Tune n_probes next. Increasing n_probes usually improves recall because each query scans more lists, but it also increases latency.
Then tune compression. Lower pq_bits and lower pq_dim reduce memory and can speed up search, but they also make each stored vector less precise. If recall drops too much, increase pq_bits, increase pq_dim, use per_cluster codebooks, or add refinement.
Use lower-precision lut_dtype, internal_distance_dtype, or coarse_search_dtype only after the basic recall and latency balance is acceptable. These options can improve throughput, but they add another source of approximation.
Memory footprint
IVF-PQ memory has four main parts: compressed vector codes, source row IDs, coarse cluster centers, and PQ codebooks. The index is much smaller than IVF-Flat because it stores pq_bits-wide codes rather than full vectors.
To keep the formulas readable, this section uses short symbols. All estimates are in bytes. The examples convert bytes to MiB by dividing by 1024 * 1024.
N: Number of database vectors, or rows in the dataset being indexed.D: Original vector dimension, or number of values in each vector.D_rot: Rotated and padded dimension used internally by PQ. This is a multiple ofM.B: Bytes stored for each original vector value. Use4for fp32,2for fp16, or1for int8 and uint8.L: Number of inverted lists. This is then_listsbuild parameter.M: Number of PQ code dimensions. This is thepq_dimbuild parameter.b: Bits per PQ code. This is thepq_bitsbuild parameter.C: Number of entries in each PQ codebook, equal to2^b.S_idx: Bytes per source row ID. This issizeof(IdxT), usually8forint64_t.N_cap: Total allocated list capacity across all lists after padding and over-allocation.Q: Query batch size, or number of query vectors processed together.K: Search result count, or the requestedknearest neighbors per query.K0: Candidate count used before refinement, whereK0 >= K.P: Number of lists probed during search. This is then_probessearch parameter.F: Training-set fraction. This is thekmeans_trainset_fractionbuild parameter.B_lut: Bytes per lookup-table value. This depends onlut_dtype.
The named terms in the formulas are also memory sizes:
encoded_data_size: Device memory used by compressed PQ codes stored inside IVF lists.list_index_size: Device memory used by source row IDs stored beside list codes.centers_size: Device memory used by theLcoarse cluster centers.rotation_size: Device memory used by the optional random rotation matrix and rotated centers.codebook_size: Device memory used by PQ codebooks.index_size: Approximate device memory used by the trained IVF-PQ index.query_size: Device memory for the current query batch.result_size: Device memory for neighbor IDs and distances returned for the current query batch.workspace_size: Temporary search workspace.refinement_dataset_size: Memory needed for original vectors when refinement reranks candidates.
Scratch and maximum vectors
The formulas below show the major persistent buffers and search workspaces. Additional scratch comes from k-means training, PQ training, list allocation padding, allocator fragmentation, CUDA library workspaces, and memory held by the active memory resource. Use H = 0.20 for build estimates and H = 0.10 for search estimates. If you can measure a representative smaller run, use:
Then set:
The capacity variables in this subsection are:
M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.M_usable: Memory budget left for the formula after subtractingM_otherand reserving headroom.observed_peak: Peak memory observed during a smaller representative run.formula_without_scratch: Value of the selected peak formula with explicitscratchterms removed and without applying headroom.peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usuallyNfor rows or vectors andBfor K-selection batch rows.B_per_row/B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such asD,K,Q, andLare substituted.B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.N_max/B_max: Estimated largest row, vector, or batch-row count that fits inM_usable.
IVF-PQ capacity depends on list padding and the selected code layout. Estimate N_cap first for the default interleaved layout. Then solve the selected peak formula:
For a rough maximum-vector estimate, try increasing N values until peak_without_scratch(N) <= M_usable no longer holds. If using the flat layout and no refinement dataset, most terms are linear in N, so the shortcut is:
List capacity
The default IVF-PQ list layout is interleaved. It pads each list for efficient GPU memory access. The minimum alignment is 32 rows. With the default allocation strategy, lists may be overallocated up to 1024-row chunks to make future extend calls cheaper. With conservative_memory_allocation = true, the maximum alignment is 32 rows.
For each list i, let n_i be the number of vectors assigned to that list.
For the default interleaved layout, the implementation chooses a capacity for each list based on A_min and A_max:
The total allocated list capacity is:
For balanced lists, a simple approximation is:
The flat code layout stores exactly one contiguous code per vector, so its list capacity is approximately N instead of N_cap. Use the exact list sizes when estimating tightly.
Baseline memory after build
For the flat code layout, each compressed vector uses:
The flat encoded data size is:
For the default interleaved code layout, codes are stored in 32-row groups and 16-byte vectorized chunks. Let:
Then the encoded data size is approximately:
Source row IDs are stored beside the compressed codes:
Coarse centers are stored in full precision:
When a rotation is needed, the rotated centers and rotation matrix add:
For per_subspace codebooks:
For per_cluster codebooks:
The baseline index footprint is:
Example (N = 1e6, D = 128, D_rot = 128, L = 1024, M = 64, b = 8, balanced lists, default interleaved layout, IdxT = int64, per_subspace codebooks):
N / L = 976.56, soN_cap ~= 1024 * 1024 = 1,048,576encoded_data_size ~= 67,108,864 B = 64.00 MiBlist_index_size = 8,388,608 B = 8.00 MiBcenters_size = 524,288 B = 0.50 MiBrotation_size ~= 589,824 B = 0.56 MiBcodebook_size = 131,072 B = 0.13 MiBindex_size ~= 73.19 MiB + small metadata
The original dataset would be 1e6 * 128 * 4 = 488.28 MiB in fp32, so IVF-PQ stores the searchable index much more compactly.
Refinement memory
Refinement needs the original vectors in addition to the IVF-PQ index:
It also needs candidate IDs from the approximate IVF-PQ search:
When the original dataset is too large to keep on the GPU, refinement can be staged or performed where the original vectors are available.
Build peak memory usage
Build trains coarse centers, trains PQ codebooks, and fills the lists when add_data_on_build = true. Peak memory depends on whether the input dataset is already on device and how the workspace memory resource is configured.
The k-means training sample size is:
A practical build estimate is:
The PQ training workspace depends on pq_bits, pq_dim, codebook_kind, and max_train_points_per_pq_code. A useful estimate for the number of training vectors per codebook is:
The total build peak is approximately:
When the workspace memory resource does not have enough room for the training set and labels, IVF-PQ build can fall back to managed memory for those temporary allocations.
Search peak memory usage
IVF-PQ search requires the compressed index to be resident on the GPU. Search also needs query vectors, output buffers, PQ lookup tables, and temporary workspace.
A useful lookup-table estimate is:
A practical workspace estimate is:
The total search memory is:
For many lists, use a pool memory resource when possible. Each IVF list is allocated separately, and a pool allocator helps avoid large per-allocation overhead from the underlying device memory resource.