IVF-PQ

View as Markdown

IVF-PQ is a GPU-accelerated approximate nearest-neighbor index. It first partitions vectors into coarse clusters, also called inverted lists, then stores a compressed product-quantized code for each vector in the selected list.

Compared with IVF-Flat, IVF-PQ uses much less GPU memory because it stores compressed codes instead of full vectors. The trade-off is that search distances are approximate. For high recall, IVF-PQ is commonly paired with refinement, where the search returns more candidates than needed and the final top results are reranked with the original vectors.

IVF-PQ works well when the dataset is too large for full-precision storage on the GPU, some recall loss is acceptable, and the original vectors are available for optional reranking.

Example API Usage

C API | C++ API | Python API | Java IVF-PQ Params | Rust API | Go API

Java currently exposes IVF-PQ parameter classes for CAGRA graph construction, not a standalone IVF-PQ index/search binding. The runnable standalone examples below cover C, C++, Python, Rust, and Go.

Building an index

1#include <cuvs/core/c_api.h>
2#include <cuvs/neighbors/ivf_pq.h>
3
4cuvsResources_t res;
5cuvsIvfPqIndexParams_t index_params;
6cuvsIvfPqIndex_t index;
7DLManagedTensor *dataset;
8
9// Populate tensor with data.
10load_dataset(dataset);
11
12cuvsResourcesCreate(&res);
13cuvsIvfPqIndexParamsCreate(&index_params);
14cuvsIvfPqIndexCreate(&index);
15
16index_params->n_lists = 1024;
17index_params->kmeans_trainset_fraction = 0.5;
18index_params->pq_bits = 8;
19index_params->pq_dim = 64;
20
21cuvsIvfPqBuild(res, index_params, dataset, index);
22
23// Keep index alive while you search, then destroy it when finished.
24cuvsIvfPqIndexDestroy(index);
25cuvsIvfPqIndexParamsDestroy(index_params);
26cuvsResourcesDestroy(res);

Searching an index

1#include <cuvs/core/c_api.h>
2#include <cuvs/neighbors/ivf_pq.h>
3
4cuvsResources_t res;
5cuvsIvfPqSearchParams_t search_params;
6cuvsIvfPqIndex_t index;
7DLManagedTensor *queries;
8DLManagedTensor *neighbors;
9DLManagedTensor *distances;
10
11// Populate tensors with query and output data.
12load_queries(queries);
13allocate_outputs(neighbors, distances);
14
15cuvsResourcesCreate(&res);
16cuvsIvfPqSearchParamsCreate(&search_params);
17search_params->n_probes = 20;
18
19// ... build or load index ...
20cuvsIvfPqSearch(
21 res,
22 search_params,
23 index,
24 queries,
25 neighbors,
26 distances);
27
28cuvsIvfPqSearchParamsDestroy(search_params);
29cuvsResourcesDestroy(res);

Saving and loading an index

Serialize an IVF-PQ index when you want to reuse trained coarse clusters, PQ codebooks, and compressed lists without rebuilding the index.

Java currently exposes IVF-PQ parameter classes for CAGRA graph construction, not a standalone IVF-PQ index/search binding. Rust and Go currently expose IVF-PQ build/search wrappers, but not IVF-PQ save/load wrappers.

1#include <cuvs/neighbors/ivf_pq.h>
2
3cuvsResources_t res;
4cuvsIvfPqIndex_t index;
5cuvsIvfPqIndex_t loaded_index;
6
7cuvsResourcesCreate(&res);
8cuvsIvfPqIndexCreate(&index);
9cuvsIvfPqIndexCreate(&loaded_index);
10
11// ... build index ...
12cuvsIvfPqSerialize(res, "/tmp/cuvs-ivf-pq.bin", index);
13cuvsIvfPqDeserialize(res, "/tmp/cuvs-ivf-pq.bin", loaded_index);
14
15cuvsIvfPqIndexDestroy(loaded_index);
16cuvsIvfPqIndexDestroy(index);
17cuvsResourcesDestroy(res);

How IVF-PQ works

IVF-PQ combines two ideas:

  1. IVF partitions the dataset into n_lists coarse clusters. During search, each query visits only the closest n_probes lists.
  2. PQ compresses each vector into a short code. Instead of comparing the query to every original vector value, search compares the query to precomputed codebook entries.

Think of each vector as a long row of numbers. PQ splits that row into smaller chunks, gives each chunk a short code, and stores the codes. At search time, cuVS uses lookup tables to score those codes quickly.

The compression is lossy, so IVF-PQ trades some recall for a smaller and faster index. Refinement can recover much of that recall when the original vectors are still available.

When to use IVF-PQ

Use IVF-PQ when GPU memory is the main limit and IVF-Flat would store too much full-precision data.

IVF-PQ is often a good fit for large datasets, high-throughput candidate generation, and workflows that can rerank candidates with exact distances afterward.

Avoid IVF-PQ when exact distances are required for every result, when recall cannot tolerate compression error, or when the original vectors are unavailable but high recall depends on refinement.

Refinement and reranking

IVF-PQ search returns approximate distances from compressed vectors. A common pattern is to ask IVF-PQ for K0 candidates, where K0 is larger than the final K, then rerank those candidates with exact distances against the original dataset.

For example, search for K0 = 4 * K, refine the candidates, and keep the best K. This needs the original vectors, but it can improve recall substantially without storing full vectors inside the IVF-PQ index.

1#include <cuvs/neighbors/ivf_pq.h>
2#include <cuvs/neighbors/refine.h>
3
4int64_t k = 10;
5int64_t k0 = 4 * k;
6
7DLManagedTensor *dataset;
8DLManagedTensor *queries;
9DLManagedTensor *candidate_neighbors;
10DLManagedTensor *candidate_distances;
11DLManagedTensor *refined_neighbors;
12DLManagedTensor *refined_distances;
13
14// Allocate candidate output as [n_queries, k0].
15allocate_outputs(candidate_neighbors, candidate_distances);
16
17cuvsIvfPqSearch(
18 res,
19 search_params,
20 index,
21 queries,
22 candidate_neighbors,
23 candidate_distances);
24
25// Allocate refined output as [n_queries, k].
26allocate_outputs(refined_neighbors, refined_distances);
27
28cuvsRefine(
29 res,
30 dataset,
31 queries,
32 candidate_neighbors,
33 index_params->metric,
34 refined_neighbors,
35 refined_distances);

Rust and Go currently expose standalone IVF-PQ search without a separate refinement wrapper.

Using Filters

IVF-PQ C++ search supports sample filters. Filtering happens only inside the lists selected by n_probes, so a filtered search can miss an allowed vector if that vector lives in a list that was not probed.

If filtered recall matters, increase n_probes, evaluate with representative filters, or use brute-force when the filter is expected to remove most vectors. C, Python, Rust, and Go currently expose standalone IVF-PQ search without a filter argument, and Java does not currently expose a standalone IVF-PQ binding.

1#include <cuvs/core/bitset.hpp>
2#include <cuvs/neighbors/ivf_pq.hpp>
3
4using namespace cuvs::neighbors;
5using indexing_dtype = int64_t;
6
7raft::device_resources res;
8ivf_pq::index<indexing_dtype> index(res);
9ivf_pq::search_params search_params;
10auto queries = load_queries();
11
12auto neighbors =
13 raft::make_device_matrix<indexing_dtype, indexing_dtype>(res, n_queries, k);
14auto distances =
15 raft::make_device_matrix<float, indexing_dtype>(res, n_queries, k);
16
17// Load a list of all samples that should be allowed.
18std::vector<indexing_dtype> allowed_indices_host = get_allowed_indices();
19auto allowed_indices_device =
20 raft::make_device_vector<indexing_dtype, indexing_dtype>(
21 res,
22 allowed_indices_host.size());
23
24raft::copy(
25 allowed_indices_device.data_handle(),
26 allowed_indices_host.data(),
27 allowed_indices_host.size(),
28 raft::resource::get_cuda_stream(res));
29
30cuvs::core::bitset<uint32_t, indexing_dtype> allowed_indices_bitset(
31 res,
32 allowed_indices_device.view(),
33 index.size());
34auto bitset_filter =
35 filtering::bitset_filter(allowed_indices_bitset.view());
36
37ivf_pq::search(
38 res,
39 search_params,
40 index,
41 queries,
42 neighbors.view(),
43 distances.view(),
44 bitset_filter);

Configuration parameters

Build parameters

NameDefaultDescription
metricL2Expanded / sqeuclideanDistance metric used for cluster training, PQ scoring, and search.
metric_arg2.0Extra argument for metrics that need one, such as Minkowski distance.
n_lists1024Number of coarse clusters, or inverted lists. More lists make each list smaller, but may require more probes to maintain recall.
kmeans_n_iters20Maximum number of k-means iterations used to train the coarse list centers.
kmeans_trainset_fraction0.5Fraction of the dataset used for coarse k-means training. Reducing this can speed up build on very large datasets.
pq_bits8Number of bits used for each PQ code. Smaller values reduce memory and lookup-table size, but can lower recall.
pq_dim0Number of PQ code dimensions. 0 lets cuVS choose a heuristic value. pq_dim * pq_bits must be a multiple of 8.
codebook_kindper_subspaceTrains one codebook per PQ subspace, or separate codebooks per cluster with per_cluster. Per-cluster codebooks can improve recall but use more memory.
codes_layoutinterleavedMemory layout for compressed list data. interleaved is optimized for GPU search. flat stores each vector code contiguously.
force_random_rotationFalseApplies a random rotation even when the input dimension is already compatible with pq_dim. Rotation is always needed when the dimension must be padded to a multiple of pq_dim.
conservative_memory_allocationFalseUses tighter per-list allocations. The default overallocates lists to reduce reallocations during repeated extend calls.
add_data_on_buildTrueAdds vectors to the index during build. Set to False to train the quantizers first and add vectors later with extend.
max_train_points_per_pq_code256Maximum training points used per PQ code during codebook training. Higher values can improve codebooks but increase build time.

Search parameters

NameDefaultDescription
n_probes20Number of closest lists to scan for each query. This is the main recall and latency knob.
lut_dtypeCUDA_R_32F / float32Data type used for PQ lookup tables. Lower precision can reduce shared memory use and improve speed, with possible recall loss.
internal_distance_dtypeCUDA_R_32F / float32Data type used for search-time distance accumulation. CUDA_R_16F can improve performance when memory access dominates.
preferred_shmem_carveout1.0Hint for the fraction of unified L1/shared memory used as shared memory. This is a low-level performance knob.
coarse_search_dtypeCUDA_R_32F / float32Data type used when finding the closest coarse clusters. Lower precision is faster but can reduce recall.
max_internal_batch_size4096Internal search batch size. Larger batches can improve GPU utilization but require more temporary memory.

Filters are passed as search function arguments in bindings that expose filtered search, not as fields in ivf_pq::search_params.

Tuning

Start with n_lists. A common target is roughly 1,000 to 10,000 vectors per list. More lists make each list smaller, but search usually needs a higher n_probes value to maintain recall.

Tune n_probes next. Increasing n_probes usually improves recall because each query scans more lists, but it also increases latency.

Then tune compression. Lower pq_bits and lower pq_dim reduce memory and can speed up search, but they also make each stored vector less precise. If recall drops too much, increase pq_bits, increase pq_dim, use per_cluster codebooks, or add refinement.

Use lower-precision lut_dtype, internal_distance_dtype, or coarse_search_dtype only after the basic recall and latency balance is acceptable. These options can improve throughput, but they add another source of approximation.

Memory footprint

IVF-PQ memory has four main parts: compressed vector codes, source row IDs, coarse cluster centers, and PQ codebooks. The index is much smaller than IVF-Flat because it stores pq_bits-wide codes rather than full vectors.

To keep the formulas readable, this section uses short symbols. All estimates are in bytes. The examples convert bytes to MiB by dividing by 1024 * 1024.

  • N: Number of database vectors, or rows in the dataset being indexed.
  • D: Original vector dimension, or number of values in each vector.
  • D_rot: Rotated and padded dimension used internally by PQ. This is a multiple of M.
  • B: Bytes stored for each original vector value. Use 4 for fp32, 2 for fp16, or 1 for int8 and uint8.
  • L: Number of inverted lists. This is the n_lists build parameter.
  • M: Number of PQ code dimensions. This is the pq_dim build parameter.
  • b: Bits per PQ code. This is the pq_bits build parameter.
  • C: Number of entries in each PQ codebook, equal to 2^b.
  • S_idx: Bytes per source row ID. This is sizeof(IdxT), usually 8 for int64_t.
  • N_cap: Total allocated list capacity across all lists after padding and over-allocation.
  • Q: Query batch size, or number of query vectors processed together.
  • K: Search result count, or the requested k nearest neighbors per query.
  • K0: Candidate count used before refinement, where K0 >= K.
  • P: Number of lists probed during search. This is the n_probes search parameter.
  • F: Training-set fraction. This is the kmeans_trainset_fraction build parameter.
  • B_lut: Bytes per lookup-table value. This depends on lut_dtype.

The named terms in the formulas are also memory sizes:

  • encoded_data_size: Device memory used by compressed PQ codes stored inside IVF lists.
  • list_index_size: Device memory used by source row IDs stored beside list codes.
  • centers_size: Device memory used by the L coarse cluster centers.
  • rotation_size: Device memory used by the optional random rotation matrix and rotated centers.
  • codebook_size: Device memory used by PQ codebooks.
  • index_size: Approximate device memory used by the trained IVF-PQ index.
  • query_size: Device memory for the current query batch.
  • result_size: Device memory for neighbor IDs and distances returned for the current query batch.
  • workspace_size: Temporary search workspace.
  • refinement_dataset_size: Memory needed for original vectors when refinement reranks candidates.

Scratch and maximum vectors

The formulas below show the major persistent buffers and search workspaces. Additional scratch comes from k-means training, PQ training, list allocation padding, allocator fragmentation, CUDA library workspaces, and memory held by the active memory resource. Use H = 0.20 for build estimates and H = 0.10 for search estimates. If you can measure a representative smaller run, use:

Hmeasured=observed_peakformula_without_scratchformula_without_scratchH_{\text{measured}} = \frac{\text{observed\_peak} - \text{formula\_without\_scratch}} {\text{formula\_without\_scratch}}

Then set:

Musable=(MfreeMother)(1H)M_{\text{usable}} = (M_{\text{free}} - M_{\text{other}}) \cdot (1 - H)

The capacity variables in this subsection are:

  • M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.
  • M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.
  • H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.
  • M_usable: Memory budget left for the formula after subtracting M_other and reserving headroom.
  • observed_peak: Peak memory observed during a smaller representative run.
  • formula_without_scratch: Value of the selected peak formula with explicit scratch terms removed and without applying headroom.
  • peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usually N for rows or vectors and B for K-selection batch rows.
  • B_per_row / B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such as D, K, Q, and L are substituted.
  • B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.
  • N_max / B_max: Estimated largest row, vector, or batch-row count that fits in M_usable.

IVF-PQ capacity depends on list padding and the selected code layout. Estimate N_cap first for the default interleaved layout. Then solve the selected peak formula:

peak_without_scratch(N)= encoded_data_size(Ncap)+NBper_input_vector+Bfixed\begin{aligned} \text{peak\_without\_scratch}(N) =&\ \text{encoded\_data\_size}(N_{\text{cap}}) \\ &+ N \cdot B_{\text{per\_input\_vector}} + B_{\text{fixed}} \end{aligned}

For a rough maximum-vector estimate, try increasing N values until peak_without_scratch(N) <= M_usable no longer holds. If using the flat layout and no refinement dataset, most terms are linear in N, so the shortcut is:

Nmax=MusableBfixedBper_vectorN_{\max} = \left\lfloor \frac{M_{\text{usable}} - B_{\text{fixed}}} {B_{\text{per\_vector}}} \right\rfloor

List capacity

The default IVF-PQ list layout is interleaved. It pads each list for efficient GPU memory access. The minimum alignment is 32 rows. With the default allocation strategy, lists may be overallocated up to 1024-row chunks to make future extend calls cheaper. With conservative_memory_allocation = true, the maximum alignment is 32 rows.

For each list i, let n_i be the number of vectors assigned to that list.

Amin=32\begin{aligned} A_{\text{min}} &= 32 \end{aligned} Amax={32,conservative allocation1024,default allocation\begin{aligned} A_{\text{max}} &= \begin{cases} 32, & \text{conservative allocation} \\ 1024, & \text{default allocation} \end{cases} \end{aligned}

For the default interleaved layout, the implementation chooses a capacity for each list based on A_min and A_max:

capacity(ni)={min ⁣(nextpow2(max(ni,Amin)),Amax),ni<Amaxround_up(ni,Amax),niAmax\begin{aligned} \operatorname{capacity}(n_i) &= \begin{cases} \min\!\big( \operatorname{nextpow2} (\max(n_i, A_{\text{min}})), A_{\text{max}} \big), & n_i < A_{\text{max}} \\ \operatorname{round\_up}(n_i, A_{\text{max}}), & n_i \ge A_{\text{max}} \end{cases} \end{aligned}

The total allocated list capacity is:

Ncap=i=1Lcapacity(ni)\begin{aligned} N_{\text{cap}} &= \sum_{i=1}^{L} \operatorname{capacity}(n_i) \end{aligned}

For balanced lists, a simple approximation is:

NcapL×capacity(NL)\begin{aligned} N_{\text{cap}} &\approx L \times \operatorname{capacity} \left(\frac{N}{L}\right) \end{aligned}

The flat code layout stores exactly one contiguous code per vector, so its list capacity is approximately N instead of N_cap. Use the exact list sizes when estimating tightly.

Baseline memory after build

For the flat code layout, each compressed vector uses:

bytes_per_code=M×b8\begin{aligned} \text{bytes\_per\_code} &= \left\lceil \frac{M \times b}{8} \right\rceil \end{aligned}

The flat encoded data size is:

encoded_data_sizeflat=N×bytes_per_code\begin{aligned} \text{encoded\_data\_size}_{\text{flat}} &= N \times \text{bytes\_per\_code} \end{aligned}

For the default interleaved code layout, codes are stored in 32-row groups and 16-byte vectorized chunks. Let:

Cchunk=16×8b\begin{aligned} C_{\text{chunk}} &= \frac{16 \times 8}{b} \end{aligned}

Then the encoded data size is approximately:

encoded_data_sizeinterleaved=i=1Lcapacity(ni)32×MCchunk×32×16\begin{aligned} \text{encoded\_data\_size}_{\text{interleaved}} &= \sum_{i=1}^{L} \left\lceil \frac{\operatorname{capacity}(n_i)}{32} \right\rceil \\ &\quad \times \left\lceil \frac{M}{C_{\text{chunk}}} \right\rceil \times 32 \times 16 \end{aligned}

Source row IDs are stored beside the compressed codes:

list_index_size=Ncap×Sidx\begin{aligned} \text{list\_index\_size} &= N_{\text{cap}} \times S_{\text{idx}} \end{aligned}

Coarse centers are stored in full precision:

centers_size=L×D×sizeof(float)\begin{aligned} \text{centers\_size} &= L \times D \times \operatorname{sizeof}(\mathrm{float}) \end{aligned}

When a rotation is needed, the rotated centers and rotation matrix add:

rotation_sizeL×Drot×sizeof(float)+Drot×D×sizeof(float)\begin{aligned} \text{rotation\_size} &\approx L \times D_{\text{rot}} \times \operatorname{sizeof}(\mathrm{float}) \\ &\quad + D_{\text{rot}} \times D \times \operatorname{sizeof}(\mathrm{float}) \end{aligned}

For per_subspace codebooks:

codebook_sizeper_subspace=M×DrotM×C×sizeof(float)\begin{aligned} \text{codebook\_size}_{\text{per\_subspace}} &= M \times \frac{D_{\text{rot}}}{M} \times C \times \operatorname{sizeof}(\mathrm{float}) \end{aligned}

For per_cluster codebooks:

codebook_sizeper_cluster=L×DrotM×C×sizeof(float)\begin{aligned} \text{codebook\_size}_{\text{per\_cluster}} &= L \times \frac{D_{\text{rot}}}{M} \times C \times \operatorname{sizeof}(\mathrm{float}) \end{aligned}

The baseline index footprint is:

index_sizeencoded_data_size+list_index_size+centers_size+rotation_size+codebook_size+small list metadata\begin{aligned} \text{index\_size} &\approx \text{encoded\_data\_size} \\ &\quad + \text{list\_index\_size} \\ &\quad + \text{centers\_size} \\ &\quad + \text{rotation\_size} \\ &\quad + \text{codebook\_size} \\ &\quad + \text{small list metadata} \end{aligned}

Example (N = 1e6, D = 128, D_rot = 128, L = 1024, M = 64, b = 8, balanced lists, default interleaved layout, IdxT = int64, per_subspace codebooks):

  • N / L = 976.56, so N_cap ~= 1024 * 1024 = 1,048,576
  • encoded_data_size ~= 67,108,864 B = 64.00 MiB
  • list_index_size = 8,388,608 B = 8.00 MiB
  • centers_size = 524,288 B = 0.50 MiB
  • rotation_size ~= 589,824 B = 0.56 MiB
  • codebook_size = 131,072 B = 0.13 MiB
  • index_size ~= 73.19 MiB + small metadata

The original dataset would be 1e6 * 128 * 4 = 488.28 MiB in fp32, so IVF-PQ stores the searchable index much more compactly.

Refinement memory

Refinement needs the original vectors in addition to the IVF-PQ index:

refinement_dataset_size=N×D×B\begin{aligned} \text{refinement\_dataset\_size} &= N \times D \times B \end{aligned}

It also needs candidate IDs from the approximate IVF-PQ search:

candidate_size=Q×K0×Sidx\begin{aligned} \text{candidate\_size} &= Q \times K0 \times S_{\text{idx}} \end{aligned}

When the original dataset is too large to keep on the GPU, refinement can be staged or performed where the original vectors are available.

Build peak memory usage

Build trains coarse centers, trains PQ codebooks, and fills the lists when add_data_on_build = true. Peak memory depends on whether the input dataset is already on device and how the workspace memory resource is configured.

The k-means training sample size is:

Ntrain=F×N\begin{aligned} N_{\text{train}} &= F \times N \end{aligned}

A practical build estimate is:

training_sample_size=Ntrain×D×B\begin{aligned} \text{training\_sample\_size} &= N_{\text{train}} \times D \times B \end{aligned} training_labels_size=Ntrain×sizeof(uint32_t)\begin{aligned} \text{training\_labels\_size} &= N_{\text{train}} \times \operatorname{sizeof}(\mathrm{uint32\_t}) \end{aligned}

The PQ training workspace depends on pq_bits, pq_dim, codebook_kind, and max_train_points_per_pq_code. A useful estimate for the number of training vectors per codebook is:

Npq_trainC×max_train_points_per_pq_code\begin{aligned} N_{\text{pq\_train}} &\lesssim C \times \text{max\_train\_points\_per\_pq\_code} \end{aligned}

The total build peak is approximately:

build_peakinput_dataset_size+training_sample_size+training_labels_size+centers_size+rotation_size+codebook_training_workspace+index_size\begin{aligned} \text{build\_peak} &\approx \text{input\_dataset\_size} \\ &\quad + \text{training\_sample\_size} \\ &\quad + \text{training\_labels\_size} \\ &\quad + \text{centers\_size} \\ &\quad + \text{rotation\_size} \\ &\quad + \text{codebook\_training\_workspace} \\ &\quad + \text{index\_size} \end{aligned}

When the workspace memory resource does not have enough room for the training set and labels, IVF-PQ build can fall back to managed memory for those temporary allocations.

Search peak memory usage

IVF-PQ search requires the compressed index to be resident on the GPU. Search also needs query vectors, output buffers, PQ lookup tables, and temporary workspace.

query_size=Q×D×B\begin{aligned} \text{query\_size} &= Q \times D \times B \end{aligned} result_size=Q×K×(Sidx+sizeof(float))\begin{aligned} \text{result\_size} &= Q \times K \\ &\quad \times \big(S_{\text{idx}} + \operatorname{sizeof}(\mathrm{float})\big) \end{aligned}

A useful lookup-table estimate is:

Pcodebook={1,per-subspace codebooksP,per-cluster codebooks\begin{aligned} P_{\text{codebook}} &= \begin{cases} 1, & \text{per-subspace codebooks} \\ P, & \text{per-cluster codebooks} \end{cases} \end{aligned} lut_sizeQ×Pcodebook×M×C×Blut\begin{aligned} \text{lut\_size} &\approx Q \times P_{\text{codebook}} \times M \times C \times B_{\text{lut}} \end{aligned}

A practical workspace estimate is:

workspace_sizelut_size+Q×[(L+1+P×(K+1))×sizeof(float)+P×K×Sidx]\begin{aligned} \text{workspace\_size} &\approx \text{lut\_size} \\ &\quad + Q \times \Big[ \big(L + 1 + P \times (K + 1)\big) \times \operatorname{sizeof}(\mathrm{float}) \\ &\qquad + P \times K \times S_{\text{idx}} \Big] \end{aligned}

The total search memory is:

search_memoryindex_size+query_size+result_size+workspace_size\begin{aligned} \text{search\_memory} &\approx \text{index\_size} \\ &\quad + \text{query\_size} \\ &\quad + \text{result\_size} \\ &\quad + \text{workspace\_size} \end{aligned}

For many lists, use a pool memory resource when possible. Each IVF list is allocated separately, and a pool allocator helps avoid large per-allocation overhead from the underlying device memory resource.