> For clean Markdown of any page, append .md to the page URL. > For a complete documentation index, see https://docs.nvidia.com/cuvs/llms.txt. > For full documentation content, see https://docs.nvidia.com/cuvs/llms-full.txt. > For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/cuvs/_mcp/server. # IVF-PQ IVF-PQ is a GPU-accelerated approximate nearest-neighbor index. It first partitions vectors into coarse clusters, also called inverted lists, then stores a compressed product-quantized code for each vector in the selected list. Compared with IVF-Flat, IVF-PQ uses much less GPU memory because it stores compressed codes instead of full vectors. The trade-off is that search distances are approximate. For high recall, IVF-PQ is commonly paired with refinement, where the search returns more candidates than needed and the final top results are reranked with the original vectors. IVF-PQ works well when the dataset is too large for full-precision storage on the GPU, some recall loss is acceptable, and the original vectors are available for optional reranking. ## Example API Usage [C API](/api-reference/c-api-neighbors-ivf-pq) | [C++ API](/api-reference/cpp-api-neighbors-ivf-pq) | [Python API](/api-reference/python-api-neighbors-ivf-pq) | [Java IVF-PQ Params](/api-reference/java-api-com-nvidia-cuvs-cuvsivfpqparams) | [Rust API](/api-reference/rust-api-cuvs-ivf-pq) | [Go API](/api-reference/go-api-ivf-pq) Java currently exposes IVF-PQ parameter classes for CAGRA graph construction, not a standalone IVF-PQ index/search binding. The runnable standalone examples below cover C, C++, Python, Rust, and Go. ### Building an index ```c #include #include cuvsResources_t res; cuvsIvfPqIndexParams_t index_params; cuvsIvfPqIndex_t index; DLManagedTensor *dataset; // Populate tensor with data. load_dataset(dataset); cuvsResourcesCreate(&res); cuvsIvfPqIndexParamsCreate(&index_params); cuvsIvfPqIndexCreate(&index); index_params->n_lists = 1024; index_params->kmeans_trainset_fraction = 0.5; index_params->pq_bits = 8; index_params->pq_dim = 64; cuvsIvfPqBuild(res, index_params, dataset, index); // Keep index alive while you search, then destroy it when finished. cuvsIvfPqIndexDestroy(index); cuvsIvfPqIndexParamsDestroy(index_params); cuvsResourcesDestroy(res); ``` ```cpp #include using namespace cuvs::neighbors; raft::device_resources res; auto dataset = load_dataset(); ivf_pq::index_params index_params; index_params.n_lists = 1024; index_params.kmeans_trainset_fraction = 0.5; index_params.pq_bits = 8; index_params.pq_dim = 64; index_params.metric = cuvs::distance::DistanceType::L2Expanded; auto index = ivf_pq::build(res, index_params, dataset); ``` ```python from cuvs.neighbors import ivf_pq dataset = load_data() index_params = ivf_pq.IndexParams( n_lists=1024, kmeans_trainset_fraction=0.5, pq_bits=8, pq_dim=64, ) index = ivf_pq.build(index_params, dataset) ``` ```rust use cuvs::ivf_pq::{Index, IndexParams}; use cuvs::{ManagedTensor, Resources, Result}; fn build_ivf_pq_index(dataset: &ndarray::Array2) -> Result { let res = Resources::new()?; let dataset = ManagedTensor::from(dataset).to_device(&res)?; let index_params = IndexParams::new()? .set_n_lists(1024) .set_pq_bits(8) .set_pq_dim(64); Index::build(&res, &index_params, dataset) } ``` ```go package main import ( cuvs "github.com/rapidsai/cuvs/go" "github.com/rapidsai/cuvs/go/ivf_pq" ) func buildIvfPqIndex(dataset cuvs.Tensor[float32]) (*ivf_pq.IvfPqIndex, error) { resource, err := cuvs.NewResource(nil) if err != nil { return nil, err } indexParams, err := ivf_pq.CreateIndexParams() if err != nil { return nil, err } _, err = indexParams.SetNLists(1024) if err != nil { return nil, err } _, err = indexParams.SetPQBits(8) if err != nil { return nil, err } _, err = indexParams.SetPQDim(64) if err != nil { return nil, err } index, err := ivf_pq.CreateIndex(indexParams, &dataset) if err != nil { return nil, err } _, err = dataset.ToDevice(&resource) if err != nil { return nil, err } err = ivf_pq.BuildIndex(resource, indexParams, &dataset, index) return index, err } ``` ### Searching an index ```c #include #include cuvsResources_t res; cuvsIvfPqSearchParams_t search_params; cuvsIvfPqIndex_t index; DLManagedTensor *queries; DLManagedTensor *neighbors; DLManagedTensor *distances; // Populate tensors with query and output data. load_queries(queries); allocate_outputs(neighbors, distances); cuvsResourcesCreate(&res); cuvsIvfPqSearchParamsCreate(&search_params); search_params->n_probes = 20; // ... build or load index ... cuvsIvfPqSearch( res, search_params, index, queries, neighbors, distances); cuvsIvfPqSearchParamsDestroy(search_params); cuvsResourcesDestroy(res); ``` ```cpp #include using namespace cuvs::neighbors; raft::device_resources res; ivf_pq::index index(res); auto queries = load_queries(); auto neighbors = raft::make_device_matrix(res, n_queries, k); auto distances = raft::make_device_matrix(res, n_queries, k); ivf_pq::search_params search_params; search_params.n_probes = 20; // ... build or load index ... ivf_pq::search( res, search_params, index, queries, neighbors.view(), distances.view()); ``` ```python from cuvs.neighbors import ivf_pq queries = load_queries() search_params = ivf_pq.SearchParams(n_probes=20) k = 10 # ... build or load index ... distances, neighbors = ivf_pq.search(search_params, index, queries, k) ``` ```rust use cuvs::ivf_pq::SearchParams; use cuvs::{ManagedTensor, Resources, Result}; fn search_ivf_pq_index( res: &Resources, index: &cuvs::ivf_pq::Index, queries: &ndarray::ArrayView2, k: usize, ) -> Result<()> { let n_queries = queries.shape()[0]; let queries = ManagedTensor::from(queries).to_device(res)?; let mut neighbors_host = ndarray::Array::::zeros((n_queries, k)); let neighbors = ManagedTensor::from(&neighbors_host).to_device(res)?; let mut distances_host = ndarray::Array::::zeros((n_queries, k)); let distances = ManagedTensor::from(&distances_host).to_device(res)?; let search_params = SearchParams::new()?.set_n_probes(20); index.search(res, &search_params, &queries, &neighbors, &distances)?; distances.to_host(res, &mut distances_host)?; neighbors.to_host(res, &mut neighbors_host)?; Ok(()) } ``` ```go searchParams, err := ivf_pq.CreateSearchParams() if err != nil { return err } defer searchParams.Close() _, err = searchParams.SetNProbes(20) if err != nil { return err } _, err = queries.ToDevice(&resource) if err != nil { return err } err = ivf_pq.SearchIndex( resource, searchParams, index, &queries, &neighbors, &distances, ) if err != nil { return err } _, err = neighbors.ToHost(&resource) if err != nil { return err } _, err = distances.ToHost(&resource) if err != nil { return err } ``` ### Saving and loading an index Serialize an IVF-PQ index when you want to reuse trained coarse clusters, PQ codebooks, and compressed lists without rebuilding the index. Java currently exposes IVF-PQ parameter classes for CAGRA graph construction, not a standalone IVF-PQ index/search binding. Rust and Go currently expose IVF-PQ build/search wrappers, but not IVF-PQ save/load wrappers. ```c #include cuvsResources_t res; cuvsIvfPqIndex_t index; cuvsIvfPqIndex_t loaded_index; cuvsResourcesCreate(&res); cuvsIvfPqIndexCreate(&index); cuvsIvfPqIndexCreate(&loaded_index); // ... build index ... cuvsIvfPqSerialize(res, "/tmp/cuvs-ivf-pq.bin", index); cuvsIvfPqDeserialize(res, "/tmp/cuvs-ivf-pq.bin", loaded_index); cuvsIvfPqIndexDestroy(loaded_index); cuvsIvfPqIndexDestroy(index); cuvsResourcesDestroy(res); ``` ```cpp #include using namespace cuvs::neighbors; raft::device_resources res; auto dataset = load_dataset(); ivf_pq::index_params index_params; auto index = ivf_pq::build(res, index_params, dataset); ivf_pq::serialize(res, "/tmp/cuvs-ivf-pq.bin", index); ivf_pq::index loaded_index(res); ivf_pq::deserialize(res, "/tmp/cuvs-ivf-pq.bin", &loaded_index); ``` ```python from cuvs.neighbors import ivf_pq dataset = load_data() index = ivf_pq.build(ivf_pq.IndexParams(), dataset) ivf_pq.save("/tmp/cuvs-ivf-pq.bin", index, include_dataset=True) loaded_index = ivf_pq.load("/tmp/cuvs-ivf-pq.bin") ``` ## How IVF-PQ works IVF-PQ combines two ideas: 1. IVF partitions the dataset into `n_lists` coarse clusters. During search, each query visits only the closest `n_probes` lists. 2. PQ compresses each vector into a short code. Instead of comparing the query to every original vector value, search compares the query to precomputed codebook entries. Think of each vector as a long row of numbers. PQ splits that row into smaller chunks, gives each chunk a short code, and stores the codes. At search time, cuVS uses lookup tables to score those codes quickly. The compression is lossy, so IVF-PQ trades some recall for a smaller and faster index. Refinement can recover much of that recall when the original vectors are still available. ## When to use IVF-PQ Use IVF-PQ when GPU memory is the main limit and IVF-Flat would store too much full-precision data. IVF-PQ is often a good fit for large datasets, high-throughput candidate generation, and workflows that can rerank candidates with exact distances afterward. Avoid IVF-PQ when exact distances are required for every result, when recall cannot tolerate compression error, or when the original vectors are unavailable but high recall depends on refinement. ## Refinement and reranking IVF-PQ search returns approximate distances from compressed vectors. A common pattern is to ask IVF-PQ for `K0` candidates, where `K0` is larger than the final `K`, then rerank those candidates with exact distances against the original dataset. For example, search for `K0 = 4 * K`, refine the candidates, and keep the best `K`. This needs the original vectors, but it can improve recall substantially without storing full vectors inside the IVF-PQ index. ```c #include #include int64_t k = 10; int64_t k0 = 4 * k; DLManagedTensor *dataset; DLManagedTensor *queries; DLManagedTensor *candidate_neighbors; DLManagedTensor *candidate_distances; DLManagedTensor *refined_neighbors; DLManagedTensor *refined_distances; // Allocate candidate output as [n_queries, k0]. allocate_outputs(candidate_neighbors, candidate_distances); cuvsIvfPqSearch( res, search_params, index, queries, candidate_neighbors, candidate_distances); // Allocate refined output as [n_queries, k]. allocate_outputs(refined_neighbors, refined_distances); cuvsRefine( res, dataset, queries, candidate_neighbors, index_params->metric, refined_neighbors, refined_distances); ``` ```cpp #include #include using namespace cuvs::neighbors; int64_t k = 10; int64_t k0 = 4 * k; auto candidate_neighbors = raft::make_device_matrix(res, n_queries, k0); auto candidate_distances = raft::make_device_matrix(res, n_queries, k0); ivf_pq::search( res, search_params, index, queries, candidate_neighbors.view(), candidate_distances.view()); auto refined_neighbors = raft::make_device_matrix(res, n_queries, k); auto refined_distances = raft::make_device_matrix(res, n_queries, k); refine( res, dataset, queries, raft::make_const_mdspan(candidate_neighbors.view()), refined_neighbors.view(), refined_distances.view(), index.metric()); ``` ```python from cuvs.neighbors import ivf_pq, refine k = 10 k0 = 4 * k candidate_distances, candidates = ivf_pq.search( search_params, index, queries, k0, ) distances, neighbors = refine( dataset, queries, candidates, k, metric="sqeuclidean", ) ``` Rust and Go currently expose standalone IVF-PQ search without a separate refinement wrapper. ## Using Filters IVF-PQ C++ search supports sample filters. Filtering happens only inside the lists selected by `n_probes`, so a filtered search can miss an allowed vector if that vector lives in a list that was not probed. If filtered recall matters, increase `n_probes`, evaluate with representative filters, or use brute-force when the filter is expected to remove most vectors. C, Python, Rust, and Go currently expose standalone IVF-PQ search without a filter argument, and Java does not currently expose a standalone IVF-PQ binding. ```cpp #include #include using namespace cuvs::neighbors; using indexing_dtype = int64_t; raft::device_resources res; ivf_pq::index index(res); ivf_pq::search_params search_params; auto queries = load_queries(); auto neighbors = raft::make_device_matrix(res, n_queries, k); auto distances = raft::make_device_matrix(res, n_queries, k); // Load a list of all samples that should be allowed. std::vector allowed_indices_host = get_allowed_indices(); auto allowed_indices_device = raft::make_device_vector( res, allowed_indices_host.size()); raft::copy( allowed_indices_device.data_handle(), allowed_indices_host.data(), allowed_indices_host.size(), raft::resource::get_cuda_stream(res)); cuvs::core::bitset allowed_indices_bitset( res, allowed_indices_device.view(), index.size()); auto bitset_filter = filtering::bitset_filter(allowed_indices_bitset.view()); ivf_pq::search( res, search_params, index, queries, neighbors.view(), distances.view(), bitset_filter); ``` ## Configuration parameters ### Build parameters | Name | Default | Description | | --- | --- | --- | | `metric` | `L2Expanded` / `sqeuclidean` | Distance metric used for cluster training, PQ scoring, and search. | | `metric_arg` | `2.0` | Extra argument for metrics that need one, such as Minkowski distance. | | `n_lists` | `1024` | Number of coarse clusters, or inverted lists. More lists make each list smaller, but may require more probes to maintain recall. | | `kmeans_n_iters` | `20` | Maximum number of k-means iterations used to train the coarse list centers. | | `kmeans_trainset_fraction` | `0.5` | Fraction of the dataset used for coarse k-means training. Reducing this can speed up build on very large datasets. | | `pq_bits` | `8` | Number of bits used for each PQ code. Smaller values reduce memory and lookup-table size, but can lower recall. | | `pq_dim` | `0` | Number of PQ code dimensions. `0` lets cuVS choose a heuristic value. `pq_dim * pq_bits` must be a multiple of 8. | | `codebook_kind` | `per_subspace` | Trains one codebook per PQ subspace, or separate codebooks per cluster with `per_cluster`. Per-cluster codebooks can improve recall but use more memory. | | `codes_layout` | `interleaved` | Memory layout for compressed list data. `interleaved` is optimized for GPU search. `flat` stores each vector code contiguously. | | `force_random_rotation` | `False` | Applies a random rotation even when the input dimension is already compatible with `pq_dim`. Rotation is always needed when the dimension must be padded to a multiple of `pq_dim`. | | `conservative_memory_allocation` | `False` | Uses tighter per-list allocations. The default overallocates lists to reduce reallocations during repeated `extend` calls. | | `add_data_on_build` | `True` | Adds vectors to the index during build. Set to `False` to train the quantizers first and add vectors later with `extend`. | | `max_train_points_per_pq_code` | `256` | Maximum training points used per PQ code during codebook training. Higher values can improve codebooks but increase build time. | ### Search parameters | Name | Default | Description | | --- | --- | --- | | `n_probes` | `20` | Number of closest lists to scan for each query. This is the main recall and latency knob. | | `lut_dtype` | `CUDA_R_32F` / `float32` | Data type used for PQ lookup tables. Lower precision can reduce shared memory use and improve speed, with possible recall loss. | | `internal_distance_dtype` | `CUDA_R_32F` / `float32` | Data type used for search-time distance accumulation. `CUDA_R_16F` can improve performance when memory access dominates. | | `preferred_shmem_carveout` | `1.0` | Hint for the fraction of unified L1/shared memory used as shared memory. This is a low-level performance knob. | | `coarse_search_dtype` | `CUDA_R_32F` / `float32` | Data type used when finding the closest coarse clusters. Lower precision is faster but can reduce recall. | | `max_internal_batch_size` | `4096` | Internal search batch size. Larger batches can improve GPU utilization but require more temporary memory. | Filters are passed as search function arguments in bindings that expose filtered search, not as fields in `ivf_pq::search_params`. ## Tuning Start with `n_lists`. A common target is roughly 1,000 to 10,000 vectors per list. More lists make each list smaller, but search usually needs a higher `n_probes` value to maintain recall. Tune `n_probes` next. Increasing `n_probes` usually improves recall because each query scans more lists, but it also increases latency. Then tune compression. Lower `pq_bits` and lower `pq_dim` reduce memory and can speed up search, but they also make each stored vector less precise. If recall drops too much, increase `pq_bits`, increase `pq_dim`, use `per_cluster` codebooks, or add refinement. Use lower-precision `lut_dtype`, `internal_distance_dtype`, or `coarse_search_dtype` only after the basic recall and latency balance is acceptable. These options can improve throughput, but they add another source of approximation. ## Memory footprint IVF-PQ memory has four main parts: compressed vector codes, source row IDs, coarse cluster centers, and PQ codebooks. The index is much smaller than IVF-Flat because it stores `pq_bits`-wide codes rather than full vectors. To keep the formulas readable, this section uses short symbols. All estimates are in bytes. The examples convert bytes to MiB by dividing by `1024 * 1024`. - `N`: Number of database vectors, or rows in the dataset being indexed. - `D`: Original vector dimension, or number of values in each vector. - `D_rot`: Rotated and padded dimension used internally by PQ. This is a multiple of `M`. - `B`: Bytes stored for each original vector value. Use `4` for fp32, `2` for fp16, or `1` for int8 and uint8. - `L`: Number of inverted lists. This is the `n_lists` build parameter. - `M`: Number of PQ code dimensions. This is the `pq_dim` build parameter. - `b`: Bits per PQ code. This is the `pq_bits` build parameter. - `C`: Number of entries in each PQ codebook, equal to `2^b`. - `S_idx`: Bytes per source row ID. This is `sizeof(IdxT)`, usually `8` for `int64_t`. - `N_cap`: Total allocated list capacity across all lists after padding and over-allocation. - `Q`: Query batch size, or number of query vectors processed together. - `K`: Search result count, or the requested `k` nearest neighbors per query. - `K0`: Candidate count used before refinement, where `K0 >= K`. - `P`: Number of lists probed during search. This is the `n_probes` search parameter. - `F`: Training-set fraction. This is the `kmeans_trainset_fraction` build parameter. - `B_lut`: Bytes per lookup-table value. This depends on `lut_dtype`. The named terms in the formulas are also memory sizes: - `encoded_data_size`: Device memory used by compressed PQ codes stored inside IVF lists. - `list_index_size`: Device memory used by source row IDs stored beside list codes. - `centers_size`: Device memory used by the `L` coarse cluster centers. - `rotation_size`: Device memory used by the optional random rotation matrix and rotated centers. - `codebook_size`: Device memory used by PQ codebooks. - `index_size`: Approximate device memory used by the trained IVF-PQ index. - `query_size`: Device memory for the current query batch. - `result_size`: Device memory for neighbor IDs and distances returned for the current query batch. - `workspace_size`: Temporary search workspace. - `refinement_dataset_size`: Memory needed for original vectors when refinement reranks candidates. ### Scratch and maximum vectors The formulas below show the major persistent buffers and search workspaces. Additional scratch comes from k-means training, PQ training, list allocation padding, allocator fragmentation, CUDA library workspaces, and memory held by the active memory resource. Use `H = 0.20` for build estimates and `H = 0.10` for search estimates. If you can measure a representative smaller run, use: $$ H_{\text{measured}} = \frac{\text{observed\_peak} - \text{formula\_without\_scratch}} {\text{formula\_without\_scratch}} $$ Then set: $$ M_{\text{usable}} = (M_{\text{free}} - M_{\text{other}}) \cdot (1 - H) $$ The capacity variables in this subsection are: - `M_free`: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory. - `M_other`: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula. - `H`: Scratch headroom fraction reserved for temporary buffers and allocator overhead. - `M_usable`: Memory budget left for the formula after subtracting `M_other` and reserving headroom. - `observed_peak`: Peak memory observed during a smaller representative run. - `formula_without_scratch`: Value of the selected peak formula with explicit `scratch` terms removed and without applying headroom. - `peak_without_scratch(count)`: The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usually `N` for rows or vectors and `B` for K-selection batch rows. - `B_per_row` / `B_per_vector`: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such as `D`, `K`, `Q`, and `L` are substituted. - `B_fixed`: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata. - `N_max` / `B_max`: Estimated largest row, vector, or batch-row count that fits in `M_usable`. IVF-PQ capacity depends on list padding and the selected code layout. Estimate `N_cap` first for the default interleaved layout. Then solve the selected peak formula: $$ \begin{aligned} \text{peak\_without\_scratch}(N) =&\ \text{encoded\_data\_size}(N_{\text{cap}}) \\ &+ N \cdot B_{\text{per\_input\_vector}} + B_{\text{fixed}} \end{aligned} $$ For a rough maximum-vector estimate, try increasing `N` values until `peak_without_scratch(N) <= M_usable` no longer holds. If using the `flat` layout and no refinement dataset, most terms are linear in `N`, so the shortcut is: $$ N_{\max} = \left\lfloor \frac{M_{\text{usable}} - B_{\text{fixed}}} {B_{\text{per\_vector}}} \right\rfloor $$ ### List capacity The default IVF-PQ list layout is `interleaved`. It pads each list for efficient GPU memory access. The minimum alignment is 32 rows. With the default allocation strategy, lists may be overallocated up to 1024-row chunks to make future `extend` calls cheaper. With `conservative_memory_allocation = true`, the maximum alignment is 32 rows. For each list `i`, let `n_i` be the number of vectors assigned to that list. $$ \begin{aligned} A_{\text{min}} &= 32 \end{aligned} $$ $$ \begin{aligned} A_{\text{max}} &= \begin{cases} 32, & \text{conservative allocation} \\ 1024, & \text{default allocation} \end{cases} \end{aligned} $$ For the default `interleaved` layout, the implementation chooses a capacity for each list based on `A_min` and `A_max`: $$ \begin{aligned} \operatorname{capacity}(n_i) &= \begin{cases} \min\!\big( \operatorname{nextpow2} (\max(n_i, A_{\text{min}})), A_{\text{max}} \big), & n_i < A_{\text{max}} \\ \operatorname{round\_up}(n_i, A_{\text{max}}), & n_i \ge A_{\text{max}} \end{cases} \end{aligned} $$ The total allocated list capacity is: $$ \begin{aligned} N_{\text{cap}} &= \sum_{i=1}^{L} \operatorname{capacity}(n_i) \end{aligned} $$ For balanced lists, a simple approximation is: $$ \begin{aligned} N_{\text{cap}} &\approx L \times \operatorname{capacity} \left(\frac{N}{L}\right) \end{aligned} $$ The `flat` code layout stores exactly one contiguous code per vector, so its list capacity is approximately `N` instead of `N_cap`. Use the exact list sizes when estimating tightly. ### Baseline memory after build For the `flat` code layout, each compressed vector uses: $$ \begin{aligned} \text{bytes\_per\_code} &= \left\lceil \frac{M \times b}{8} \right\rceil \end{aligned} $$ The flat encoded data size is: $$ \begin{aligned} \text{encoded\_data\_size}_{\text{flat}} &= N \times \text{bytes\_per\_code} \end{aligned} $$ For the default `interleaved` code layout, codes are stored in 32-row groups and 16-byte vectorized chunks. Let: $$ \begin{aligned} C_{\text{chunk}} &= \frac{16 \times 8}{b} \end{aligned} $$ Then the encoded data size is approximately: $$ \begin{aligned} \text{encoded\_data\_size}_{\text{interleaved}} &= \sum_{i=1}^{L} \left\lceil \frac{\operatorname{capacity}(n_i)}{32} \right\rceil \\ &\quad \times \left\lceil \frac{M}{C_{\text{chunk}}} \right\rceil \times 32 \times 16 \end{aligned} $$ Source row IDs are stored beside the compressed codes: $$ \begin{aligned} \text{list\_index\_size} &= N_{\text{cap}} \times S_{\text{idx}} \end{aligned} $$ Coarse centers are stored in full precision: $$ \begin{aligned} \text{centers\_size} &= L \times D \times \operatorname{sizeof}(\mathrm{float}) \end{aligned} $$ When a rotation is needed, the rotated centers and rotation matrix add: $$ \begin{aligned} \text{rotation\_size} &\approx L \times D_{\text{rot}} \times \operatorname{sizeof}(\mathrm{float}) \\ &\quad + D_{\text{rot}} \times D \times \operatorname{sizeof}(\mathrm{float}) \end{aligned} $$ For `per_subspace` codebooks: $$ \begin{aligned} \text{codebook\_size}_{\text{per\_subspace}} &= M \times \frac{D_{\text{rot}}}{M} \times C \times \operatorname{sizeof}(\mathrm{float}) \end{aligned} $$ For `per_cluster` codebooks: $$ \begin{aligned} \text{codebook\_size}_{\text{per\_cluster}} &= L \times \frac{D_{\text{rot}}}{M} \times C \times \operatorname{sizeof}(\mathrm{float}) \end{aligned} $$ The baseline index footprint is: $$ \begin{aligned} \text{index\_size} &\approx \text{encoded\_data\_size} \\ &\quad + \text{list\_index\_size} \\ &\quad + \text{centers\_size} \\ &\quad + \text{rotation\_size} \\ &\quad + \text{codebook\_size} \\ &\quad + \text{small list metadata} \end{aligned} $$ **Example** (`N = 1e6`, `D = 128`, `D_rot = 128`, `L = 1024`, `M = 64`, `b = 8`, balanced lists, default interleaved layout, `IdxT = int64`, `per_subspace` codebooks): - `N / L = 976.56`, so `N_cap ~= 1024 * 1024 = 1,048,576` - `encoded_data_size ~= 67,108,864 B = 64.00 MiB` - `list_index_size = 8,388,608 B = 8.00 MiB` - `centers_size = 524,288 B = 0.50 MiB` - `rotation_size ~= 589,824 B = 0.56 MiB` - `codebook_size = 131,072 B = 0.13 MiB` - `index_size ~= 73.19 MiB + small metadata` The original dataset would be `1e6 * 128 * 4 = 488.28 MiB` in fp32, so IVF-PQ stores the searchable index much more compactly. ### Refinement memory Refinement needs the original vectors in addition to the IVF-PQ index: $$ \begin{aligned} \text{refinement\_dataset\_size} &= N \times D \times B \end{aligned} $$ It also needs candidate IDs from the approximate IVF-PQ search: $$ \begin{aligned} \text{candidate\_size} &= Q \times K0 \times S_{\text{idx}} \end{aligned} $$ When the original dataset is too large to keep on the GPU, refinement can be staged or performed where the original vectors are available. ### Build peak memory usage Build trains coarse centers, trains PQ codebooks, and fills the lists when `add_data_on_build = true`. Peak memory depends on whether the input dataset is already on device and how the workspace memory resource is configured. The k-means training sample size is: $$ \begin{aligned} N_{\text{train}} &= F \times N \end{aligned} $$ A practical build estimate is: $$ \begin{aligned} \text{training\_sample\_size} &= N_{\text{train}} \times D \times B \end{aligned} $$ $$ \begin{aligned} \text{training\_labels\_size} &= N_{\text{train}} \times \operatorname{sizeof}(\mathrm{uint32\_t}) \end{aligned} $$ The PQ training workspace depends on `pq_bits`, `pq_dim`, `codebook_kind`, and `max_train_points_per_pq_code`. A useful estimate for the number of training vectors per codebook is: $$ \begin{aligned} N_{\text{pq\_train}} &\lesssim C \times \text{max\_train\_points\_per\_pq\_code} \end{aligned} $$ The total build peak is approximately: $$ \begin{aligned} \text{build\_peak} &\approx \text{input\_dataset\_size} \\ &\quad + \text{training\_sample\_size} \\ &\quad + \text{training\_labels\_size} \\ &\quad + \text{centers\_size} \\ &\quad + \text{rotation\_size} \\ &\quad + \text{codebook\_training\_workspace} \\ &\quad + \text{index\_size} \end{aligned} $$ When the workspace memory resource does not have enough room for the training set and labels, IVF-PQ build can fall back to managed memory for those temporary allocations. ### Search peak memory usage IVF-PQ search requires the compressed index to be resident on the GPU. Search also needs query vectors, output buffers, PQ lookup tables, and temporary workspace. $$ \begin{aligned} \text{query\_size} &= Q \times D \times B \end{aligned} $$ $$ \begin{aligned} \text{result\_size} &= Q \times K \\ &\quad \times \big(S_{\text{idx}} + \operatorname{sizeof}(\mathrm{float})\big) \end{aligned} $$ A useful lookup-table estimate is: $$ \begin{aligned} P_{\text{codebook}} &= \begin{cases} 1, & \text{per-subspace codebooks} \\ P, & \text{per-cluster codebooks} \end{cases} \end{aligned} $$ $$ \begin{aligned} \text{lut\_size} &\approx Q \times P_{\text{codebook}} \times M \times C \times B_{\text{lut}} \end{aligned} $$ A practical workspace estimate is: $$ \begin{aligned} \text{workspace\_size} &\approx \text{lut\_size} \\ &\quad + Q \times \Big[ \big(L + 1 + P \times (K + 1)\big) \times \operatorname{sizeof}(\mathrm{float}) \\ &\qquad + P \times K \times S_{\text{idx}} \Big] \end{aligned} $$ The total search memory is: $$ \begin{aligned} \text{search\_memory} &\approx \text{index\_size} \\ &\quad + \text{query\_size} \\ &\quad + \text{result\_size} \\ &\quad + \text{workspace\_size} \end{aligned} $$ For many lists, use a pool memory resource when possible. Each IVF list is allocated separately, and a pool allocator helps avoid large per-allocation overhead from the underlying device memory resource.