Product Quantization
Product Quantization, or PQ, is a GPU-accelerated preprocessing algorithm for compressing vectors. It learns small codebooks from a training dataset, then represents each vector with compact integer codes instead of storing every original floating-point value.
Use PQ when memory footprint or bandwidth is the bottleneck and approximate reconstruction is acceptable. PQ is not an ANN index by itself. It is a compression step that can be used before storage, approximate scoring, vector quantization workflows, or index construction.
Example API Usage
C API | C++ API | Python API
Building a quantizer
Building trains the PQ codebooks from a representative dataset. The trained quantizer can then transform compatible datasets into compact PQ codes.
C
C++
Python
Transforming data
Transforming replaces each vector with its PQ code. If vector quantization is enabled with use_vq=True, the transform also returns one VQ label per row.
C
C++
Python
Reconstructing vectors
Inverse transform reconstructs approximate floating-point vectors from PQ codes. Reconstruction is lossy because each code stores the nearest learned codebook entry, not the original subvector.
C
C++
Python
How Product Quantization works
PQ divides each vector into smaller subvectors. For each subvector position, it trains a codebook with 2^pq_bits representative values. Transforming a vector stores the ID of the nearest codebook entry for each subvector.
This turns a high-dimensional floating-point vector into a much shorter code. For example, if pq_bits = 8 and pq_dim = 16, each vector is represented by 16 one-byte codes.
PQ can also be combined with vector quantization. With use_vq=True, cuVS first assigns each vector to a coarse VQ centroid, then trains PQ on the residuals. This can improve reconstruction quality when the dataset has strong global structure, at the cost of extra labels and VQ codebook memory.
When to use Product Quantization
Use PQ when vectors are too large to store or move efficiently in full precision, when approximate reconstruction is acceptable, or when a downstream workflow can use compact codes instead of original vectors.
PQ is especially useful for large vector search systems because it reduces memory traffic. It can also be useful for compression before storage, coarse candidate scoring, or as a building block inside ANN indexes such as IVF-PQ.
Avoid PQ when exact vector values are required. PQ is lossy, so increasing compression usually reduces reconstruction quality and can reduce recall in downstream search workflows.
Configuration parameters
Build parameters
Tuning
Start with pq_bits = 8. Lower values reduce memory but make each code less precise. Higher values can improve reconstruction quality but increase code size and codebook memory.
Choose pq_dim based on the desired code size and the input dimension. More PQ dimensions usually improve reconstruction quality because each code represents a smaller subvector, but the encoded row becomes longer.
Keep use_subspaces=True for most workloads. Separate subspace codebooks usually improve quality because each part of the vector gets its own codebook.
Enable use_vq when a single global PQ codebook is not accurate enough and the data has meaningful coarse groups. VQ adds a coarse label and VQ codebook, so it improves quality at the cost of more metadata and training work.
Increase max_train_points_per_pq_code or kmeans_n_iters when reconstruction quality is poor and build time is acceptable. Decrease them when build time is the main constraint.
Use kmeans_balanced for PQ codebook training unless there is a specific reason to prefer regular K-Means.
Memory footprint
PQ memory is dominated by the encoded output, PQ codebooks, optional VQ labels, and temporary training buffers. The estimates below use bytes and are meant for planning.
Variables:
N: Number of vectors.D: Number of input features.P: PQ dimension, or number of PQ code dimensions.b: Bits per PQ code, orpq_bits.C: Number of entries in each PQ codebook, equal to2^b.S: Subvector width, approximatelyD / P.L: Number of VQ centers, orvq_n_centers.B_x: Bytes per input element.B_c: Bytes per codebook element.B_l: Bytes per VQ label.
Scratch and maximum rows
The scratch term covers K-Means workspace for codebook training, residual buffers, transform buffers, allocator padding, CUDA library workspaces, and memory held by the active memory resource. Use H = 0.20 for build and H = 0.10 for transform-only planning. If you can measure a representative run, use:
Then set:
The capacity variables in this subsection are:
M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.M_usable: Memory budget left for the formula after subtractingM_otherand reserving headroom.observed_peak: Peak memory observed during a smaller representative run.formula_without_scratch: Value of the selected peak formula with explicitscratchterms removed and without applying headroom.peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usuallyNfor rows or vectors andBfor K-selection batch rows.B_per_row/B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such asD,K,Q, andLare substituted.B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.N_max/B_max: Estimated largest row, vector, or batch-row count that fits inM_usable.
Most PQ storage terms are linear in N. Rewrite the selected peak as:
and solve:
Training buffers use min(...) terms. If the cap is active, treat the capped training term as fixed; otherwise include it in B_per_row.
Encoded vectors
Each row stores P codes with b bits per code:
This is the main persistent output of PQ.
Codebooks
With separate subspace codebooks, the PQ codebook size is:
When use_subspaces=False, a shared codebook is used:
If VQ is enabled, add the coarse VQ codebook:
VQ labels
When use_vq=True, transform also stores one VQ label per row:
Training peak
PQ training samples up to C * max_train_points_per_pq_code rows for each PQ codebook. A rough training buffer estimate per codebook is:
If VQ is enabled, VQ training can additionally use:
The overall build peak is approximately:
For large datasets, reduce max_train_points_per_pq_code, reduce max_train_points_per_vq_cluster, or disable VQ to reduce training memory and build time.