ScaNN
ScaNN is an experimental cuVS index builder for the open-source ScaNN format. It combines partitioning, residual product quantization, SOAR spilling, and optional bfloat16 reordering data. Think of it as a pipeline that first sorts vectors into buckets, then stores compact shortcuts for the vectors in each bucket, and finally writes those pieces so OSS ScaNN can search them.
The cuVS SCaNN API currently builds and serializes indexes from C++. It does not expose a cuVS search API.
Example API Usage
Building an index
C++
The build API also accepts host row-major float data. C, Python, Java, Rust, and Go do not currently expose SCaNN bindings.
Serializing an index
C++
Serialization writes the files needed by OSS ScaNN, including partition centers, datapoint labels, PQ codebooks, quantized residuals, SOAR residuals, and optional bfloat16 reordering data.
Searching an index
cuVS does not currently provide a SCaNN search API or SCaNN search parameters. To search a SCaNN index, serialize it with cuVS and load the generated files with OSS ScaNN.
Loading a serialized index
cuVS does not currently expose a SCaNN deserialization API. The serialized directory is intended to be loaded by OSS ScaNN or another consumer that understands the OSS ScaNN file layout.
How ScaNN works
SCaNN first partitions the dataset into leaves. A query only needs to consider promising leaves instead of the full dataset.
Next, SCaNN stores residual product quantization codes. A residual is the leftover difference between a vector and its assigned partition center. Product quantization compresses those residuals into compact codes.
SCaNN also computes SOAR labels. SOAR gives each vector another assignment that can help recover good candidates that would otherwise be missed near partition boundaries.
If reordering_bf16 is enabled, cuVS also stores a bfloat16 copy of the dataset. OSS ScaNN can use that copy to rerank candidates with more accurate distances after the quantized first stage.
When to use ScaNN
Use SCaNN when you want cuVS to build an OSS ScaNN-compatible index from C++ and you are comfortable with an experimental API.
Use SCaNN when partitioning plus quantization is a good fit for the dataset and you plan to search with OSS ScaNN.
Use IVF-Flat, IVF-PQ, CAGRA, brute-force, or Vamana instead when you need a cuVS search API, multi-language bindings, or a non-experimental API surface.
Interoperability with OSS ScaNN
The SCaNN serializer writes a directory of files that OSS ScaNN can consume:
cuvs_metadata.bincenters.npydatapoint_to_token.npypq_codebook.npyhashed_dataset.npyhashed_dataset_soar.npybf16_dataset.npy, whenreordering_bf16is enabled
The implementation is experimental. Accuracy and performance are not currently guaranteed to match OSS ScaNN across releases.
Using Filters
cuVS SCaNN does not expose a search API, so it does not expose cuVS filtering controls. Apply filtering in the OSS ScaNN search layer after loading the serialized index.
Configuration parameters
Build parameters
Tuning
Start with the defaults, then tune one part of the pipeline at a time.
Increase n_leaves when partitions are too large. Smaller partitions can reduce search work in OSS ScaNN, but too many leaves can make partition selection harder and increase metadata.
Tune pq_dim and pq_bits together. Smaller codes reduce memory, but can lower recall unless reranking has enough good candidates.
Use soar_lambda when recall suffers near partition boundaries. It controls the extra SOAR assignment that helps recover vectors that sit between leaves.
Enable reordering_bf16 when final recall needs help and the extra host memory is acceptable.
Memory footprint
SCaNN memory has three main parts: partition metadata, residual PQ codes, and optional reranking data. During build, cuVS also uses temporary training and batch workspaces. These estimates are derived from the current C++ storage layout and are intended for planning, not as exact allocator accounting.
To keep the formulas readable, this section uses short symbols. All estimates are in bytes. The examples convert bytes to MiB by dividing by 1024 * 1024.
N: Number of database vectors.D: Vector dimension.B: Bytes per input vector value. Use4for fp32.L: Number of partition leaves, orn_leaves.P: PQ subspace dimension, orpq_dim.S: Number of PQ subspaces, whereS = D / P.b: Bits per PQ code, orpq_bits.C: Number of PQ clusters per subspace, whereC = 2^b.T_k: K-means training rows, orkmeans_n_rows_train.T_p: PQ training rows, ormin(pq_n_rows_train, 100000).Q_b: Build batch size, currentlymin(N, 65536).R:1whenreordering_bf16is enabled, otherwise0.S_idx: Bytes per stored label, currentlysizeof(uint32_t).
The named terms in the formulas are also memory sizes:
centers_size: Device memory for partition centers.labels_size: Device memory for normal and SOAR leaf labels.pq_codebook_size: Device memory for residual PQ codebooks.residual_codes_size: Host memory for normal and SOAR residual PQ codes.bf16_dataset_size: Optional host memory for bfloat16 reranking data.*_peak: Temporary peak memory for one build phase. Sequential phases are not added together.
Scratch and maximum vectors
The formulas below include the largest visible build phases, but additional scratch can come from k-means, PQ training, SOAR workspace, allocator padding, CUDA library workspaces, and memory held by the active memory resource. Use H = 0.25 for SCaNN build planning. If you can measure a representative smaller run, use:
Then set:
The capacity variables in this subsection are:
M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.M_usable: Memory budget left for the formula after subtractingM_otherand reserving headroom.observed_peak: Peak memory observed during a smaller representative run.formula_without_scratch: Value of the selected peak formula with explicitscratchterms removed and without applying headroom.peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usuallyNfor rows or vectors andBfor K-selection batch rows.B_per_row/B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such asD,K,Q, andLare substituted.B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.N_max/B_max: Estimated largest row, vector, or batch-row count that fits inM_usable.
For fixed D, L, P, pq_bits, and batch settings, most SCaNN storage terms are linear in N, while capped training terms can become fixed. Solve the full build formula or rewrite the dominant phase as:
and estimate:
Check device and host memory separately because SCaNN keeps residual codes and optional bfloat16 reranking data on host.
Baseline memory after build
The baseline device memory kept by the SCaNN index is:
The baseline host memory kept by the SCaNN index is:
The total index footprint is approximately:
Example (N = 1e6, D = 128, L = 1000, pq_dim = 8, pq_bits = 8, reordering_bf16 = false):
centers_size = 512000 B = 0.49 MiBlabels_size = 8000000 B = 7.63 MiBpq_codebook_size = 131072 B = 0.13 MiBresidual_codes_size = 32000000 B = 30.52 MiBindex_size = 40643072 B = 38.76 MiB
Build peak memory usage
SCaNN build runs in phases, so the temporary allocations below are not all held at once. The largest active phase usually dominates the extra build memory.
K-means training samples rows into device memory:
PQ training samples residuals into device memory:
Batch quantization stores normal residuals, SOAR residuals, and packed PQ codes for one build batch. Packed code width is ceil(S * b / 8) bytes per vector.
When pq_bits = 4, cuVS also unpacks codes before copying them to the host index:
SOAR label computation uses a score matrix between one batch and all leaves. This is often the largest temporary in the quantization phase:
If bfloat16 reordering is enabled, one device batch is quantized before being copied to host:
The overall build peak can be estimated as the dataset, the baseline device index, and the largest temporary phase:
Serialization peak memory usage
Serialization writes host and device arrays to disk. It also creates a temporary device vector that combines normal labels and SOAR labels:
Search memory usage
cuVS does not currently search SCaNN indexes. Search memory depends on the OSS ScaNN configuration used after serialization.