Scalar Quantizer

View as Markdown

Scalar quantization compresses floating-point vectors by mapping each value into a smaller integer type. In cuVS, the scalar quantizer learns a global floating-point range and then linearly maps values in that range to int8.

Use scalar quantization when full-precision vectors are too expensive to store or move, but you still want a simple element-wise approximation that can be reconstructed later. It is often useful before storage, approximate scoring, or ANN workflows that can tolerate quantization error.

Example API Usage

C API | C++ API | Python API

Training a quantizer

Training finds the value range used for the linear mapping. The quantile parameter can ignore extreme high and low values so that a small number of outliers do not stretch the range for the entire dataset.

1#include <cuvs/core/c_api.h>
2#include <cuvs/preprocessing/quantize/scalar.h>
3
4cuvsResources_t res;
5cuvsScalarQuantizerParams_t params;
6cuvsScalarQuantizer_t quantizer;
7DLManagedTensor *dataset;
8
9load_dataset(dataset);
10
11cuvsResourcesCreate(&res);
12cuvsScalarQuantizerParamsCreate(&params);
13cuvsScalarQuantizerCreate(&quantizer);
14
15params->quantile = 0.99f;
16
17cuvsScalarQuantizerTrain(res, params, dataset, quantizer);
18
19cuvsScalarQuantizerDestroy(quantizer);
20cuvsScalarQuantizerParamsDestroy(params);
21cuvsResourcesDestroy(res);

Transforming data

Transforming replaces each floating-point value with an int8 code. Inverse transform maps those codes back into approximate floating-point values.

1DLManagedTensor *codes;
2DLManagedTensor *reconstructed;
3
4allocate_int8_matrix(codes, n_rows, n_features);
5allocate_float_matrix(reconstructed, n_rows, n_features);
6
7cuvsScalarQuantizerTransform(res, quantizer, dataset, codes);
8cuvsScalarQuantizerInverseTransform(res, quantizer, codes, reconstructed);

How Scalar Quantization works

The quantizer stores a minimum and maximum value. Each input value is clipped to that range and mapped into the available int8 codes. Inverse transform applies the reverse mapping, so reconstructed values are approximate rather than exact.

The range is learned from the training data. With quantile = 1.0, the full observed range is used. With a smaller quantile, cuVS ignores a small fraction of values at the high and low ends before choosing the range.

When to use Scalar Quantization

Use scalar quantization when memory bandwidth or storage size is more important than exact reconstruction. It is a good fit when every dimension has roughly comparable scale, or when the downstream algorithm is robust to small per-value errors.

Avoid scalar quantization when exact values are required, when very small numeric differences matter, or when a few dimensions have very different ranges and need more specialized handling.

Configuration parameters

Train parameters

ParameterDefaultDescription
quantile0.99Fraction of the distribution used to choose the quantization range. Lower values reduce the influence of outliers but can clip more values. Must be in (0, 1].

Tuning

Start with quantile = 0.99. Increase it toward 1.0 when clipping hurts reconstruction quality. Decrease it when a few outliers stretch the range and make most values less precise.

Check downstream recall or reconstruction error rather than only looking at compression ratio. Scalar quantization always stores one byte per value, so the main tradeoff is how well the learned range matches the data.

Memory footprint

Scalar quantization memory is dominated by the input matrix and the int8 output matrix.

Variables:

  • N: Number of vectors.
  • D: Number of features per vector.
  • B_x: Bytes per input element.
  • B_q: Bytes per quantized output element. For int8, B_q = 1.
  • B_c: Bytes per stored quantizer value.

Scratch and maximum rows

The scratch term covers temporary transform buffers, allocator padding, CUDA library workspaces, and memory held by the active memory resource. Scalar quantization is mostly a streaming transform, so use H = 0.10 as a first scratch headroom estimate. If you can measure a representative run, use:

Hmeasured=observed_peakformula_without_scratchformula_without_scratchH_{\text{measured}} = \frac{\text{observed\_peak} - \text{formula\_without\_scratch}} {\text{formula\_without\_scratch}}

Then set:

Musable=(MfreeMother)(1H)M_{\text{usable}} = (M_{\text{free}} - M_{\text{other}}) \cdot (1 - H)

The capacity variables in this subsection are:

  • M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.
  • M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.
  • H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.
  • M_usable: Memory budget left for the formula after subtracting M_other and reserving headroom.
  • observed_peak: Peak memory observed during a smaller representative run.
  • formula_without_scratch: Value of the selected peak formula with explicit scratch terms removed and without applying headroom.
  • peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usually N for rows or vectors and B for K-selection batch rows.
  • B_per_row / B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such as D, K, Q, and L are substituted.
  • B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.
  • N_max / B_max: Estimated largest row, vector, or batch-row count that fits in M_usable.

The transform formulas are linear in N. Rewrite the selected peak as:

peak_without_scratch(N)=NBper_row+Bfixed\text{peak\_without\_scratch}(N) = N \cdot B_{\text{per\_row}} + B_{\text{fixed}}

and solve:

Nmax=MusableBfixedBper_rowN_{\max} = \left\lfloor \frac{M_{\text{usable}} - B_{\text{fixed}}} {B_{\text{per\_row}}} \right\rfloor

Persistent data

The quantized dataset stores one byte per input value:

codes_size=NDBq\text{codes\_size} = N \cdot D \cdot B_q

The scalar quantizer stores the learned range:

quantizer_size2Bc\text{quantizer\_size} \approx 2 \cdot B_c

Transform peak

The transform peak is approximately:

transform_peak NDBx+codes_size+quantizer_size+scratch\begin{aligned} \text{transform\_peak} \approx&\ N \cdot D \cdot B_x + \text{codes\_size} \\ &+ \text{quantizer\_size} + \text{scratch} \end{aligned}

Inverse transform replaces codes_size with the reconstructed floating-point output:

reconstructed_size=NDBx\text{reconstructed\_size} = N \cdot D \cdot B_x