Binary Quantizer

View as Markdown

Binary quantization compresses each vector into a packed bit vector. Each input value becomes one bit, and each bit records whether the value is greater than a threshold.

Use binary quantization when you want very compact vectors for workflows based on bitwise distances such as Hamming distance. It is a strong compression step, so it trades away most magnitude information for much smaller storage and faster bitwise comparisons.

Example API Usage

C API | C++ API | Python API

Training a quantizer

Training is needed when thresholds are learned from data, such as per-dimension means or sampled medians. The Python API currently exposes the direct zero-threshold transform path.

1#include <cuvs/core/c_api.h>
2#include <cuvs/preprocessing/quantize/binary.h>
3
4cuvsResources_t res;
5cuvsBinaryQuantizerParams_t params;
6cuvsBinaryQuantizer_t quantizer;
7DLManagedTensor *dataset;
8
9load_dataset(dataset);
10
11cuvsResourcesCreate(&res);
12cuvsBinaryQuantizerParamsCreate(&params);
13cuvsBinaryQuantizerCreate(&quantizer);
14
15params->threshold = MEAN;
16params->sampling_ratio = 0.1f;
17
18cuvsBinaryQuantizerTrain(res, params, dataset, quantizer);
19
20cuvsBinaryQuantizerDestroy(quantizer);
21cuvsBinaryQuantizerParamsDestroy(params);
22cuvsResourcesDestroy(res);

Transforming data

Transforming packs every group of eight input dimensions into one byte. In C and C++, use the trained quantizer when using learned thresholds. Python uses a zero threshold and sets a bit when the corresponding input value is positive.

1DLManagedTensor *codes;
2uint64_t packed_dim = (n_features + 7) / 8;
3
4allocate_uint8_matrix(codes, n_rows, packed_dim);
5
6cuvsBinaryQuantizerTransformWithParams(res, quantizer, dataset, codes);

How Binary Quantization works

Binary quantization compares each value with a threshold. If the value is greater than the threshold, the output bit is set to 1; otherwise it is set to 0.

The threshold can be fixed at zero, learned as a per-dimension mean, or learned as a sampled per-dimension median. The output is packed into bytes, so a vector with D features needs ceil(D / 8) bytes.

When to use Binary Quantization

Use binary quantization when compact storage and bitwise comparison speed matter more than preserving floating-point magnitude. It is especially useful for binary vector search with bitwise Hamming distance.

Avoid binary quantization when vector magnitude or fine-grained distance information is important. Because each value becomes only one bit, this is the most aggressive quantizer in the preprocessing guide.

Configuration parameters

Train parameters

ParameterDefaultDescription
thresholdmeanThreshold strategy. zero uses zero, mean learns a per-dimension mean, and sampling_median estimates a per-dimension median from samples.
sampling_ratio0.1Fraction of rows sampled when using the sampled median threshold. Higher values can improve the median estimate but increase training work.

Tuning

Use zero when the data is already centered or when you want the fastest path. Use mean when dimensions have different offsets. Use sampling_median when outliers skew the mean and a more robust threshold is needed.

Increase sampling_ratio only when the sampled median is unstable. It does not affect the compressed size, only threshold quality and training cost.

Memory footprint

Binary quantization is usually dominated by the input matrix and the packed output matrix.

Variables:

  • N: Number of vectors.
  • D: Number of features per vector.
  • B_x: Bytes per input element.
  • B_t: Bytes per threshold value.

Scratch and maximum rows

The scratch term covers temporary packing buffers, threshold-training buffers, allocator padding, CUDA library workspaces, and memory held by the active memory resource. Use H = 0.10 for direct transform and H = 0.20 when training learned thresholds. If you can measure a representative run, use:

Hmeasured=observed_peakformula_without_scratchformula_without_scratchH_{\text{measured}} = \frac{\text{observed\_peak} - \text{formula\_without\_scratch}} {\text{formula\_without\_scratch}}

Then set:

Musable=(MfreeMother)(1H)M_{\text{usable}} = (M_{\text{free}} - M_{\text{other}}) \cdot (1 - H)

The capacity variables in this subsection are:

  • M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.
  • M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.
  • H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.
  • M_usable: Memory budget left for the formula after subtracting M_other and reserving headroom.
  • observed_peak: Peak memory observed during a smaller representative run.
  • formula_without_scratch: Value of the selected peak formula with explicit scratch terms removed and without applying headroom.
  • peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usually N for rows or vectors and B for K-selection batch rows.
  • B_per_row / B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such as D, K, Q, and L are substituted.
  • B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.
  • N_max / B_max: Estimated largest row, vector, or batch-row count that fits in M_usable.

The packed output is still linear in N. Rewrite the selected peak as:

peak_without_scratch(N)=NBper_row+Bfixed\text{peak\_without\_scratch}(N) = N \cdot B_{\text{per\_row}} + B_{\text{fixed}}

and solve:

Nmax=MusableBfixedBper_rowN_{\max} = \left\lfloor \frac{M_{\text{usable}} - B_{\text{fixed}}} {B_{\text{per\_row}}} \right\rfloor

Packed vectors

Each row stores one bit per input feature:

packed_dim=D8codes_size=Npacked_dim\begin{aligned} \text{packed\_dim} &= \left\lceil \frac{D}{8} \right\rceil \\ \text{codes\_size} &= N \cdot \text{packed\_dim} \end{aligned}

Thresholds

Learned thresholds store one threshold per input dimension:

threshold_sizeDBt\text{threshold\_size} \approx D \cdot B_t

The zero-threshold path does not need a learned threshold vector.

Transform peak

The transform peak is approximately:

transform_peak NDBx+codes_size+threshold_size+scratch\begin{aligned} \text{transform\_peak} \approx&\ N \cdot D \cdot B_x + \text{codes\_size} \\ &+ \text{threshold\_size} + \text{scratch} \end{aligned}