K-selection
K-selection
K-selection selects the k smallest or largest values from each row of an input matrix. It also returns payload indices for the selected values, so it can turn a score or distance matrix into top-k results.
Use K-selection when you already have per-row scores, distances, or candidate values and need to keep only the best k entries per row.
Example API Usage
Selecting smallest values
For distance matrices, nearest neighbors are usually the smallest values in each row.
C++
Selecting largest values
For similarity scores, best matches are often the largest values.
C++
Variable row lengths
Use len_i when each row has a different valid length, such as padded candidate lists.
C++
How K-selection works
K-selection processes each row independently. For each row, it keeps the best k values according to select_min. If select_min=true, lower values are better. If select_min=false, higher values are better.
The output values and output indices have shape batch_size x k. When in_idx is omitted, the selected payload is the column index from the input row. When in_idx is provided, the output payload comes from the corresponding entry in in_idx.
When to use K-selection
Use K-selection after computing scores or distances when the full matrix is too large or noisy to keep. Common examples include exact nearest-neighbor search, reranking candidate lists, selecting top scoring documents, pruning graph candidates, and keeping top-k model outputs.
K-selection is also useful inside multi-stage workflows. A first stage can produce many candidates, then K-selection keeps a smaller set for refinement, reranking, or downstream processing.
Avoid using K-selection as a substitute for indexing when the expensive part is computing all distances. K-selection reduces output size, but it does not avoid the cost of generating the input values.
Tuning
Start with SelectAlgo::kAuto. The automatic choice is the safest default across different k, row lengths, and data types.
Set sorted=false when result ordering is not needed. Sorting selected values adds work, and many downstream stages only need the selected set.
Use the smallest k that satisfies the downstream workflow. Selection cost and output memory both grow with k.
Use len_i instead of sentinel values when rows have different valid lengths. This avoids selecting padded values and can reduce unnecessary work.
Memory footprint
K-selection memory is dominated by the input matrix and the selected output matrices.
Variables:
B: Batch size, or number of rows.L: Number of input values per row.K: Number of selected values per row.B_v: Bytes per value.B_i: Bytes per payload index.
Scratch and maximum rows
The scratch term covers temporary selection buffers, optional sorting workspace, allocator padding, CUDA library workspaces, and memory held by the active memory resource. Use H = 0.10 for unsorted selection and H = 0.15 when sorted=true. If you can measure a representative run, use:
Then set:
The capacity variables in this subsection are:
M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.M_usable: Memory budget left for the formula after subtractingM_otherand reserving headroom.observed_peak: Peak memory observed during a smaller representative run.formula_without_scratch: Value of the selected peak formula with explicitscratchterms removed and without applying headroom.peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usuallyNfor rows or vectors andBfor K-selection batch rows.B_per_row/B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such asD,K,Q, andLare substituted.B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.N_max/B_max: Estimated largest row, vector, or batch-row count that fits inM_usable.
K-selection scales with batch rows B, not database vectors directly. For fixed row length L and selected count K, rewrite the peak as:
and solve:
For nearest-neighbor workflows, this B_max is the maximum number of query rows that can be selected from one materialized candidate matrix.
Inputs and outputs
The input values use:
If in_idx is provided, input payloads use:
The output matrices use:
The peak memory is approximately:
For exact nearest-neighbor workflows, avoid materializing both a large distance matrix and large selected outputs at the same time when tiled distance computation can select and merge candidates incrementally.