K-selection

View as Markdown

K-selection selects the k smallest or largest values from each row of an input matrix. It also returns payload indices for the selected values, so it can turn a score or distance matrix into top-k results.

Use K-selection when you already have per-row scores, distances, or candidate values and need to keep only the best k entries per row.

Example API Usage

C++ API

Selecting smallest values

For distance matrices, nearest neighbors are usually the smallest values in each row.

1#include <cuvs/selection/select_k.hpp>
2
3#include <raft/core/device_mdarray.hpp>
4#include <raft/core/resources.hpp>
5
6raft::device_resources res;
7raft::device_matrix_view<const float, int64_t, raft::row_major> distances =
8 load_distance_matrix();
9int64_t k = 10;
10
11auto out_values = raft::make_device_matrix<float, int64_t>(
12 res, distances.extent(0), k);
13auto out_indices = raft::make_device_matrix<int64_t, int64_t>(
14 res, distances.extent(0), k);
15
16cuvs::selection::select_k(res,
17 distances,
18 std::nullopt,
19 out_values.view(),
20 out_indices.view(),
21 true,
22 true);

Selecting largest values

For similarity scores, best matches are often the largest values.

1auto out_scores = raft::make_device_matrix<float, int64_t>(
2 res, scores.extent(0), k);
3auto out_ids = raft::make_device_matrix<uint32_t, int64_t>(
4 res, scores.extent(0), k);
5
6cuvs::selection::select_k(res,
7 scores,
8 candidate_ids,
9 out_scores.view(),
10 out_ids.view(),
11 false,
12 true);

Variable row lengths

Use len_i when each row has a different valid length, such as padded candidate lists.

1std::optional<raft::device_vector_view<const int64_t, int64_t>> row_lengths =
2 valid_lengths.view();
3
4cuvs::selection::select_k(res,
5 candidates,
6 candidate_ids,
7 out_values.view(),
8 out_indices.view(),
9 true,
10 false,
11 cuvs::selection::SelectAlgo::kAuto,
12 row_lengths);

How K-selection works

K-selection processes each row independently. For each row, it keeps the best k values according to select_min. If select_min=true, lower values are better. If select_min=false, higher values are better.

The output values and output indices have shape batch_size x k. When in_idx is omitted, the selected payload is the column index from the input row. When in_idx is provided, the output payload comes from the corresponding entry in in_idx.

When to use K-selection

Use K-selection after computing scores or distances when the full matrix is too large or noisy to keep. Common examples include exact nearest-neighbor search, reranking candidate lists, selecting top scoring documents, pruning graph candidates, and keeping top-k model outputs.

K-selection is also useful inside multi-stage workflows. A first stage can produce many candidates, then K-selection keeps a smaller set for refinement, reranking, or downstream processing.

Avoid using K-selection as a substitute for indexing when the expensive part is computing all distances. K-selection reduces output size, but it does not avoid the cost of generating the input values.

Tuning

Start with SelectAlgo::kAuto. The automatic choice is the safest default across different k, row lengths, and data types.

Set sorted=false when result ordering is not needed. Sorting selected values adds work, and many downstream stages only need the selected set.

Use the smallest k that satisfies the downstream workflow. Selection cost and output memory both grow with k.

Use len_i instead of sentinel values when rows have different valid lengths. This avoids selecting padded values and can reduce unnecessary work.

Memory footprint

K-selection memory is dominated by the input matrix and the selected output matrices.

Variables:

  • B: Batch size, or number of rows.
  • L: Number of input values per row.
  • K: Number of selected values per row.
  • B_v: Bytes per value.
  • B_i: Bytes per payload index.

Scratch and maximum rows

The scratch term covers temporary selection buffers, optional sorting workspace, allocator padding, CUDA library workspaces, and memory held by the active memory resource. Use H = 0.10 for unsorted selection and H = 0.15 when sorted=true. If you can measure a representative run, use:

Hmeasured=observed_peakformula_without_scratchformula_without_scratchH_{\text{measured}} = \frac{\text{observed\_peak} - \text{formula\_without\_scratch}} {\text{formula\_without\_scratch}}

Then set:

Musable=(MfreeMother)(1H)M_{\text{usable}} = (M_{\text{free}} - M_{\text{other}}) \cdot (1 - H)

The capacity variables in this subsection are:

  • M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.
  • M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.
  • H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.
  • M_usable: Memory budget left for the formula after subtracting M_other and reserving headroom.
  • observed_peak: Peak memory observed during a smaller representative run.
  • formula_without_scratch: Value of the selected peak formula with explicit scratch terms removed and without applying headroom.
  • peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usually N for rows or vectors and B for K-selection batch rows.
  • B_per_row / B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such as D, K, Q, and L are substituted.
  • B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.
  • N_max / B_max: Estimated largest row, vector, or batch-row count that fits in M_usable.

K-selection scales with batch rows B, not database vectors directly. For fixed row length L and selected count K, rewrite the peak as:

peak_without_scratch(B)=BBper_row+Bfixed\text{peak\_without\_scratch}(B) = B \cdot B_{\text{per\_row}} + B_{\text{fixed}}

and solve:

Bmax=MusableBfixedBper_rowB_{\max} = \left\lfloor \frac{M_{\text{usable}} - B_{\text{fixed}}} {B_{\text{per\_row}}} \right\rfloor

For nearest-neighbor workflows, this B_max is the maximum number of query rows that can be selected from one materialized candidate matrix.

Inputs and outputs

The input values use:

input_values_size=BLBv\text{input\_values\_size} = B \cdot L \cdot B_v

If in_idx is provided, input payloads use:

input_payload_size=BLBi\text{input\_payload\_size} = B \cdot L \cdot B_i

The output matrices use:

output_size=BK(Bv+Bi)\text{output\_size} = B \cdot K \cdot (B_v + B_i)

The peak memory is approximately:

select_peak input_values_size+input_payload_size+output_size+scratch\begin{aligned} \text{select\_peak} \approx&\ \text{input\_values\_size} + \text{input\_payload\_size} \\ &+ \text{output\_size} + \text{scratch} \end{aligned}

For exact nearest-neighbor workflows, avoid materializing both a large distance matrix and large selected outputs at the same time when tiled distance computation can select and merge candidates incrementally.