K-Means

View as Markdown

K-Means is a GPU-accelerated clustering algorithm. It groups rows into n_clusters groups by learning one centroid for each group, then assigning every row to the closest centroid.

Use K-Means when you want to summarize a dataset with representative centers, assign vectors to coarse groups, build vector quantizers, or partition data before another algorithm. Unlike an ANN index, K-Means does not build a search structure for nearest-neighbor lookup. Its primary outputs are centroids, labels, inertia, and the number of training iterations.

Example API Usage

C API | C++ API | Python API | Rust API

Fitting clusters

Fitting learns the cluster centroids from a dataset. The input data can be on the device, and C, C++, and Python also support host-data paths that stream batches to the GPU.

1#include <cuvs/cluster/kmeans.h>
2#include <cuvs/core/c_api.h>
3
4cuvsResources_t res;
5cuvsKMeansParams_t params;
6DLManagedTensor *dataset;
7DLManagedTensor *centroids;
8double inertia;
9int n_iter;
10
11load_dataset(dataset);
12allocate_centroids(centroids);
13
14cuvsResourcesCreate(&res);
15cuvsKMeansParamsCreate(&params);
16
17params->n_clusters = 1024;
18params->max_iter = 300;
19params->tol = 1e-4;
20
21cuvsKMeansFit(res, params, dataset, NULL, centroids, &inertia, &n_iter);
22
23cuvsKMeansParamsDestroy(params);
24cuvsResourcesDestroy(res);

Assigning labels

Prediction assigns each row to the nearest learned centroid. Use it after fitting when you need a cluster label per row.

1#include <cuvs/cluster/kmeans.h>
2
3cuvsResources_t res;
4cuvsKMeansParams_t params;
5DLManagedTensor *dataset;
6DLManagedTensor *centroids;
7DLManagedTensor *labels;
8double inertia;
9
10load_dataset(dataset);
11load_centroids(centroids);
12allocate_labels(labels);
13
14cuvsResourcesCreate(&res);
15cuvsKMeansParamsCreate(&params);
16
17params->n_clusters = 1024;
18
19cuvsKMeansPredict(res, params, dataset, NULL, centroids, labels, true, &inertia);
20
21cuvsKMeansParamsDestroy(params);
22cuvsResourcesDestroy(res);

Evaluating centroids

The cluster cost, also called inertia, is the sum of squared distances from each row to its closest centroid. Use it to compare different runs or to check whether additional iterations meaningfully improve the clustering.

1double cost;
2cuvsKMeansClusterCost(res, dataset, centroids, &cost);

How K-Means works

K-Means alternates between two steps:

  1. Assign each row to the closest centroid.
  2. Move each centroid to the average of the rows assigned to it.

The algorithm repeats those steps until it reaches max_iter or the inertia changes by less than tol. The GPU is useful because the expensive part is repeatedly comparing many rows against many centroids.

When to use K-Means

Use K-Means when you need compact representatives for a dataset, cluster labels for downstream analysis, coarse partitions for batching or indexing, or vector quantization codebooks.

K-Means works best when clusters are roughly spherical under the selected distance metric. If the data has complex shapes, very uneven cluster sizes, or strong outliers, consider using K-Means as a fast preprocessing step rather than treating the labels as final ground truth.

Standard and balanced K-Means

Standard K-Means minimizes inertia. It can produce uneven cluster sizes if the data distribution is uneven.

Balanced K-Means encourages more even cluster sizes. It is useful when clusters will be used as partitions for later work and very large clusters would create load imbalance. In cuVS, balanced K-Means is exposed through balanced_params in C++ and through hierarchical=True in C and Python.

Configuration parameters

Fit parameters

ParameterDefaultDescription
metricL2Expanded / language defaultDistance metric used to compare rows and centroids. Standard K-Means commonly uses squared L2 distance.
n_clusters8Number of clusters and output centroids. Larger values create finer groups but increase work and centroid memory.
init / init_methodKMeansPlusPlusInitialization strategy. Use k-means++ for robust default seeding, random for faster but less stable seeding, or array initialization when providing centroids.
max_iter300Maximum number of training iterations for one run.
tol1e-4Relative inertia tolerance used for convergence.
n_init1Number of independent runs with different seeds. More runs can improve quality but multiply build time.
oversampling_factor2.0Oversampling factor used by k-means parallel initialization.
batch_samples32768Number of samples per tile for the nearest-centroid computation. Lower values reduce temporary memory.
batch_centroids0Number of centroids per tile. 0 means all centroids. Lower values reduce temporary memory.
init_size0Number of rows sampled for k-means++ initialization on host-data paths. 0 uses the default heuristic.
streaming_batch_size0Number of host rows streamed to the GPU per batch. 0 processes all host rows at once.
hierarchicalfalseEnables hierarchical, balanced K-Means in C and Python.
hierarchical_n_itersimplementation defaultNumber of training iterations for hierarchical K-Means.

Tuning

Start with n_clusters. More clusters reduce average within-cluster distance, but they increase memory and the work per iteration. If clusters become too small or unstable, reduce n_clusters or increase the amount of data used for fitting.

Use KMeansPlusPlus initialization for most workloads. Increase n_init when quality matters more than fit time, especially when different random seeds produce noticeably different inertia.

Tune max_iter and tol together. If n_iter often reaches max_iter, increase max_iter or relax tol. If inertia stops improving early, lowering max_iter can reduce runtime.

Use batch_samples and batch_centroids to control device memory for device-resident data. Smaller tiles reduce temporary memory but add more tiled work.

Use streaming_batch_size when fitting host-resident datasets that do not fit on the GPU. Smaller batches reduce GPU memory pressure, while larger batches reduce transfer and launch overhead.

Use balanced K-Means when cluster size matters. This is often useful when clusters are later used as work partitions, batches, or coarse groups for another algorithm.

Memory footprint

K-Means memory is dominated by the input data, the centroid matrix, optional labels or weights, and temporary tiles used to compare samples with centroids. The exact scratch space depends on the selected data type and implementation path, but the estimates below are useful for planning.

Variables:

  • N: Number of rows, or samples.
  • D: Number of features per row.
  • K: Number of clusters, or centroids.
  • S: Number of samples processed in one GPU batch.
  • T_s: Sample tile size, controlled by batch_samples.
  • T_c: Centroid tile size, controlled by batch_centroids; if batch_centroids = 0, use K.
  • B_x: Bytes per input element.
  • B_c: Bytes per centroid element.
  • B_l: Bytes per output label.

Scratch and maximum rows

The scratch term covers temporary buffers that are not part of the persistent inputs or outputs: reduction buffers, assignment buffers, allocator padding, CUDA library workspaces, and memory held by the active memory resource. For a first capacity estimate, reserve a scratch headroom factor H. Use H = 0.20 for K-Means fit and H = 0.10 for prediction. If you can measure a representative run, use:

Hmeasured=observed_peakformula_without_scratchformula_without_scratchH_{\text{measured}} = \frac{\text{observed\_peak} - \text{formula\_without\_scratch}} {\text{formula\_without\_scratch}}

Then set:

Musable=(MfreeMother)(1H)M_{\text{usable}} = (M_{\text{free}} - M_{\text{other}}) \cdot (1 - H)

The capacity variables in this subsection are:

  • M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.
  • M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.
  • H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.
  • M_usable: Memory budget left for the formula after subtracting M_other and reserving headroom.
  • observed_peak: Peak memory observed during a smaller representative run.
  • formula_without_scratch: Value of the selected peak formula with explicit scratch terms removed and without applying headroom.
  • peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usually N for rows or vectors and B for K-selection batch rows.
  • B_per_row / B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such as D, K, Q, and L are substituted.
  • B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.
  • N_max / B_max: Estimated largest row, vector, or batch-row count that fits in M_usable.

To estimate the largest usable row count, rewrite the selected peak formula as:

peak_without_scratch(N)=NBper_row+Bfixed\text{peak\_without\_scratch}(N) = N \cdot B_{\text{per\_row}} + B_{\text{fixed}}

and solve:

Nmax=MusableBfixedBper_rowN_{\max} = \left\lfloor \frac{M_{\text{usable}} - B_{\text{fixed}}} {B_{\text{per\_row}}} \right\rfloor

For host streaming fit, solve for S first. S controls the active GPU batch size even when the full dataset has more than S rows.

Device-resident fit

When the dataset is already on the GPU, the persistent data and centroids use:

dataset_size=NDBxcentroids_size=KDBc\begin{aligned} \text{dataset\_size} &= N \cdot D \cdot B_x \\ \text{centroids\_size} &= K \cdot D \cdot B_c \end{aligned}

The nearest-centroid computation is tiled. A useful estimate for the main distance tile is:

distance_tile_sizemin(N,Ts)min(K,Tc)Bc\text{distance\_tile\_size} \approx \min(N, T_s) \cdot \min(K, T_c) \cdot B_c

The fit peak is approximately:

fit_peak dataset_size+centroids_size+distance_tile_size+scratch\begin{aligned} \text{fit\_peak} \approx&\ \text{dataset\_size} + \text{centroids\_size} \\ &+ \text{distance\_tile\_size} + \text{scratch} \end{aligned}

Host streaming fit

For host-resident data, cuVS can stream rows to the GPU in batches. If streaming_batch_size = 0, then S = N; otherwise S = streaming_batch_size.

streaming_batch_size=SDBxfit_peak streaming_batch_size+centroids_size+min(S,Ts)min(K,Tc)Bc+scratch\begin{aligned} \text{streaming\_batch\_size} &= S \cdot D \cdot B_x \\ \text{fit\_peak} \approx&\ \text{streaming\_batch\_size} + \text{centroids\_size} \\ &+ \min(S, T_s) \cdot \min(K, T_c) \cdot B_c + \text{scratch} \end{aligned}

Use a smaller streaming_batch_size when the host-data fit path runs out of GPU memory. Use a larger value when GPU memory is available and transfer overhead dominates.

Prediction

Prediction needs the input rows, centroids, output labels, and a distance tile:

labels_size=NBlpredict_peak NDBx+KDBc+labels_size+distance_tile_size+scratch\begin{aligned} \text{labels\_size} &= N \cdot B_l \\ \text{predict\_peak} \approx&\ N \cdot D \cdot B_x + K \cdot D \cdot B_c \\ &+ \text{labels\_size} + \text{distance\_tile\_size} + \text{scratch} \end{aligned}

For large K, reduce batch_centroids. For large N, reduce batch_samples or use the host streaming fit path when fitting from host data.