Single-linkage
Single-linkage
Single-linkage is a hierarchical clustering algorithm. It builds a tree of merges by repeatedly connecting the two closest clusters, where cluster distance is defined by the closest pair of points across the two clusters.
Use single-linkage when you want a dendrogram, want to cut a hierarchy into a chosen number of clusters, or need clustering that can capture chain-like structure. Unlike K-Means, single-linkage does not learn centroids.
Example API Usage
Clustering data
The C++ API can build labels directly from a dense row-major device matrix. The output dendrogram has n_rows - 1 rows and two columns, and the labels vector has one label per input row.
C++
Building a linkage
Use the helper API when you need the minimum spanning tree, merge distances, or cluster sizes in addition to the dendrogram.
C++
How Single-linkage works
Single-linkage first builds a graph over the data and then finds a minimum spanning tree. Sorting the tree edges from shortest to longest gives the merge order for the dendrogram. Cutting that tree at the right number of components gives cluster labels.
The PAIRWISE mode builds from pairwise distances. It is simple and fast for smaller datasets, but its memory grows quadratically. The KNN_GRAPH mode builds a sparse nearest-neighbor graph and can scale to much larger datasets, at the cost of more graph construction work.
When to use Single-linkage
Use single-linkage when the hierarchy itself matters, when clusters may have elongated or connected shapes, or when you need a dendrogram for later analysis.
Avoid single-linkage when small bridges between groups should not merge them. Because it uses the closest pair of points, it can chain through sparse connections that other clustering methods would separate.
Configuration parameters
Clustering parameters
Tuning
Start with KNN_GRAPH for large datasets and PAIRWISE for smaller datasets where the pairwise distance matrix fits comfortably in memory.
Increase c if the KNN graph is too sparse or produces unstable connectivity. Decrease it when memory use or graph construction time is too high.
Choose the distance metric to match the data representation. The hierarchy can change substantially when the metric changes.
Memory footprint
Single-linkage memory depends strongly on the graph construction mode.
Variables:
N: Number of rows.D: Number of features per row.k: Number of neighbors used by the sparse graph.B_x: Bytes per input element.B_d: Bytes per distance value.B_i: Bytes per index value.
Scratch and maximum rows
The formulas below show the large persistent arrays and graph buffers. scratch covers temporary graph-construction buffers, sort or merge workspace, allocator padding, CUDA library workspaces, and memory held by the active memory resource. For planning, reserve a headroom factor H = 0.30 for single-linkage builds. If you can measure a representative smaller run, use:
Then set:
The capacity variables in this subsection are:
M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.M_usable: Memory budget left for the formula after subtractingM_otherand reserving headroom.observed_peak: Peak memory observed during a smaller representative run.formula_without_scratch: Value of the selected peak formula with explicitscratchterms removed and without applying headroom.peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usuallyNfor rows or vectors andBfor K-selection batch rows.B_per_row/B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such asD,K,Q, andLare substituted.B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.N_max/B_max: Estimated largest row, vector, or batch-row count that fits inM_usable.
For KNN graph mode, substitute k and rewrite the peak as a linear function of N:
Then estimate:
For pairwise mode, the dense distance matrix is quadratic. If both inputs have N rows, solve:
Pairwise mode
The pairwise path can require a dense distance matrix:
This mode is usually only practical when N is small enough for the quadratic matrix to fit comfortably.
KNN graph mode
The sparse graph path stores roughly N * k weighted edges:
The dendrogram and labels are smaller:
The sparse path peak is approximately: