Agglomerative

View as Markdown

Source header: cuvs/cluster/agglomerative.hpp

agglomerative clustering hyperparameters

cluster::agglomerative::Linkage

Determines the method for computing the minimum spanning tree (MST)

1enum Linkage { ... };

Values

NameValue
PAIRWISE0
KNN_GRAPH1

Types

cluster::agglomerative::single_linkage_output

Simple container object for consolidating linkage results. This closely

mirrors the trained instance variables populated in Scikit-learn’s AgglomerativeClustering estimator.

1template <typename idx_t>
2class single_linkage_output { ... };

single-linkage clustering APIs

cluster::agglomerative::single_linkage

Single-linkage clustering, capable of constructing a KNN graph to

1void single_linkage(
2raft::resources const& handle,
3raft::device_matrix_view<const float, int, raft::row_major> X,
4raft::device_matrix_view<int, int, raft::row_major> dendrogram,
5raft::device_vector_view<int, int> labels,
6cuvs::distance::DistanceType metric,
7size_t n_clusters,
8cuvs::cluster::agglomerative::Linkage linkage = cuvs::cluster::agglomerative::Linkage::KNN_GRAPH,
9std::optional<int> c = std::make_optional<int>(DEFAULT_CONST_C));

scale the algorithm beyond the n^2 memory consumption of implementations that use the fully-connected graph of pairwise distances by connecting a knn graph when k is not large enough to connect it.

Parameters

NameDirectionTypeDescription
handleinraft::resources const&raft handle
Xinraft::device_matrix_view<const float, int, raft::row_major>dense input matrix in row-major layout
dendrogramoutraft::device_matrix_view<int, int, raft::row_major>output dendrogram (size [n_rows - 1] * 2)
labelsoutraft::device_vector_view<int, int>output labels vector (size n_rows)
metricincuvs::distance::DistanceTypedistance metric to use when constructing connectivities graph
n_clustersinsize_tnumber of clusters to assign data samples
linkageincuvs::cluster::agglomerative::Linkagestrategy for constructing the linkage. PAIRWISE uses more memory but can be faster for smaller datasets. KNN_GRAPH allows the memory usage to be controlled (using parameter c) at the expense of potentially additional minimum spanning tree iterations. Default: cuvs::cluster::agglomerative::Linkage::KNN_GRAPH.
cinstd::optional<int>a constant used when constructing linkage from knn graph. Allows the indirect control of k. The algorithm will set k = log(n) + c Default: std::make_optional&lt;int&gt;(DEFAULT_CONST_C).

Returns

void

cluster::agglomerative::helpers::linkage_graph_params::distance_params

Specialized parameters to build the KNN graph with regular distances

1struct distance_params { ... };

Fields

NameTypeDescription
cinta constant used when constructing linkage from knn graph. Allows the indirect control of k. The algorithm will set k = log(n) + c
dist_typecuvs::cluster::agglomerative::Linkagestrategy for constructing the linkage. PAIRWISE uses more memory but can be faster for smaller datasets. KNN_GRAPH allows the memory usage to be controlled (using parameter c)

cluster::agglomerative::helpers::linkage_graph_params::mutual_reachability_params

Specialized parameters to build the Mutual Reachability graph

1struct mutual_reachability_params { ... };

Fields

NameTypeDescription
min_samplesintthis neighborhood will be selected for core distances.
alphafloatweight applied when internal distance is chosen for mutual reachability (value of 1.0 disables the weighting)
brute_force_paramscuvs::neighbors::all_neighbors::all_neighbors_params all_neighbors_params{ cuvs::neighbors::graph_build_params::Parameters for building the mutual reachability graph using an underlying KNN algorithm. The all-neighbors graph construction algorithm enables building the mutual reachability graph on datasets larger than device memory by:
1. Partitioning the dataset into overlapping clusters,
2. Computing local KNN graphs within each cluster, and
3. Merging the local graphs into a single global graph. Key fields:
- graph_build_params: Selects the KNN construction method (Brute Force or NN Descent) and controls algorithm-specific parameters.
- n_clusters: Number of partitions (batches) to split the data into. Larger n_clusters reduces memory usage but may reduce accuracy if overlap_factor is too low. Recommended starting value: n_clusters = 4. Increase progressively (4 → 8 → 16 …) to reduce memory usage at the cost of some accuracy. This is independent of overlap_factor as long as overlap_factor &lt; n_clusters.
- overlap_factor: Number of nearest clusters each data point is assigned to. Higher overlap_factor improves accuracy at the cost of memory and performance. Recommended starting value: overlap_factor = 2. Increase gradually (2 → 3 → 4 …) for better accuracy with higher device memory usage.
- metric: Distance metric to use when computing nearest neighbors.

cluster::agglomerative::helpers::build_linkage

Given a dataset, builds the KNN graph, connects graph components and builds a linkage

1void build_linkage(
2raft::resources const& handle,
3raft::device_matrix_view<const float, int64_t, raft::row_major> X,
4std::variant<linkage_graph_params::distance_params,
5linkage_graph_params::mutual_reachability_params> linkage_graph_params,
6cuvs::distance::DistanceType metric,
7raft::device_coo_matrix_view<float, int64_t, int64_t, size_t> out_mst,
8raft::device_matrix_view<int64_t, int64_t> dendrogram,
9raft::device_vector_view<float, int64_t> out_distances,
10raft::device_vector_view<int64_t, int64_t> out_sizes,
11std::optional<raft::device_vector_view<float, int64_t>> core_dists);

(dendrogram). Returns the Minimum Spanning Tree edges sorted by weight and the dendrogram. Reachability space

Parameters

NameDirectionTypeDescription
handleinraft::resources const&raft handle for resource reuse
Xinraft::device_matrix_view<const float, int64_t, raft::row_major>data points on device memory (size n_rows * d)
linkage_graph_paramsinstd::variant<linkage_graph_params::distance_params, linkage_graph_params::mutual_reachability_params>Parameters controlling how the KNN graph is built. This can be either:
- distance_params: standard distance-based KNN graph construction for traditional agglomerative clustering.
- mutual_reachability_params: parameters to compute a mutual reachability graph for density-aware hierarchical clustering (e.g. HDBSCAN).
metricincuvs::distance::DistanceTypedistance metric to use
out_mstoutraft::device_coo_matrix_view<float, int64_t, int64_t, size_t>output MST sorted by edge weights (size n_rows - 1)
dendrogramoutraft::device_matrix_view<int64_t, int64_t>output dendrogram (size [n_rows - 1] * 2)
out_distancesoutraft::device_vector_view<float, int64_t>distances for output
out_sizesoutraft::device_vector_view<int64_t, int64_t>cluster sizes of output
core_distsoutstd::optional<raft::device_vector_view<float, int64_t>>(optional) core distances (size m). Must be supplied in the Mutual

Returns

void

Additional overload: cluster::agglomerative::helpers::build_linkage

Given a dataset, builds the KNN graph, connects graph components and builds a linkage

1void build_linkage(
2raft::resources const& handle,
3raft::host_matrix_view<const float, int64_t, raft::row_major> X,
4std::variant<linkage_graph_params::distance_params,
5linkage_graph_params::mutual_reachability_params> linkage_graph_params,
6cuvs::distance::DistanceType metric,
7raft::device_coo_matrix_view<float, int64_t, int64_t, size_t> out_mst,
8raft::device_matrix_view<int64_t, int64_t> dendrogram,
9raft::device_vector_view<float, int64_t> out_distances,
10raft::device_vector_view<int64_t, int64_t> out_sizes,
11std::optional<raft::device_vector_view<float, int64_t>> core_dists);

(dendrogram). Returns the Minimum Spanning Tree edges sorted by weight and the dendrogram. Reachability space

Parameters

NameDirectionTypeDescription
handleinraft::resources const&raft handle for resource reuse
Xinraft::host_matrix_view<const float, int64_t, raft::row_major>data points on host memory (size n_rows * d)
linkage_graph_paramsinstd::variant<linkage_graph_params::distance_params, linkage_graph_params::mutual_reachability_params>Parameters controlling how the KNN graph is built. This can be either:
- distance_params: standard distance-based KNN graph construction for traditional agglomerative clustering.
- mutual_reachability_params: parameters to compute a mutual reachability graph for density-aware hierarchical clustering (e.g. HDBSCAN).
metricincuvs::distance::DistanceTypedistance metric to use
out_mstoutraft::device_coo_matrix_view<float, int64_t, int64_t, size_t>output MST sorted by edge weights (size n_rows - 1)
dendrogramoutraft::device_matrix_view<int64_t, int64_t>output dendrogram (size [n_rows - 1] * 2)
out_distancesoutraft::device_vector_view<float, int64_t>distances for output
out_sizesoutraft::device_vector_view<int64_t, int64_t>cluster sizes of output
core_distsoutstd::optional<raft::device_vector_view<float, int64_t>>(optional) core distances (size m). Must be supplied in the Mutual

Returns

void

cluster::agglomerative::helpers::build_dendrogram

Build dendrogram from a Minimum Spanning Tree (MST).

1void build_dendrogram(raft::resources const& handle,
2raft::device_vector_view<const int64_t, int64_t> rows,
3raft::device_vector_view<const int64_t, int64_t> cols,
4raft::device_vector_view<const float, int64_t> data,
5raft::device_matrix_view<int64_t, int64_t, raft::row_major> children,
6raft::device_vector_view<float, int64_t> out_delta,
7raft::device_vector_view<int64_t, int64_t> out_size);

This function takes a sorted MST (represented as edges with source, destination, and weights) and constructs a dendrogram (hierarchical clustering tree) on the host.

nnz)

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft resources handle
rowsinraft::device_vector_view<const int64_t, int64_t>Source nodes of the MST edges (device memory, size: nnz)
colsinraft::device_vector_view<const int64_t, int64_t>Destination nodes of the MST edges (device memory, size: nnz)
datainraft::device_vector_view<const float, int64_t>Edge weights/distances of the MST (device memory, size: nnz)
childrenoutraft::device_matrix_view<int64_t, int64_t, raft::row_major>Output dendrogram children array (device memory, size: nnz * 2) Each pair of consecutive elements represents the two children merged at each step of the hierarchy
out_deltaoutraft::device_vector_view<float, int64_t>Output distances/heights at which clusters are merged (device memory, size:
out_sizeoutraft::device_vector_view<int64_t, int64_t>Output cluster sizes at each merge step (device memory, size: nnz)

Returns

void