K-Means

View as Markdown

Source header: cuvs/cluster/kmeans.hpp

Types

cluster::kmeans::base_params

Base structure for parameters that are common to all k-means algorithms

1struct base_params {
2 cuvs::distance::DistanceType metric;
3};

Fields

NameTypeDescription
metriccuvs::distance::DistanceTypeMetric to use for distance computation. The supported metrics can vary per algorithm.

k-means hyperparameters

cluster::kmeans::params

Simple object to specify hyper-parameters to the kmeans algorithm.

1struct params : base_params {
2 int n_clusters;
3 InitMethod init;
4 int max_iter;
5 double tol;
6 rapids_logger::level_enum verbosity;
7 raft::random::RngState rng_state;
8 int n_init;
9 double oversampling_factor;
10 int batch_samples;
11 int batch_centroids;
12 int64_t init_size;
13 int64_t streaming_batch_size;
14};

Fields

NameTypeDescription
n_clustersintThe number of clusters to form as well as the number of centroids to generate (default:8).
initInitMethodMethod for initialization, defaults to k-means++:
- InitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm to select the initial cluster centers.
- InitMethod::Random (random): Choose ‘n_clusters’ observations (rows) at random from the input data for the initial centroids.
- InitMethod::Array (ndarray): Use ‘centroids’ as initial cluster centers.
max_iterintMaximum number of iterations of the k-means algorithm for a single run.
toldoubleRelative tolerance with regards to inertia to declare convergence.
verbosityrapids_logger::level_enumverbosity level.
rng_stateraft::random::RngStateSeed to the random number generator.
n_initintNumber of instance k-means algorithm will be run with different seeds.
oversampling_factordoubleOversampling factor for use in the k-means|| algorithm
batch_samplesintbatch_samples and batch_centroids are used to tile 1NN computation which is useful to optimize/control the memory footprint
Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 then don’t tile the centroids

NB: These parameters are unrelated to streaming_batch_size, which controls how many samples to transfer from host to device per batch when processing out-of-core data.
batch_centroidsintif 0 then batch_centroids = n_clusters
init_sizeint64_tNumber of samples to randomly draw for the KMeansPlusPlus initialization step. A random subset of this size is used for centroid seeding.

Only applies when dataset is on host; for device data the full dataset is always used for seeding and this parameter is ignored.

When set to 0 (default) with host data uses min(3 * n_clusters, n_samples) as a default.

In Batched multi-GPU host-data fits, the effective KMeansPlusPlus initialization sample is materialized on device on every rank. Every rank must have enough GPU memory for this sample, and rank 0 must also have enough GPU memory for the seeding workspace.

Default: 0.
streaming_batch_sizeint64_tNumber of samples to process per GPU batch when fitting with host data. When set to 0, defaults to n_samples (process all at once). Only used by the batched (host-data) code path and ignored by device-data overloads.

In multi-GPU mode, this is a per-rank batch size. Each rank processes up to this many local samples per batch, clamped to that rank’s local sample count.
Default: 0 (process all data at once).

cluster::kmeans::balanced_params

Simple object to specify hyper-parameters to the balanced k-means algorithm.

The following metrics are currently supported in k-means balanced:

  • CosineExpanded
  • InnerProduct
  • L2Expanded
  • L2SqrtExpanded
1struct balanced_params : base_params {
2 uint32_t n_iters;
3};

Fields

NameTypeDescription
n_itersuint32_tNumber of training iterations

cluster::kmeans::kmeans_type

Type of k-means algorithm.

1enum class kmeans_type {
2 KMeans = 0,
3 KMeansBalanced = 1
4};

Values

NameValue
KMeans0
KMeansBalanced1

k-means clustering APIs

cluster::kmeans::fit

Find clusters with k-means algorithm using batched processing of host data.

1void fit(raft::resources const& handle,
2const cuvs::cluster::kmeans::params& params,
3raft::host_matrix_view<const float, int64_t> X,
4std::optional<raft::host_vector_view<const float, int64_t>> sample_weight,
5raft::device_matrix_view<float, int64_t> centroids,
6raft::host_scalar_view<float> inertia,
7raft::host_scalar_view<int64_t> n_iter);

TODO: Evaluate replacing the extent type with int64_t. Reference issue: https://github.com/rapidsai/cuvs/issues/1961

This overload supports out-of-core computation where the dataset resides on the host. Data is processed in GPU-sized batches, streaming from host to device. The batch size is controlled by params.streaming_batch_size. In multi-GPU mode, this is a per-rank batch size.

Multi-GPU dispatch is selected automatically based on the handle state:

  • If raft::resource::is_multi_gpu(handle) (cuVS SNMG): the full dataset X is split across GPUs internally with an OpenMP parallel region and NCCL.
  • If raft::resource::comms_initialized(handle) (Dask/Ray/MPI): X is treated as this worker’s partition, and RAFT communicators are used for collectives.
  • Otherwise: single-GPU batched k-means.

With params.init == InitMethod::KMeansPlusPlus in multi-GPU mode, the effective initialization sample must fit in GPU memory on every rank because it is materialized on every device. Rank 0 must also have enough GPU memory for the seeding workspace before centroids are broadcast.

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle. When a multi-GPU resource is attached, multi-GPU dispatch is used automatically.
paramsinconst cuvs::cluster::kmeans::params&Parameters for KMeans model. Batch size is read from params.streaming_batch_size.
Xinraft::host_matrix_view<const float, int64_t>Training instances on HOST memory. The data must be in row-major format. [dim = n_samples x n_features]
sample_weightinstd::optional<raft::host_vector_view<const float, int64_t>>Optional weights for each observation in X (on host). [len = n_samples]
centroidsinoutraft::device_matrix_view<float, int64_t>[in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
inertiaoutraft::host_scalar_view<float>Sum of squared distances of samples to their closest cluster center.
n_iteroutraft::host_scalar_view<int64_t>Number of iterations run.

Returns

void

Additional overload: cluster::kmeans::fit

Find clusters with k-means algorithm using batched processing of host data.

1void fit(raft::resources const& handle,
2const cuvs::cluster::kmeans::params& params,
3raft::host_matrix_view<const double, int64_t> X,
4std::optional<raft::host_vector_view<const double, int64_t>> sample_weight,
5raft::device_matrix_view<double, int64_t> centroids,
6raft::host_scalar_view<double> inertia,
7raft::host_scalar_view<int64_t> n_iter);

Parameters

NameDirectionTypeDescription
handleraft::resources const&
paramsconst cuvs::cluster::kmeans::params&
Xraft::host_matrix_view<const double, int64_t>
sample_weightstd::optional<raft::host_vector_view<const double, int64_t>>
centroidsraft::device_matrix_view<double, int64_t>
inertiaraft::host_scalar_view<double>
n_iterraft::host_scalar_view<int64_t>

Returns

void

Additional overload: cluster::kmeans::fit

Find clusters with k-means algorithm. Initial centroids are chosen with k-means++ algorithm. Empty clusters are reinitialized by choosing new centroids with k-means++ algorithm.

1void fit(raft::resources const& handle,
2const cuvs::cluster::kmeans::params& params,
3raft::device_matrix_view<const float, int> X,
4std::optional<raft::device_vector_view<const float, int>> sample_weight,
5raft::device_matrix_view<float, int> centroids,
6raft::host_scalar_view<float> inertia,
7raft::host_scalar_view<int> n_iter);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst cuvs::cluster::kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const float, int>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
sample_weightinstd::optional<raft::device_vector_view<const float, int>>Optional weights for each observation in X. [len = n_samples]
centroidsinoutraft::device_matrix_view<float, int>[in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
inertiaoutraft::host_scalar_view<float>Sum of squared distances of samples to their closest cluster center.
n_iteroutraft::host_scalar_view<int>Number of iterations run.

Returns

void

Additional overload: cluster::kmeans::fit

Find clusters with k-means algorithm. Initial centroids are chosen with k-means++ algorithm. Empty clusters are reinitialized by choosing new centroids with k-means++ algorithm.

1void fit(raft::resources const& handle,
2const cuvs::cluster::kmeans::params& params,
3raft::device_matrix_view<const float, int64_t> X,
4std::optional<raft::device_vector_view<const float, int64_t>> sample_weight,
5raft::device_matrix_view<float, int64_t> centroids,
6raft::host_scalar_view<float> inertia,
7raft::host_scalar_view<int64_t> n_iter);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst cuvs::cluster::kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const float, int64_t>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
sample_weightinstd::optional<raft::device_vector_view<const float, int64_t>>Optional weights for each observation in X. [len = n_samples]
centroidsinoutraft::device_matrix_view<float, int64_t>[in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
inertiaoutraft::host_scalar_view<float>Sum of squared distances of samples to their closest cluster center.
n_iteroutraft::host_scalar_view<int64_t>Number of iterations run.

Returns

void

Additional overload: cluster::kmeans::fit

Find clusters with k-means algorithm. Initial centroids are chosen with k-means++ algorithm. Empty clusters are reinitialized by choosing new centroids with k-means++ algorithm.

1void fit(raft::resources const& handle,
2const cuvs::cluster::kmeans::params& params,
3raft::device_matrix_view<const double, int> X,
4std::optional<raft::device_vector_view<const double, int>> sample_weight,
5raft::device_matrix_view<double, int> centroids,
6raft::host_scalar_view<double> inertia,
7raft::host_scalar_view<int> n_iter);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst cuvs::cluster::kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const double, int>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
sample_weightinstd::optional<raft::device_vector_view<const double, int>>Optional weights for each observation in X. [len = n_samples]
centroidsinoutraft::device_matrix_view<double, int>[in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
inertiaoutraft::host_scalar_view<double>Sum of squared distances of samples to their closest cluster center.
n_iteroutraft::host_scalar_view<int>Number of iterations run.

Returns

void

Additional overload: cluster::kmeans::fit

Find clusters with k-means algorithm. Initial centroids are chosen with k-means++ algorithm. Empty clusters are reinitialized by choosing new centroids with k-means++ algorithm.

1void fit(raft::resources const& handle,
2const cuvs::cluster::kmeans::params& params,
3raft::device_matrix_view<const double, int64_t> X,
4std::optional<raft::device_vector_view<const double, int64_t>> sample_weight,
5raft::device_matrix_view<double, int64_t> centroids,
6raft::host_scalar_view<double> inertia,
7raft::host_scalar_view<int64_t> n_iter);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst cuvs::cluster::kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const double, int64_t>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
sample_weightinstd::optional<raft::device_vector_view<const double, int64_t>>Optional weights for each observation in X. [len = n_samples]
centroidsinoutraft::device_matrix_view<double, int64_t>[in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
inertiaoutraft::host_scalar_view<double>Sum of squared distances of samples to their closest cluster center.
n_iteroutraft::host_scalar_view<int64_t>Number of iterations run.

Returns

void

Additional overload: cluster::kmeans::fit

Find clusters with k-means algorithm. Initial centroids are chosen with k-means++ algorithm. Empty clusters are reinitialized by choosing new centroids with k-means++ algorithm.

1void fit(raft::resources const& handle,
2const cuvs::cluster::kmeans::params& params,
3raft::device_matrix_view<const int8_t, int> X,
4std::optional<raft::device_vector_view<const int8_t, int>> sample_weight,
5raft::device_matrix_view<int8_t, int> centroids,
6raft::host_scalar_view<int8_t> inertia,
7raft::host_scalar_view<int> n_iter);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst cuvs::cluster::kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const int8_t, int>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
sample_weightinstd::optional<raft::device_vector_view<const int8_t, int>>Optional weights for each observation in X. [len = n_samples]
centroidsinoutraft::device_matrix_view<int8_t, int>[in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
inertiaoutraft::host_scalar_view<int8_t>Sum of squared distances of samples to their closest cluster center.
n_iteroutraft::host_scalar_view<int>Number of iterations run.

Returns

void

Additional overload: cluster::kmeans::fit

Find balanced clusters with k-means algorithm.

1void fit(const raft::resources& handle,
2cuvs::cluster::kmeans::balanced_params const& params,
3raft::device_matrix_view<const float, int64_t> X,
4raft::device_matrix_view<float, int64_t> centroids,
5std::optional<raft::host_scalar_view<float>> inertia = std::nullopt);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle.
paramsincuvs::cluster::kmeans::balanced_params const&Parameters for KMeans model.
Xinraft::device_matrix_view<const float, int64_t>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
centroidsoutraft::device_matrix_view<float, int64_t>[out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
inertiaoutstd::optional<raft::host_scalar_view<float>>Sum of squared distances of samples to their closest cluster center.
Default: std::nullopt.

Returns

void

Additional overload: cluster::kmeans::fit

Find balanced clusters with k-means algorithm.

1void fit(const raft::resources& handle,
2cuvs::cluster::kmeans::balanced_params const& params,
3raft::device_matrix_view<const int8_t, int64_t> X,
4raft::device_matrix_view<float, int64_t> centroids,
5std::optional<raft::host_scalar_view<float>> inertia = std::nullopt);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle.
paramsincuvs::cluster::kmeans::balanced_params const&Parameters for KMeans model.
Xinraft::device_matrix_view<const int8_t, int64_t>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
centroidsinoutraft::device_matrix_view<float, int64_t>[out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
inertiaoutstd::optional<raft::host_scalar_view<float>>Sum of squared distances of samples to their closest cluster center.
Default: std::nullopt.

Returns

void

Additional overload: cluster::kmeans::fit

Find balanced clusters with k-means algorithm.

1void fit(const raft::resources& handle,
2cuvs::cluster::kmeans::balanced_params const& params,
3raft::device_matrix_view<const half, int64_t> X,
4raft::device_matrix_view<float, int64_t> centroids,
5std::optional<raft::host_scalar_view<float>> inertia = std::nullopt);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle.
paramsincuvs::cluster::kmeans::balanced_params const&Parameters for KMeans model.
Xinraft::device_matrix_view<const half, int64_t>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
centroidsinoutraft::device_matrix_view<float, int64_t>[out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
inertiaoutstd::optional<raft::host_scalar_view<float>>Sum of squared distances of samples to their closest cluster center.
Default: std::nullopt.

Returns

void

Additional overload: cluster::kmeans::fit

Find balanced clusters with k-means algorithm.

1void fit(const raft::resources& handle,
2cuvs::cluster::kmeans::balanced_params const& params,
3raft::device_matrix_view<const uint8_t, int64_t> X,
4raft::device_matrix_view<float, int64_t> centroids,
5std::optional<raft::host_scalar_view<float>> inertia = std::nullopt);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle.
paramsincuvs::cluster::kmeans::balanced_params const&Parameters for KMeans model.
Xinraft::device_matrix_view<const uint8_t, int64_t>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
centroidsinoutraft::device_matrix_view<float, int64_t>[out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
inertiaoutstd::optional<raft::host_scalar_view<float>>Sum of squared distances of samples to their closest cluster center.
Default: std::nullopt.

Returns

void

cluster::kmeans::predict

Predict the closest cluster each sample in X belongs to.

1void predict(raft::resources const& handle,
2const kmeans::params& params,
3raft::device_matrix_view<const float, int> X,
4std::optional<raft::device_vector_view<const float, int>> sample_weight,
5raft::device_matrix_view<const float, int> centroids,
6raft::device_vector_view<int, int> labels,
7bool normalize_weight,
8raft::host_scalar_view<float> inertia);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const float, int>New data to predict. [dim = n_samples x n_features]
sample_weightinstd::optional<raft::device_vector_view<const float, int>>Optional weights for each observation in X. [len = n_samples]
centroidsinraft::device_matrix_view<const float, int>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<int, int>Index of the cluster each sample in X belongs to. [len = n_samples]
normalize_weightinboolTrue if the weights should be normalized
inertiaoutraft::host_scalar_view<float>Sum of squared distances of samples to their closest cluster center.

Returns

void

Additional overload: cluster::kmeans::predict

Predict the closest cluster each sample in X belongs to.

1void predict(raft::resources const& handle,
2const kmeans::params& params,
3raft::device_matrix_view<const float, int64_t> X,
4std::optional<raft::device_vector_view<const float, int64_t>> sample_weight,
5raft::device_matrix_view<const float, int64_t> centroids,
6raft::device_vector_view<int64_t, int64_t> labels,
7bool normalize_weight,
8raft::host_scalar_view<float> inertia);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const float, int64_t>New data to predict. [dim = n_samples x n_features]
sample_weightinstd::optional<raft::device_vector_view<const float, int64_t>>Optional weights for each observation in X. [len = n_samples]
centroidsinraft::device_matrix_view<const float, int64_t>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<int64_t, int64_t>Index of the cluster each sample in X belongs to. [len = n_samples]
normalize_weightinboolTrue if the weights should be normalized
inertiaoutraft::host_scalar_view<float>Sum of squared distances of samples to their closest cluster center.

Returns

void

Additional overload: cluster::kmeans::predict

Predict the closest cluster each sample in X belongs to.

1void predict(raft::resources const& handle,
2const kmeans::params& params,
3raft::device_matrix_view<const double, int> X,
4std::optional<raft::device_vector_view<const double, int>> sample_weight,
5raft::device_matrix_view<const double, int> centroids,
6raft::device_vector_view<int, int> labels,
7bool normalize_weight,
8raft::host_scalar_view<double> inertia);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const double, int>New data to predict. [dim = n_samples x n_features]
sample_weightinstd::optional<raft::device_vector_view<const double, int>>Optional weights for each observation in X. [len = n_samples]
centroidsinraft::device_matrix_view<const double, int>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<int, int>Index of the cluster each sample in X belongs to. [len = n_samples]
normalize_weightinboolTrue if the weights should be normalized
inertiaoutraft::host_scalar_view<double>Sum of squared distances of samples to their closest cluster center.

Returns

void

Additional overload: cluster::kmeans::predict

Predict the closest cluster each sample in X belongs to.

1void predict(raft::resources const& handle,
2const kmeans::params& params,
3raft::device_matrix_view<const double, int64_t> X,
4std::optional<raft::device_vector_view<const double, int64_t>> sample_weight,
5raft::device_matrix_view<const double, int64_t> centroids,
6raft::device_vector_view<int64_t, int64_t> labels,
7bool normalize_weight,
8raft::host_scalar_view<double> inertia);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const double, int64_t>New data to predict. [dim = n_samples x n_features]
sample_weightinstd::optional<raft::device_vector_view<const double, int64_t>>Optional weights for each observation in X. [len = n_samples]
centroidsinraft::device_matrix_view<const double, int64_t>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<int64_t, int64_t>Index of the cluster each sample in X belongs to. [len = n_samples]
normalize_weightinboolTrue if the weights should be normalized
inertiaoutraft::host_scalar_view<double>Sum of squared distances of samples to their closest cluster center.

Returns

void

Additional overload: cluster::kmeans::predict

Predict the closest cluster each sample in X belongs to.

1void predict(const raft::resources& handle,
2cuvs::cluster::kmeans::balanced_params const& params,
3raft::device_matrix_view<const int8_t, int64_t> X,
4raft::device_matrix_view<const float, int64_t> centroids,
5raft::device_vector_view<uint32_t, int64_t> labels);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle.
paramsincuvs::cluster::kmeans::balanced_params const&Parameters for KMeans model.
Xinraft::device_matrix_view<const int8_t, int64_t>New data to predict. [dim = n_samples x n_features]
centroidsinraft::device_matrix_view<const float, int64_t>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<uint32_t, int64_t>Index of the cluster each sample in X belongs to. [len = n_samples]

Returns

void

Additional overload: cluster::kmeans::predict

Predict the closest cluster each sample in X belongs to.

1void predict(const raft::resources& handle,
2cuvs::cluster::kmeans::balanced_params const& params,
3raft::device_matrix_view<const int8_t, int64_t> X,
4raft::device_matrix_view<const float, int64_t> centroids,
5raft::device_vector_view<int, int64_t> labels);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle.
paramsincuvs::cluster::kmeans::balanced_params const&Parameters for KMeans model.
Xinraft::device_matrix_view<const int8_t, int64_t>New data to predict. [dim = n_samples x n_features]
centroidsinraft::device_matrix_view<const float, int64_t>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<int, int64_t>Index of the cluster each sample in X belongs to. [len = n_samples]

Returns

void

Additional overload: cluster::kmeans::predict

Predict the closest cluster each sample in X belongs to.

1void predict(const raft::resources& handle,
2cuvs::cluster::kmeans::balanced_params const& params,
3raft::device_matrix_view<const float, int64_t> X,
4raft::device_matrix_view<const float, int64_t> centroids,
5raft::device_vector_view<int, int64_t> labels);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle.
paramsincuvs::cluster::kmeans::balanced_params const&Parameters for KMeans model.
Xinraft::device_matrix_view<const float, int64_t>New data to predict. [dim = n_samples x n_features]
centroidsinraft::device_matrix_view<const float, int64_t>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<int, int64_t>Index of the cluster each sample in X belongs to. [len = n_samples]

Returns

void

Additional overload: cluster::kmeans::predict

Predict the closest cluster each sample in X belongs to.

1void predict(const raft::resources& handle,
2cuvs::cluster::kmeans::balanced_params const& params,
3raft::device_matrix_view<const float, int64_t> X,
4raft::device_matrix_view<const float, int64_t> centroids,
5raft::device_vector_view<uint32_t, int64_t> labels);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle.
paramsincuvs::cluster::kmeans::balanced_params const&Parameters for KMeans model.
Xinraft::device_matrix_view<const float, int64_t>New data to predict. [dim = n_samples x n_features]
centroidsinraft::device_matrix_view<const float, int64_t>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<uint32_t, int64_t>Index of the cluster each sample in X belongs to. [len = n_samples]

Returns

void

Additional overload: cluster::kmeans::predict

Predict the closest cluster each sample in X belongs to.

1void predict(const raft::resources& handle,
2cuvs::cluster::kmeans::balanced_params const& params,
3raft::device_matrix_view<const half, int64_t> X,
4raft::device_matrix_view<const float, int64_t> centroids,
5raft::device_vector_view<uint32_t, int64_t> labels);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle.
paramsincuvs::cluster::kmeans::balanced_params const&Parameters for KMeans model.
Xinraft::device_matrix_view<const half, int64_t>New data to predict. [dim = n_samples x n_features]
centroidsinraft::device_matrix_view<const float, int64_t>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<uint32_t, int64_t>Index of the cluster each sample in X belongs to. [len = n_samples]

Returns

void

Additional overload: cluster::kmeans::predict

Predict the closest cluster each sample in X belongs to.

1void predict(const raft::resources& handle,
2cuvs::cluster::kmeans::balanced_params const& params,
3raft::device_matrix_view<const uint8_t, int64_t> X,
4raft::device_matrix_view<const float, int64_t> centroids,
5raft::device_vector_view<uint32_t, int64_t> labels);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle.
paramsincuvs::cluster::kmeans::balanced_params const&Parameters for KMeans model.
Xinraft::device_matrix_view<const uint8_t, int64_t>New data to predict. [dim = n_samples x n_features]
centroidsinraft::device_matrix_view<const float, int64_t>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<uint32_t, int64_t>Index of the cluster each sample in X belongs to. [len = n_samples]

Returns

void

cluster::kmeans::fit_predict

Compute k-means clustering and predicts cluster index for each sample in the input.

1void fit_predict(raft::resources const& handle,
2const kmeans::params& params,
3raft::device_matrix_view<const float, int> X,
4std::optional<raft::device_vector_view<const float, int>> sample_weight,
5std::optional<raft::device_matrix_view<float, int>> centroids,
6raft::device_vector_view<int, int> labels,
7raft::host_scalar_view<float> inertia,
8raft::host_scalar_view<int> n_iter);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const float, int>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
sample_weightinstd::optional<raft::device_vector_view<const float, int>>Optional weights for each observation in X. [len = n_samples]
centroidsinoutstd::optional<raft::device_matrix_view<float, int>>Optional [in] When init is InitMethod::Array, use centroids as the initial cluster centers [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<int, int>Index of the cluster each sample in X belongs to. [len = n_samples]
inertiaoutraft::host_scalar_view<float>Sum of squared distances of samples to their closest cluster center.
n_iteroutraft::host_scalar_view<int>Number of iterations run.

Returns

void

Additional overload: cluster::kmeans::fit_predict

Compute k-means clustering and predicts cluster index for each sample in the input.

1void fit_predict(raft::resources const& handle,
2const kmeans::params& params,
3raft::device_matrix_view<const float, int64_t> X,
4std::optional<raft::device_vector_view<const float, int64_t>> sample_weight,
5std::optional<raft::device_matrix_view<float, int64_t>> centroids,
6raft::device_vector_view<int64_t, int64_t> labels,
7raft::host_scalar_view<float> inertia,
8raft::host_scalar_view<int64_t> n_iter);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const float, int64_t>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
sample_weightinstd::optional<raft::device_vector_view<const float, int64_t>>Optional weights for each observation in X. [len = n_samples]
centroidsinoutstd::optional<raft::device_matrix_view<float, int64_t>>Optional [in] When init is InitMethod::Array, use centroids as the initial cluster centers [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<int64_t, int64_t>Index of the cluster each sample in X belongs to. [len = n_samples]
inertiaoutraft::host_scalar_view<float>Sum of squared distances of samples to their closest cluster center.
n_iteroutraft::host_scalar_view<int64_t>Number of iterations run.

Returns

void

Additional overload: cluster::kmeans::fit_predict

Compute k-means clustering and predicts cluster index for each sample in the input.

1void fit_predict(raft::resources const& handle,
2const kmeans::params& params,
3raft::device_matrix_view<const double, int> X,
4std::optional<raft::device_vector_view<const double, int>> sample_weight,
5std::optional<raft::device_matrix_view<double, int>> centroids,
6raft::device_vector_view<int, int> labels,
7raft::host_scalar_view<double> inertia,
8raft::host_scalar_view<int> n_iter);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const double, int>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
sample_weightinstd::optional<raft::device_vector_view<const double, int>>Optional weights for each observation in X. [len = n_samples]
centroidsinoutstd::optional<raft::device_matrix_view<double, int>>Optional [in] When init is InitMethod::Array, use centroids as the initial cluster centers [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<int, int>Index of the cluster each sample in X belongs to. [len = n_samples]
inertiaoutraft::host_scalar_view<double>Sum of squared distances of samples to their closest cluster center.
n_iteroutraft::host_scalar_view<int>Number of iterations run.

Returns

void

Additional overload: cluster::kmeans::fit_predict

Compute k-means clustering and predicts cluster index for each sample in the input.

1void fit_predict(raft::resources const& handle,
2const kmeans::params& params,
3raft::device_matrix_view<const double, int64_t> X,
4std::optional<raft::device_vector_view<const double, int64_t>> sample_weight,
5std::optional<raft::device_matrix_view<double, int64_t>> centroids,
6raft::device_vector_view<int64_t, int64_t> labels,
7raft::host_scalar_view<double> inertia,
8raft::host_scalar_view<int64_t> n_iter);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const double, int64_t>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
sample_weightinstd::optional<raft::device_vector_view<const double, int64_t>>Optional weights for each observation in X. [len = n_samples]
centroidsinoutstd::optional<raft::device_matrix_view<double, int64_t>>Optional [in] When init is InitMethod::Array, use centroids as the initial cluster centers [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<int64_t, int64_t>Index of the cluster each sample in X belongs to. [len = n_samples]
inertiaoutraft::host_scalar_view<double>Sum of squared distances of samples to their closest cluster center.
n_iteroutraft::host_scalar_view<int64_t>Number of iterations run.

Returns

void

Additional overload: cluster::kmeans::fit_predict

Compute balanced k-means clustering and predicts cluster index for each sample in the input.

1void fit_predict(const raft::resources& handle,
2cuvs::cluster::kmeans::balanced_params const& params,
3raft::device_matrix_view<const float, int64_t> X,
4raft::device_matrix_view<float, int64_t> centroids,
5raft::device_vector_view<uint32_t, int64_t> labels);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle.
paramsincuvs::cluster::kmeans::balanced_params const&Parameters for KMeans model.
Xinraft::device_matrix_view<const float, int64_t>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
centroidsinoutraft::device_matrix_view<float, int64_t>Optional [in] When init is InitMethod::Array, use centroids as the initial cluster centers [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<uint32_t, int64_t>Index of the cluster each sample in X belongs to. [len = n_samples]

Returns

void

Additional overload: cluster::kmeans::fit_predict

Compute balanced k-means clustering and predicts cluster index for each sample in the input.

1void fit_predict(const raft::resources& handle,
2cuvs::cluster::kmeans::balanced_params const& params,
3raft::device_matrix_view<const int8_t, int64_t> X,
4raft::device_matrix_view<float, int64_t> centroids,
5raft::device_vector_view<uint32_t, int64_t> labels);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle.
paramsincuvs::cluster::kmeans::balanced_params const&Parameters for KMeans model.
Xinraft::device_matrix_view<const int8_t, int64_t>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
centroidsinoutraft::device_matrix_view<float, int64_t>Optional [in] When init is InitMethod::Array, use centroids as the initial cluster centers [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. [dim = n_clusters x n_features]
labelsoutraft::device_vector_view<uint32_t, int64_t>Index of the cluster each sample in X belongs to. [len = n_samples]

Returns

void

cluster::kmeans::transform

Transform X to a cluster-distance space.

1void transform(raft::resources const& handle,
2const kmeans::params& params,
3raft::device_matrix_view<const float, int> X,
4raft::device_matrix_view<const float, int> centroids,
5raft::device_matrix_view<float, int> X_new);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const float, int>Training instances to cluster. The data must be in row-major format [dim = n_samples x n_features]
centroidsinraft::device_matrix_view<const float, int>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
X_newoutraft::device_matrix_view<float, int>X transformed in the new space. [dim = n_samples x n_features]

Returns

void

Additional overload: cluster::kmeans::transform

Transform X to a cluster-distance space.

1void transform(raft::resources const& handle,
2const kmeans::params& params,
3raft::device_matrix_view<const double, int> X,
4raft::device_matrix_view<const double, int> centroids,
5raft::device_matrix_view<double, int> X_new);

Parameters

NameDirectionTypeDescription
handleinraft::resources const&The raft handle.
paramsinconst kmeans::params&Parameters for KMeans model.
Xinraft::device_matrix_view<const double, int>Training instances to cluster. The data must be in row-major format [dim = n_samples x n_features]
centroidsinraft::device_matrix_view<const double, int>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
X_newoutraft::device_matrix_view<double, int>X transformed in the new space. [dim = n_samples x n_features]

Returns

void

cluster::kmeans::cluster_cost

Compute (optionally weighted) cluster cost

1void cluster_cost(
2const raft::resources& handle,
3raft::device_matrix_view<const float, int> X,
4raft::device_matrix_view<const float, int> centroids,
5raft::host_scalar_view<float> cost,
6std::optional<raft::device_vector_view<const float, int>> sample_weight = std::nullopt);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle
Xinraft::device_matrix_view<const float, int>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
centroidsinraft::device_matrix_view<const float, int>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
costoutraft::host_scalar_view<float>Resulting cluster cost
sample_weightinstd::optional<raft::device_vector_view<const float, int>>Optional per-sample weights. [len = n_samples]
Default: std::nullopt.

Returns

void

Additional overload: cluster::kmeans::cluster_cost

Compute cluster cost

1void cluster_cost(
2const raft::resources& handle,
3raft::device_matrix_view<const double, int> X,
4raft::device_matrix_view<const double, int> centroids,
5raft::host_scalar_view<double> cost,
6std::optional<raft::device_vector_view<const double, int>> sample_weight = std::nullopt);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle
Xinraft::device_matrix_view<const double, int>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
centroidsinraft::device_matrix_view<const double, int>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
costoutraft::host_scalar_view<double>Resulting cluster cost
sample_weightinstd::optional<raft::device_vector_view<const double, int>>Optional per-sample weights. [len = n_samples]
Default: std::nullopt.

Returns

void

Additional overload: cluster::kmeans::cluster_cost

Compute (optionally weighted) cluster cost

1void cluster_cost(
2const raft::resources& handle,
3raft::device_matrix_view<const float, int64_t> X,
4raft::device_matrix_view<const float, int64_t> centroids,
5raft::host_scalar_view<float> cost,
6std::optional<raft::device_vector_view<const float, int64_t>> sample_weight = std::nullopt);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle
Xinraft::device_matrix_view<const float, int64_t>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
centroidsinraft::device_matrix_view<const float, int64_t>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
costoutraft::host_scalar_view<float>Resulting cluster cost
sample_weightinstd::optional<raft::device_vector_view<const float, int64_t>>Optional per-sample weights. [len = n_samples]
Default: std::nullopt.

Returns

void

Additional overload: cluster::kmeans::cluster_cost

Compute (optionally weighted) cluster cost

1void cluster_cost(
2const raft::resources& handle,
3raft::device_matrix_view<const double, int64_t> X,
4raft::device_matrix_view<const double, int64_t> centroids,
5raft::host_scalar_view<double> cost,
6std::optional<raft::device_vector_view<const double, int64_t>> sample_weight = std::nullopt);

Parameters

NameDirectionTypeDescription
handleinconst raft::resources&The raft handle
Xinraft::device_matrix_view<const double, int64_t>Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
centroidsinraft::device_matrix_view<const double, int64_t>Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
costoutraft::host_scalar_view<double>Resulting cluster cost
sample_weightinstd::optional<raft::device_vector_view<const double, int64_t>>Optional per-sample weights. [len = n_samples]
Default: std::nullopt.

Returns

void

k-means API helpers

cluster::kmeans::helpers::find_k

Automatically find the optimal value of k using a binary search. This method maximizes the Calinski-Harabasz Index while minimizing the per-cluster inertia.

1void find_k(raft::resources const& handle,
2raft::device_matrix_view<const float, int> X,
3raft::host_scalar_view<int> best_k,
4raft::host_scalar_view<float> inertia,
5raft::host_scalar_view<int> n_iter,
6int kmax,
7int kmin = 1,
8int maxiter = 100,
9float tol = 1e-3);

Parameters

NameDirectionTypeDescription
handleraft::resources const&raft handle
Xraft::device_matrix_view<const float, int>input observations (shape n_samples, n_dims)
best_kraft::host_scalar_view<int>best k found from binary search
inertiaraft::host_scalar_view<float>inertia of best k found
n_iterraft::host_scalar_view<int>number of iterations used to find best k
kmaxintmaximum k to try in search
kminintminimum k to try in search (should be >= 1)
Default: 1.
maxiterintmaximum number of iterations to run
Default: 100.
tolfloattolerance for early stopping convergence
Default: 1e-3.

Returns

void