Cluster Kmeans

Source header: cuvs/cluster/kmeans.h

k-means hyperparameters

cuvsKMeansInitMethod

k-means hyperparameters

1 typedef enum { ... } cuvsKMeansInitMethod;

Values

Name	Value
`KMeansPlusPlus`	`0`
`Random`	`1`
`Array`	`2`

cuvsKMeansParams

Hyper-parameters for the kmeans algorithm

NB: The inertia_check field is kept for ABI compatibility. Removed in cuvsKMeansParams_v2. TODO: CalVer for the replacement: 26.08

1 struct cuvsKMeansParams { ... };

Fields

Name	Type	Description
`n_clusters`	`int`	The number of clusters to form as well as the number of centroids to generate (default:8).
`init`	`cuvsKMeansInitMethod`	Method for initialization, defaults to k-means++: - cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm to select the initial cluster centers. - cuvsKMeansInitMethod::Random (random): Choose ‘n_clusters’ observations (rows) at random from the input data for the initial centroids. - cuvsKMeansInitMethod::Array (ndarray): Use ‘centroids’ as initial cluster centers.
`max_iter`	`int`	Maximum number of iterations of the k-means algorithm for a single run.
`tol`	`double`	Relative tolerance with regards to inertia to declare convergence.
`n_init`	`int`	Number of instance k-means algorithm will be run with different seeds.
`oversampling_factor`	`double`	Oversampling factor for use in the k-means\|\| algorithm
`batch_samples`	`int`	batch_samples and batch_centroids are used to tile 1NN computation which is useful to optimize/control the memory footprint Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 then don’t tile the centroids
`batch_centroids`	`int`	if 0 then batch_centroids = n_clusters
`inertia_check`	`bool`	Deprecated, ignored. Kept for ABI compatibility.
`hierarchical`	`bool`	Whether to use hierarchical (balanced) kmeans or not
`hierarchical_n_iters`	`int`	For hierarchical k-means , defines the number of training iterations
`streaming_batch_size`	`int64_t`	Number of samples to process per GPU batch for the batched (host-data) API. When set to 0, defaults to n_samples (process all at once).
`init_size`	`int64_t`	Number of samples to draw for KMeansPlusPlus initialization. When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data, or n_samples for device data.
`metric`	`cuvsDistanceType`

cuvsKMeansParams_v2

Hyper-parameters for the kmeans algorithm

TODO: Remove this after cuvsKMeansParams is replaced in ABI 2.0

1 struct cuvsKMeansParams_v2 { ... };

Fields

Name	Type	Description
`n_clusters`	`int`	The number of clusters to form as well as the number of centroids to generate (default:8).
`init`	`cuvsKMeansInitMethod`	Method for initialization, defaults to k-means++: - cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm to select the initial cluster centers. - cuvsKMeansInitMethod::Random (random): Choose ‘n_clusters’ observations (rows) at random from the input data for the initial centroids. - cuvsKMeansInitMethod::Array (ndarray): Use ‘centroids’ as initial cluster centers.
`max_iter`	`int`	Maximum number of iterations of the k-means algorithm for a single run.
`tol`	`double`	Relative tolerance with regards to inertia to declare convergence.
`n_init`	`int`	Number of instance k-means algorithm will be run with different seeds.
`oversampling_factor`	`double`	Oversampling factor for use in the k-means\|\| algorithm
`batch_samples`	`int`	batch_samples and batch_centroids are used to tile 1NN computation which is useful to optimize/control the memory footprint Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 then don’t tile the centroids
`batch_centroids`	`int`	if 0 then batch_centroids = n_clusters
`hierarchical`	`bool`	Whether to use hierarchical (balanced) kmeans or not
`hierarchical_n_iters`	`int`	For hierarchical k-means , defines the number of training iterations
`streaming_batch_size`	`int64_t`	Number of samples to process per GPU batch for the batched (host-data) API. When set to 0, defaults to n_samples (process all at once).
`init_size`	`int64_t`	Number of samples to draw for KMeansPlusPlus initialization. When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data, or n_samples for device data.
`metric`	`cuvsDistanceType`

cuvsKMeansParamsCreate

Allocate KMeans params, and populate with default values

1 CUVS_EXPORT cuvsError_t cuvsKMeansParamsCreate(cuvsKMeansParams_t* params);

replaced by cuvsKMeansParamsCreate_v2.

Parameters

Name	Direction	Type	Description
`params`	in	`cuvsKMeansParams_t*`	cuvsKMeansParams_t to allocate

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansParamsDestroy

De-allocate KMeans params

1 CUVS_EXPORT cuvsError_t cuvsKMeansParamsDestroy(cuvsKMeansParams_t params);

replaced by cuvsKMeansParamsDestroy_v2.

Parameters

Name	Direction	Type	Description
`params`	in	`cuvsKMeansParams_t`

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansParamsCreate_v2

Allocate KMeans params

1 CUVS_EXPORT cuvsError_t cuvsKMeansParamsCreate_v2(cuvsKMeansParams_v2_t* params);

Mirrors cuvsKMeansParamsCreate but operates on cuvsKMeansParams_v2. Will become the unsuffixed cuvsKMeansParamsCreate in cuVS 26.08.

Parameters

Name	Direction	Type	Description
`params`	in	`cuvsKMeansParams_v2_t*`	cuvsKMeansParams_v2_t to allocate

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansParamsDestroy_v2

De-allocate KMeans params allocated by cuvsKMeansParamsCreate_v2.

1 CUVS_EXPORT cuvsError_t cuvsKMeansParamsDestroy_v2(cuvsKMeansParams_v2_t params);

Parameters

Name	Direction	Type	Description
`params`	in	`cuvsKMeansParams_v2_t`

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansType

Type of k-means algorithm.

1 typedef enum { ... } cuvsKMeansType;

Values

Name	Value
`CUVS_KMEANS_TYPE_KMEANS`	`0`
`CUVS_KMEANS_TYPE_KMEANS_BALANCED`	`1`

k-means clustering APIs

cuvsKMeansFit

Find clusters with k-means algorithm.

1 CUVS_EXPORT cuvsError_t cuvsKMeansFit(cuvsResources_t res,
2 cuvsKMeansParams_t params,
3 DLManagedTensor* X,
4 DLManagedTensor* sample_weight,
5 DLManagedTensor* centroids,
6 double* inertia,
7 int* n_iter);

Initial centroids are chosen with k-means++ algorithm. Empty clusters are reinitialized by choosing new centroids with k-means++ algorithm.

X may reside on either host (CPU) or device (GPU) memory. When X is on the host the data is streamed to the GPU in batches controlled by params->streaming_batch_size.

replaced by cuvsKMeansFit_v2.

Parameters

Name	Direction	Type	Description
`res`	in	`cuvsResources_t`	opaque C handle
`params`	in	`cuvsKMeansParams_t`	Parameters for KMeans model.
`X`	in	`DLManagedTensor*`	Training instances to cluster. The data must be in row-major format. May be on host or device memory. [dim = n_samples x n_features]
`sample_weight`	in	`DLManagedTensor*`	Optional weights for each observation in X. Must be on the same memory space as X. [len = n_samples]
`centroids`	inout	`DLManagedTensor*`	[in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. Must be on device. [dim = n_clusters x n_features]
`inertia`	out	`double*`	Sum of squared distances of samples to their closest cluster center.
`n_iter`	out	`int*`	Number of iterations run.

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansFit_v2

Find clusters with k-means algorithm (v2 params layout).

1 CUVS_EXPORT cuvsError_t cuvsKMeansFit_v2(cuvsResources_t res,
2 cuvsKMeansParams_v2_t params,
3 DLManagedTensor* X,
4 DLManagedTensor* sample_weight,
5 DLManagedTensor* centroids,
6 double* inertia,
7 int* n_iter);

Mirrors cuvsKMeansFit but takes cuvsKMeansParams_v2_t. Will become the unsuffixed cuvsKMeansFit in cuVS 26.08.

Parameters

Name	Direction	Type	Description
`res`	in	`cuvsResources_t`	opaque C handle
`params`	in	`cuvsKMeansParams_v2_t`	Parameters for KMeans model (v2 layout).
`X`	in	`DLManagedTensor*`	Training instances to cluster. The data must be in row-major format. May be on host or device memory. [dim = n_samples x n_features]
`sample_weight`	in	`DLManagedTensor*`	Optional weights for each observation in X. Must be on the same memory space as X. [len = n_samples]
`centroids`	inout	`DLManagedTensor*`	[in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. Must be on device. [dim = n_clusters x n_features]
`inertia`	out	`double*`	Sum of squared distances of samples to their closest cluster center.
`n_iter`	out	`int*`	Number of iterations run.

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansPredict

Predict the closest cluster each sample in X belongs to.

1 CUVS_EXPORT cuvsError_t cuvsKMeansPredict(cuvsResources_t res,
2 cuvsKMeansParams_t params,
3 DLManagedTensor* X,
4 DLManagedTensor* sample_weight,
5 DLManagedTensor* centroids,
6 DLManagedTensor* labels,
7 bool normalize_weight,
8 double* inertia);

replaced by cuvsKMeansPredict_v2.

Parameters

Name	Direction	Type	Description
`res`	in	`cuvsResources_t`	opaque C handle
`params`	in	`cuvsKMeansParams_t`	Parameters for KMeans model.
`X`	in	`DLManagedTensor*`	New data to predict. [dim = n_samples x n_features]
`sample_weight`	in	`DLManagedTensor*`	Optional weights for each observation in X. [len = n_samples]
`centroids`	in	`DLManagedTensor*`	Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
`labels`	out	`DLManagedTensor*`	Index of the cluster each sample in X belongs to. [len = n_samples]
`normalize_weight`	in	`bool`	True if the weights should be normalized
`inertia`	out	`double*`	Sum of squared distances of samples to their closest cluster center.

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansPredict_v2

Predict the closest cluster each sample in X belongs to (v2 params layout).

1 CUVS_EXPORT cuvsError_t cuvsKMeansPredict_v2(cuvsResources_t res,
2 cuvsKMeansParams_v2_t params,
3 DLManagedTensor* X,
4 DLManagedTensor* sample_weight,
5 DLManagedTensor* centroids,
6 DLManagedTensor* labels,
7 bool normalize_weight,
8 double* inertia);

Mirrors cuvsKMeansPredict but takes cuvsKMeansParams_v2_t. Will become the unsuffixed cuvsKMeansPredict in cuVS 26.08.

Parameters

Name	Direction	Type	Description
`res`	in	`cuvsResources_t`	opaque C handle
`params`	in	`cuvsKMeansParams_v2_t`	Parameters for KMeans model (v2 layout).
`X`	in	`DLManagedTensor*`	New data to predict. [dim = n_samples x n_features]
`sample_weight`	in	`DLManagedTensor*`	Optional weights for each observation in X. [len = n_samples]
`centroids`	in	`DLManagedTensor*`	Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
`labels`	out	`DLManagedTensor*`	Index of the cluster each sample in X belongs to. [len = n_samples]
`normalize_weight`	in	`bool`	True if the weights should be normalized
`inertia`	out	`double*`	Sum of squared distances of samples to their closest cluster center.

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansClusterCost

Compute cluster cost

1 CUVS_EXPORT cuvsError_t cuvsKMeansClusterCost(cuvsResources_t res,
2 DLManagedTensor* X,
3 DLManagedTensor* centroids,
4 double* cost);

Parameters

Name	Direction	Type	Description
`res`	in	`cuvsResources_t`	opaque C handle
`X`	in	`DLManagedTensor*`	Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
`centroids`	in	`DLManagedTensor*`	Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
`cost`	out	`double*`	Resulting cluster cost

Returns

CUVS_EXPORT cuvsError_t