K-Means

View as Markdown

Source header: cuvs/cluster/kmeans.h

k-means hyperparameters

cuvsKMeansInitMethod

k-means hyperparameters

1typedef enum { ... } cuvsKMeansInitMethod;

Values

NameValue
KMeansPlusPlus0
Random1
Array2

cuvsKMeansParams

Hyper-parameters for the kmeans algorithm

NB: The inertia_check field is kept for ABI compatibility. Removed in cuvsKMeansParams_v2. TODO: CalVer for the replacement: 26.08

1struct cuvsKMeansParams { ... };

Fields

NameTypeDescription
n_clustersintThe number of clusters to form as well as the number of centroids to generate (default:8).
initcuvsKMeansInitMethodMethod for initialization, defaults to k-means++:
- cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm to select the initial cluster centers.
- cuvsKMeansInitMethod::Random (random): Choose ‘n_clusters’ observations (rows) at random from the input data for the initial centroids.
- cuvsKMeansInitMethod::Array (ndarray): Use ‘centroids’ as initial cluster centers.
max_iterintMaximum number of iterations of the k-means algorithm for a single run.
toldoubleRelative tolerance with regards to inertia to declare convergence.
n_initintNumber of instance k-means algorithm will be run with different seeds.
oversampling_factordoubleOversampling factor for use in the k-means|| algorithm
batch_samplesintbatch_samples and batch_centroids are used to tile 1NN computation which is useful to optimize/control the memory footprint Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 then don’t tile the centroids
batch_centroidsintif 0 then batch_centroids = n_clusters
inertia_checkboolDeprecated, ignored. Kept for ABI compatibility.
hierarchicalboolWhether to use hierarchical (balanced) kmeans or not
hierarchical_n_itersintFor hierarchical k-means , defines the number of training iterations
streaming_batch_sizeint64_tNumber of samples to process per GPU batch for the batched (host-data) API. When set to 0, defaults to n_samples (process all at once).
init_sizeint64_tNumber of samples to draw for KMeansPlusPlus initialization. When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data, or n_samples for device data.
metriccuvsDistanceType

cuvsKMeansParams_v2

Hyper-parameters for the kmeans algorithm

TODO: Remove this after cuvsKMeansParams is replaced in ABI 2.0

1struct cuvsKMeansParams_v2 { ... };

Fields

NameTypeDescription
n_clustersintThe number of clusters to form as well as the number of centroids to generate (default:8).
initcuvsKMeansInitMethodMethod for initialization, defaults to k-means++:
- cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm to select the initial cluster centers.
- cuvsKMeansInitMethod::Random (random): Choose ‘n_clusters’ observations (rows) at random from the input data for the initial centroids.
- cuvsKMeansInitMethod::Array (ndarray): Use ‘centroids’ as initial cluster centers.
max_iterintMaximum number of iterations of the k-means algorithm for a single run.
toldoubleRelative tolerance with regards to inertia to declare convergence.
n_initintNumber of instance k-means algorithm will be run with different seeds.
oversampling_factordoubleOversampling factor for use in the k-means|| algorithm
batch_samplesintbatch_samples and batch_centroids are used to tile 1NN computation which is useful to optimize/control the memory footprint Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 then don’t tile the centroids
batch_centroidsintif 0 then batch_centroids = n_clusters
hierarchicalboolWhether to use hierarchical (balanced) kmeans or not
hierarchical_n_itersintFor hierarchical k-means , defines the number of training iterations
streaming_batch_sizeint64_tNumber of samples to process per GPU batch for the batched (host-data) API. When set to 0, defaults to n_samples (process all at once).
init_sizeint64_tNumber of samples to draw for KMeansPlusPlus initialization. When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data, or n_samples for device data.
metriccuvsDistanceType

cuvsKMeansParamsCreate

Allocate KMeans params, and populate with default values

1CUVS_EXPORT cuvsError_t cuvsKMeansParamsCreate(cuvsKMeansParams_t* params);

replaced by cuvsKMeansParamsCreate_v2.

Parameters

NameDirectionTypeDescription
paramsincuvsKMeansParams_t*cuvsKMeansParams_t to allocate

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansParamsDestroy

De-allocate KMeans params

1CUVS_EXPORT cuvsError_t cuvsKMeansParamsDestroy(cuvsKMeansParams_t params);

replaced by cuvsKMeansParamsDestroy_v2.

Parameters

NameDirectionTypeDescription
paramsincuvsKMeansParams_t

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansParamsCreate_v2

Allocate KMeans params

1CUVS_EXPORT cuvsError_t cuvsKMeansParamsCreate_v2(cuvsKMeansParams_v2_t* params);

Mirrors cuvsKMeansParamsCreate but operates on cuvsKMeansParams_v2. Will become the unsuffixed cuvsKMeansParamsCreate in cuVS 26.08.

Parameters

NameDirectionTypeDescription
paramsincuvsKMeansParams_v2_t*cuvsKMeansParams_v2_t to allocate

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansParamsDestroy_v2

De-allocate KMeans params allocated by cuvsKMeansParamsCreate_v2.

1CUVS_EXPORT cuvsError_t cuvsKMeansParamsDestroy_v2(cuvsKMeansParams_v2_t params);

Parameters

NameDirectionTypeDescription
paramsincuvsKMeansParams_v2_t

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansType

Type of k-means algorithm.

1typedef enum { ... } cuvsKMeansType;

Values

NameValue
CUVS_KMEANS_TYPE_KMEANS0
CUVS_KMEANS_TYPE_KMEANS_BALANCED1

k-means clustering APIs

cuvsKMeansFit

Find clusters with k-means algorithm.

1CUVS_EXPORT cuvsError_t cuvsKMeansFit(cuvsResources_t res,
2cuvsKMeansParams_t params,
3DLManagedTensor* X,
4DLManagedTensor* sample_weight,
5DLManagedTensor* centroids,
6double* inertia,
7int* n_iter);

Initial centroids are chosen with k-means++ algorithm. Empty clusters are reinitialized by choosing new centroids with k-means++ algorithm.

X may reside on either host (CPU) or device (GPU) memory. When X is on the host the data is streamed to the GPU in batches controlled by params->streaming_batch_size.

replaced by cuvsKMeansFit_v2.

Parameters

NameDirectionTypeDescription
resincuvsResources_topaque C handle
paramsincuvsKMeansParams_tParameters for KMeans model.
XinDLManagedTensor*Training instances to cluster. The data must be in row-major format. May be on host or device memory. [dim = n_samples x n_features]
sample_weightinDLManagedTensor*Optional weights for each observation in X. Must be on the same memory space as X. [len = n_samples]
centroidsinoutDLManagedTensor*[in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. Must be on device. [dim = n_clusters x n_features]
inertiaoutdouble*Sum of squared distances of samples to their closest cluster center.
n_iteroutint*Number of iterations run.

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansFit_v2

Find clusters with k-means algorithm (v2 params layout).

1CUVS_EXPORT cuvsError_t cuvsKMeansFit_v2(cuvsResources_t res,
2cuvsKMeansParams_v2_t params,
3DLManagedTensor* X,
4DLManagedTensor* sample_weight,
5DLManagedTensor* centroids,
6double* inertia,
7int* n_iter);

Mirrors cuvsKMeansFit but takes cuvsKMeansParams_v2_t. Will become the unsuffixed cuvsKMeansFit in cuVS 26.08.

Parameters

NameDirectionTypeDescription
resincuvsResources_topaque C handle
paramsincuvsKMeansParams_v2_tParameters for KMeans model (v2 layout).
XinDLManagedTensor*Training instances to cluster. The data must be in row-major format. May be on host or device memory. [dim = n_samples x n_features]
sample_weightinDLManagedTensor*Optional weights for each observation in X. Must be on the same memory space as X. [len = n_samples]
centroidsinoutDLManagedTensor*[in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. Must be on device. [dim = n_clusters x n_features]
inertiaoutdouble*Sum of squared distances of samples to their closest cluster center.
n_iteroutint*Number of iterations run.

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansPredict

Predict the closest cluster each sample in X belongs to.

1CUVS_EXPORT cuvsError_t cuvsKMeansPredict(cuvsResources_t res,
2cuvsKMeansParams_t params,
3DLManagedTensor* X,
4DLManagedTensor* sample_weight,
5DLManagedTensor* centroids,
6DLManagedTensor* labels,
7bool normalize_weight,
8double* inertia);

replaced by cuvsKMeansPredict_v2.

Parameters

NameDirectionTypeDescription
resincuvsResources_topaque C handle
paramsincuvsKMeansParams_tParameters for KMeans model.
XinDLManagedTensor*New data to predict. [dim = n_samples x n_features]
sample_weightinDLManagedTensor*Optional weights for each observation in X. [len = n_samples]
centroidsinDLManagedTensor*Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
labelsoutDLManagedTensor*Index of the cluster each sample in X belongs to. [len = n_samples]
normalize_weightinboolTrue if the weights should be normalized
inertiaoutdouble*Sum of squared distances of samples to their closest cluster center.

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansPredict_v2

Predict the closest cluster each sample in X belongs to (v2 params layout).

1CUVS_EXPORT cuvsError_t cuvsKMeansPredict_v2(cuvsResources_t res,
2cuvsKMeansParams_v2_t params,
3DLManagedTensor* X,
4DLManagedTensor* sample_weight,
5DLManagedTensor* centroids,
6DLManagedTensor* labels,
7bool normalize_weight,
8double* inertia);

Mirrors cuvsKMeansPredict but takes cuvsKMeansParams_v2_t. Will become the unsuffixed cuvsKMeansPredict in cuVS 26.08.

Parameters

NameDirectionTypeDescription
resincuvsResources_topaque C handle
paramsincuvsKMeansParams_v2_tParameters for KMeans model (v2 layout).
XinDLManagedTensor*New data to predict. [dim = n_samples x n_features]
sample_weightinDLManagedTensor*Optional weights for each observation in X. [len = n_samples]
centroidsinDLManagedTensor*Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
labelsoutDLManagedTensor*Index of the cluster each sample in X belongs to. [len = n_samples]
normalize_weightinboolTrue if the weights should be normalized
inertiaoutdouble*Sum of squared distances of samples to their closest cluster center.

Returns

CUVS_EXPORT cuvsError_t

cuvsKMeansClusterCost

Compute cluster cost

1CUVS_EXPORT cuvsError_t cuvsKMeansClusterCost(cuvsResources_t res,
2DLManagedTensor* X,
3DLManagedTensor* centroids,
4double* cost);

Parameters

NameDirectionTypeDescription
resincuvsResources_topaque C handle
XinDLManagedTensor*Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
centroidsinDLManagedTensor*Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
costoutdouble*Resulting cluster cost

Returns

CUVS_EXPORT cuvsError_t