Kmeans

View as Markdown

Python module: cuvs.cluster.kmeans

KMeansParams

1cdef class KMeansParams

Hyper-parameters for the kmeans algorithm

Parameters

NameTypeDescription
metricstrString denoting the metric type.
n_clustersintThe number of clusters to form as well as the number of centroids to generate
init_methodstrMethod for initializing clusters. One of: “KMeansPlusPlus” : Use scalable k-means++ algorithm to select initial cluster centers “Random” : Choose ‘n_clusters’ observations at random from the input data “Array” : Use centroids as initial cluster centers
max_iterintMaximum number of iterations of the k-means algorithm for a single run
tolfloatRelative tolerance with regards to inertia to declare convergence.
n_initintNumber of instance k-means algorithm will be run with different seeds
oversampling_factordoubleOversampling factor for use in the k-means|| algorithm
batch_samplesintNumber of samples to process in each batch for tiled 1NN computation. Useful to optimize/control memory footprint. Default tile is [batch_samples x n_clusters].
batch_centroidsintNumber of centroids to process in each batch. If 0, uses n_clusters.
inertia_checkboolDeprecated and ignored. Will be removed in a future release. Inertia-based convergence checking always runs.
init_sizeintNumber of samples to draw for KMeansPlusPlus initialization with host (out-of-core) data. When set to 0, uses the heuristic min(3 * n_clusters, n_samples). Default: 0.
streaming_batch_sizeintNumber of samples to process per GPU batch when fitting with host (numpy) data. When set to 0, defaults to n_samples (process all at once). Only used by the batched (host-data) code path. Reducing streaming_batch_size can help reduce GPU memory pressure but increases overhead as the number of times centroid adjustments are computed increases.

Default: 0 (process all data at once).
hierarchicalboolWhether to use hierarchical (balanced) kmeans or not
hierarchical_n_itersintFor hierarchical k-means , defines the number of training iterations

Constructor

1def __init__(self, *, metric=None, n_clusters=None, init_method=None, max_iter=None, tol=None, n_init=None, oversampling_factor=None, batch_samples=None, batch_centroids=None, inertia_check=None, init_size=None, streaming_batch_size=None, hierarchical=None, hierarchical_n_iters=None)

Members

NameKind
metricproperty
n_clustersproperty
init_methodproperty
max_iterproperty
tolproperty
n_initproperty
oversampling_factorproperty
batch_samplesproperty
batch_centroidsproperty
init_sizeproperty
streaming_batch_sizeproperty
hierarchicalproperty
hierarchical_n_itersproperty

metric

1def metric(self)

n_clusters

1def n_clusters(self)

init_method

1def init_method(self)

max_iter

1def max_iter(self)

tol

1def tol(self)

n_init

1def n_init(self)

oversampling_factor

1def oversampling_factor(self)

batch_samples

1def batch_samples(self)

batch_centroids

1def batch_centroids(self)

init_size

1def init_size(self)

streaming_batch_size

1def streaming_batch_size(self)

hierarchical

1def hierarchical(self)

hierarchical_n_iters

1def hierarchical_n_iters(self)

cluster_cost

@auto_sync_resources @auto_convert_output

1def cluster_cost(X, centroids, resources=None)

Compute cluster cost given an input matrix and existing centroids

Parameters

NameTypeDescription
XInput CUDA array interface compliant matrix shape (m, k)
centroidsInput CUDA array interface compliant matrix shape(n_clusters, k)
resourcescuvs.common.Resources, optional

Returns

NameTypeDescription
inertiafloatThe cluster cost between the input matrix and existing centroids

Examples

1>>> import cupy as cp
2>>>
3>>> from cuvs.cluster.kmeans import cluster_cost
4>>>
5>>> n_samples = 5000
6>>> n_features = 50
7>>> n_clusters = 3
8>>>
9>>> X = cp.random.random_sample((n_samples, n_features),
10... dtype=cp.float32)
1>>> centroids = cp.random.random_sample((n_clusters, n_features),
2... dtype=cp.float32)
1>>> inertia = cluster_cost(X, centroids)

fit

@auto_sync_resources @auto_convert_output

1def fit( KMeansParams params, X, centroids=None, sample_weights=None, resources=None )

Find clusters with the k-means algorithm

When X is a device array (CUDA array interface), standard on-device k-means is used. When X is a host array (numpy ndarray or __array_interface__), data is streamed to the GPU in batches controlled by params.streaming_batch_size. For large host datasets, consider reducing streaming_batch_size to reduce GPU memory usage.

Parameters

NameTypeDescription
paramsKMeansParamsParameters to use to fit KMeans model. For host data, params.streaming_batch_size controls how many samples are sent to the GPU per batch.
Xarray-likeTraining instances, shape (m, k). Accepts both device arrays (cupy / CUDA array interface) and host arrays (numpy).
centroidsOptional writable CUDA array interface compliant matrixshape (n_clusters, k)
sample_weightsOptional weights per observation. Must reside onthe same memory space as X (device or host). default: None
resourcescuvs.common.Resources, optional

Returns

NameTypeDescription
centroidsraft.device_ndarrayThe computed centroids for each cluster
inertiafloatSum of squared distances of samples to their closest cluster center
n_iterintThe number of iterations used to fit the model

Examples

1>>> import cupy as cp
2>>>
3>>> from cuvs.cluster.kmeans import fit, KMeansParams
4>>>
5>>> n_samples = 5000
6>>> n_features = 50
7>>> n_clusters = 3
8>>>
9>>> X = cp.random.random_sample((n_samples, n_features),
10... dtype=cp.float32)
1>>> params = KMeansParams(n_clusters=n_clusters)
2>>> centroids, inertia, n_iter = fit(params, X)

Host-data (batched) example:

1>>> import numpy as np
2>>> X_host = np.random.random((10_000_000, 128)).astype(np.float32)
3>>> params = KMeansParams(n_clusters=1000, streaming_batch_size=1_000_000)
4>>> centroids, inertia, n_iter = fit(params, X_host)

predict

@auto_sync_resources @auto_convert_output

1def predict( KMeansParams params, X, centroids, sample_weights=None, labels=None, normalize_weight=True, resources=None )

Predict clusters with the k-means algorithm

Parameters

NameTypeDescription
paramsKMeansParamsParameters to used in fitting KMeans model
XInput CUDA array interface compliant matrix shape (m, k)
centroidsCUDA array interface compliant matrix, calculated by fitshape (n_clusters, k)
sample_weightsOptional input CUDA array interface compliant matrix shape(n_clusters, 1) default: None
labelsOptional preallocated CUDA array interface matrix shape (m, 1)to hold the output
normalize_weightboolTrue if the weights should be normalized
resourcescuvs.common.Resources, optional

Returns

NameTypeDescription
labelsraft.device_ndarrayThe label for each datapoint in X
inertiafloatSum of squared distances of samples to their closest cluster center

Examples

1>>> import cupy as cp
2>>>
3>>> from cuvs.cluster.kmeans import fit, predict, KMeansParams
4>>>
5>>> n_samples = 5000
6>>> n_features = 50
7>>> n_clusters = 3
8>>>
9>>> X = cp.random.random_sample((n_samples, n_features),
10... dtype=cp.float32)
1>>> params = KMeansParams(n_clusters=n_clusters)
2>>> centroids, inertia, n_iter = fit(params, X)
3>>>
4>>> labels, inertia = predict(params, X, centroids)