PCA

View as Markdown

Principal Component Analysis, or PCA, is a GPU-accelerated dimensionality reduction algorithm. It learns directions of high variance in a dataset and projects each row onto a smaller number of components.

Use PCA when you want to reduce vector dimensionality, denoise data, visualize high-dimensional data, or prepare a lower-dimensional representation before another algorithm. PCA is lossy when n_components is smaller than the original feature count.

Example API Usage

C API | C++ API | Python API

Fitting components

Fitting learns the principal components, explained variances, singular values, and column means from a col-major float32 input matrix.

1#include <cuvs/core/c_api.h>
2#include <cuvs/preprocessing/pca.h>
3
4cuvsResources_t res;
5cuvsPcaParams_t params;
6DLManagedTensor *input;
7DLManagedTensor *components;
8DLManagedTensor *explained_var;
9DLManagedTensor *explained_var_ratio;
10DLManagedTensor *singular_vals;
11DLManagedTensor *mu;
12DLManagedTensor *noise_vars;
13
14load_col_major_dataset(input);
15allocate_pca_outputs(components,
16 explained_var,
17 explained_var_ratio,
18 singular_vals,
19 mu,
20 noise_vars);
21
22cuvsResourcesCreate(&res);
23cuvsPcaParamsCreate(&params);
24
25params->n_components = 32;
26params->copy = true;
27params->whiten = false;
28
29cuvsPcaFit(res,
30 params,
31 input,
32 components,
33 explained_var,
34 explained_var_ratio,
35 singular_vals,
36 mu,
37 noise_vars,
38 false);
39
40cuvsPcaParamsDestroy(params);
41cuvsResourcesDestroy(res);

Transforming data

Transforming projects rows into the PCA component space. fit_transform combines fitting and transforming in one call.

1DLManagedTensor *transformed;
2
3allocate_col_major_matrix(transformed, n_rows, params->n_components);
4
5cuvsPcaFitTransform(res,
6 params,
7 input,
8 transformed,
9 components,
10 explained_var,
11 explained_var_ratio,
12 singular_vals,
13 mu,
14 noise_vars,
15 false);

Reconstructing data

Inverse transform maps PCA-space rows back to the original feature space. The reconstruction is approximate when fewer components are kept.

1DLManagedTensor *reconstructed;
2
3allocate_col_major_matrix(reconstructed, n_rows, n_features);
4
5cuvsPcaInverseTransform(res,
6 params,
7 transformed,
8 components,
9 singular_vals,
10 mu,
11 reconstructed);

How PCA works

PCA centers the input columns and finds orthogonal directions that explain the most variance. Keeping the first n_components directions gives a lower-dimensional representation that preserves as much variance as possible under a linear projection.

When whiten=True, the transformed components are scaled to have unit component-wise variance. Whitening can help downstream models that assume similarly scaled features, but it also removes the original variance scale.

When to use PCA

Use PCA when the data has redundant or noisy dimensions and a linear lower-dimensional representation is acceptable. It can reduce memory use, reduce distance-computation cost, and make later algorithms easier to tune.

Avoid PCA when interpretability of original dimensions is required, when nonlinear structure is the main signal, or when dropping low-variance directions would remove important information.

Configuration parameters

Fit parameters

ParameterDefaultDescription
n_components1Number of principal components to keep. Larger values preserve more information but reduce memory and compute less.
copytruePreserves the input during fit. If false, fit may temporarily overwrite input data.
whitenfalseScales transformed components to unit variance.
algorithmcov_eig_dqPCA solver. Python accepts cov_eig_dq or cov_eig_jacobi; C uses CUVS_PCA_COV_EIG_DQ or CUVS_PCA_COV_EIG_JACOBI.
tol0.0Tolerance used by the Jacobi solver.
n_iterations15Number of Jacobi solver iterations.
flip_signs_based_on_UfalseControls the sign convention for fitted components in C and C++.

Tuning

Start with n_components based on the target dimensionality or the amount of variance you need to preserve. Increase it when reconstruction quality or downstream accuracy is too low.

Use the default divide-and-conquer covariance eigensolver for most workloads. Try the Jacobi solver when you need its convergence behavior, then tune tol and n_iterations together.

Enable whiten only when the downstream workflow benefits from unit-variance components. Whitening changes component scaling, so compare downstream metrics before making it the default.

Memory footprint

PCA memory is dominated by the input matrix, the covariance workspace, the component matrix, and the transformed matrix when fit_transform is used.

Variables:

  • N: Number of rows.
  • D: Number of input features.
  • K: Number of retained components.
  • B_x: Bytes per floating-point element.

Scratch and maximum rows

The scratch term covers covariance or solver workspace, temporary centered data, allocator padding, CUDA library workspaces, and memory held by the active memory resource. Use H = 0.30 for PCA fit and fit_transform, because eigensolver workspace can be significant. If you can measure a representative run, use:

Hmeasured=observed_peakformula_without_scratchformula_without_scratchH_{\text{measured}} = \frac{\text{observed\_peak} - \text{formula\_without\_scratch}} {\text{formula\_without\_scratch}}

Then set:

Musable=(MfreeMother)(1H)M_{\text{usable}} = (M_{\text{free}} - M_{\text{other}}) \cdot (1 - H)

The capacity variables in this subsection are:

  • M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.
  • M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.
  • H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.
  • M_usable: Memory budget left for the formula after subtracting M_other and reserving headroom.
  • observed_peak: Peak memory observed during a smaller representative run.
  • formula_without_scratch: Value of the selected peak formula with explicit scratch terms removed and without applying headroom.
  • peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usually N for rows or vectors and B for K-selection batch rows.
  • B_per_row / B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such as D, K, Q, and L are substituted.
  • B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.
  • N_max / B_max: Estimated largest row, vector, or batch-row count that fits in M_usable.

For fixed D and K, solve the fit peak as a linear function of N:

peak_without_scratch(N)=NBper_row+Bfixed\text{peak\_without\_scratch}(N) = N \cdot B_{\text{per\_row}} + B_{\text{fixed}} Nmax=MusableBfixedBper_rowN_{\max} = \left\lfloor \frac{M_{\text{usable}} - B_{\text{fixed}}} {B_{\text{per\_row}}} \right\rfloor

The covariance workspace scales with D^2, so it belongs in B_fixed when D is fixed. If D also changes, solve the full formula rather than using the linear shortcut.

Persistent arrays

The main arrays are:

input_size=NDBxcomponents_size=KDBxtransformed_size=NKBx\begin{aligned} \text{input\_size} &= N \cdot D \cdot B_x \\ \text{components\_size} &= K \cdot D \cdot B_x \\ \text{transformed\_size} &= N \cdot K \cdot B_x \end{aligned}

The vectors for explained variance, explained variance ratio, singular values, means, and noise variance are smaller:

stats_size(3K+D+1)Bx\text{stats\_size} \approx (3K + D + 1) \cdot B_x

Fit peak

The covariance-based solvers can require a feature-by-feature covariance workspace:

covariance_sizeD2Bx\text{covariance\_size} \approx D^2 \cdot B_x

The fit-transform peak is approximately:

fit_transform_peak input_size+covariance_size+components_size+transformed_size+stats_size+scratch\begin{aligned} \text{fit\_transform\_peak} \approx&\ \text{input\_size} + \text{covariance\_size} \\ &+ \text{components\_size} + \text{transformed\_size} + \text{stats\_size} + \text{scratch} \end{aligned}

For very high-dimensional data, reduce n_components only reduces the component and transformed matrices. The covariance workspace still scales with D^2.