PCA
Principal Component Analysis, or PCA, is a GPU-accelerated dimensionality reduction algorithm. It learns directions of high variance in a dataset and projects each row onto a smaller number of components.
Use PCA when you want to reduce vector dimensionality, denoise data, visualize high-dimensional data, or prepare a lower-dimensional representation before another algorithm. PCA is lossy when n_components is smaller than the original feature count.
Example API Usage
C API | C++ API | Python API
Fitting components
Fitting learns the principal components, explained variances, singular values, and column means from a col-major float32 input matrix.
C
C++
Python
Transforming data
Transforming projects rows into the PCA component space. fit_transform combines fitting and transforming in one call.
C
C++
Python
Reconstructing data
Inverse transform maps PCA-space rows back to the original feature space. The reconstruction is approximate when fewer components are kept.
C
C++
Python
How PCA works
PCA centers the input columns and finds orthogonal directions that explain the most variance. Keeping the first n_components directions gives a lower-dimensional representation that preserves as much variance as possible under a linear projection.
When whiten=True, the transformed components are scaled to have unit component-wise variance. Whitening can help downstream models that assume similarly scaled features, but it also removes the original variance scale.
When to use PCA
Use PCA when the data has redundant or noisy dimensions and a linear lower-dimensional representation is acceptable. It can reduce memory use, reduce distance-computation cost, and make later algorithms easier to tune.
Avoid PCA when interpretability of original dimensions is required, when nonlinear structure is the main signal, or when dropping low-variance directions would remove important information.
Configuration parameters
Fit parameters
Tuning
Start with n_components based on the target dimensionality or the amount of variance you need to preserve. Increase it when reconstruction quality or downstream accuracy is too low.
Use the default divide-and-conquer covariance eigensolver for most workloads. Try the Jacobi solver when you need its convergence behavior, then tune tol and n_iterations together.
Enable whiten only when the downstream workflow benefits from unit-variance components. Whitening changes component scaling, so compare downstream metrics before making it the default.
Memory footprint
PCA memory is dominated by the input matrix, the covariance workspace, the component matrix, and the transformed matrix when fit_transform is used.
Variables:
N: Number of rows.D: Number of input features.K: Number of retained components.B_x: Bytes per floating-point element.
Scratch and maximum rows
The scratch term covers covariance or solver workspace, temporary centered data, allocator padding, CUDA library workspaces, and memory held by the active memory resource. Use H = 0.30 for PCA fit and fit_transform, because eigensolver workspace can be significant. If you can measure a representative run, use:
Then set:
The capacity variables in this subsection are:
M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.M_usable: Memory budget left for the formula after subtractingM_otherand reserving headroom.observed_peak: Peak memory observed during a smaller representative run.formula_without_scratch: Value of the selected peak formula with explicitscratchterms removed and without applying headroom.peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usuallyNfor rows or vectors andBfor K-selection batch rows.B_per_row/B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such asD,K,Q, andLare substituted.B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.N_max/B_max: Estimated largest row, vector, or batch-row count that fits inM_usable.
For fixed D and K, solve the fit peak as a linear function of N:
The covariance workspace scales with D^2, so it belongs in B_fixed when D is fixed. If D also changes, solve the full formula rather than using the linear shortcut.
Persistent arrays
The main arrays are:
The vectors for explained variance, explained variance ratio, singular values, means, and noise variance are smaller:
Fit peak
The covariance-based solvers can require a feature-by-feature covariance workspace:
The fit-transform peak is approximately:
For very high-dimensional data, reduce n_components only reduces the component and transformed matrices. The covariance workspace still scales with D^2.