Cholesky Factorization For a Large Matrix Using Blocked Algorithm#

blocked_potrf.cu

This example presents an implementation of Cholesky factorization using blocked algorithm for large matrices too large to fit in the shared memory, or too slow to directly use cuSolverDx’s unblocked Cholesky API because of the high register and shared memory usage.

The code uses a left-looking blocked algorithm with an out-of-core implementation, using a single thread block to process each batch of the matrix A. The factorization of the N x N matrix A proceeds in N / NB steps of sub-matrices of size NB x NB, each step including a sequence of calls to the unblocked Cholesky factorization, triangular solver (TRSM), and cuBlasDx’s GEMM. The results are compared with these obtained using cuSolver host API cusolverDnXportf.

reg_least_squares.cu

This example solves a batch of regularized least squares problems

\[\min ||b - Ax||_2^2 + \lambda ||x||_2^2\]

with three different approaches:

Using cuBLASDx’s GEMM and cuSolverDx’s POTRF and TRSM to build a single fused kernel to solve the normal equations

\[(A' A + \lambda I) x = A' b\]
Using cuSolverDx’s GELS function to do a Householder QR on A augmented by \(\lambda I\)
using cuBLAS and cuSolver host API to solve the normal equations

The comparison of the three approaches clearly shows the performance benefit of using MathDx functions in a fused kernel to improve throughput and reduce global memory access.