Advanced Examples#

All examples are in 10_Advanced/.

blocked_potrf.cu

This example implements Cholesky factorization with a blocked algorithm for large matrices that do not fit in shared memory or are too slow with cuSolverDx’s unblocked Cholesky API due to register and shared memory usage.

The code uses a left-looking blocked algorithm with an out-of-core implementation: a single thread block processes each batch of the matrix A. The factorization of the N x N matrix A proceeds in N / NB steps over submatrices of size NB x NB, each step calling the unblocked Cholesky factorization, triangular solve (TRSM), and cuBLASDx GEMM. Results are compared with the cuSolver host API cusolverDnXpotrf.

reg_least_squares.cu

This example solves a batch of regularized least squares problems

\[\min ||b - Ax||_2^2 + \lambda ||x||_2^2\]

with three different approaches:

Using cuBLASDx’s GEMM and cuSolverDx’s POTRF and TRSM to build a single fused kernel to solve the normal equations

\[(A' A + \lambda I) x = A' b\]
Using cuSolverDx’s GELS function to do a Householder QR on A augmented by \(\lambda I\)
using cuBLAS and cuSolver host API to solve the normal equations

The comparison of the three approaches illustrates the performance benefit of using MathDx functions in a fused kernel to improve throughput and reduce global memory access.