First Example of Using cuSolverDx API#

In this introduction, we will perform a batched linear system solver based on Cholesky factorization using the cuSolverDx library. This section is based on the Introduction Example shipped with cuSolverDx. Refer to Examples section for other cuSolverDx samples.

Defining The Problem With Operators#

NVIDIA MathDX uses a list of operators to create a full description of the function to solve. Therefore, the first step is defining a function descriptor by adding together cuSolverDx operators that capture the properties of the function. The correctness of this description is evaluated at compile time.

A well-defined cuSolverDx routine description must include two parts:

Selected linear algebra routine. In this case it is Cholesky decomposition and solver: cusolverdx::function::posv. Please refer to Functionality section for the list of available functions.
Valid and sufficient description of the input/output matrices. In this case it includes the dimensions (Size M, N, K of matrices A, X, and B, the precision (float or double), the data type (real or complex), the fill mode (lower or upper), and the data arrangement of matrices (row- or column-major). Different functions may require different description operators. Please refer to Description Operators section for the list of description operators and descriptions of each operator.

To get a descriptor for Cholesky Solver with M = N = 32, and K = 1, we just need to write the following lines:

#include <cusolverdx.hpp>
using namespace cusolverdx;

using POSV = decltype(Size<32 /* = M */, 32 /* = N */, 1 /* = K */>()
                      + Precision<double>()
                      + Type<type::complex>()
                      + Function<function::posv>()
                      + FillMode<lower>()
                      + Arrangement<col_major /* A */, row_major /* B */>());

In order to encode the operation properties, cuSolverDx provides operators Size, Precision, Type, Function, FillMode, and Arrangement, which can be combined by using addition (+).

Optionally, user can set leading dimensions for each matrix using LeadingDimension operator, or set them dynamically during the execution.

Defining A Full Function Descriptor#

To obtain a fully usable operation that can compute the function, we need to provide at least two additional pieces of information:

SM operator: Indicates the targeted CUDA architecture on which we want to run the function. Each GPU architecture is different, therefore each may use a different implementation and require different CUDA block size for the best performance. In the introduction example (00_Introduction/posv_batched_block.cu) this is passed as a template parameter, but in this example we can assume we’re targeting H100 GPUs (SM<900>()).
Execution Operator: Either Thread Operator or Block Operator is required to indicates how the function will be executed. In this case, we want to execute the function on a CUDA block. At this point, cuSolverDx performs additional verifications to make sure provided description is valid and that it is possible to execute it on the requested architecture.

#include <cusolverdx.hpp>
using namespace cusolverdx;

using POSV = decltype(Size<32 /* = M */, 32 /* = N */, 1 /* = K */>()
                      + Precision<double>()
                      + Type<type::complex>()
                      + Function<function::posv>()
                      + FillMode<lower>()
                      + Arrangement<col_major /* A */, row_major /* B */>()
                      + SM<900>()
                      + Block());

As the Block Operator is used, optionally user can also specify the number of threads that will be performing the function. This is done with the BlockDim operator. If BlockDim operator is not defined, cuSolverDx will select a preferred block size that can be obtained with Solver::block_dim.

Tip

If there is no need to set custom block dimensions, it is recommended not to use BlockDim operator and rely on Solver::block_dim. For more details, see Block Execute Method section, BlockDim, and Suggested block dim.

Executing the Function#

Class POSV which describes the function can be instantiated into object (or objects). Forming the object has no computational cost, and should be seen as a handle. The function descriptor object provides a compute method, execute(...) that performs the requested function.

#include <cusolverdx.hpp>
using namespace cusolverdx;

using POSV = decltype(Size<32 /* = M */, 32 /* = N */, 1 /* = K */>()
                      + Precision<double>()
                      + Type<type::complex>()
                      + Function<function::posv>()
                      + FillMode<lower>()
                      + Arrangement<col_major /* A */, row_major /* B */>()
                      + SM<900>()
                      + Block());

template<class POSV>
__global__ void posv_kernel(typename POSV::a_data_type* A, typename POSV::b_data_type* B, typename POSV::status_type* info) {
      // copy A and B from global memory to shared memory

      // Execute Solver
      POSV().execute(A_smem, B_smem, info);

      // copy the output A and B from shared memory to global memory
}

If Cholesky factorization failed, i.e. some leading minor of A is not positive definite, the output parameter info would indicate smallest leading minor of A which is not positive definite for each batch.

As shown in the comments, the execution method assumes that all the matrices reside in the shared memory. It is up to the users to load the matrices from global to shared memory before calling the execution method, and are responsible for saving the results. In the examples shipped with cuSolverDx, there are common utility functions of load and store provided for users’ convenience (see Other Methods).

Important

Block execution functions, i.e., function described with Block Operator, accept input and output data in either global memory or shared memory; however, using shared memory is highly recommended to achieve optimal performance. Please refer to Execute Method section for more details.

Launching the Kernel#

To launch a kernel executing the defined POSV we need to specify the block dimensions and the amount of shared memory needed. The required shared memory can be obtained using Solver::shared_memory_size. It accounts for any padding declared using LeadingDimension Operator.

For simplicity, the example below shows only the main component of the code. Please check the full example shipped with cuSolverDx.

template<class POSV, unsigned int BatchesPerBlock = 1,
         class DataType = typename POSV::a_data_type>
__global__ void posv_kernel(DataType* A, const unsigned int lda_gmem,
                            DataType* B, const unsigned int ldb_gmem,
                            typename POSV::status_type* info,
                            const unsigned int batches) {

    using namespace cusolverdx;
    constexpr auto m    = POSV::m_size;
    constexpr auto nrhs = POSV::k_size;
    const auto     one_batch_size_a_gmem = lda_gmem * m;
    const auto     one_batch_size_b_gmem =
          (arrangement_of_v_b<POSV> == arrangement::col_major) ? ldb_gmem * nrhs : m * ldb_gmem;

    constexpr auto lda_smem = POSV::lda;
    constexpr auto ldb_smem = POSV::ldb;
    constexpr auto one_batch_size_a_smem = lda_smem * m;
    constexpr auto one_batch_size_b_smem =
          (arrangement_of_v_b<POSV> == arrangement::col_major) ? ldb_smem * nrhs : m * ldb_smem;

    extern __shared__ __align__(sizeof(DataType)) cusolverdx::byte shared_mem[];

    DataType* As = reinterpret_cast<DataType*>(shared_mem);
    DataType* Bs = As + one_batch_size_a_smem * BatchesPerBlock;

    const auto batch_idx = blockIdx.x * BatchesPerBlock;
    if (batch_idx >= batches)
        return;

    auto Ag = A + size_t(one_batch_size_a_gmem) * batch_idx;
    auto Bg = B + size_t(one_batch_size_b_gmem) * batch_idx;

    // Load data from global memory to registers
    common::io<POSV, BatchesPerBlock>::load(Ag, lda_gmem, As, lda_smem);
    common::io<POSV, BatchesPerBlock>::load_rhs(Bg, ldb_gmem, Bs, ldb_smem);

    POSV().execute(As, lda_smem, Bs, &info[batch_idx]);

    // store
    common::io<POSV, BatchesPerBlock>::store(As, lda_smem, Ag, lda_gmem);
    common::io<POSV, BatchesPerBlock>::store_rhs(Bs, ldb_smem, Bg, ldb_gmem);
}

 int posv_batched() {
    using POSV = decltype(Size<32 /* = M */, 32 /* = N */, 1 /* = K */>()
                      + Precision<double>()
                      + Type<type::complex>()
                      + Function<function::posv>()
                      + FillMode<lower>()
                      + Arrangement<col_major /* A */, row_major /* B */>()
                      + SM<900>()
                      + Block());

    using data_type = typename POSV::a_data_type;

    constexpr auto m    = POSV::m_size;
    constexpr auto n    = POSV::n_size;
    constexpr auto nrhs = POSV::k_size;
    static_assert(m == n, "posv is for Hermitian positive-definite matrix only");
    constexpr auto lda_smem = POSV::lda;
    constexpr auto ldb_smem = POSV::ldb;

    constexpr auto lda = m;
    constexpr auto ldb = (arrangement_of_v_b<POSV> == arrangement::col_major) ? m : nrhs;

    const auto batches          = 2;
    const auto one_batch_size_A = lda * n;
    const auto one_batch_size_B = m * nrhs;

    std::vector<data_type> A(one_batch_size_A * batches);
    // fill the A matrix

    std::vector<data_type> B(one_batch_size_B * batches);
    // fill the B matrix

    std::vector<int> info(batches, 0);
    data_type*       d_A    = nullptr; /* device copy of A */
    data_type*       d_B    = nullptr; /* device copy of B */
    int*             d_info = nullptr; /* error info */

    cudaMalloc(reinterpret_cast<void**>(&d_A), sizeof(data_type) * A.size());
    cudaMalloc(reinterpret_cast<void**>(&d_B), sizeof(data_type) * B.size());
    cudaMalloc(reinterpret_cast<void**>(&d_info), sizeof(int) * batches);

    cudaMemcpy(d_A, A.data(), sizeof(data_type) * A.size(), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B.data(), sizeof(data_type) * B.size(), cudaMemcpyHostToDevice);

    const auto sm_size = POSV::shared_memory_size;

    //Invokes kernel. Batches per block is 1
    posv_kernel<POSV><<<batches, POSV::block_dim, sm_size>>>(d_A, lda, d_B, ldb, d_info);

    cudaDeviceSynchronize();

    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_info);
 }

It is important to note that unlike the cuSolver library, cuSolverDx does not require moving data back to global memory after executing an operation, nor does it require the input data to be loaded from the global memory. Those properties can be a major performance advantage for certain use-cases. The list of possible optimizations includes but is not limited to:

Fusing Solver functions with custom pre- and post-processing.
Fusing multiple Solver functions together.
Fusing Solver functions with BLAS, RAND, or FFT operations.

Tip

It is recommended to use the cusolverdx::byte type for declaring shared memory variables, like shared_mem variable present in the code snippet above, but it is also valid to use unsigned char or std::byte types.

Compilation#

For instructions on how to compile programs with cuSolverDx see Installation Guide.