Execution Methods

Execution methods are used to run the BLAS function as defined by user with cuBLASDx operators.

Note

Right now, cuBLASDx supports only execution on CUDA thread block level (block execution).

A code example:

#include <cublasdx.hpp>

using GEMM = decltype(cublasdx::Size<32, 32, 32>()
              + cublasdx::Precision<double>()
              + cublasdx::Type<cublasdx::type::real>()
              + cublasdx::TransposeMode<cublasdx::T, cublasdx::N>()
              + cublasdx::Function<cublasdx::function::MM>()
              + cublasdx::SM<700>()
              + cublasdx::Block());

using data_type = typename GEMM::value_type;

__shared__ data_type sa[GEMM::a_size];
__shared__ data_type sb[GEMM::b_size];
__shared__ data_type sc[GEMM::c_size];

// Load data from global to shared memory
// sa <-- a
// sb <-- b
// sc <-- c

// Execute
GEMM().execute(alpha, sa, sb, beta, sc);

// Store results back to global memory

Block Execute Method

The block execution methods are available if the descriptor has been constructed using the Block Operator and is_complete_blas_execution Trait is true.

// #1
void BLAS().execute<typename T>(const T   alpha,
                                T*        matrix_a,
                                T*        matrix_b,
                                const T   beta,
                                T*        matrix_c)

// #2: Allows providing runtime/dynamic leading dimensions for matrices A, B, and C
void BLAS().execute<typename T>(const T            alpha,
                                T*                 matrix_a,
                                const unsigned int lda,
                                T*                 matrix_b,
                                const unsigned int ldb,
                                const T            beta,
                                T*                 matrix_c,
                                const unsigned int ldc)

Method execute(...) runs the calculations defined by the BLAS descriptor.

By default T is BLAS::value_type, but it can be any type (such as float2, cuda::std::complex<double>), as long as its alignment and size are the same as those of BLAS::value_type. Pointers matrix_a, matrix_b, matrix_c must point to a shared memory regions aligned to alignof(BLAS::value_type).

Method #2 allows user to provide custom dynamic leading dimensions via lda, ldb, and ldc arguments. In this case, leading dimension values set via LeadingDimension operator are ignored. Values lda, ldb, and ldc have to follow the same rules as presented in LeadingDimension operator.

After methods #1 and #2 user has to perform CUDA block synchronization before accessing matrix_a, matrix_b, matrix_c.

Warning

It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs but with different

will produce bit-identical results.

Warning

It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs on GPUs of different CUDA architectures will produce bit-identical results.

Value Format

BLAS::value_type

For complex numbers of every precision, the first value in a complex number is the real part and the second is the imaginary part. For real number, BLAS::value_type is same as P in Precision<P> used to describe BLAS (or the default precision).

Input/Output Data Format

This section describes the input and output data format required for correct calculations.

GEMM (function::MM)

The general matrix multiplication assumes that values in matrices matrix_a, matrix_b, matrix_c are in column-major format. Results are stored in matrix_c in column-major format too.

Shared Memory Usage

It’s important to note that large BLAS operations (as defined by Size operator) may require more than 48 KB of shared memory per CUDA block for the matrices. Therefore, as described in CUDA Programming Guide (#1, #2, #3), kernels with such BLAS operations must use the dynamic shared memory rather than statically sized shared memory arrays. Additionally, these kernels require an explicit opt-in using cudaFuncSetAttribute() to set the cudaFuncAttributeMaxDynamicSharedMemorySize. See example code below.

#include <cublasdx.hpp>
using namespace cublasdx;

using GEMM = decltype(cublasdx::Size<128, 128, 64>()
              + cublasdx::Precision<double>()
              + cublasdx::Type<cublasdx::type::real>()
              + cublasdx::TransposeMode<cublasdx::T, cublasdx::N>()
              + cublasdx::Function<cublasdx::function::MM>()
              + cublasdx::SM<900>()
              + cublasdx::Block());

void example() {
  (...)

  // Increases the max dynamic shared memory size to match GEMM requirements
  cudaFuncSetAttribute(gemm_kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, GEMM::shared_memory_size)
  // Invokes kernel with GEMM::block_dim threads in CUDA block
  gemm_kernel<GEMM><<<1, GEMM::block_dim, GEMM::shared_memory_size>>>(alpha, a, b, beta, c);

  (...)
}