General Matrix Multiply Using cuBLASDx

In this introduction, we will perform a general matrix multiplication \(C = {\alpha} * op(A) * op(B) + {\beta} * C\) using the cuBLASDx library. This section is based on the introduction_example.cu example shipped with cuBLASDx. See Examples section to check other cuBLASDx samples.

Defining GEMM Operation

The first step is defining the GEMM we want to perform. It is done by adding together cuBLASDx operators to create a GEMM description. The correctness of this type is evaluated at compile time every time new operator is added. A well-defined cuBLASDx GEMM routine description must include two parts:

  1. Selected linear algebra routine. In this case that is matrix multiplication: cublasdx::function::MM.

  2. Valid and sufficient description of the inputs and outputs: the dimensions of matrices (M, N, K), the precision (half, float, double etc.), the data type (real or complex), and transpose mode of each matrix.

To get a descriptor for \(C = {\alpha} * A * B + {\beta} * C\) with M = N = K = 32, where A and B are non-transposed matrices, we just need to write the following lines:

#include <cublasdx.hpp>
using namespace cublasdx;

constexpr auto t_mode = cublasdx::transpose_mode::non_transposed;

using GEMM = decltype(Size<32 /* M */, 32 /* N */, 32 /* K */>()
              + Precision<double>()
              + Type<type::real>()
              + TransposeMode<t_mode /* A */, t_mode /* B */>()
              + Function<function::MM>());

In order to encode the operation properties, cuBLASDx provides operators Size, Precision, Type, TransposeMode, and Function, which can be combined by using the standard addition operator (+).

Optionally, user can set leading dimensions for each matrix using LeadingDimension operator. It is also possible to set leading dimensions dynamically during the execution, however, it is worth noting it may have an effect on the performance.

To obtain a fully usable operation that executes GEMM on CUDA block level, we need to provide at least two additional pieces of information:

  • The first one is the SM Operator which indicates the targeted CUDA architecture on which we want to run the GEMM. Each GPU architecture is different, therefore each can use a different implementation and may require different CUDA block size for the best performance. In the introduction_example.cu example this is passed as template parameter, but in here we can assume we’re targeting Volta GPUs (SM<700>()).

  • Finally, we use the Block Operator to show that the BLAS routine will be performed by multiple threads in a single CUDA block. At this point, cuBLASDx performs additional verifications to make sure provided description is valid and that it is possible to execute it on the requested architecture.

#include <cublasdx.hpp>
using namespace cublasdx;

constexpr auto t_mode = cublasdx::transpose_mode::non_transposed;

using GEMM = decltype(Size<32, 32, 32>()
              + Precision<double>()
              + Type<type::real>()
              + TransposeMode<t_mode, t_mode>()
              + Function<function::MM>()
              + SM<700>()
              + Block());

User can also specify the layout and the number of threads that will be performing the GEMM. This is done with the BlockDim Operator. Adding BlockDim<X, Y, Z> means that the GEMM will only work correctly if a kernel is launched with block dimensions dim3(X1, Y1, Z1) where X1 >= X, Y1 >= Y, and Z1 >= Z. Detailed requirements can be found in the section dedicated to BlockDim operator. If BlockDim operator is not used, cuBLASDx will select preferred block size that can be obtained with GEMM::block_dim.

Tip

If there is no need to set custom block dimensions, it is recommended not to use BlockDim operator and rely on GEMM::block_dim. For more details, see Block Execute Method section, BlockDim Operator, and Suggested Block Dim Trait.

For this sample, let’s assume we want to use a 1D CUDA thread block with 256 threads.

#include <cublasdx.hpp>
using namespace cublasdx;

constexpr auto t_mode = cublasdx::transpose_mode::non_transposed;

using GEMM = decltype(Size<32, 32, 32>()
              + Precision<double>()
              + Type<type::real>()
              + TransposeMode<t_mode, t_mode>()
              + Function<function::MM>()
              + SM<700>()
              + Block()
              + BlockDim<256>());

Executing GEMM

Class GEMM which describes the matrix multiplication can be instantiated into object (or objects). Forming the object has no computational cost, and should be seen as a handle. The function descriptor object provides a compute method, execute(...) that performs the requested function.

#include <cublasdx.hpp>
using namespace cublasdx;

constexpr auto t_mode = cublasdx::transpose_mode::non_transposed;

using GEMM = decltype(Size<32, 32, 32>()
              + Precision<double>()
              + Type<type::real>()
              + TransposeMode<t_mode, t_mode>()
              + Function<function::MM>()
              + SM<700>()
              + Block()
              + BlockDim<256>());

__global__ void gemm_kernel(double alpha, double *a, double *b, double beta, double *c) {
      // Execute GEMM
      GEMM().execute(/* What are the arguments? */);
}

It is assumed that all the matrices reside in the shared memory. It is up to the users to load the matrices from global to shared memory before calling the execution method. In the same way, users are responsible for saving the results.

#include <cublasdx.hpp>
using namespace cublasdx;

constexpr auto t_mode = cublasdx::transpose_mode::non_transposed;

using GEMM = decltype(Size<32, 32, 32>()
              + Precision<double>()
              + Type<type::real>()
              + TransposeMode<t_mode, t_mode>()
              + Function<function::MM>()
              + SM<700>()
              + Block()
              + BlockDim<256>());
// Type value_type is defined based on the GEMM description. Precision operator defines its numerical
// precision, and via Type operator user specifies if it is complex or real.
//
// In this case, value_type is double since set precision is double, and type is real.
using value_type = typename GEMM::value_type;

__global__ void gemm_kernel(value_type alpha, value_type *a, value_type *b, value_type beta, value_type *c) {
      extern __shared__ double smem[];

      // gemmm::<a/b/c>_size provides the number of elements <a/b/c> matrix including padding defined by leading dimension.
      value_type* sa = smem;
      value_type* sb = smem + GEMM::a_size;
      value_type* sc = smem + GEMM::a_size + GEMM::b_size;

      // Load data from global to shared memory
      // sa <-- a
      // sb <-- b
      // sc <-- c
      __syncthreads();

      // Execute GEMM
      GEMM().execute(alpha, sa, sb, beta, sc);
      __syncthreads();

      // Store data to global memory
      // c <-- sc
}

Launching GEMM Kernel

To launch a kernel executing the defined GEMM we need to know the required block dimensions and the amount of shared memory needed for all three matrices - A, B, C. Elements in the matrices should be in a column-major format (accounting for leading dimensions).

#include <cublasdx.hpp>
using namespace cublasdx;

template<class GEMM>
__global__ void gemm_kernel(GEMM::value_type alpha, GEMM::value_type *a, GEMM::value_type *b, GEMM::value_type beta, GEMM::value_type *c) {
      using value_type = typename GEMM::value_type;
      extern __shared__ value_type smem[];

      // GEMM::<a/b/c>_size provides the number of elements <a/b/c> matrix including padding defined by leading dimension.
      value_type* sa = smem;
      value_type* sb = smem + GEMM::a_size;
      value_type* sc = smem + GEMM::a_size + GEMM::b_size;

      // Load data from global to shared memory
      // sa <-- a
      // sb <-- b
      // sc <-- c
      __syncthreads();

      // Execute GEMM
      GEMM().execute(alpha, sa, sb, beta, sc);
      __syncthreads();

      // Store data to global memory
}

// CUDA_CHECK_AND_EXIT - marco checks if function returns cudaSuccess; if not it prints
// the error code and exits the program
void introduction_example(value_type alpha, value_type *a, value_type *b, value_type beta, value_type *c) {
  constexpr auto t_mode = cublasdx::transpose_mode::non_transposed;
  using GEMM = decltype(Size<32, 32, 32>()
                + Precision<double>()
                + Type<type::real>()
                + TransposeMode<t_mode, t_mode>()
                + Function<function::MM>()
                + SM<700>()
                + Block()
                + BlockDim<256>());

  // Invokes kernel with GEMM::block_dim threads in CUDA block
  gemm_kernel<GEMM><<<1, GEMM::block_dim, GEMM::shared_memory_size>>>(alpha, a, b, beta, c);
  CUDA_CHECK_AND_EXIT(cudaPeekAtLastError());
  CUDA_CHECK_AND_EXIT(cudaDeviceSynchronize());
}

The required shared memory can be obtained using GEMM::shared_memory_size. It accounts for any padding declared using LeadingDimension Operator.

For simplicity, in the example we allocate managed memory for device matrices, assume that Volta architecture is used, and don’t check CUDA error codes returned by CUDA API functions. In addition, the function which copies data between global and shared memory is implemented in a naive way. Please check the full introduction_example.cu example, as well as others shipped with cuBLASDx, for more detailed code.

#include <cublasdx.hpp>
using namespace cublasdx;

// Naive copy; one thread does all the work
template<class T>
inline __device__ void naive_copy(T* dst, const T* src, unsigned int size) {
    if ((threadIdx.x == 0) && (threadIdx.y == 0) && (threadIdx.z == 0)) {
        for (unsigned int i = 0; i < size; ++i) {
            dst[i] = src[i];
        }
    }
}

template<class GEMM>
__global__ void gemm_kernel(GEMM::value_type alpha, GEMM::value_type *a, GEMM::value_type *b, GEMM::value_type beta, GEMM::value_type *c) {
      using value_type = typename GEMM::value_type;
      extern __shared__ value_type smem[];

      value_type* sa = smem;
      value_type* sb = smem + GEMM::a_size;
      value_type* sc = smem + GEMM::a_size + GEMM::b_size;

      // Load data from global to shared memory
      naive_copy(sa, a, GEMM::a_size);
      naive_copy(sb, b, GEMM::b_size);
      naive_copy(sc, c, GEMM::c_size);
      __syncthreads();

      // Execute GEMM
      GEMM().execute(alpha, sa, sb, beta, sc);
      __syncthreads();

      // Store data to global memory
      naive_copy(c, sc, GEMM::c_size);
}

void introduction_example() {
  constexpr auto t_mode = cublasdx::transpose_mode::non_transposed;
  using GEMM = decltype(Size<32, 32, 32>()
                + Precision<double>()
                + Type<type::real>()
                + TransposeMode<t_mode, t_mode>()
                + Function<function::MM>()
                + SM<700>()
                + Block()
                + BlockDim<256>());

  // Allocate managed memory for A, B, C matrices in one go
  value_type* abc;
  auto        size       = GEMM::a_size + GEMM::b_size + GEMM::c_size;
  auto        size_bytes = size * sizeof(value_type);
  cudaMallocManaged(&abc, size_bytes);
  // Generate data
  for (size_t i = 0; i < size; i++) {
      abc[i] = double(i) / size;
  }

  value_type* a = abc;
  value_type* b = abc + GEMM::a_size;
  value_type* c = abc + GEMM::a_size + GEMM::b_size;

  // Invokes kernel with GEMM::block_dim threads in CUDA block
  gemm_kernel<GEMM><<<1, GEMM::block_dim, GEMM::shared_memory_size>>>(1.0, a, b, 1.0, c);
  cudaDeviceSynchronize();

  cudaFree(abc);
}

It is important to notice that unlike the cuBLAS library cuBLASDx does not require moving data back to global memory after executing a BLAS operation. Nor does it require the input data to be loaded from global memory. Those properties can be a major performance advantage for certain use-cases. The list of possible optimizations includes but is not limited to:

  • Fusing BLAS routines with custom pre- and post-processing.

  • Fusing multiple BLAS operations together.

  • Fusing BLAS and FFT operations (using cuFFTDx) together.

  • Generating input matrices or parts of them.

Compilation

For instructions on how to compile programs with cuBLASDx see Quick Installation Guide.