Input / Output Format#

Value Format#

BLAS::a_value_type
BLAS::b_value_type
BLAS::c_value_type

For complex numbers of every precision, the first value in a complex number is the real part and the second is the imaginary part. For real numbers, BLAS::<a/b/c>_value_type is the same as P in Precision<PA, PB, PC> used to describe BLAS (or the default precision).

Input/Output Data Format#

This section describes the input and output data format (layout) required for correct calculations.

The tensor API for general matrix multiplication (execute()) and triangular solve (execute() and execute()) are execution descriptor methods which expect matrices represented using (cublasdx::tensor). They both accept matrices represented by tensors with arbitrary layouts. Since the tensor object carries all the information about the dimensions, the memory location and layout of a matrix, no other implicit assumptions are needed.

The dimensions of the matrices must match the dimensions defined by Size operator. See also Get Memory Layout and Suggested shared memory Layout sections for using optimized layouts.

The pointer API methods for general matrix multiplication (#2 and #3 overloads of execute()) and triangular solve (#2 and #3 overloads of execute() and execute()) assume that values in input matrices matrix_a, matrix_b (and optionally matrix_c for GEMM) are stored as defined by the Arrangement operator added to the description (by default it’s row-major format for matrix_a, column-major for matrix_b, and column-major for matrix_c).

Shared Memory Usage#

It’s important to note that large BLAS operations (as defined by Size operator) may require more than 48 KB of shared memory per CUDA block for the matrices. Therefore, as described in CUDA Programming Guide (Technical Specifications per Compute Capability table, Shared Memory section for each compute capability in Compute Capabilities chapter), kernels with such BLAS operations must use the dynamic shared memory rather than statically sized shared memory arrays. Additionally, these kernels require an explicit opt-in using cudaFuncSetAttribute() to set the cudaFuncAttributeMaxDynamicSharedMemorySize. See example code below.

#include <cublasdx.hpp>
using namespace cublasdx;

using GEMM = decltype(cublasdx::Size<128, 128, 64>()
              + cublasdx::Precision<__nv_fp8_e4m3, __nv_fp8_e5m2, float>()
              + cublasdx::Type<cublasdx::type::real>()
              + cublasdx::Arrangement<cublasdx::row_major, cublasdx::col_major>()
              + cublasdx::Function</* MM or TRSM */>()
              + cublasdx::SM<900>()
              + cublasdx::Block()); // Thread() will output single problem requirement for TRSM

void example() {
  (...)

  // Get required shared memory sizes, options:

  // Shared Memory API GEMM
  // 1 - Shared memory size required for matrices based on GEMM definition
  auto shared_memory_size = cublasdx::get_shared_storage_size<GEMM>();
  // 2 - Shared memory size when dynamic leading dimensions are used
  auto shared_memory_size = cublasdx::get_shared_storage_size<GEMM>(lda, ldb, ldc);
  // 3 - Shared memory size calculated based on custom matrix layouts for A, B, C matrices
  auto shared_memory_size = cublasdx::get_shared_storage_size<GEMM>(matrix_a_layout, matrix_b_layout, matrix_c_layout);

  // TRSM and register API GEMM
  // 1 - Shared memory size required for matrices based on GEMM definition
  auto shared_memory_size = cublasdx::get_shared_storage_size_ab<GEMM>();
  // 2 - Shared memory size when dynamic leading dimensions are used
  auto shared_memory_size = cublasdx::get_shared_storage_size_ab<GEMM>(lda, ldb);
  // 3 - Shared memory size calculated based on custom matrix layouts for A, B matrices
  auto shared_memory_size = cublasdx::get_shared_storage_size_ab<GEMM>(matrix_a_layout, matrix_b_layout);

  // Increases the max dynamic shared memory size to match GEMM requirements
  cudaFuncSetAttribute(gemm_kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size)
  // Invokes kernel with GEMM::block_dim threads in CUDA block
  gemm_kernel<GEMM><<<1, GEMM::block_dim, shared_memory_size>>>(alpha, a, b, beta, c);

  (...)
}