Requirements and Functionality#

Requirements#

cuBLASDx is a CUDA C++ header only library. Therefore, the list of required software to use the library is relatively small:

  • CUDA Toolkit 11.4 or newer

  • Supported CUDA compiler (C++17 required)

  • Supported host compiler (C++17 required)

  • (Optionally) CMake (version 3.18 or greater)

Dependencies:

  • commonDx (shipped with MathDx package)

  • CUTLASS 3.6.0 or newer (CUTLASS 3.6.0 shipped with MathDx package)

Supported Compilers#

CUDA Compilers:

  • NVCC 11.4.152+ (CUDA Toolkit 11.4 or newer)

  • (Experimental support) NVRTC 11.4.152+ (CUDA Toolkit 11.4 or newer)

Host / C++ Compilers:

  • GCC 7+

  • Clang 9+ (only on Linux/WSL2)

  • HPC SDK nvc++ 23.1+

Note

We recommend using GCC 9+ as the host compiler, and NVCC shipped with the latest CUDA Toolkit as the CUDA compiler.

Warning

Compiling cuBLASDx on Windows with MSVC has not been tested and is not supported yet. However, it’s possible to compile kernels with cuBLASDx on Windows using NVRTC as presented in one of the examples.

Note

cuBLASDx emits errors for unsupported versions of C++ standard, which can be silenced by defining CUBLASDX_IGNORE_DEPRECATED_DIALECT during compilation. cuBLASDx is not guaranteed to work with versions of C++ standard that are not supported in cuBLASDx.

Warning

Due to a known GCC issue, cuBLASDx only supports NVCC 11.6+ if the host compiler is GCC 11+. In addition, we recommend using NVCC 12.3+ when running kernels on Volta and Turing architectures to avoid a known compiler bug that leads to wrong results for certain use-cases.

Supported Functionality#

This is an Early Access (EA) version of cuBLASDx. The current functionality of the library is a subset of the capabilities cuBLASDx will have on the first release.

Supported features include:

  • Creating block descriptors that run GEMM - general matrix multiply routine: \(\mathbf{C}_{m\times n} = {\alpha} \times \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + {\beta} \times \mathbf{C}_{m\times n}\) (See Function operator).

  • Automatic use of Tensor Cores and automatic data layouts for best memory access patterns.

  • Using either register fragments or shared memory as input / output memory space for accumulation.

  • Bi-directional information flow, from the user to the descriptor via Operators and from the descriptor to the user via Traits.

  • Targeting specific GPU architectures using the SM Operator. This enables users to configure the descriptor with suggested parameters to target performance.

Supported Memory Spaces#

cuBLASDx supports all GEMM sizes defined by m, n, k dimensions that can fit into register file (RF) and shared memory combined. Matrices A, B, have to fit into the shared memory to perform computations. Those input matrices may overlap or alias each other. Maximum amount of shared memory per CUDA thread block can be found in CUDA C Programming Guide.

The input / output C matrix can be:
  1. provided in shared memory (it cannot alias any input elements) for \(\mathbf{C}_{m\times n} = {\alpha} \times \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + {\beta} \times \mathbf{C}_{m\times n}\)

  2. provided as register fragment for accumulation to \(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + \mathbf{C}_{m\times n}\)

  3. returned by value as return register fragment from \(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n}\)

Supported Computation Types#

cuBLASDx supports calculations with 2 domains:
  1. real

  2. complex,

In 7 floating point precisions:
  1. half (__half)

  2. single (float)

  3. double (double)

  4. fp8_e4m3 (__nv_fp8_e4m3)

  5. fp8_e5m2 (__nv_fp8_e5m2)

  6. bf16 (__nv_bfloat16)

  7. tf32 (cublasdx::tfloat32_t)

In 8 integral precisions:
  1. Signed 8-bit (int8_t)

  2. Unsigned 8-bit (uint8_t)

  3. Signed 16-bit (int16_t)

  4. Unsigned 16-bit (uint16_t)

  5. Signed 32-bit (int32_t)

  6. Unsigned 32-bit (uint32_t)

  7. Signed 64-bit (int64_t)

  8. Unsigned 64-bit (uint64_t)

Starting from cuBLASDx 0.2.0, matrix multiplication of different precisions of A, B, and C is supported.

Any 3 precision combination is supported, as long as either:
  1. They are all floating point precisions.

  2. They are all integral precisisions and:
    1. accumulator is at least 4x wider than any input,

    2. input signedness implies accumulator signedness.

Mixed floating / integral GEMMs are unsupported, but in-register input conversion can be applied to that effect.

Supported Input Types#

Note

Starting from cuBLASDx 0.3.0, computational precision has been decoupled from data precision, i.e. the input / output data for each matrix can be of arbitrary type (even integral input for floating point GEMM) provided that Alignment Operator is set and at least one of those conditions is met:

  1. It’s implicitly convertible to the data type chosen with Precision Operator and Type Operator.

  2. For inputs: An appropriate converting loading operation is provided as one of the arguments. It takes the input type value. Its result must be at least implicitly convertible to the compute type.

  3. For output: An appropriate converting storing operation is provided as one of the arguments. It takes the result computational type (usually C type as defined by Precision Operator and Type Operator). Its result must be at least implicitly convertible to the output type.

Supported Input Layouts#

Data can be provided in any layout described by a cublasdx::tensor (or underlying cute::Tensor) as long as it’s a 2-dimensional tensor. Its modes can be hierarchical, but there needs to be only 2 of them.

Supported Maximal Sizes with non-overlapping A and B#

Below you can find a table presenting maximal supported sizes for three commonly-used floating point precisions (half, single, and double) and type (real or complex) assuming m, n, and k dimensions are equal, and precisions of A, B and C are the same.

Effective supported dimensions are much bigger, if:
  1. Dimensions are not equal (long and wide matrices)

  2. A and B are aliased and share elements (e.g. A multiplied with its transposition requires fitting only A)

Function

Type, Precision of A/B/C

Architecture

Max Size

Restricted AB with C in Shared

Restricted AB with C in registers

GEMM

  • Real, half

70, 72

128

156

75

104

127

80, 87

166

203

86, 89

129

157

90

196

240

  • Real, float

  • Complex, half

70, 72

90

110

75

73

89

80, 87

117

143

86, 89

91

111

90

139

170

  • Real, double

  • Complex, float

70, 72

64

78

75

52

63

80, 87

83

101

86, 89

64

78

90

98

120

  • Complex, double

70, 72

45

55

75

36

44

80, 87

58

71

86, 89

45

55

90

69

84

Warning

Starting with cuBLASDx 0.3.0 there are no static assertions on whether the chosen size will fit onto device. This is a result of allowing inputs to overlap and providing a register based accumulation API along with shared memory one.

Supported MMA Data Types#

The table below lists the precisions of A, B, and C for which specialized Tensor Core operation is available.

The type and precision of the scales, i.e., \({\alpha}\) and \({\beta}\), are expected to be the same as those of matrix C, and A/B/C has to be of either all real, or all complex data type. If IO precision is decoupled from computation precision, the scale type must be compatible with compute type.

Precision A

Precision B

Precision C

Note

fp8_e4m3

fp8_e4m3

float

MMA, SM89+

fp8_e4m3

fp8_e5m2

float

MMA, SM89+

fp8_e5m2

fp8_e5m2

float

MMA, SM89+

fp8_e5m2

fp8_e4m3

float

MMA, SM89+

half

half

half

MMA, SM70+

half

half

float

MMA, SM70+

bf16

bf16

float

MMA, SM80+

tf32

tf32

float

MMA, SM80+

double

double

double

MMA, SM80+

int8_t

int8_t

int32_t

MMA, SM75+

uint8_t

int8_t

int32_t

MMA, SM75+

int8_t

uint8_t

int32_t

MMA, SM75+

uint8_t

uint8_t

int32_t

MMA, SM75+

Note

If an MMA exists for the combination of precisions of A, B, and C cuBLASDx will use the MMA instruction automatically on the supported GPU architectures. Otherwise, cuBLASDx will use an FMA instruction and there are no performance guarantees.