Requirements and Functionality#

Requirements#

cuBLASDx is a CUDA C++ header-only library. Therefore, the list of required software to use the library is relatively short:

  • CUDA Toolkit 12.0 or newer

  • Supported CUDA compiler (C++17 required)

  • Supported host compiler (C++17 required)

  • (Optional) CMake (version 3.30 or greater)

Dependencies:

  • commonDx (included with the MathDx package)

  • CUTLASS 4.2.0 or newer (CUTLASS 4.2.0 is included with the MathDx package)

Supported Compilers#

CUDA Compilers:

  • NVCC 12.0.76+ (CUDA Toolkit 12.0 or newer)

  • (Experimental support) NVRTC 12.0.76+ (CUDA Toolkit 12.0 or newer)

Host / C++ Compilers:

  • GCC 7+

  • Clang 9+ (Linux/WSL2 only)

Note

We recommend using GCC 9+ as the host compiler, and NVCC shipped with the latest CUDA Toolkit as the CUDA compiler.

Warning

Compiling cuBLASDx on Windows with MSVC has not been tested and is not supported yet. However, it is possible to compile kernels with cuBLASDx on Windows using NVRTC, as demonstrated in one of the examples.

Note

cuBLASDx emits errors for unsupported C++ standard versions, which can be silenced by defining CUBLASDX_IGNORE_DEPRECATED_DIALECT during compilation. cuBLASDx is not guaranteed to work with C++ standard versions that are not officially supported.

Supported Functionality#

This is an Early Access (EA) version of cuBLASDx. The current functionality of the library is a subset of the capabilities that will be available in the first official release.

Supported features include:

  • Creating block descriptors that execute the GEMM (general matrix multiply) routine: \(\mathbf{C}_{m\times n} = {\alpha} \times \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + {\beta} \times \mathbf{C}_{m\times n}\) (See Function operator).

  • Creating pipeline objects that introduce asynchroneous overlaping between stages of tile GEMM execution

  • Automatic use of Tensor Cores and Tensor Memory Accelerator, as well as automatic data layouts for optimal memory access patterns.

  • Using either register fragments or shared memory as input/output memory space for accumulation.

  • Bi-directional information flow: from the user to the descriptor via Operators, and from the descriptor to the user via Traits.

  • Targeting specific GPU architectures using the SM Operator. This enables users to configure the descriptor with suggested parameters for optimal performance.

Supported Memory Spaces#

cuBLASDx tiles support all GEMM sizes defined by the m, n, and k dimensions that can fit into the combined register file (RF) and shared memory. Matrices A and B must fit into shared memory to perform computations. These input matrices may overlap or alias each other. The maximum amount of shared memory per CUDA thread block can be found in the CUDA C Programming Guide.

The input/output C matrix can be:
  1. Provided in shared memory (it cannot alias any input elements) for \(\mathbf{C}_{m\times n} = {\alpha} \times \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + {\beta} \times \mathbf{C}_{m\times n}\)

  2. Provided as a register fragment for accumulation to \(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + \mathbf{C}_{m\times n}\)

  3. Returned by value as a register fragment from \(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n}\)

Pipeline extension to cuBLASDx allows to start the GEMM execution (as described with operators) from global memory and efficiently pipeline subsequent stages till all tiles have been loaded and multiplied accordingly.

Supported Computation Types#

cuBLASDx supports calculations in two domains:
  1. Real

  2. Complex

In seven floating point precisions:
  1. half (__half)

  2. single (float)

  3. double (double)

  4. fp8_e4m3 (__nv_fp8_e4m3)

  5. fp8_e5m2 (__nv_fp8_e5m2)

  6. bf16 (__nv_bfloat16)

  7. tf32 (cublasdx::tfloat32_t)

In eight integral precisions:
  1. Signed 8-bit (int8_t)

  2. Unsigned 8-bit (uint8_t)

  3. Signed 16-bit (int16_t)

  4. Unsigned 16-bit (uint16_t)

  5. Signed 32-bit (int32_t)

  6. Unsigned 32-bit (uint32_t)

  7. Signed 64-bit (int64_t)

  8. Unsigned 64-bit (uint64_t)

Starting from cuBLASDx 0.2.0, matrix multiplication with different precisions for A, B, and C is supported.

Any combination of three precisions is supported, as long as either:
  1. All are floating point precisions.

  2. All are integral precisions and:
    1. The accumulator is at least 4x wider than any input,

    2. Input signedness implies accumulator signedness.

Mixed floating/integral GEMMs are unsupported, but in-register input conversion can be applied to achieve this effect.

Supported Input Types#

Note

Starting from cuBLASDx 0.3.0, computational precision has been decoupled from data precision, i.e. the input / output data for each matrix can be of arbitrary type (even integral input for floating point GEMM) provided that Alignment Operator is set and at least one of those conditions is met:

  1. It’s implicitly convertible to the data type chosen with Precision Operator and Type Operator.

  2. For inputs: An appropriate converting loading operation is provided as one of the arguments. It takes the input type value. Its result must be at least implicitly convertible to the compute type.

  3. For output: An appropriate converting storing operation is provided as one of the arguments. It takes the result computational type (usually C type as defined by Precision Operator and Type Operator). Its result must be at least implicitly convertible to the output type.

Supported Input Layouts#

Data can be provided in any layout described by a cublasdx::tensor (or underlying cute::Tensor) as long as it’s a 2-dimensional tensor. Its modes can be hierarchical, but there needs to be only 2 of them.

Supported Maximal Sizes with non-overlapping A and B#

Below you can find a table presenting maximal supported sizes for three commonly-used floating point precisions (half, single, and double) and type (real or complex) assuming m, n, and k dimensions are equal, and precisions of A, B and C are the same.

Effective supported dimensions are much bigger, if:
  1. Dimensions are not equal (long and wide matrices)

  2. A and B are aliased and share elements (e.g. A multiplied with its transposition requires fitting only A)

Function

Type, Precision of A/B/C

Architecture

Max Size

Restricted AB with C in Shared

Restricted AB with C in registers

GEMM

  • Real, half

70, 72

128

156

75

104

127

80, 87

166

203

86, 89, 120, 121

129

157

90, 100, 101, 103, 110

196

240

  • Real, float

  • Complex, half

70, 72

90

110

75

73

89

80, 87

117

143

86, 89, 120, 121

91

111

90, 100, 101, 103, 110

139

170

  • Real, double

  • Complex, float

70, 72

64

78

75

52

63

80, 87

83

101

86, 89, 120, 121

64

78

90, 100, 101, 103, 110

98

120

  • Complex, double

70, 72

45

55

75

36

44

80, 87

58

71

86, 89, 120, 121

45

55

90, 100, 101, 103, 110

69

84

Warning

Starting with cuBLASDx 0.3.0 there are no static assertions on whether the chosen size will fit onto device. This is a result of allowing inputs to overlap and providing a register based accumulation API along with shared memory one.

Supported MMA Data Types#

The table below lists the precisions of A, B, and C for which specialized Tensor Core operation is available.

The type and precision of the scaling factors, i.e., \({\alpha}\) and \({\beta}\), are expected to match those of matrix C. Matrices A, B, and C must all be either real or complex data types. If the I/O precision is decoupled from the computation precision, the scale type must be compatible with the compute type.

Precision A

Precision B

Precision C

Note

fp8_e4m3

fp8_e4m3

float

MMA for SM89+, WGMMA for SM90a, and 1SM UTCMMA for SM100a

fp8_e4m3

fp8_e5m2

float

MMA for SM89+, WGMMA for SM90a and 1SM UTCMMA for SM100a

fp8_e5m2

fp8_e5m2

float

MMA for SM89+, WGMMA for SM90a and 1SM UTCMMA for SM100a

fp8_e5m2

fp8_e4m3

float

MMA for SM89+, WGMMA for SM90a and 1SM UTCMMA for SM100a

half

half

half

MMA for SM70+, WGMMA for SM90a and 1SM UTCMMA for SM100a

half

half

float

MMA for SM70+, WGMMA for SM90a and 1SM UTCMMA for SM100a

bf16

bf16

float

MMA for SM80+, WGMMA for SM90a and 1SM UTCMMA for SM100a

tf32

tf32

float

MMA for SM80+, WGMMA for SM90a and 1SM UTCMMA for SM100a

int8_t

int8_t

int32_t

MMA for SM80+, WGMMA for SM90a and 1SM UTCMMA for SM100a

uint8_t

int8_t

int32_t

MMA for SM80+, WGMMA for SM90a and 1SM UTCMMA for SM100a

int8_t

uint8_t

int32_t

MMA for SM80+, WGMMA for SM90a and 1SM UTCMMA for SM100a

uint8_t

uint8_t

int32_t

MMA for SM80+, WGMMA for SM90a and 1SM UTCMMA for SM100a

double

double

double

MMA, SM80+

Note

If an MMA instruction exists for the combination of precisions of A, B, and C, cuBLASDx will use the MMA instruction automatically on supported GPU architectures. Otherwise, cuBLASDx will use an FMA instruction, and there are no performance guarantees.