Requirements and Functionality#

Requirements#

cuBLASDx is a CUDA C++ header-only library. Therefore, the list of required software to use the library is relatively short:

  • CUDA Toolkit 11.4 or newer

  • Supported CUDA compiler (C++17 required)

  • Supported host compiler (C++17 required)

  • (Optional) CMake (version 3.18 or greater)

Dependencies:

  • commonDx (included with the MathDx package)

  • CUTLASS 3.9.0 or newer (CUTLASS 3.9.0 is included with the MathDx package)

Supported Compilers#

CUDA Compilers:

  • NVCC 11.4.152+ (CUDA Toolkit 11.4 or newer)

  • (Experimental support) NVRTC 11.4.152+ (CUDA Toolkit 11.4 or newer)

Host / C++ Compilers:

  • GCC 7+

  • Clang 9+ (Linux/WSL2 only)

  • HPC SDK nvc++ 23.1+

Note

We recommend using GCC 9+ as the host compiler, and NVCC shipped with the latest CUDA Toolkit as the CUDA compiler.

Warning

Compiling cuBLASDx on Windows with MSVC has not been tested and is not supported yet. However, it is possible to compile kernels with cuBLASDx on Windows using NVRTC, as demonstrated in one of the examples.

Note

cuBLASDx emits errors for unsupported C++ standard versions, which can be silenced by defining CUBLASDX_IGNORE_DEPRECATED_DIALECT during compilation. cuBLASDx is not guaranteed to work with C++ standard versions that are not officially supported.

Warning

Due to a known GCC issue, cuBLASDx only supports NVCC 11.6+ if the host compiler is GCC 11+. In addition, we recommend using NVCC 12.3+ when running kernels on Volta and Turing architectures to avoid a known compiler bug that can lead to incorrect results for certain use cases.

Supported Functionality#

This is an Early Access (EA) version of cuBLASDx. The current functionality of the library is a subset of the capabilities that will be available in the first official release.

Supported features include:

  • Creating block descriptors that execute the GEMM (general matrix multiply) routine: \(\mathbf{C}_{m\times n} = {\alpha} \times \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + {\beta} \times \mathbf{C}_{m\times n}\) (See Function operator).

  • Automatic use of Tensor Cores and automatic data layouts for optimal memory access patterns.

  • Using either register fragments or shared memory as input/output memory space for accumulation.

  • Bi-directional information flow: from the user to the descriptor via Operators, and from the descriptor to the user via Traits.

  • Targeting specific GPU architectures using the SM Operator. This enables users to configure the descriptor with suggested parameters for optimal performance.

Supported Memory Spaces#

cuBLASDx supports all GEMM sizes defined by the m, n, and k dimensions that can fit into the combined register file (RF) and shared memory. Matrices A and B must fit into shared memory to perform computations. These input matrices may overlap or alias each other. The maximum amount of shared memory per CUDA thread block can be found in the CUDA C Programming Guide.

The input/output C matrix can be:
  1. Provided in shared memory (it cannot alias any input elements) for \(\mathbf{C}_{m\times n} = {\alpha} \times \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + {\beta} \times \mathbf{C}_{m\times n}\)

  2. Provided as a register fragment for accumulation to \(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + \mathbf{C}_{m\times n}\)

  3. Returned by value as a register fragment from \(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n}\)

Supported Computation Types#

cuBLASDx supports calculations in two domains:
  1. Real

  2. Complex

In seven floating point precisions:
  1. half (__half)

  2. single (float)

  3. double (double)

  4. fp8_e4m3 (__nv_fp8_e4m3)

  5. fp8_e5m2 (__nv_fp8_e5m2)

  6. bf16 (__nv_bfloat16)

  7. tf32 (cublasdx::tfloat32_t)

In eight integral precisions:
  1. Signed 8-bit (int8_t)

  2. Unsigned 8-bit (uint8_t)

  3. Signed 16-bit (int16_t)

  4. Unsigned 16-bit (uint16_t)

  5. Signed 32-bit (int32_t)

  6. Unsigned 32-bit (uint32_t)

  7. Signed 64-bit (int64_t)

  8. Unsigned 64-bit (uint64_t)

Starting from cuBLASDx 0.2.0, matrix multiplication with different precisions for A, B, and C is supported.

Any combination of three precisions is supported, as long as either:
  1. All are floating point precisions.

  2. All are integral precisions and:
    1. The accumulator is at least 4x wider than any input,

    2. Input signedness implies accumulator signedness.

Mixed floating/integral GEMMs are unsupported, but in-register input conversion can be applied to achieve this effect.

Supported Input Types#

Note

Starting from cuBLASDx 0.3.0, computational precision has been decoupled from data precision, i.e. the input / output data for each matrix can be of arbitrary type (even integral input for floating point GEMM) provided that Alignment Operator is set and at least one of those conditions is met:

  1. It’s implicitly convertible to the data type chosen with Precision Operator and Type Operator.

  2. For inputs: An appropriate converting loading operation is provided as one of the arguments. It takes the input type value. Its result must be at least implicitly convertible to the compute type.

  3. For output: An appropriate converting storing operation is provided as one of the arguments. It takes the result computational type (usually C type as defined by Precision Operator and Type Operator). Its result must be at least implicitly convertible to the output type.

Supported Input Layouts#

Data can be provided in any layout described by a cublasdx::tensor (or underlying cute::Tensor) as long as it’s a 2-dimensional tensor. Its modes can be hierarchical, but there needs to be only 2 of them.

Supported Maximal Sizes with non-overlapping A and B#

Below you can find a table presenting maximal supported sizes for three commonly-used floating point precisions (half, single, and double) and type (real or complex) assuming m, n, and k dimensions are equal, and precisions of A, B and C are the same.

Effective supported dimensions are much bigger, if:
  1. Dimensions are not equal (long and wide matrices)

  2. A and B are aliased and share elements (e.g. A multiplied with its transposition requires fitting only A)

Function

Type, Precision of A/B/C

Architecture

Max Size

Restricted AB with C in Shared

Restricted AB with C in registers

GEMM

  • Real, half

70, 72

128

156

75

104

127

80, 87

166

203

86, 89, 120, 121

129

157

90, 100, 101, 103

196

240

  • Real, float

  • Complex, half

70, 72

90

110

75

73

89

80, 87

117

143

86, 89, 120, 121

91

111

90, 100, 101, 103

139

170

  • Real, double

  • Complex, float

70, 72

64

78

75

52

63

80, 87

83

101

86, 89, 120, 121

64

78

90, 100, 101, 103

98

120

  • Complex, double

70, 72

45

55

75

36

44

80, 87

58

71

86, 89, 120, 121

45

55

90, 100, 101, 103

69

84

Warning

Starting with cuBLASDx 0.3.0 there are no static assertions on whether the chosen size will fit onto device. This is a result of allowing inputs to overlap and providing a register based accumulation API along with shared memory one.

Supported MMA Data Types#

The table below lists the precisions of A, B, and C for which specialized Tensor Core operation is available.

The type and precision of the scaling factors, i.e., \({\alpha}\) and \({\beta}\), are expected to match those of matrix C. Matrices A, B, and C must all be either real or complex data types. If the I/O precision is decoupled from the computation precision, the scale type must be compatible with the compute type.

Precision A

Precision B

Precision C

Note

fp8_e4m3

fp8_e4m3

float

MMA, SM89+

fp8_e4m3

fp8_e5m2

float

MMA, SM89+

fp8_e5m2

fp8_e5m2

float

MMA, SM89+

fp8_e5m2

fp8_e4m3

float

MMA, SM89+

half

half

half

MMA, SM70+

half

half

float

MMA, SM70+

bf16

bf16

float

MMA, SM80+

tf32

tf32

float

MMA, SM80+

double

double

double

MMA, SM80+

int8_t

int8_t

int32_t

MMA, SM80+

uint8_t

int8_t

int32_t

MMA, SM80+

int8_t

uint8_t

int32_t

MMA, SM80+

uint8_t

uint8_t

int32_t

MMA, SM80+

Note

If an MMA instruction exists for the combination of precisions of A, B, and C, cuBLASDx will use the MMA instruction automatically on supported GPU architectures. Otherwise, cuBLASDx will use an FMA instruction, and there are no performance guarantees.