Requirements and Functionality

Requirements

cuBLASDx is a CUDA C++ header only library. Therefore, the list of required software to use the library is relatively small:

  • CUDA Toolkit 11.4 or newer

  • Supported CUDA compiler (C++17 required)

  • Supported host compiler (C++17 required)

  • (Optionally) CMake (version 3.18 or greater)

Dependencies:

  • commonDx (shipped with MathDx package)

  • CUTLASS 3.3.0 or newer (CUTLASS 3.3.0 shipped with MathDx package)

Supported Compilers

CUDA Compilers:

  • NVCC 11.4.152+ (CUDA Toolkit 11.4 or newer)

  • (Experimental support) NVRTC 11.4.152+ (CUDA Toolkit 11.4 or newer)

Host / C++ Compilers:

  • GCC 7+

  • Clang 9+ (only on Linux/WSL2)

Warning

Compiling cuBLASDx on Windows with MSVC has not been tested and is not supported yet. However, it’s possible to compile kernels with cuBLASDx on Windows using NVRTC as presented in one of the examples.

Note

cuBLASDx emits errors for unsupported versions of compilers, which can be silenced by defining CUBLASDX_IGNORE_DEPRECATED_COMPILER during compilation. cuBLASDx is not guaranteed to work with versions of compilers that are not supported in cuBLASDx.

Note

cuBLASDx emits errors for unsupported versions of C++ standard, which can be silenced by defining CUBLASDX_IGNORE_DEPRECATED_DIALECT during compilation. cuBLASDx is not guaranteed to work with versions of C++ standard that are not supported in cuBLASDx.

Supported Functionality

This is an Early Access (EA) version of cuBLASDx. The current functionality of the library is a subset of the capabilities cuBLASDx will have on the first release.

Supported features include:

  • Creating block descriptors that run GEMM - general matrix multiply routine: \(C = {\alpha} * op(A) * op(B) + {\beta} * C\). See Block Operator.

  • Automatic use of Tensor Cores when processing real half, real double and complex double data.

  • Bi-directional information flow, from the user to the descriptor via Operators and from the descriptor to the user via Traits.

  • Targeting specific GPU architectures using the SM Operator. This enables users to configure the descriptor with suggested parameters to target performance.

cuBLASDx supports all GEMM sizes defined by M, N, K dimensions that can fit into the shared memory, meaning matrices A, B, and C have to fit into the shared memory to perform computations. Maximum amount of shared memory per CUDA thread block can be found in CUDA C Programming Guide.

As of now, cuBLASDx supports calculations with 2 types: real and complex, in 3 floating point precisions: half (__half), single (float) and double (double). Each matrix can be in one of 3 transpose modes: non-transposed, transposed, and conjugated transposed.

Below you can find a table presenting maximal supported sizes for every precision and type assuming M, N, and K dimensions are equal.

Function

Type, Precision

Architecture

Max Size

GEMM

  • Real, half

70, 72

128

75

104

80, 87

166

86, 89

129

90

196

  • Real, float

  • Complex, half

70, 72

90

75

73

80, 87

117

86, 89

91

90

139

  • Real, double

  • Complex, float

70, 72

64

75

52

80, 87

83

86, 89

64

90

98

  • Complex, double

70, 72

45

75

36

80, 87

58

86, 89

45

90

69