Requirements and Functionality¶
Requirements¶
cuBLASDx is a CUDA C++ header only library. Therefore, the list of required software to use the library is relatively small:
CUDA Toolkit 11.4 or newer
Supported CUDA compiler (C++17 required)
Supported host compiler (C++17 required)
(Optionally) CMake (version 3.18 or greater)
Dependencies:
commonDx (shipped with MathDx package)
CUTLASS 3.5.1 or newer (CUTLASS 3.5.1 shipped with MathDx package)
Supported Compilers¶
CUDA Compilers:
NVCC 11.4.152+ (CUDA Toolkit 11.4 or newer)
(Experimental support) NVRTC 11.4.152+ (CUDA Toolkit 11.4 or newer)
Host / C++ Compilers:
GCC 7+
Clang 9+ (only on Linux/WSL2)
HPC SDK nvc++ 23.1+
Note
We recommended always using the latest CUDA Toolkit and NVCC
compiler, and GCC 9+.
Warning
Compiling cuBLASDx on Windows with MSVC has not been tested and is not supported yet. However, it’s possible to compile kernels with cuBLASDx on Windows using NVRTC as presented in one of the examples.
Note
cuBLASDx emits errors for unsupported versions of C++ standard, which can be silenced by defining CUBLASDX_IGNORE_DEPRECATED_DIALECT
during compilation. cuBLASDx is not guaranteed to work with versions of C++ standard that are not supported in cuBLASDx.
Warning
Because of a known GCC issue, cuBLASDx only supports NVCC 11.6+ if the host compiler is GCC 11+. In addition, we recommend using NVCC 12.3+ when running kernels on Volta and Turing architectures to avoid a known compiler bug that leads to wrong results for certain use-cases.
Supported Functionality¶
This is an Early Access (EA) version of cuBLASDx. The current functionality of the library is a subset of the capabilities cuBLASDx will have on the first release.
Supported features include:
Creating block descriptors that run
GEMM
- general matrix multiply routine: \(\mathbf{C}_{m\times n} = {\alpha} \times \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + {\beta} \times \mathbf{C}_{m\times n}\). See Block Operator.Automatic use of Tensor Cores.
Bi-directional information flow, from the user to the descriptor via Operators and from the descriptor to the user via Traits.
Targeting specific GPU architectures using the SM Operator. This enables users to configure the descriptor with suggested parameters to target performance.
cuBLASDx supports all GEMM sizes defined by m
, n
, k
dimensions that can fit into the shared memory, meaning matrices A
, B
,
and C
have to fit into the shared memory to perform computations. Maximum amount of shared memory per CUDA thread block can be found
in CUDA C Programming Guide.
cuBLASDx supports calculations with 2 types: real and complex, in 7 floating point precisions: half (__half
), single (float
), double (double
), fp8_e4m3 (__nv_fp8_e4m3
), fp8_e5m2 (__nv_fp8_e5m2
), bf16 (__nv_bfloat16
), and tf32 (cublasdx::tfloat32_t
).
The memory layout of each matrix can be in either in column-major or row-major order.
Below you can find a table presenting maximal supported sizes for three commonly-used floating point precisions (half, single, and double) and type (real or complex) assuming m
, n
, and k
dimensions are equal, and precisions of A
, B
and C
are the same.
Function |
Type, Precision of A/B/C |
Architecture |
Max Size |
---|---|---|---|
GEMM |
|
70, 72 |
128 |
75 |
104 |
||
80, 87 |
166 |
||
86, 89 |
129 |
||
90 |
196 |
||
|
70, 72 |
90 |
|
75 |
73 |
||
80, 87 |
117 |
||
86, 89 |
91 |
||
90 |
139 |
||
|
70, 72 |
64 |
|
75 |
52 |
||
80, 87 |
83 |
||
86, 89 |
64 |
||
90 |
98 |
||
|
70, 72 |
45 |
|
75 |
36 |
||
80, 87 |
58 |
||
86, 89 |
45 |
||
90 |
69 |
Supported GEMM Data Types¶
Starting from cuBLASDx 0.2.0, matrix multiplication of different precisions of A
, B
, and C
is supported.
The table below lists the supported precisions of A
, B
, and C
.
The type and precision of the scales, i.e., \({\alpha}\) and \({\beta}\), are expected to be the same as those of matrix C
,
and A/B/C
has to be of either all real, or all complex data type.
Precision A |
Precision B |
Precision C |
Note |
---|---|---|---|
fp8_e4m3 |
fp8_e4m3 |
float |
MMA, SM89+ |
fp8_e4m3 |
fp8_e5m2 |
float |
MMA, SM89+ |
fp8_e5m2 |
fp8_e5m2 |
float |
MMA, SM89+ |
fp8_e5m2 |
fp8_e4m3 |
float |
MMA, SM89+ |
half |
half |
half |
MMA |
half |
half |
float |
MMA |
bf16 |
bf16 |
float |
MMA |
tf32 |
tf32 |
float |
MMA |
double |
double |
double |
MMA |
fp8_e4m3 |
fp8_e4m3 |
half |
|
fp8_e4m3 |
fp8_e4m3 |
bf16 |
|
fp8_e4m3 |
fp8_e4m3 |
double |
|
fp8_e4m3 |
float |
float |
|
fp8_e4m3 |
float |
double |
|
fp8_e4m3 |
double |
double |
|
fp8_e5m2 |
fp8_e5m2 |
half |
|
fp8_e5m2 |
fp8_e5m2 |
bf16 |
|
fp8_e5m2 |
fp8_e5m2 |
double |
|
fp8_e5m2 |
float |
float |
|
fp8_e5m2 |
float |
double |
|
fp8_e5m2 |
double |
double |
|
half |
half |
double |
|
half |
float |
float |
|
half |
float |
double |
|
half |
double |
double |
|
bf16 |
bf16 |
bf16 |
|
bf16 |
bf16 |
double |
|
bf16 |
float |
float |
|
bf16 |
float |
double |
|
bf16 |
double |
double |
|
tf32 |
tf32 |
double |
|
tf32 |
float |
float |
|
tf32 |
float |
double |
|
tf32 |
double |
double |
|
float |
fp8_e4m3 |
float |
|
float |
fp8_e4m3 |
double |
|
float |
fp8_e5m2 |
float |
|
float |
fp8_e5m2 |
double |
|
float |
half |
float |
|
float |
half |
double |
|
float |
bf16 |
float |
|
float |
bf16 |
double |
|
float |
tf32 |
float |
|
float |
tf32 |
double |
|
float |
float |
float |
|
float |
float |
double |
|
float |
double |
double |
|
double |
fp8_e5m2 |
double |
|
double |
fp8_e4m3 |
double |
|
double |
half |
double |
|
double |
bf16 |
double |
|
double |
tf32 |
double |
|
double |
float |
double |
Note
If an MMA exists for the combination of precisions of A
, B
, and C
cuBLASDx will use the MMA instruction automatically on the supported GPU architectures.
Otherwise, cuBLASDx will use an FMA instruction and there are no performance guarantees.