Requirements and Functionality¶

Requirements¶

cuBLASDx is a CUDA C++ header only library. Therefore, the list of required software to use the library is relatively small:

CUDA Toolkit 11.4 or newer
Supported CUDA compiler (C++17 required)
Supported host compiler (C++17 required)
(Optionally) CMake (version 3.18 or greater)

Dependencies:

commonDx (shipped with MathDx package)
CUTLASS 3.5.1 or newer (CUTLASS 3.5.1 shipped with MathDx package)

Supported Compilers¶

CUDA Compilers:

NVCC 11.4.152+ (CUDA Toolkit 11.4 or newer)
(Experimental support) NVRTC 11.4.152+ (CUDA Toolkit 11.4 or newer)

Host / C++ Compilers:

GCC 7+
Clang 9+ (only on Linux/WSL2)
HPC SDK nvc++ 23.1+

Note

We recommended always using the latest CUDA Toolkit and NVCC compiler, and GCC 9+.

Warning

Compiling cuBLASDx on Windows with MSVC has not been tested and is not supported yet. However, it’s possible to compile kernels with cuBLASDx on Windows using NVRTC as presented in one of the examples.

Note

cuBLASDx emits errors for unsupported versions of C++ standard, which can be silenced by defining CUBLASDX_IGNORE_DEPRECATED_DIALECT during compilation. cuBLASDx is not guaranteed to work with versions of C++ standard that are not supported in cuBLASDx.

Warning

Because of a known GCC issue, cuBLASDx only supports NVCC 11.6+ if the host compiler is GCC 11+. In addition, we recommend using NVCC 12.3+ when running kernels on Volta and Turing architectures to avoid a known compiler bug that leads to wrong results for certain use-cases.

Supported Functionality¶

This is an Early Access (EA) version of cuBLASDx. The current functionality of the library is a subset of the capabilities cuBLASDx will have on the first release.

Supported features include:

Creating block descriptors that run GEMM - general matrix multiply routine: \(\mathbf{C}_{m\times n} = {\alpha} \times \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + {\beta} \times \mathbf{C}_{m\times n}\). See Block Operator.
Automatic use of Tensor Cores.
Bi-directional information flow, from the user to the descriptor via Operators and from the descriptor to the user via Traits.
Targeting specific GPU architectures using the SM Operator. This enables users to configure the descriptor with suggested parameters to target performance.

cuBLASDx supports all GEMM sizes defined by m, n, k dimensions that can fit into the shared memory, meaning matrices A, B, and C have to fit into the shared memory to perform computations. Maximum amount of shared memory per CUDA thread block can be found in CUDA C Programming Guide.

cuBLASDx supports calculations with 2 types: real and complex, in 7 floating point precisions: half (__half), single (float), double (double), fp8_e4m3 (__nv_fp8_e4m3), fp8_e5m2 (__nv_fp8_e5m2), bf16 (__nv_bfloat16), and tf32 (cublasdx::tfloat32_t). The memory layout of each matrix can be in either in column-major or row-major order.

Below you can find a table presenting maximal supported sizes for three commonly-used floating point precisions (half, single, and double) and type (real or complex) assuming m, n, and k dimensions are equal, and precisions of A, B and C are the same.

Function	Type, Precision of A/B/C	Architecture	Max Size
GEMM	Real, half	70, 72	128
		75	104
		80, 87	166
		86, 89	129
		90	196
	Real, float Complex, half	70, 72	90
		75	73
		80, 87	117
		86, 89	91
		90	139
	Real, double Complex, float	70, 72	64
		75	52
		80, 87	83
		86, 89	64
		90	98
	Complex, double	70, 72	45
		75	36
		80, 87	58
		86, 89	45
		90	69

Supported GEMM Data Types¶

Starting from cuBLASDx 0.2.0, matrix multiplication of different precisions of A, B, and C is supported. The table below lists the supported precisions of A, B, and C. The type and precision of the scales, i.e., \({\alpha}\) and \({\beta}\), are expected to be the same as those of matrix C, and A/B/C has to be of either all real, or all complex data type.

Precision A	Precision B	Precision C	Note
fp8_e4m3	fp8_e4m3	float	MMA, SM89+
fp8_e4m3	fp8_e5m2	float	MMA, SM89+
fp8_e5m2	fp8_e5m2	float	MMA, SM89+
fp8_e5m2	fp8_e4m3	float	MMA, SM89+
half	half	half	MMA
half	half	float	MMA
bf16	bf16	float	MMA
tf32	tf32	float	MMA
double	double	double	MMA
fp8_e4m3	fp8_e4m3	half
fp8_e4m3	fp8_e4m3	bf16
fp8_e4m3	fp8_e4m3	double
fp8_e4m3	float	float
fp8_e4m3	float	double
fp8_e4m3	double	double
fp8_e5m2	fp8_e5m2	half
fp8_e5m2	fp8_e5m2	bf16
fp8_e5m2	fp8_e5m2	double
fp8_e5m2	float	float
fp8_e5m2	float	double
fp8_e5m2	double	double
half	half	double
half	float	float
half	float	double
half	double	double
bf16	bf16	bf16
bf16	bf16	double
bf16	float	float
bf16	float	double
bf16	double	double
tf32	tf32	double
tf32	float	float
tf32	float	double
tf32	double	double
float	fp8_e4m3	float
float	fp8_e4m3	double
float	fp8_e5m2	float
float	fp8_e5m2	double
float	half	float
float	half	double
float	bf16	float
float	bf16	double
float	tf32	float
float	tf32	double
float	float	float
float	float	double
float	double	double
double	fp8_e5m2	double
double	fp8_e4m3	double
double	half	double
double	bf16	double
double	tf32	double
double	float	double

Note

If an MMA exists for the combination of precisions of A, B, and C cuBLASDx will use the MMA instruction automatically on the supported GPU architectures. Otherwise, cuBLASDx will use an FMA instruction and there are no performance guarantees.