Requirements and Functionality#

Requirements#

cuBLASDx is a CUDA C++ header-only library. Therefore, the list of required software to use the library is relatively short:

CUDA Toolkit 11.4 or newer
Supported CUDA compiler (C++17 required)
Supported host compiler (C++17 required)
(Optional) CMake (version 3.18 or greater)

Dependencies:

commonDx (included with the MathDx package)
CUTLASS 3.9.0 or newer (CUTLASS 3.9.0 is included with the MathDx package)

Supported Compilers#

CUDA Compilers:

NVCC 11.4.152+ (CUDA Toolkit 11.4 or newer)
(Experimental support) NVRTC 11.4.152+ (CUDA Toolkit 11.4 or newer)

Host / C++ Compilers:

GCC 7+
Clang 9+ (Linux/WSL2 only)
HPC SDK nvc++ 23.1+

Note

We recommend using GCC 9+ as the host compiler, and NVCC shipped with the latest CUDA Toolkit as the CUDA compiler.

Warning

Compiling cuBLASDx on Windows with MSVC has not been tested and is not supported yet. However, it is possible to compile kernels with cuBLASDx on Windows using NVRTC, as demonstrated in one of the examples.

Note

cuBLASDx emits errors for unsupported C++ standard versions, which can be silenced by defining CUBLASDX_IGNORE_DEPRECATED_DIALECT during compilation. cuBLASDx is not guaranteed to work with C++ standard versions that are not officially supported.

Warning

Due to a known GCC issue, cuBLASDx only supports NVCC 11.6+ if the host compiler is GCC 11+. In addition, we recommend using NVCC 12.3+ when running kernels on Volta and Turing architectures to avoid a known compiler bug that can lead to incorrect results for certain use cases.

Supported Functionality#

This is an Early Access (EA) version of cuBLASDx. The current functionality of the library is a subset of the capabilities that will be available in the first official release.

Supported features include:

Creating block descriptors that execute the GEMM (general matrix multiply) routine: \(\mathbf{C}_{m\times n} = {\alpha} \times \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + {\beta} \times \mathbf{C}_{m\times n}\) (See Function operator).
Automatic use of Tensor Cores and automatic data layouts for optimal memory access patterns.
Using either register fragments or shared memory as input/output memory space for accumulation.
Bi-directional information flow: from the user to the descriptor via Operators, and from the descriptor to the user via Traits.
Targeting specific GPU architectures using the SM Operator. This enables users to configure the descriptor with suggested parameters for optimal performance.

Supported Memory Spaces#

cuBLASDx supports all GEMM sizes defined by the m, n, and k dimensions that can fit into the combined register file (RF) and shared memory. Matrices A and B must fit into shared memory to perform computations. These input matrices may overlap or alias each other. The maximum amount of shared memory per CUDA thread block can be found in the CUDA C Programming Guide.

The input/output C matrix can be:

Provided in shared memory (it cannot alias any input elements) for \(\mathbf{C}_{m\times n} = {\alpha} \times \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + {\beta} \times \mathbf{C}_{m\times n}\)
Provided as a register fragment for accumulation to \(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + \mathbf{C}_{m\times n}\)
Returned by value as a register fragment from \(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n}\)

Supported Computation Types#

cuBLASDx supports calculations in two domains:

Real
Complex

In seven floating point precisions:

half (__half)
single (float)
double (double)
fp8_e4m3 (__nv_fp8_e4m3)
fp8_e5m2 (__nv_fp8_e5m2)
bf16 (__nv_bfloat16)
tf32 (cublasdx::tfloat32_t)

In eight integral precisions:

Signed 8-bit (int8_t)
Unsigned 8-bit (uint8_t)
Signed 16-bit (int16_t)
Unsigned 16-bit (uint16_t)
Signed 32-bit (int32_t)
Unsigned 32-bit (uint32_t)
Signed 64-bit (int64_t)
Unsigned 64-bit (uint64_t)

Starting from cuBLASDx 0.2.0, matrix multiplication with different precisions for A, B, and C is supported.

Any combination of three precisions is supported, as long as either:

All are floating point precisions.
All are integral precisions and:
1. The accumulator is at least 4x wider than any input,
2. Input signedness implies accumulator signedness.

Mixed floating/integral GEMMs are unsupported, but in-register input conversion can be applied to achieve this effect.

Supported Input Types#

Note

Starting from cuBLASDx 0.3.0, computational precision has been decoupled from data precision, i.e. the input / output data for each matrix can be of arbitrary type (even integral input for floating point GEMM) provided that Alignment Operator is set and at least one of those conditions is met:

It’s implicitly convertible to the data type chosen with Precision Operator and Type Operator.
For inputs: An appropriate converting loading operation is provided as one of the arguments. It takes the input type value. Its result must be at least implicitly convertible to the compute type.
For output: An appropriate converting storing operation is provided as one of the arguments. It takes the result computational type (usually C type as defined by Precision Operator and Type Operator). Its result must be at least implicitly convertible to the output type.

Supported Input Layouts#

Data can be provided in any layout described by a cublasdx::tensor (or underlying cute::Tensor) as long as it’s a 2-dimensional tensor. Its modes can be hierarchical, but there needs to be only 2 of them.

Supported Maximal Sizes with non-overlapping A and B#

Below you can find a table presenting maximal supported sizes for three commonly-used floating point precisions (half, single, and double) and type (real or complex) assuming m, n, and k dimensions are equal, and precisions of A, B and C are the same.

Effective supported dimensions are much bigger, if:

Dimensions are not equal (long and wide matrices)
A and B are aliased and share elements (e.g. A multiplied with its transposition requires fitting only A)

Function	Type, Precision of A/B/C	Architecture	Max Size
Function	Type, Precision of A/B/C	Architecture	Restricted AB with C in Shared	Restricted AB with C in registers
GEMM	Real, half	70, 72	128	156
		75	104	127
		80, 87	166	203
		86, 89, 120, 121	129	157
		90, 100, 101, 103	196	240
	Real, float Complex, half	70, 72	90	110
		75	73	89
		80, 87	117	143
		86, 89, 120, 121	91	111
		90, 100, 101, 103	139	170
	Real, double Complex, float	70, 72	64	78
		75	52	63
		80, 87	83	101
		86, 89, 120, 121	64	78
		90, 100, 101, 103	98	120
	Complex, double	70, 72	45	55
		75	36	44
		80, 87	58	71
		86, 89, 120, 121	45	55
		90, 100, 101, 103	69	84

Warning

Starting with cuBLASDx 0.3.0 there are no static assertions on whether the chosen size will fit onto device. This is a result of allowing inputs to overlap and providing a register based accumulation API along with shared memory one.

Supported MMA Data Types#

The table below lists the precisions of A, B, and C for which specialized Tensor Core operation is available.

The type and precision of the scaling factors, i.e., \({\alpha}\) and \({\beta}\), are expected to match those of matrix C. Matrices A, B, and C must all be either real or complex data types. If the I/O precision is decoupled from the computation precision, the scale type must be compatible with the compute type.

Precision A	Precision B	Precision C	Note
fp8_e4m3	fp8_e4m3	float	MMA, SM89+
fp8_e4m3	fp8_e5m2	float	MMA, SM89+
fp8_e5m2	fp8_e5m2	float	MMA, SM89+
fp8_e5m2	fp8_e4m3	float	MMA, SM89+
half	half	half	MMA, SM70+
half	half	float	MMA, SM70+
bf16	bf16	float	MMA, SM80+
tf32	tf32	float	MMA, SM80+
double	double	double	MMA, SM80+
int8_t	int8_t	int32_t	MMA, SM80+
uint8_t	int8_t	int32_t	MMA, SM80+
int8_t	uint8_t	int32_t	MMA, SM80+
uint8_t	uint8_t	int32_t	MMA, SM80+

Note

If an MMA instruction exists for the combination of precisions of A, B, and C, cuBLASDx will use the MMA instruction automatically on supported GPU architectures. Otherwise, cuBLASDx will use an FMA instruction, and there are no performance guarantees.