NVIDIA cuBLASDx#

The cuBLAS Device Extensions (cuBLASDx) library enables you to perform selected linear algebra functions known from cuBLAS inside your CUDA kernel. This is currently limited only to General Matrix Multiplication (GEMM). Fusing linear algebra routines with other operations can decrease the latency and improve the overall performance of your application.

The documentation consists of three main components:

Requirements and supported features.
A quick start guide, General Matrix Multiply Using cuBLASDx.
An API reference for a comprehensive overview of the provided functionality.

Warning

CUDA 12.8.0, 12.8.1 and 12.9.0 have been known to miscompile cuBLASDx 0.3.1 and 0.4.X in some rare slow-path cases (see CUDA 12.9 release notes for more details).

It has been observed to happen when all of the following conditions are met:

Any of the computation types is fp8_e5m2, fp8_e4m3, fp16, bf16, or int8
Any of M, N or K (Size Operator) is not a multiple of 16
Custom static leading dimension is used (LeadingDimension Operator)

These code corruptions may manifest as either incorrect results or illegal memory access errors.

To highlight this issue cuBLASDx 0.4.X will overprotectively hard fail if these conditions are met.

If you are using cuBLASDx, and this happens to your code, you can:

Update to the latest CUDA Toolkit and NVCC compiler (12.9.1 is known to work)

define the CUBLASDX_IGNORE_NVBUG_5218000_ASSERT macro to ignore these assertions and verify correctness of the results.

if the case is indeed affected by the bug, adding the -Xptxas -O1 flag to the compilation command will disable PTX optimization phase and produce slower code but not a corrupted binary.

Highlights#

The cuBLASDx library currently provides:

BLAS GEMM routine embeddable into a CUDA kernel.
High performance, no unnecessary data movement from and to global memory.
Customizability, options to adjust GEMM routine for different needs (size, precision, type, targeted CUDA architecture, etc.).
Flexibility of performing accumulation and fusion in either shared memory or registers.
Ability to fuse BLAS kernels with other operations in order to save global memory trips.
Compatibility with future versions of the CUDA Toolkit.