NVIDIA cuBLASDx#
The cuBLAS Device Extensions (cuBLASDx) library enables you to perform selected linear algebra functions known from cuBLAS inside your CUDA kernel. This is currently limited only to General Matrix Multiplication (GEMM). Fusing linear algebra routines with other operations can decrease the latency and improve the overall performance of your application.
The documentation consists of three main components:
A quick start guide, General Matrix Multiply Using cuBLASDx.
An API reference for a comprehensive overview of the provided functionality.
Warning
CUDA 12.8 and 12.9 have been known to miscompile cuBLASDx 0.3.1 in some rare slow-path cases (see CUDA 12.9 release notes for more details).
- It has been observed to happen when all of the following conditions are met:
Any of the computation types is
fp8_e5m2
,fp8_e4m3
,fp16
,bf16
, orint8
Any of M, N or K (Size Operator) is not a multiple of 16
Custom static leading dimension is used (LeadingDimension Operator)
These code corruptions may manifest as either incorrect results or illegal memory access
errors.
If this happens to their code, cuBLASDx 0.3.1 users can add the -Xptxas -O1
flag to the
compilation command to disable PTX optimization phase and produce slower code but not a corrupted binary.
This warning will be updated retroactively when the NVCC issue is resolved. We are actively working to fix this issue in an upcoming CUDA release.
Highlights#
The cuBLASDx library currently provides:
BLAS GEMM routine embeddable into a CUDA kernel.
High performance, no unnecessary data movement from and to global memory.
Customizability, options to adjust GEMM routine for different needs (size, precision, type, targeted CUDA architecture, etc.).
Flexibility of performing accumulation and fusion in either shared memory or registers.
Ability to fuse BLAS kernels with other operations in order to save global memory trips.
Compatibility with future versions of the CUDA Toolkit.