NVIDIA cuBLASDx#
The cuBLAS Device Extensions (cuBLASDx) library enables you to perform selected linear algebra functions known from cuBLAS inside your CUDA kernel. This is currently limited only to General Matrix Multiplication (GEMM). Fusing linear algebra routines with other operations can decrease the latency and improve the overall performance of your application.
The documentation consists of three main components:
A quick start guide, General Matrix Multiply Using cuBLASDx.
An advanced pipelining API guide, Pipelined GEMM Using cuBLASDx Pipelines.
An API reference for a comprehensive overview of the provided functionality.
The cuBLASDx library currently provides:#
- BLAS GEMM routine embeddable into a CUDA kernel.
Static database dispatching to appropriate
FMA / MMAinstruction configuration.Automatic dispatch to optimal autovectorized
LDSM / STSMMMA loading instructions.Opaque analytical generation of swizzled layouts for best performance.
Customizability, options to adjust GEMM routine for different needs (size, precision, type, targeted CUDA architecture, etc.).
Flexibility of performing accumulation and fusion in either shared memory or registers.
Ability to fuse BLAS kernels with other operations in order to save global memory trips.
Compatibility with future versions of the CUDA Toolkit.
- Autovectorizing high performance data movement between shared and global memory.
Opaque analytical generation of swizzled layouts for best performance.
Currently dispatches to either vectorized
LDG / STGorLDGSTS.
The pipelining extension to cuBLASDx (0.5.0+) offers:#
- Automatic N-buffer staged pipelined GEMM execution.
Allowing to increase number of bytes in flight during computation.
Better exposing GPU asynchronicity to
cuBLASDxusers.Barrier based synchronization, compatible with Volta+ GPUs.
- Automatic dispatching between
Tensor TMA / LDGSTS / LDG+STGfor global to shared transfers. Fully asynchroneous transfers where possible.
Automatic maximal vectorization and equal work division among threads.
Opaque generation of swizzled layouts for shared memory.
- Automatic dispatching between
Automatic internal exposure of selected
WGMMA / 1SM UTCMMAinstructions.Automatic opaque warp specialization in selected cases.
Automatic opaque register trading in selected cases.
The same interface on all CUDA architectures.