NVIDIA cuBLASDx#
The cuBLAS Device Extensions (cuBLASDx) library enables you to perform selected linear algebra functions known from cuBLAS inside your CUDA kernel. Available routines include General Matrix Multiplication (GEMM) and Triangular Solve (TRSM). Fusing linear algebra routines with other operations can decrease the latency and improve the overall performance of your application.
The documentation consists of the following components:
A GEMM quick start guide, Using cuBLASDx GEMM.
A TRSM quick start guide, Using cuBLASDx TRSM.
An advanced pipelining API guide, Using Pipelined GEMM.
An API reference for a comprehensive overview of the provided functionality.
The cuBLASDx library currently provides:#
Customizability, options to adjust BLAS routines for different needs (size, precision, type, targeted CUDA architecture, etc.).
Flexibility of performing accumulation and fusion in either shared memory or registers.
Ability to fuse BLAS kernels with other operations in order to save global memory trips.
Compatibility with future versions of the CUDA Toolkit.
- Autovectorizing high performance data movement between shared and global memory.
Opaque analytical generation of swizzled layouts for best performance.
Currently dispatches to either vectorized
LDG+STSorLDGSTS.
The pipelining extension to cuBLASDx (0.5.0+) offers:#
- Automatic N-buffer staged pipelined GEMM execution.
Allowing to increase number of bytes in flight during computation.
Better exposing GPU asynchronicity to
cuBLASDxusers.Barrier based synchronization, compatible with Turing+ GPUs.
- Automatic dispatching between
Tensor TMA / LDGSTS / LDG+STSfor global to shared transfers. Fully asynchroneous transfers where possible.
Automatic maximal vectorization and equal work division among threads.
Opaque generation of swizzled layouts for shared memory.
- Automatic dispatching between
Automatic internal exposure of selected
WGMMA / 1SM UTCMMAinstructions.Automatic opaque warp specialization in selected cases.
Automatic opaque register trading in selected cases.
The same interface on all CUDA architectures.