NVIDIA cuBLASDx#

The cuBLAS Device Extensions (cuBLASDx) library enables you to perform selected linear algebra functions known from cuBLAS inside your CUDA kernel. This is currently limited only to General Matrix Multiplication (GEMM). Fusing linear algebra routines with other operations can decrease the latency and improve the overall performance of your application.

The documentation consists of three main components:

Highlights#

The cuBLASDx library currently provides:

  • BLAS GEMM routine embeddable into a CUDA kernel.

  • High performance, no unnecessary data movement from and to global memory.

  • Customizability, options to adjust GEMM routine for different needs (size, precision, type, targeted CUDA architecture, etc.).

  • Flexibility of performing accumulation and fusion in either shared memory or registers.

  • Ability to fuse BLAS kernels with other operations in order to save global memory trips.

  • Compatibility with future versions of the CUDA Toolkit.