NVIDIA cuBLASDx#

The cuBLAS Device Extensions (cuBLASDx) library enables you to perform selected linear algebra functions known from cuBLAS inside your CUDA kernel. This is currently limited only to General Matrix Multiplication (GEMM). Fusing linear algebra routines with other operations can decrease the latency and improve the overall performance of your application.

The documentation consists of three main components:

The cuBLASDx library currently provides:#


_images/tiled_execution.svg

  • BLAS GEMM routine embeddable into a CUDA kernel.
    • Static database dispatching to appropriate FMA / MMA instruction configuration.

    • Automatic dispatch to optimal autovectorized LDSM / STSM MMA loading instructions.

    • Opaque analytical generation of swizzled layouts for best performance.

  • Customizability, options to adjust GEMM routine for different needs (size, precision, type, targeted CUDA architecture, etc.).

  • Flexibility of performing accumulation and fusion in either shared memory or registers.

  • Ability to fuse BLAS kernels with other operations in order to save global memory trips.

  • Compatibility with future versions of the CUDA Toolkit.

  • Autovectorizing high performance data movement between shared and global memory.
    • Opaque analytical generation of swizzled layouts for best performance.

    • Currently dispatches to either vectorized LDG / STG or LDGSTS.

The pipelining extension to cuBLASDx (0.5.0+) offers:#


_images/pipelined_execution.svg

  • Automatic N-buffer staged pipelined GEMM execution.
    • Allowing to increase number of bytes in flight during computation.

    • Better exposing GPU asynchronicity to cuBLASDx users.

    • Barrier based synchronization, compatible with Volta+ GPUs.

  • Automatic dispatching between Tensor TMA / LDGSTS / LDG+STG for global to shared transfers.
    • Fully asynchroneous transfers where possible.

    • Automatic maximal vectorization and equal work division among threads.

    • Opaque generation of swizzled layouts for shared memory.

  • Automatic internal exposure of selected WGMMA / 1SM UTCMMA instructions.

  • Automatic opaque warp specialization in selected cases.

  • Automatic opaque register trading in selected cases.

  • The same interface on all CUDA architectures.