Examples#

The cuBLASDx library provides multiple block-level BLAS samples covering basic GEMM operations with various precisions and types, as well as a few special examples that highlight performance benefits of cuBLASDx.

Examples

Group

Example

Description

Subgroup

Introduction Examples

introduction_example

cuBLASDx API introduction example

Simple GEMM Examples

Basic Example

simple_gemm_fp32

Performs fp32 GEMM

simple_gemm_int8_int8_int32

Performs integral GEMM using Tensor Cores

simple_gemm_cfp16

Performs complex fp16 GEMM

simple_gemm_fp8

Performs fp8 GEMM

Extra Examples

simple_gemm_leading_dimensions

Performs GEMM with non-default leading dimensions

simple_gemm_fp32_decoupled

Performs fp32 GEMM using 16-bit input type to save on storage and transfers

simple_gemm_std_complex_fp32

Performs GEMM with cuda::std::complex as data type

simple_gemm_mixed_precision

Performs GEMM with different data type for matrices A, B and C

simple_gemm_transform

Performs GEMM with transform operators

simple_gemm_custom_layout

Performs GEMM with custom matrix layouts

simple_gemm_aat

Performs GEMM where C = A * A^T

NVRTC Examples

nvrtc_gemm

Performs GEMM, kernel is compiled using NVRTC

GEMM Performance

single_gemm_performance

Benchmark for single GEMM

fused_gemm_performance

Benchmark for 2 GEMMs fused into a single kernel

device_gemm_performance

Benchmark entire device GEMMs using cuBLASDx for single tile

Advanced Examples

Fusion

gemm_fusion

Performs 2 GEMMs in a single kernel

gemm_fft

Perform GEMM and FFT in a single kernel

gemm_fft_fp16

Perform GEMM and FFT in a single kernel (half-precision complex type)

gemm_fft_performance

Benchmark for GEMM and FFT fused into a single kernel

Deep Learning

scaled_dot_prod_attn

Scaled dot product attention using cuBLASDx

scaled_dot_prod_attn_batched

Multi-head attention using cuBLASDx

Other

batched_gemm_fp64

Manual batching in a single CUDA block

blockdim_gemm_fp16

BLAS execution with different block dimensions

Introduction Examples#

  • introduction_example

Introduction examples are used in the documentation to explain basics of the cuBLASDx library and its API. introduction_example is used in the beginner’s guide to run GEMM with cuBLASDx API: General Matrix Multiply Using cuBLASDx.

Simple GEMM Examples#

  • simple_gemm_fp32

  • simple_gemm_fp32_decoupled

  • simple_gemm_int8_int8_int32

  • simple_gemm_cfp16

  • simple_gemm_fp8

  • simple_gemm_leading_dimensions

  • simple_gemm_std_complex_fp32

  • simple_gemm_mixed_precision

  • simple_gemm_transform

  • simple_gemm_custom_layout

  • simple_gemm_aat

In each of the examples listed above a general matrix multiply (GEMM) operation is performed in a CUDA block. The examples show how to create a complete BLAS description, allocate memory, set block dimensions and the necessary amount of shared memory. The input data (matrices A, B, and C) is generated on the host, copied into a device buffer, and loaded into registers and shared memory. The execute() calculates GEMM and stores the results in either matrix C or register fragment accumulator, then they are copied into the global memory, and finally back to the host. The results are verified against cuBLAS.

The simple_gemm_leading_dimensions shows how to set static (compile-time) leading dimensions for matrices A, B, and C via LeadingDimension operator. For performance reasons, it is recommended to try suggested leading dimensions using suggested_leading_dimension_of.

The simple_gemm_fp32_decoupled example shows how to decouple input precision from compute precision.

The simple_gemm_std_complex_fp32 example proves that cuBLASDx accepts matrices of types different than BLAS::a_value_type, BLAS::b_value_type, and BLAS::c_value_type. In this cases, it is the complex type from CUDA C++ Standard Library - cuda::std::complex<float>, but it could be float2 provided by CUDA too.

The simple_gemm_mixed_precision example shows how to compute an mixed-precision GEMM, where matrices A, B, and C have data of different precisions. Note that the scales alpha and beta are expected to have the same precision and type as the matrix C elements.

The simple_gemm_transform example shows how to compute GEMM with transform operators a_load_op, b_load_op, c_load_op and c_store_opt that are applied element-wisely when loading matrices A, B and C and storing matrix C, respectively.

The simple_gemm_custom_layout example shows how to compute a GEMM, where matrices A, B, and C in shared memory use custom CuTe layouts.

The simple_gemm_aat example shows how to perform C = A * A^T where both A and A^T occupies the same shared memory allowing users to increase kernel occupancy or operate on larger matrices.

NVRTC Examples#

  • nvrtc_gemm

The NVRTC example presents how to use cuBLASDx with NVRTC runtime compilation to perform a GEMM. The BLAS descriptions created with cuBLASDx operators are defined only in the device code. The header file cublasdx.hpp is also included only in the device code that’s passed to the NVRTC.

Note

Since version 0.0.1 cuBLASDx has an experimental support for compilation with NVRTC. See Requirements and Functionality section.

GEMM Performance#

  • single_gemm_performance

  • fused_gemm_performance

The examples listed above illustrate the performance of cuBLASDx.

The single_gemm_performance program presents the performance of cuBLASDx device function executing a general matrix multiply (GEMM) function. Users can easily modify this sample to test the performance of a particular GEMM they need.

The fused_gemm_performance example shows the performance of two GEMM operations fused together into a single kernel. The kernel execution time is compared against the time cuBLAS requires to do the same calculations. In both cases, the measured operation is run multiple times and the average speed is reported.

Advanced Examples#

  • gemm_fusion

  • gemm_fft

  • gemm_fft_fp16

  • gemm_fft_performance

  • scaled_dot_prod_attn

  • scaled_dot_prod_attn_batched

  • batched_gemm_fp64

  • blockdim_gemm_fp16

The advanced cuBLASDx examples are there to show how cuBLASDx can be utilized to improve performance by fusing many calculations into a single kernel, which ultimately means less global memory accesses.

Examples gemm_fusion, gemm_fft, gemm_fft_fp16 and gemm_fft_performance present how to fuse multiple GEMMs or a GEMM and an FFT together in one kernel. It might be especially useful for pipelines with a lot of the small input matrices as Dx libraries can be easily adapted to batched execution by launching many CUDA blocks in a grid.

The scaled_dot_prod_attn and scaled_dot_prod_attn_batched examples explore the areas of deep learning and natural language processing, showcasing the implementations of scaled dot-product attention and multi-head attention (MHA) algorithms. The performance of half precision MHA in scaled_dot_prod_attn_batched example was compared with PyTorch’s scaled_dot_product_attention function on H100 PCIe 80GB and the results are presented in Fig. 1.

Multi-Head Attention (half precision) performance with PyTorch and cuBLASDx on H100 PCIe 80GB with maximum clocks set.

Fig. 1 Comparison of Multi-Head Attention algorithm was performed between PyTorch (light blue) and cuBLASDx (green) on H100 PCIe 80GB with maximum clocks set. The chart presents speed-ups of cuBLASDx over PyTorch’s scaled_dot_product_attention function for sequences of different lengths and the batch size set to 64. Both input data and computations were in half precision (fp16). NVIDIA PyTorch container image 23.04 was used for PyTorch performance evaluation.#

Examples batched_gemm_fp64 and blockdim_gemm_fp16 demonstrate the uses of BlockDim operator. In batched_gemm_fp64 adding 1D BlockDim to a BLAS description and launching kernel with 2D block dimensions allow for manual batching of GEMMs in a single CUDA block (in contrast to batching by launching multiple blocks in a grid). blockdim_gemm_fp16 includes multiple scenarios which present how to safely and correctly execute BLAS operation when kernel is launched with block dimensions different from the layout and the number of threads specified in BlockDim.