Examples

The cuBLASDx library provides multiple block-level BLAS samples covering basic GEMM operations with various precisions and types, as well as a few special examples that highlight performance benefits of cuBLASDx.

Examples

Group

Example

Description

Subgroup

Introduction Examples

introduction_example

cuBLASDx API introduction example

Simple GEMM Examples

Basic Example

simple_gemm_fp32

Performs fp32 GEMM

simple_gemm_cfp16

Performs complex fp16 GEMM

Extra Examples

simple_gemm_leading_dimensions

Performs GEMM with non-default leading dimensions

simple_gemm_std_complex_fp32

Performs GEMM with cuda::std::complex as data type

NVRTC Examples

nvrtc_gemm

Performs GEMM, kernel is compiled using NVRTC

GEMM Performance

single_gemm_performance

Benchmark for single GEMM

fused_gemm_performance

Benchmark for 2 GEMMs fused into a single kernel

Advanced Examples

Fusion

fused_gemm

Performs 2 GEMMs in a single kernel

gemm_fft

Perform GEMM and FFT in a single kernel

gemm_fft_fp16

Perform GEMM and FFT in a single kernel (half-precision complex type)

gemm_fft_performance

Benchmark for GEMM and FFT fused into a single kernel

Deep Learning

scaled_dot_prod_attn

Scaled dot product attention using cuBLASDx

scaled_dot_prod_attn_batched

Multi-head attention using cuBLASDx

Other

multiblock_gemm

Proof-of-concept for single large GEMM using multiple CUDA blocks

batched_gemm_fp64

Manual batching in a single CUDA block

blockdim_gemm_fp16

BLAS execution with different block dimensions

Introduction Examples

  • introduction_example

Introduction examples are used in the documentation to explain basics of the cuBLASDx library and its API. introduction_example is used in the beginner’s guide to run GEMM with cuBLASDx API: General Matrix Multiply Using cuBLASDx.

Simple GEMM Examples

  • simple_gemm_fp32

  • simple_gemm_cfp16

  • simple_gemm_leading_dimensions

  • simple_gemm_std_complex_fp32

In each of the examples listed above a general matrix multiply (GEMM) operation is performed in a CUDA block. The examples show how to create a complete BLAS description, allocate memory, set block dimensions and the necessary amount of shared memory. The input data (matrices A, B, and C) is generated on the host, copied into a device buffer, and loaded into the shared memory. The execute() calculates GEMM and stores the results in matrix C, then they are copied into the global memory, and finally back to the host. The results are verified against cuBLAS.

The simple_gemm_leading_dimensions shows how to set static (compile-time) leading dimensions for matrices A, B, and C via LeadingDimension operator. For performance reasons, it is recommended to try suggested leading dimensions using suggested_leading_dimension_of.

The simple_gemm_std_complex_fp32 example proves that cuBLASDx accepts input types other than BLAS::value_type. In this cases, it is complex type from CUDA C++ Standard Library - cuda::std::complex<float>, but it can be float2 provided by CUDA too.

NVRTC Examples

  • nvrtc_gemm

The NVRTC example presents how to use cuBLASDx with NVRTC runtime compilation to perform a GEMM. The BLAS descriptions created with cuBLASDx operators are defined only in the device code. The header file cublasdx.hpp is also included only in the device code that’s passed to the NVRTC.

Note

Since version 0.0.1 cuBLASDx has an experimental support for compilation with NVRTC. See Requirements and Functionality section.

GEMM Performance

  • single_gemm_performance

  • fused_gemm_performance

The examples listed above illustrate the performance of cuBLASDx.

The single_gemm_performance program presents the performance of cuBLASDx device function executing a general matrix multiply (GEMM) function. Users can easily modify this sample to test the performance of a particular GEMM they need.

The fused_gemm_performance example shows the performance of two GEMM operations fused together into a single kernel. The kernel execution time is compared against the time cuBLAS requires to do the same calculations. In both cases, the measured operation is run multiple times and the average speed is reported.

Advanced Examples

  • multiblock_gemm

  • gemm_fusion

  • gemm_fft

  • gemm_fft_fp16

  • gemm_fft_performance

  • scaled_dot_prod_attn

  • scaled_dot_prod_attn_batched

  • batched_gemm_fp64

  • blockdim_gemm_fp16

The advanced cuBLASDx examples are there to show how cuBLASDx can be utilized to improve performance by fusing many calculations into a single kernel, which ultimately means less global memory accesses.

The multiblock_gemm example is proof-of-concept code to execute a GEMM operation using multiple CUDA blocks in cuBLASDx, which is useful when the matrices don’t fit into shared memory or to introduce more parallelism. Users can experiment with the problem size, precision, data type, local block size, etc. and understand the effect on performance.

Examples gemm_fusion, gemm_fft, gemm_fft_fp16 and gemm_fft_performance present how to fuse multiple GEMMs or a GEMM and an FFT together in one kernel. It might be especially useful for pipelines with a lot of the small input matrices as Dx libraries can be easily adapted to batched execution by launching many CUDA blocks in a grid.

The scaled_dot_prod_attn and scaled_dot_prod_attn_batched examples explore the areas of deep learning and natural language processing, showcasing the implementations of scaled dot-product attention and multi-head attention (MHA) algorithms. The performance of half precision MHA in scaled_dot_prod_attn_batched example was compared with PyTorch’s scaled_dot_product_attention function on H100 PCIe 80GB and the results are presented in Fig. 1.

Multi-Head Attention (half precision) performance with PyTorch and cuBLASDx on H100 PCIe 80GB with maximum clocks set.

Fig. 1 Comparison of Multi-Head Attention algorithm was performed between PyTorch (light blue) and cuBLASDx (green) on H100 PCIe 80GB with maximum clocks set. The chart presents speed-ups of cuBLASDx over PyTorch’s scaled_dot_product_attention function for sequences of different lengths and the batch size set to 64. Both input data and computations were in half precision (fp16). NVIDIA PyTorch container image 23.04 was used for PyTorch performance evaluation.

Examples batched_gemm_fp64 and blockdim_gemm_fp16 demonstrate the uses of BlockDim operator. In batched_gemm_fp64 adding 1D BlockDim to a BLAS description and launching kernel with 2D block dimensions allow for manual batching of GEMMs in a single CUDA block (in contrast to batching by launching multiple blocks in a grid). blockdim_gemm_fp16 includes multiple scenarios which present how to safely and correctly execute BLAS operation when kernel is launched with block dimensions different from the layout and the number of threads specified in BlockDim.