Examples#

The cuBLASDx library provides several block-level BLAS samples, covering basic GEMM operations with various precisions and types, as well as special examples that highlight the performance benefits of cuBLASDx.

Examples

Group

Example

Description

Subgroup

Introduction Examples

introduction_example

cuBLASDx API introduction example

Simple GEMM Examples

Basic Example

simple_gemm_fp32

Performs fp32 GEMM

simple_gemm_int8_int8_int32

Performs integral GEMM using Tensor Cores

simple_gemm_cfp16

Performs complex fp16 GEMM

simple_gemm_fp8

Performs fp8 GEMM

Extra Examples

simple_gemm_leading_dimensions

Performs GEMM with non-default leading dimensions

simple_gemm_fp32_decoupled

Performs fp32 GEMM using 16-bit input type to save on storage and transfers

simple_gemm_std_complex_fp32

Performs GEMM with cuda::std::complex as data type

simple_gemm_mixed_precision

Performs GEMM with different data type for matrices A, B and C

simple_gemm_transform

Performs GEMM with transform operators

simple_gemm_custom_layout

Performs GEMM with custom matrix layouts

simple_gemm_aat

Performs GEMM where C = A * A^T

NVRTC Examples

nvrtc_gemm

Performs GEMM, kernel is compiled using NVRTC

GEMM Performance

single_gemm_performance

Benchmark for single GEMM

fused_gemm_performance

Benchmark for 2 GEMMs fused into a single kernel

device_gemm_performance

Benchmark entire device GEMMs using cuBLASDx for single tile

Advanced Examples

Fusion

gemm_fusion

Performs 2 GEMMs in a single kernel

gemm_fft

Perform GEMM and FFT in a single kernel

gemm_fft_fp16

Perform GEMM and FFT in a single kernel (half-precision complex type)

gemm_fft_performance

Benchmark for GEMM and FFT fused into a single kernel

Other

batched_gemm_fp64

Manual batching in a single CUDA block

blockdim_gemm_fp16

BLAS execution with different block dimensions

gemm_device_partial_sums

Use extra register array in higher precision to offload partial accumulation

Introduction Examples#

  • introduction_example

Introduction examples are used in the documentation to explain the basics of the cuBLASDx library and its API. The introduction_example is used in the beginner’s guide to demonstrate GEMM with the cuBLASDx API: General Matrix Multiply Using cuBLASDx.

Simple GEMM Examples#

  • simple_gemm_fp32

  • simple_gemm_fp32_decoupled

  • simple_gemm_int8_int8_int32

  • simple_gemm_cfp16

  • simple_gemm_fp8

  • simple_gemm_leading_dimensions

  • simple_gemm_std_complex_fp32

  • simple_gemm_mixed_precision

  • simple_gemm_transform

  • simple_gemm_custom_layout

  • simple_gemm_aat

Each of the examples listed above performs a general matrix multiply (GEMM) operation within a CUDA block. The examples demonstrate how to create a complete BLAS description, allocate memory, set block dimensions, and configure the necessary amount of shared memory. Input data (matrices A, B, and C) is generated on the host, copied into a device buffer, and loaded into registers and shared memory. The execute() method computes GEMM and stores the results in either matrix C or a register fragment accumulator. The results are then copied to global memory and finally back to the host. All results are verified against cuBLAS.

The simple_gemm_leading_dimensions example shows how to set static (compile-time) leading dimensions for matrices A, B, and C using the LeadingDimension operator. For optimal performance, it is recommended to use the suggested leading dimensions from suggested_leading_dimension_of.

The simple_gemm_fp32_decoupled example demonstrates how to decouple input precision from compute precision.

The simple_gemm_std_complex_fp32 example demonstrates that cuBLASDx accepts matrices of types other than BLAS::a_value_type, BLAS::b_value_type, and BLAS::c_value_type. In this case, the type is cuda::std::complex<float> from the CUDA C++ Standard Library, but it could also be float2 provided by CUDA.

The simple_gemm_mixed_precision example shows how to compute a mixed-precision GEMM, where matrices A, B, and C have different data precisions. Note that the scaling factors alpha and beta are expected to have the same precision and type as the elements of matrix C.

The simple_gemm_transform example demonstrates how to compute GEMM with transform operators a_load_op, b_load_op, c_load_op, and c_store_op, which are applied element-wise when loading matrices A, B, and C, and when storing matrix C, respectively.

The simple_gemm_custom_layout example shows how to compute a GEMM where matrices A, B, and C in shared memory use custom CuTe layouts.

The simple_gemm_aat example demonstrates how to perform C = A * A^T, where both A and A^T occupy the same shared memory, allowing users to increase kernel occupancy or operate on larger matrices.

NVRTC Examples#

  • nvrtc_gemm

The NVRTC example demonstrates how to use cuBLASDx with NVRTC runtime compilation to perform a GEMM. The BLAS descriptions created with cuBLASDx operators are defined only in the device code. The header file cublasdx.hpp is also included only in the device code that is passed to NVRTC.

Note

Since version 0.0.1, cuBLASDx has experimental support for compilation with NVRTC. See the Requirements and Functionality section.

GEMM Performance#

  • single_gemm_performance

  • fused_gemm_performance

  • device_gemm_performance

The examples listed above illustrate the performance of cuBLASDx.

The single_gemm_performance program demonstrates the performance of the cuBLASDx device function executing a general matrix multiply (GEMM) operation. Users can easily modify this sample to test the performance of a specific GEMM configuration.

The fused_gemm_performance example shows the performance of two GEMM operations fused together into a single kernel. The kernel execution time is compared against the time required by cuBLAS to perform the same calculations. In both cases, the measured operation is run multiple times and the average speed is reported.

The device_gemm_performance example compares the performance of cuBLASDx with cuBLAS when performing a GEMM that spans the entire GPU. This example does not provide the best-performing tile sizes for all precisions; checking other datatypes with custom sizes may require a parameter search. Global GEMM dimensions are dynamic values and can be passed as command line arguments:

# Perform default size GEMM with cuBLASDx static tile specified in code
./device_gemm_performance

# Perform custom size GEMM with cuBLASDx static tile specified in code
./device_gemm_performance m n k

Advanced Examples#

  • gemm_fusion

  • gemm_fft

  • gemm_fft_fp16

  • gemm_fft_performance

  • batched_gemm_fp64

  • blockdim_gemm_fp16

  • gemm_device_partial_sums

The advanced cuBLASDx examples demonstrate how cuBLASDx can be utilized to improve performance by fusing multiple calculations into a single kernel, which ultimately reduces global memory accesses.

Examples gemm_fusion, gemm_fft, gemm_fft_fp16, and gemm_fft_performance show how to fuse multiple GEMMs or a GEMM and an FFT together in one kernel. This is especially useful for pipelines with many small input matrices, as Dx libraries can be easily adapted to batched execution by launching many CUDA blocks in a grid.

Examples batched_gemm_fp64 and blockdim_gemm_fp16 demonstrate the use of the BlockDim operator. In batched_gemm_fp64, adding a 1D BlockDim to a BLAS description and launching the kernel with 2D block dimensions allows for manual batching of GEMMs in a single CUDA block (in contrast to batching by launching multiple blocks in a grid). blockdim_gemm_fp16 includes multiple scenarios that demonstrate how to safely and correctly execute BLAS operations when the kernel is launched with block dimensions different from the layout and the number of threads specified in BlockDim.

The gemm_device_partial_sums example demonstrates how to use an extra register array in higher precision to offload partial accumulation of the result of a GEMM every N iterations and avoid precision loss. This can be useful for improving the performance of GEMMs with a large number of operations, or when the GEMM is part of a larger computation that requires a high precision accumulator.