Examples#

The cuBLASDx library provides several block-level BLAS samples, covering basic GEMM operations with various precisions and types, as well as special examples that highlight the performance benefits of cuBLASDx.

Examples
Group		Example	Description
	Subgroup	Example	Description
Introduction Examples		introduction_example	cuBLASDx API introduction example
Simple GEMM Examples	Basic Example	simple_gemm_fp32	Performs fp32 GEMM
		simple_gemm_int8_int8_int32	Performs integral GEMM using Tensor Cores
		simple_gemm_cfp16	Performs complex fp16 GEMM
		simple_gemm_fp8	Performs fp8 GEMM

	Extra Examples	simple_gemm_leading_dimensions	Performs GEMM with non-default leading dimensions
		simple_gemm_fp32_decoupled	Performs fp32 GEMM using 16-bit input type to save on storage and transfers
		simple_gemm_std_complex_fp32	Performs GEMM with `cuda::std::complex` as data type
		simple_gemm_mixed_precision	Performs GEMM with different data type for matrices `A`, `B` and `C`
		simple_gemm_transform	Performs GEMM with transform operators
		simple_gemm_custom_layout	Performs GEMM with custom matrix layouts
		simple_gemm_aat	Performs GEMM where C = A * A^T
NVRTC Examples		nvrtc_gemm	Performs GEMM, kernel is compiled using NVRTC
GEMM Performance		single_gemm_performance	Benchmark for single GEMM
		fused_gemm_performance	Benchmark for 2 GEMMs fused into a single kernel
		device_gemm_performance	Benchmark entire device GEMMs using cuBLASDx for single tile
Advanced Examples	Fusion	gemm_fusion	Performs 2 GEMMs in a single kernel
		gemm_fft	Perform GEMM and FFT in a single kernel
		gemm_fft_fp16	Perform GEMM and FFT in a single kernel (half-precision complex type)
		gemm_fft_performance	Benchmark for GEMM and FFT fused into a single kernel

	Other	batched_gemm_fp64	Manual batching in a single CUDA block
		blockdim_gemm_fp16	BLAS execution with different block dimensions
		gemm_device_partial_sums	Use extra register array in higher precision to offload partial accumulation

Introduction Examples#

introduction_example

Introduction examples are used in the documentation to explain the basics of the cuBLASDx library and its API. The introduction_example is used in the beginner’s guide to demonstrate GEMM with the cuBLASDx API: General Matrix Multiply Using cuBLASDx.

Simple GEMM Examples#

simple_gemm_fp32
simple_gemm_fp32_decoupled
simple_gemm_int8_int8_int32
simple_gemm_cfp16
simple_gemm_fp8
simple_gemm_leading_dimensions
simple_gemm_std_complex_fp32
simple_gemm_mixed_precision
simple_gemm_transform
simple_gemm_custom_layout
simple_gemm_aat

Each of the examples listed above performs a general matrix multiply (GEMM) operation within a CUDA block. The examples demonstrate how to create a complete BLAS description, allocate memory, set block dimensions, and configure the necessary amount of shared memory. Input data (matrices A, B, and C) is generated on the host, copied into a device buffer, and loaded into registers and shared memory. The execute() method computes GEMM and stores the results in either matrix C or a register fragment accumulator. The results are then copied to global memory and finally back to the host. All results are verified against cuBLAS.

The simple_gemm_leading_dimensions example shows how to set static (compile-time) leading dimensions for matrices A, B, and C using the LeadingDimension operator. For optimal performance, it is recommended to use the suggested leading dimensions from suggested_leading_dimension_of.

The simple_gemm_fp32_decoupled example demonstrates how to decouple input precision from compute precision.

The simple_gemm_std_complex_fp32 example demonstrates that cuBLASDx accepts matrices of types other than BLAS::a_value_type, BLAS::b_value_type, and BLAS::c_value_type. In this case, the type is cuda::std::complex<float> from the CUDA C++ Standard Library, but it could also be float2 provided by CUDA.

The simple_gemm_mixed_precision example shows how to compute a mixed-precision GEMM, where matrices A, B, and C have different data precisions. Note that the scaling factors alpha and beta are expected to have the same precision and type as the elements of matrix C.

The simple_gemm_transform example demonstrates how to compute GEMM with transform operators a_load_op, b_load_op, c_load_op, and c_store_op, which are applied element-wise when loading matrices A, B, and C, and when storing matrix C, respectively.

The simple_gemm_custom_layout example shows how to compute a GEMM where matrices A, B, and C in shared memory use custom CuTe layouts.

The simple_gemm_aat example demonstrates how to perform C = A * A^T, where both A and A^T occupy the same shared memory, allowing users to increase kernel occupancy or operate on larger matrices.

NVRTC Examples#

nvrtc_gemm

The NVRTC example demonstrates how to use cuBLASDx with NVRTC runtime compilation to perform a GEMM. The BLAS descriptions created with cuBLASDx operators are defined only in the device code. The header file cublasdx.hpp is also included only in the device code that is passed to NVRTC.

Note

Since version 0.0.1, cuBLASDx has experimental support for compilation with NVRTC. See the Requirements and Functionality section.

GEMM Performance#

single_gemm_performance
fused_gemm_performance
device_gemm_performance

The examples listed above illustrate the performance of cuBLASDx.

The single_gemm_performance program demonstrates the performance of the cuBLASDx device function executing a general matrix multiply (GEMM) operation. Users can easily modify this sample to test the performance of a specific GEMM configuration.

The fused_gemm_performance example shows the performance of two GEMM operations fused together into a single kernel. The kernel execution time is compared against the time required by cuBLAS to perform the same calculations. In both cases, the measured operation is run multiple times and the average speed is reported.

The device_gemm_performance example compares the performance of cuBLASDx with cuBLAS when performing a GEMM that spans the entire GPU. This example does not provide the best-performing tile sizes for all precisions; checking other datatypes with custom sizes may require a parameter search. Global GEMM dimensions are dynamic values and can be passed as command line arguments:

# Perform default size GEMM with cuBLASDx static tile specified in code
./device_gemm_performance

# Perform custom size GEMM with cuBLASDx static tile specified in code
./device_gemm_performance m n k

Advanced Examples#

gemm_fusion
gemm_fft
gemm_fft_fp16
gemm_fft_performance
batched_gemm_fp64
blockdim_gemm_fp16
gemm_device_partial_sums

The advanced cuBLASDx examples demonstrate how cuBLASDx can be utilized to improve performance by fusing multiple calculations into a single kernel, which ultimately reduces global memory accesses.

Examples gemm_fusion, gemm_fft, gemm_fft_fp16, and gemm_fft_performance show how to fuse multiple GEMMs or a GEMM and an FFT together in one kernel. This is especially useful for pipelines with many small input matrices, as Dx libraries can be easily adapted to batched execution by launching many CUDA blocks in a grid.

Examples batched_gemm_fp64 and blockdim_gemm_fp16 demonstrate the use of the BlockDim operator. In batched_gemm_fp64, adding a 1D BlockDim to a BLAS description and launching the kernel with 2D block dimensions allows for manual batching of GEMMs in a single CUDA block (in contrast to batching by launching multiple blocks in a grid). blockdim_gemm_fp16 includes multiple scenarios that demonstrate how to safely and correctly execute BLAS operations when the kernel is launched with block dimensions different from the layout and the number of threads specified in BlockDim.

The gemm_device_partial_sums example demonstrates how to use an extra register array in higher precision to offload partial accumulation of the result of a GEMM every N iterations and avoid precision loss. This can be useful for improving the performance of GEMMs with a large number of operations, or when the GEMM is part of a larger computation that requires a high precision accumulator.