Examples#
The cuBLASDx library provides several block-level BLAS samples, covering basic GEMM operations with various precisions and types, as well as special examples that highlight the performance benefits of cuBLASDx.
Examples |
|||
---|---|---|---|
Group |
Example |
Description |
|
Subgroup |
|||
Introduction Examples |
introduction_example |
cuBLASDx API introduction example |
|
Simple GEMM Examples |
Basic Example |
simple_gemm_fp32 |
Performs fp32 GEMM |
simple_gemm_int8_int8_int32 |
Performs integral GEMM using Tensor Cores |
||
simple_gemm_cfp16 |
Performs complex fp16 GEMM |
||
simple_gemm_fp8 |
Performs fp8 GEMM |
||
Extra Examples |
simple_gemm_leading_dimensions |
Performs GEMM with non-default leading dimensions |
|
simple_gemm_fp32_decoupled |
Performs fp32 GEMM using 16-bit input type to save on storage and transfers |
||
simple_gemm_std_complex_fp32 |
Performs GEMM with |
||
simple_gemm_mixed_precision |
Performs GEMM with different data type for matrices |
||
simple_gemm_transform |
Performs GEMM with transform operators |
||
simple_gemm_custom_layout |
Performs GEMM with custom matrix layouts |
||
simple_gemm_aat |
Performs GEMM where C = A * A^T |
||
NVRTC Examples |
nvrtc_gemm |
Performs GEMM, kernel is compiled using NVRTC |
|
GEMM Performance |
single_gemm_performance |
Benchmark for single GEMM |
|
fused_gemm_performance |
Benchmark for 2 GEMMs fused into a single kernel |
||
device_gemm_performance |
Benchmark entire device GEMMs using cuBLASDx for single tile |
||
Advanced Examples |
Fusion |
gemm_fusion |
Performs 2 GEMMs in a single kernel |
gemm_fft |
Perform GEMM and FFT in a single kernel |
||
gemm_fft_fp16 |
Perform GEMM and FFT in a single kernel (half-precision complex type) |
||
gemm_fft_performance |
Benchmark for GEMM and FFT fused into a single kernel |
||
Other |
batched_gemm_fp64 |
Manual batching in a single CUDA block |
|
blockdim_gemm_fp16 |
BLAS execution with different block dimensions |
||
gemm_device_partial_sums |
Use extra register array in higher precision to offload partial accumulation |
Introduction Examples#
introduction_example
Introduction examples are used in the documentation to explain the basics of the cuBLASDx library and its API. The introduction_example
is used in the beginner’s guide to demonstrate GEMM with the cuBLASDx API: General Matrix Multiply Using cuBLASDx.
Simple GEMM Examples#
simple_gemm_fp32
simple_gemm_fp32_decoupled
simple_gemm_int8_int8_int32
simple_gemm_cfp16
simple_gemm_fp8
simple_gemm_leading_dimensions
simple_gemm_std_complex_fp32
simple_gemm_mixed_precision
simple_gemm_transform
simple_gemm_custom_layout
simple_gemm_aat
Each of the examples listed above performs a general matrix multiply (GEMM) operation within a CUDA block.
The examples demonstrate how to create a complete BLAS description, allocate memory, set block dimensions, and configure the necessary amount of shared memory.
Input data (matrices A
, B
, and C
) is generated on the host, copied into a device buffer, and loaded into registers and shared memory.
The execute()
method computes GEMM and stores the results in either matrix C
or a register fragment accumulator. The results are then copied to
global memory and finally back to the host. All results are verified against cuBLAS.
The simple_gemm_leading_dimensions
example shows how to set static (compile-time) leading dimensions for matrices A
, B
, and C
using the LeadingDimension operator.
For optimal performance, it is recommended to use the suggested leading dimensions from suggested_leading_dimension_of.
The simple_gemm_fp32_decoupled
example demonstrates how to decouple input precision from compute precision.
The simple_gemm_std_complex_fp32
example demonstrates that cuBLASDx accepts matrices of types other than BLAS::a_value_type, BLAS::b_value_type, and BLAS::c_value_type. In this case, the type is cuda::std::complex<float>
from the CUDA C++ Standard Library, but it could also be float2
provided by CUDA.
The simple_gemm_mixed_precision
example shows how to compute a mixed-precision GEMM, where matrices A
, B
, and C
have different data precisions. Note that the scaling factors alpha
and beta
are expected to have the same precision and type as the elements of matrix C
.
The simple_gemm_transform
example demonstrates how to compute GEMM with transform operators a_load_op
, b_load_op
, c_load_op
, and c_store_op
, which are applied element-wise when loading matrices A
, B
, and C
, and when storing matrix C
, respectively.
The simple_gemm_custom_layout
example shows how to compute a GEMM where matrices A
, B
, and C
in shared memory use custom CuTe layouts.
The simple_gemm_aat
example demonstrates how to perform C = A * A^T
, where both A
and A^T
occupy the same shared memory, allowing users to increase kernel occupancy or operate on larger matrices.
NVRTC Examples#
nvrtc_gemm
The NVRTC example demonstrates how to use cuBLASDx with NVRTC runtime compilation to perform a GEMM.
The BLAS descriptions created with cuBLASDx operators are defined only in the device code.
The header file cublasdx.hpp
is also included only in the device code that is passed to NVRTC.
Note
Since version 0.0.1, cuBLASDx has experimental support for compilation with NVRTC. See the Requirements and Functionality section.
GEMM Performance#
single_gemm_performance
fused_gemm_performance
device_gemm_performance
The examples listed above illustrate the performance of cuBLASDx.
The single_gemm_performance
program demonstrates the performance of the cuBLASDx device function executing a general matrix multiply
(GEMM) operation. Users can easily modify this sample to test the performance
of a specific GEMM configuration.
The fused_gemm_performance
example shows the performance of two GEMM operations fused together into a single kernel.
The kernel execution time is compared against the time required by cuBLAS to perform the same calculations.
In both cases, the measured operation is run multiple times and the average speed is reported.
The device_gemm_performance
example compares the performance of cuBLASDx with cuBLAS when performing a GEMM that spans the entire GPU.
This example does not provide the best-performing tile sizes for all precisions; checking other datatypes with custom sizes may
require a parameter search. Global GEMM dimensions are dynamic values and can be passed as command line arguments:
# Perform default size GEMM with cuBLASDx static tile specified in code
./device_gemm_performance
# Perform custom size GEMM with cuBLASDx static tile specified in code
./device_gemm_performance m n k
Advanced Examples#
gemm_fusion
gemm_fft
gemm_fft_fp16
gemm_fft_performance
batched_gemm_fp64
blockdim_gemm_fp16
gemm_device_partial_sums
The advanced cuBLASDx examples demonstrate how cuBLASDx can be utilized to improve performance by fusing multiple calculations into a single kernel, which ultimately reduces global memory accesses.
Examples gemm_fusion
, gemm_fft
, gemm_fft_fp16
, and gemm_fft_performance
show how to fuse multiple GEMMs or a GEMM and
an FFT together in one kernel. This is especially useful for pipelines with many small input matrices, as Dx libraries
can be easily adapted to batched execution by launching many CUDA blocks in a grid.
Examples batched_gemm_fp64
and blockdim_gemm_fp16
demonstrate the use of the BlockDim operator.
In batched_gemm_fp64
, adding a 1D BlockDim
to a BLAS description and launching the kernel with 2D block dimensions allows for manual
batching of GEMMs in a single CUDA block (in contrast to batching by launching multiple blocks in a grid).
blockdim_gemm_fp16
includes multiple scenarios that demonstrate how to safely and correctly execute BLAS operations when the kernel is
launched with block dimensions different from the layout and the number of threads specified in BlockDim
.
The gemm_device_partial_sums
example demonstrates how to use an extra register array in higher precision to offload partial accumulation
of the result of a GEMM every N iterations and avoid precision loss. This can be useful for improving the performance of GEMMs with a large
number of operations, or when the GEMM is part of a larger computation that requires a high precision accumulator.