Examples#
The cuBLASDx library provides multiple block-level BLAS samples covering basic GEMM operations with various precisions and types, as well as a few special examples that highlight performance benefits of cuBLASDx.
Examples |
|||
---|---|---|---|
Group |
Example |
Description |
|
Subgroup |
|||
Introduction Examples |
introduction_example |
cuBLASDx API introduction example |
|
Simple GEMM Examples |
Basic Example |
simple_gemm_fp32 |
Performs fp32 GEMM |
simple_gemm_int8_int8_int32 |
Performs integral GEMM using Tensor Cores |
||
simple_gemm_cfp16 |
Performs complex fp16 GEMM |
||
simple_gemm_fp8 |
Performs fp8 GEMM |
||
Extra Examples |
simple_gemm_leading_dimensions |
Performs GEMM with non-default leading dimensions |
|
simple_gemm_fp32_decoupled |
Performs fp32 GEMM using 16-bit input type to save on storage and transfers |
||
simple_gemm_std_complex_fp32 |
Performs GEMM with |
||
simple_gemm_mixed_precision |
Performs GEMM with different data type for matrices |
||
simple_gemm_transform |
Performs GEMM with transform operators |
||
simple_gemm_custom_layout |
Performs GEMM with custom matrix layouts |
||
simple_gemm_aat |
Performs GEMM where C = A * A^T |
||
NVRTC Examples |
nvrtc_gemm |
Performs GEMM, kernel is compiled using NVRTC |
|
GEMM Performance |
single_gemm_performance |
Benchmark for single GEMM |
|
fused_gemm_performance |
Benchmark for 2 GEMMs fused into a single kernel |
||
device_gemm_performance |
Benchmark entire device GEMMs using cuBLASDx for single tile |
||
Advanced Examples |
Fusion |
gemm_fusion |
Performs 2 GEMMs in a single kernel |
gemm_fft |
Perform GEMM and FFT in a single kernel |
||
gemm_fft_fp16 |
Perform GEMM and FFT in a single kernel (half-precision complex type) |
||
gemm_fft_performance |
Benchmark for GEMM and FFT fused into a single kernel |
||
Deep Learning |
scaled_dot_prod_attn |
Scaled dot product attention using cuBLASDx |
|
scaled_dot_prod_attn_batched |
Multi-head attention using cuBLASDx |
||
Other |
batched_gemm_fp64 |
Manual batching in a single CUDA block |
|
blockdim_gemm_fp16 |
BLAS execution with different block dimensions |
Introduction Examples#
introduction_example
Introduction examples are used in the documentation to explain basics of the cuBLASDx library and its API. introduction_example
is used in the beginner’s guide to run GEMM with cuBLASDx API: General Matrix Multiply Using cuBLASDx.
Simple GEMM Examples#
simple_gemm_fp32
simple_gemm_fp32_decoupled
simple_gemm_int8_int8_int32
simple_gemm_cfp16
simple_gemm_fp8
simple_gemm_leading_dimensions
simple_gemm_std_complex_fp32
simple_gemm_mixed_precision
simple_gemm_transform
simple_gemm_custom_layout
simple_gemm_aat
In each of the examples listed above a general matrix multiply (GEMM) operation is performed in a CUDA block.
The examples show how to create a complete BLAS description, allocate memory, set block dimensions and the necessary amount of shared memory.
The input data (matrices A
, B
, and C
) is generated on the host, copied into a device buffer, and loaded into registers and shared memory.
The execute()
calculates GEMM and stores the results in either matrix C
or register fragment accumulator, then they are copied into the
global memory, and finally back to the host.
The results are verified against cuBLAS.
The simple_gemm_leading_dimensions
shows how to set static (compile-time) leading dimensions for matrices A
, B
, and C
via LeadingDimension operator.
For performance reasons, it is recommended to try suggested leading dimensions using suggested_leading_dimension_of.
The simple_gemm_fp32_decoupled
example shows how to decouple input precision from compute precision.
The simple_gemm_std_complex_fp32
example proves that cuBLASDx accepts matrices of types different than BLAS::a_value_type, BLAS::b_value_type, and BLAS::c_value_type. In this cases, it is the complex type from CUDA C++ Standard Library - cuda::std::complex<float>
, but it could be float2
provided by CUDA too.
The simple_gemm_mixed_precision
example shows how to compute an mixed-precision GEMM, where matrices A
, B
, and C
have data of different precisions. Note that the scales alpha
and beta
are expected to have the same precision and type as the matrix C
elements.
The simple_gemm_transform
example shows how to compute GEMM with transform operators a_load_op
, b_load_op
, c_load_op
and c_store_opt
that are applied element-wisely when loading matrices A
, B
and C
and storing matrix C
, respectively.
The simple_gemm_custom_layout
example shows how to compute a GEMM, where matrices A
, B
, and C
in shared memory use custom CuTe layouts.
The simple_gemm_aat
example shows how to perform C = A * A^T
where both A
and A^T
occupies the same shared memory allowing users to increase kernel occupancy or operate on larger matrices.
NVRTC Examples#
nvrtc_gemm
The NVRTC example presents how to use cuBLASDx with NVRTC runtime compilation to perform a GEMM.
The BLAS descriptions created with cuBLASDx operators are defined only in the device code.
The header file cublasdx.hpp
is also included only in the device code that’s passed to the NVRTC.
Note
Since version 0.0.1 cuBLASDx has an experimental support for compilation with NVRTC. See Requirements and Functionality section.
GEMM Performance#
single_gemm_performance
fused_gemm_performance
The examples listed above illustrate the performance of cuBLASDx.
The single_gemm_performance
program presents the performance of cuBLASDx device function executing a general matrix multiply
(GEMM) function. Users can easily modify this sample to test the performance
of a particular GEMM they need.
The fused_gemm_performance
example shows the performance of two GEMM operations fused together into a single kernel.
The kernel execution time is compared against the time cuBLAS requires to do the same calculations.
In both cases, the measured operation is run multiple times and the average speed is reported.
Advanced Examples#
gemm_fusion
gemm_fft
gemm_fft_fp16
gemm_fft_performance
scaled_dot_prod_attn
scaled_dot_prod_attn_batched
batched_gemm_fp64
blockdim_gemm_fp16
The advanced cuBLASDx examples are there to show how cuBLASDx can be utilized to improve performance by fusing many calculations into a single kernel, which ultimately means less global memory accesses.
Examples gemm_fusion
, gemm_fft
, gemm_fft_fp16
and gemm_fft_performance
present how to fuse multiple GEMMs or a GEMM and
an FFT together in one kernel. It might be especially useful for pipelines with a lot of the small input matrices as Dx libraries
can be easily adapted to batched execution by launching many CUDA blocks in a grid.
The scaled_dot_prod_attn
and scaled_dot_prod_attn_batched
examples explore the areas of deep learning and natural language processing,
showcasing the implementations of scaled dot-product attention and multi-head attention (MHA) algorithms.
The performance of half precision MHA in scaled_dot_prod_attn_batched
example was compared with PyTorch’s scaled_dot_product_attention
function on H100 PCIe 80GB and the results are presented in Fig. 1.
Fig. 1 Comparison of Multi-Head Attention algorithm was performed between PyTorch (light blue) and cuBLASDx (green) on H100 PCIe 80GB with
maximum clocks set.
The chart presents speed-ups of cuBLASDx over PyTorch’s scaled_dot_product_attention
function for sequences
of different lengths and the batch size set to 64. Both input data and computations were in half precision (fp16).
NVIDIA PyTorch container image 23.04
was used for PyTorch performance evaluation.#
Examples batched_gemm_fp64
and blockdim_gemm_fp16
demonstrate the uses of BlockDim operator.
In batched_gemm_fp64
adding 1D BlockDim
to a BLAS description and launching kernel with 2D block dimensions allow for manual
batching of GEMMs in a single CUDA block (in contrast to batching by launching multiple blocks in a grid).
blockdim_gemm_fp16
includes multiple scenarios which present how to safely and correctly execute BLAS operation when kernel is
launched with block dimensions different from the layout and the number of threads specified in BlockDim
.