Examples¶
The cuBLASDx library provides multiple block-level BLAS samples covering basic GEMM operations with various precisions and types, as well as a few special examples that highlight performance benefits of cuBLASDx.
Examples |
|||
---|---|---|---|
Group |
Example |
Description |
|
Subgroup |
|||
Introduction Examples |
introduction_example |
cuBLASDx API introduction example |
|
Simple GEMM Examples |
Basic Example |
simple_gemm_fp32 |
Performs fp32 GEMM |
simple_gemm_cfp16 |
Performs complex fp16 GEMM |
||
Extra Examples |
simple_gemm_leading_dimensions |
Performs GEMM with non-default leading dimensions |
|
simple_gemm_std_complex_fp32 |
Performs GEMM with |
||
NVRTC Examples |
nvrtc_gemm |
Performs GEMM, kernel is compiled using NVRTC |
|
GEMM Performance |
single_gemm_performance |
Benchmark for single GEMM |
|
fused_gemm_performance |
Benchmark for 2 GEMMs fused into a single kernel |
||
Advanced Examples |
Fusion |
fused_gemm |
Performs 2 GEMMs in a single kernel |
gemm_fft |
Perform GEMM and FFT in a single kernel |
||
gemm_fft_fp16 |
Perform GEMM and FFT in a single kernel (half-precision complex type) |
||
gemm_fft_performance |
Benchmark for GEMM and FFT fused into a single kernel |
||
Deep Learning |
scaled_dot_prod_attn |
Scaled dot product attention using cuBLASDx |
|
scaled_dot_prod_attn_batched |
Multi-head attention using cuBLASDx |
||
Other |
multiblock_gemm |
Proof-of-concept for single large GEMM using multiple CUDA blocks |
|
batched_gemm_fp64 |
Manual batching in a single CUDA block |
||
blockdim_gemm_fp16 |
BLAS execution with different block dimensions |
Introduction Examples¶
introduction_example
Introduction examples are used in the documentation to explain basics of the cuBLASDx library and its API. introduction_example
is used in the beginner’s guide to run GEMM with cuBLASDx API: General Matrix Multiply Using cuBLASDx.
Simple GEMM Examples¶
simple_gemm_fp32
simple_gemm_cfp16
simple_gemm_leading_dimensions
simple_gemm_std_complex_fp32
In each of the examples listed above a general matrix multiply (GEMM) operation is performed in a CUDA block.
The examples show how to create a complete BLAS description, allocate memory, set block dimensions and the necessary amount of shared memory.
The input data (matrices A
, B
, and C
) is generated on the host, copied into a device buffer, and loaded into the shared memory.
The execute()
calculates GEMM and stores the results in matrix C
, then they are copied into the global memory, and finally back to
the host.
The results are verified against cuBLAS.
The simple_gemm_leading_dimensions
shows how to set static (compile-time) leading dimensions for matrices A
, B
, and C
via LeadingDimension operator.
For performance reasons, it is recommended to try suggested leading dimensions using suggested_leading_dimension_of.
The simple_gemm_std_complex_fp32
example proves that cuBLASDx accepts input types other than BLAS::value_type.
In this cases, it is complex type from CUDA C++ Standard Library - cuda::std::complex<float>
, but it can be float2
provided by CUDA too.
NVRTC Examples¶
nvrtc_gemm
The NVRTC example presents how to use cuBLASDx with NVRTC runtime compilation to perform a GEMM.
The BLAS descriptions created with cuBLASDx operators are defined only in the device code.
The header file cublasdx.hpp
is also included only in the device code that’s passed to the NVRTC.
Note
Since version 0.0.1 cuBLASDx has an experimental support for compilation with NVRTC. See Requirements and Functionality section.
GEMM Performance¶
single_gemm_performance
fused_gemm_performance
The examples listed above illustrate the performance of cuBLASDx.
The single_gemm_performance
program presents the performance of cuBLASDx device function executing a general matrix multiply
(GEMM) function. Users can easily modify this sample to test the performance
of a particular GEMM they need.
The fused_gemm_performance
example shows the performance of two GEMM operations fused together into a single kernel.
The kernel execution time is compared against the time cuBLAS requires to do the same calculations.
In both cases, the measured operation is run multiple times and the average speed is reported.
Advanced Examples¶
multiblock_gemm
gemm_fusion
gemm_fft
gemm_fft_fp16
gemm_fft_performance
scaled_dot_prod_attn
scaled_dot_prod_attn_batched
batched_gemm_fp64
blockdim_gemm_fp16
The advanced cuBLASDx examples are there to show how cuBLASDx can be utilized to improve performance by fusing many calculations into a single kernel, which ultimately means less global memory accesses.
The multiblock_gemm
example is proof-of-concept code to execute a GEMM operation using multiple CUDA blocks in cuBLASDx, which
is useful when the matrices don’t fit into shared memory or to introduce more parallelism. Users can experiment with the problem
size, precision, data type, local block size, etc. and understand the effect on performance.
Examples gemm_fusion
, gemm_fft
, gemm_fft_fp16
and gemm_fft_performance
present how to fuse multiple GEMMs or a GEMM and
an FFT together in one kernel. It might be especially useful for pipelines with a lot of the small input matrices as Dx libraries
can be easily adapted to batched execution by launching many CUDA blocks in a grid.
The scaled_dot_prod_attn
and scaled_dot_prod_attn_batched
examples explore the areas of deep learning and natural language processing,
showcasing the implementations of scaled dot-product attention and multi-head attention (MHA) algorithms.
The performance of half precision MHA in scaled_dot_prod_attn_batched
example was compared with PyTorch’s scaled_dot_product_attention
function on H100 PCIe 80GB and the results are presented in Fig. 1.
Fig. 1 Comparison of Multi-Head Attention algorithm was performed between PyTorch (light blue) and cuBLASDx (green) on H100 PCIe 80GB with
maximum clocks set.
The chart presents speed-ups of cuBLASDx over PyTorch’s scaled_dot_product_attention
function for sequences
of different lengths and the batch size set to 64. Both input data and computations were in half precision (fp16).
NVIDIA PyTorch container image 23.04
was used for PyTorch performance evaluation.¶
Examples batched_gemm_fp64
and blockdim_gemm_fp16
demonstrate the uses of BlockDim operator.
In batched_gemm_fp64
adding 1D BlockDim
to a BLAS description and launching kernel with 2D block dimensions allow for manual
batching of GEMMs in a single CUDA block (in contrast to batching by launching multiple blocks in a grid).
blockdim_gemm_fp16
includes multiple scenarios which present how to safely and correctly execute BLAS operation when kernel is
launched with block dimensions different from the layout and the number of threads specified in BlockDim
.