Examples |
|||
|---|---|---|---|
Group |
Example |
Description |
|
Subgroup |
|||
Introduction Examples |
introduction_example |
cuBLASDx API introduction example |
|
Simple GEMM Examples |
Basic Example |
simple_gemm_fp32 |
Performs fp32 GEMM |
simple_gemm_int8_int8_int32 |
Performs integral GEMM using Tensor Cores |
||
simple_gemm_cfp16 |
Performs complex fp16 GEMM |
||
simple_gemm_fp8 |
Performs fp8 GEMM |
||
Extra Examples |
simple_gemm_leading_dimensions |
Performs GEMM with non-default leading dimensions |
|
simple_gemm_fp32_decoupled |
Performs fp32 GEMM using 16-bit input type to save on storage and transfers |
||
simple_gemm_std_complex_fp32 |
Performs GEMM with |
||
simple_gemm_mixed_precision |
Performs GEMM with different data type for matrices |
||
simple_gemm_transform |
Performs GEMM with transform operators |
||
simple_gemm_custom_layout |
Performs GEMM with custom matrix layouts |
||
simple_gemm_aat |
Performs GEMM where C = A * A^T |
||
NVRTC Examples |
nvrtc_gemm |
Performs GEMM, kernel is compiled using NVRTC |
|
GEMM Performance |
single_gemm_performance |
Benchmark for single GEMM |
|
fused_gemm_performance |
Benchmark for 2 GEMMs fused into a single kernel |
||
device_gemm_performance |
Benchmark entire device GEMMs using cuBLASDx for single tile |
||
Advanced Examples |
Fusion |
gemm_fusion |
Performs 2 GEMMs in a single kernel |
gemm_fft |
Perform GEMM and FFT in a single kernel |
||
gemm_fft_fp16 |
Perform GEMM and FFT in a single kernel (half-precision complex type) |
||
gemm_fft_performance |
Benchmark for GEMM and FFT fused into a single kernel |
||
Deep Learning |
scaled_dot_prod_attn |
Scaled dot product attention using cuBLASDx |
|
scaled_dot_prod_attn_batched |
Multi-head attention using cuBLASDx |
||
Other |
batched_gemm_fp64 |
Manual batching in a single CUDA block |
|
blockdim_gemm_fp16 |
BLAS execution with different block dimensions |
||