Examples

Group

Example

Description

Subgroup

Introduction Examples

introduction_example

cuBLASDx API introduction example

Simple GEMM Examples

Basic Example

simple_gemm_fp32

Performs fp32 GEMM

simple_gemm_int8_int8_int32

Performs integral GEMM using Tensor Cores

simple_gemm_cfp16

Performs complex fp16 GEMM

simple_gemm_fp8

Performs fp8 GEMM

Extra Examples

simple_gemm_leading_dimensions

Performs GEMM with non-default leading dimensions

simple_gemm_fp32_decoupled

Performs fp32 GEMM using 16-bit input type to save on storage and transfers

simple_gemm_std_complex_fp32

Performs GEMM with cuda::std::complex as data type

simple_gemm_mixed_precision

Performs GEMM with different data type for matrices A, B and C

simple_gemm_transform

Performs GEMM with transform operators

simple_gemm_custom_layout

Performs GEMM with custom matrix layouts

simple_gemm_aat

Performs GEMM where C = A * A^T

NVRTC Examples

nvrtc_gemm

Performs GEMM, kernel is compiled using NVRTC

GEMM Performance

single_gemm_performance

Benchmark for single GEMM

fused_gemm_performance

Benchmark for 2 GEMMs fused into a single kernel

device_gemm_performance

Benchmark entire device GEMMs using cuBLASDx for single tile

Advanced Examples

Fusion

gemm_fusion

Performs 2 GEMMs in a single kernel

gemm_fft

Perform GEMM and FFT in a single kernel

gemm_fft_fp16

Perform GEMM and FFT in a single kernel (half-precision complex type)

gemm_fft_performance

Benchmark for GEMM and FFT fused into a single kernel

Deep Learning

scaled_dot_prod_attn

Scaled dot product attention using cuBLASDx

scaled_dot_prod_attn_batched

Multi-head attention using cuBLASDx

Other

batched_gemm_fp64

Manual batching in a single CUDA block

blockdim_gemm_fp16

BLAS execution with different block dimensions