Achieving High Performance#

Below, we present general advice and examples to help you achieve high performance with cuBLASDx.

General Advice#

  • Start with the library-provided defaults for optimal compute performance.

  • Prefer the register fragment API over the shared memory API.

  • The best parameters for compute-bound and memory-bound kernels may differ.

  • Whenever possible, batch BLAS operations so that enough CUDA blocks are launched in a grid to fully utilize the GPU for peak performance.

  • Merge adjacent memory-bound kernels (pre- and post-processing) with a BLAS kernel to reduce global memory accesses.

  • If possible, use a combination of precisions that has MMA support.

  • Use the tensor execute(…) API for improved I/O and matrix multiplication performance.

  • Use cublasdx::copy for copying shared and global memory tensors. This should automatically vectorize loads and stores.

  • Use cublasdx::copy_fragment for copying register fragments to shared and global memory tensors. This should also automatically vectorize loads and stores.

  • Use 16-byte (128-bit) aligned pointers, and use MaxAlignment (an alias for Alignment<16, 16, 16>).

Matrix Layouts#

We recommend using the tensor API along with get_layout_smem_*(), suggest_layout_smem_*(), and cublasdx::make_tensor functions. This allows you to use custom layouts for matrices, which can provide better performance and better match your kernel’s requirements.

  • Try using suggest_layout_smem_*(), especially for A and B matrices, to achieve better GEMM and I/O performance.

  • Use get_layout_smem_*() for a matrix if you need it to be in plain column- or row-major order.

  • The best layouts for GEMM may differ from the best layouts for the entire kernel; experiment with various approaches.

  • For use cases that do not have dedicated MMA (such as fp32-fp32-fp32), try using get_layout_smem_*() and suggested_leading_dimension_of to improve shared memory access patterns.

  • For register fragment APIs, always choose a data partitioner that matches your layouts:
    • suggest_partitioner() for suggest_layout_smem_*()

    • get_partitioner() for get_layout_smem_*()

Example

Example of using maximum alignment, cublasdx::copy, and suggested layouts for shared memory to increase performance.

using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Precision<__half, __half, double>() + MaxAlignment() + Block() + ...);

// Tensors with global memory data
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c());

// Tensors with shared memory data
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::suggest_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::suggest_layout_smem_b());

// 16-byte (128-bit) alignment helps vectorize (if possible) copying between shared and global memory
using blas_alignment = cublasdx::alignment_of<BLAS>; // 16, 16, 16
cublasdx::copy<BLAS, blas_alignment::a>(a_global_tensor, a_shared_tensor);
cublasdx::copy<BLAS, blas_alignment::b>(b_global_tensor, b_shared_tensor);
cublasdx::copy_wait();

// Get suggested partitioner, matching shared memory layouts for A and B
auto partitioner = BLAS::suggest_partitioner();

// Make a register result accumulator for this GEMM execution
auto c_fragment_accumulator = partitioner.make_accumulator_fragment();

// Partition c_global and copy this thread's elements into register fragment
cublasdx::copy_fragment<blas_alignment::c>(c_global_tensor, c_fragment_accumulator, partitioner);

// 16-byte (128-bit) alignment and suggested layouts help improve shared memory IO in GEMM
BLAS().execute(a_shared_tensor, b_shared_tensor, c_fragment_accumulator)
__syncthreads();

auto out_global_tensor = cublasdx::make_tensor(output, BLAS::get_layout_gmem_c());

// Partition out_global_tensor and copy this thread's elements into appropriate locations
cublasdx::copy_fragment<blas_alignment::c>(c_fragment_accumulator, out_global_tensor, partitioner);

Memory Management#

  • Avoid unnecessary reads/writes to global memory.

  • Ensure global memory accesses are coalesced.

  • Use shared memory or extra registers to store temporary data.

  • Offload any element-wise pre- and post-processing to transform functors that can be passed to execute(…) to avoid trips to shared memory.

Advanced#

Further Reading#

References#