Achieving High Performance#

Below, we present general advice and examples to help you achieve high performance with cuBLASDx.

General Advice#

Start with the library-provided defaults for optimal compute performance.
Prefer the register fragment API over the shared memory API.
The best parameters for compute-bound and memory-bound kernels may differ.
Whenever possible, batch BLAS operations so that enough CUDA blocks are launched in a grid to fully utilize the GPU for peak performance.
Merge adjacent memory-bound kernels (pre- and post-processing) with a BLAS kernel to reduce global memory accesses.
If possible, use a combination of precisions that has MMA support.
Use the tensor execute(…) API for improved I/O and matrix multiplication performance.
Use cublasdx::copy for copying shared and global memory tensors. This should automatically vectorize loads and stores.
Use cublasdx::copy_fragment for copying register fragments to shared and global memory tensors. This should also automatically vectorize loads and stores.
Use 16-byte (128-bit) aligned pointers, and use MaxAlignment (an alias for Alignment<16, 16, 16>).

Matrix Layouts#

We recommend using the tensor API along with get_layout_smem_*(), suggest_layout_smem_*(), and cublasdx::make_tensor functions. This allows you to use custom layouts for matrices, which can provide better performance and better match your kernel’s requirements.

Try using suggest_layout_smem_*(), especially for A and B matrices, to achieve better GEMM and I/O performance.
Use get_layout_smem_*() for a matrix if you need it to be in plain column- or row-major order.
The best layouts for GEMM may differ from the best layouts for the entire kernel; experiment with various approaches.
For use cases that do not have dedicated MMA (such as fp32-fp32-fp32), try using get_layout_smem_*() and suggested_leading_dimension_of to improve shared memory access patterns.
For register fragment APIs, always choose a data partitioner that matches your layouts:
- suggest_partitioner() for suggest_layout_smem_*()
- get_partitioner() for get_layout_smem_*()

Example

Example of using maximum alignment, cublasdx::copy, and suggested layouts for shared memory to increase performance.

using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Precision<__half, __half, double>() + MaxAlignment() + Block() + ...);

// Tensors with global memory data
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c());

// Tensors with shared memory data
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::suggest_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::suggest_layout_smem_b());

// 16-byte (128-bit) alignment helps vectorize (if possible) copying between shared and global memory
using blas_alignment = cublasdx::alignment_of<BLAS>; // 16, 16, 16
cublasdx::copy<BLAS, blas_alignment::a>(a_global_tensor, a_shared_tensor);
cublasdx::copy<BLAS, blas_alignment::b>(b_global_tensor, b_shared_tensor);
cublasdx::copy_wait();

// Get suggested partitioner, matching shared memory layouts for A and B
auto partitioner = BLAS::suggest_partitioner();

// Make a register result accumulator for this GEMM execution
auto c_fragment_accumulator = partitioner.make_accumulator_fragment();

// Partition c_global and copy this thread's elements into register fragment
cublasdx::copy_fragment<blas_alignment::c>(c_global_tensor, c_fragment_accumulator, partitioner);

// 16-byte (128-bit) alignment and suggested layouts help improve shared memory IO in GEMM
BLAS().execute(a_shared_tensor, b_shared_tensor, c_fragment_accumulator)
__syncthreads();

auto out_global_tensor = cublasdx::make_tensor(output, BLAS::get_layout_gmem_c());

// Partition out_global_tensor and copy this thread's elements into appropriate locations
cublasdx::copy_fragment<blas_alignment::c>(c_fragment_accumulator, out_global_tensor, partitioner);

Memory Management#

Avoid unnecessary reads/writes to global memory.
Ensure global memory accesses are coalesced.
Use shared memory or extra registers to store temporary data.
Offload any element-wise pre- and post-processing to transform functors that can be passed to execute(…) to avoid trips to shared memory.

Advanced#

If matrix-wide pre- and/or post-processing is required, try to partition data into registers to avoid shared memory accesses.
For BLAS loads that do not fully utilize the GPU, consider running parallel kernels in separate streams.
Use the Nsight Compute Occupancy Calculator [6] and/or the cudaOccupancyMaxActiveBlocksPerMultiprocessor [8] function todetermine the optimal launch parameters
Use the Nsight Compute Occupancy Calculator [6] or Nsight Compute [7] to determine what extra resources are available without reducing occupancy.