Achieving High Performance#
Below we present general advice and examples that may help in achieving high performance.
General Advice#
Start with the library-provided default for the best compute performance.
Use register fragment API instead of shared memory API.
Best parameters for compute bound and memory bound kernels might not be identical.
If possible ensure BLAS operations are batched so that enough CUDA blocks are run in a grid to fill the GPU for peak performance.
Merge adjacent memory bound kernels (pre- and post-processing) with a BLAS kernel to save global memory trips.
If possible use combination of precisions that has MMA support.
Use tensor execute(…) API to get better IO and matrix multiplication performance.
Use cublasdx::copy for copying shared and global memory tensors. It should automatically vectorize loads and stores.
Use cublasdx::copy_fragment for copying register fragments to shared and global memory tensors. It should automatically vectorize loads and stores.
Use 16-byte (128-bit) aligned pointers, and use MaxAlignment (alias for
Alignment<16, 16, 16>
).
Matrix Layouts#
We recommend using tensor API along with get_layout_smem_*(), suggest_layout_smem_*(), and cublasdx::make_tensor functions. This allows using custom layouts for matrices which can provide better performance, and also better match your kernel.
Try using suggest_layout_smem_*(), especially for
A
andB
matrices, to get better GEMM and IO performance.Use get_layout_smem_*() for a matrix if you need it to be in plain column- or row-major ordering.
The best layouts for GEMM may be different from the best layouts for the whole kernel; experiment and try various approaches.
For use-cases which don’t have dedicated MMA (like fp32-fp32-fp32) try using get_layout_smem_*() and suggested_leading_dimension_of to improve shared memory access patterns.
- For register fragment APIs always choose data partitioner matching your layouts:
suggest_partitioner()
forsuggest_layout_smem_*()
get_partitioner()
forget_layout_smem_*()
Example
Example of using maximum alignment, cublasdx::copy, and suggested layouts for shared memory to increase performance.
using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Precision<__half, __half, double>() + MaxAlignment() + Block() + ...);
// Tensors with global memory data
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c());
// Tensors with shared memory data
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::suggest_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::suggest_layout_smem_b());
// 16-byte (128-bit) alignment helps vectorize (if possible) copying between shared and global memory
using blas_alignment = cublasdx::alignment_of<BLAS>; // 16, 16, 16
cublasdx::copy<BLAS, blas_alignment::a>(a_global_tensor, a_shared_tensor);
cublasdx::copy<BLAS, blas_alignment::b>(b_global_tensor, b_shared_tensor);
cublasdx::copy_wait();
// Get suggested partitioner, matching shared memory layouts for A and B
auto partitioner = BLAS::suggest_partitioner();
// Make a register result accumulator for this GEMM execution
auto c_fragment_accumulator = partitioner.make_accumulator_fragment();
// Partition c_global and copy this thread's elements into register fragment
cublasdx::copy_fragment<blas_alignment::c>(c_global_tensor, c_fragment_accumulator, partitioner);
// 16-byte (128-bit) alignment and suggested layouts help improve shared memory IO in GEMM
BLAS().execute(a_shared_tensor, b_shared_tensor, c_fragment_accumulator)
__syncthreads();
auto out_global_tensor = cublasdx::make_tensor(output, BLAS::get_layout_gmem_c());
// Partition out_global_tensor and copy this thread's elements into appropriate locations
cublasdx::copy_fragment<blas_alignment::c>(c_fragment_accumulator, out_global_tensor, partitioner);
Memory Management#
Avoid reading/writing data from/to global memory unnecessarily.
Ensure global memory reads/writes are coalesced.
Use
shared
memory or extra registers to store temporary data.Offload any element-wise pre- and post-processing to transform functors that can be passed to execute(…) to avoid trips to shared memory.
Advanced#
If matrix-wide pre- and/or post-processing is required, try to partition data into registers to avoid shared memory trips.
For BLAS loads not filling the GPU entirely, consider running parallel kernels in a separate stream.
Use Nsight Compute Occupancy Calculator [6] and/or cudaOccupancyMaxActiveBlocksPerMultiprocessor [8] function to determine the optimum launch parameters.
Use the Nsight Compute Occupancy Calculator [6] or Nsight Compute [7] to determine what extra resources are available without losing occupancy.