Achieving High Performance

Below we list general advice that may help in achieving high performance.

General Advice

  • Start with the library-provided default for the best compute performance.

  • Best parameters for compute bound and memory bound kernels might not be identical.

  • If possible ensure BLAS operations are batched so that enough CUDA blocks are run in a grid to fill the GPU for peak performance.

  • Merge adjacent memory bound kernels (pre- and post-processing) with a BLAS kernel to save global memory trips.

  • Try using suggested_leading_dimension_of to improve shared memory access patterns.

Memory Management

  • Avoid reading/writing data from/to global memory unnecessarily.

  • Ensure global memory reads/writes are coalesced.

  • Use shared memory or extra registers to store temporary data.

Advanced