Achieving High Performance#

Below we present general advice that may help in achieving high performance using cuSolverDx.

General Advice#

  • Start with the library-provided default for the best compute performance.

  • In cases of fused operations, also try values beside the library-provided default for the best performance of the kernel.

  • For small matrix A size, use the suggested batches per block instead of the default 1.

  • Best parameters for compute bound and memory bound kernels might not be identical.

  • If possible ensure enough batches so that enough CUDA blocks are run in a grid to fill the GPU for peak performance.

  • Merge adjacent memory bound kernels (pre- and post-processing) with a Solver kernel to save global memory trips.

Memory Management#

  • Avoid reading/writing data from/to global memory unnecessarily.

  • Ensure global memory reads/writes to the shared memory are coalesced, i.e., maintaining the same row-major or column major layout for global and shared memory storage of A and B.

  • Use shared memory or extra registers to store temporary data.

Advanced#

Further Reading#

References#