Achieving High Performance#

Below we present general advice that may help in achieving high performance using cuSolverDx.

General Advice#

Start with the library-provided default for the best compute performance.
In cases of fused operations, also try values beside the library-provided default for the best performance of the kernel.
For small matrix A size, use the suggested batches per block instead of the default 1.
Best parameters for compute bound and memory bound kernels might not be identical.
If possible ensure enough batches so that enough CUDA blocks are run in a grid to fill the GPU for peak performance.
Merge adjacent memory bound kernels (pre- and post-processing) with a Solver kernel to save global memory trips.

Avoid reading/writing data from/to global memory unnecessarily.
Ensure global memory reads/writes to the shared memory are coalesced, i.e., maintaining the same row-major or column major layout for global and shared memory storage of A and B.
Use shared memory or extra registers to store temporary data.

For Solver loads not filling the GPU entirely, consider running parallel kernels in a separate stream.
Use CUDA Occupancy Calculator [6] and/or cudaOccupancyMaxActiveBlocksPerMultiprocessor [7] function to determine the optimum launch parameters.
Use the Nsight Compute CUDA Occupancy Calculator [6] to determine what extra resources are available without losing occupancy.