Achieving High Performance¶
Below we list general advice that may help in achieving high performance.
General Advice¶
Start with the library-provided default for the best compute performance.
Best parameters for compute bound and memory bound kernels might not be identical.
If possible ensure BLAS operations are batched so that enough CUDA blocks are run in a grid to fill the GPU for peak performance.
Merge adjacent memory bound kernels (pre- and post-processing) with a BLAS kernel to save global memory trips.
Try using suggested_leading_dimension_of to improve shared memory access patterns.
Memory Management¶
Avoid reading/writing data from/to global memory unnecessarily.
Ensure global memory reads/writes are coalesced.
Use
shared
memory or extra registers to store temporary data.
Advanced¶
For BLAS loads not filling the GPU entirely consider running parallel kernels in a separate stream.
Use CUDA Occupancy Calculator 6 and/or cudaOccupancyMaxActiveBlocksPerMultiprocessor 8 function to determine the optimum launch parameters.
Use the CUDA Occupancy Calculator 6 or Nsight Compute 7 to determine what extra resources are available without losing occupancy.
Further Reading¶
References¶
- 1
https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html
- 2
- 3
- 4
- 5
- 6(1,2)
https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html
- 7
https://docs.nvidia.com/nsight-compute/NsightCompute/index.html#occupancy-calculator
- 8
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OCCUPANCY.html