Achieving high performance¶

In High-Performance Computing, the ability to write customized code enables users to target better performance. In the case of cuFFTDx, the potential for performance improvement of existing FFT applications is high, but it greatly depends on how the library is used. Taking the regular cuFFT library as baseline, the performance may be up to one order of magnitude better or worse. For this reason porting existing sources to cuFFTDx should always be done in parallel with performance analysis. Below we list general advice that may help in this process.

General advice¶

Try library-provided default settings to start with best compute performance
Best parameters for compute bound and memory bound kernels might not be identical
Ensure FFT kernel runs enough blocks to fill a GPU for peak performance
Merge adjacent memory bound kernels (pre- and post-processing) with an FFT kernel to save global memory trips

Memory management¶

Avoid reading/writing data from global memory
Ensure global memory reads/writes are coalesced (increase the value of FFTs Per Block Operator if needed)
Use shared memory or extra registers to store the temporary data

Kernel fusion¶

For complex kernels consider adjusting FFT operation to match user kernel (ie. tweaking Elements Per Thread Operator will change required CUDA block size). Upcoming versions of cuFFTDx will offer more customization options.
For simple operations consider merging operations into FFT kernel optimized for FFT performance.

Advanced¶

For FFT loads not filling the GPU entirely consider running parallel kernels in a separate stream
CUDA Occupancy Calculator 5 and/or cudaOccupancyMaxActiveBlocksPerMultiprocessor 7 function to determine the optimum launch parameters
Use the CUDA Occupancy Calculator 5 or Nsight Compute 6 to determine what extra resources are available without losing occupancy

Achieving high performance¶

General advice¶

Memory management¶

Kernel fusion¶

Advanced¶

Further reading¶

References¶