Achieving High Performance#

In High-Performance Computing, the ability to write customized code enables users to target better performance. In the case of cuFFTDx, the potential for performance improvement of existing FFT applications is high, but it greatly depends on how the library is used. Taking the regular cuFFT library as baseline, the performance may be up to one order of magnitude better or worse. For this reason porting existing sources to cuFFTDx should always be done in parallel with performance analysis. Below we list general advice that may help in this process.

General Advice#

Try library-provided default settings to start with best compute performance.
Best parameters for compute bound and memory bound kernels might not be identical.
Ensure FFT kernel runs enough blocks to fill a GPU for peak performance.
Merge adjacent memory bound kernels (pre- and post-processing) with an FFT kernel to save global memory trips.

Memory Management#

Avoid reading/writing data from global memory.
Ensure global memory reads/writes are coalesced (increase the value of FFTs Per Block Operator if needed).
Use shared memory or extra registers to store the temporary data.

Kernel Fusion#

For complex kernels consider adjusting FFT operation to match user kernel. (ie. tweaking Elements Per Thread Operator will change required CUDA block size). Upcoming versions of cuFFTDx will offer more customization options.
For simple operations consider merging operations into FFT kernel optimized for FFT performance.

Advanced#

For FFT loads not filling the GPU entirely consider running parallel kernels in a separate stream.
Nsight Compute Occupancy Calculator [6] and/or cudaOccupancyMaxActiveBlocksPerMultiprocessor [8] function to determine the optimum launch parameters.
Use the Nsight Compute [7] to determine what extra resources are available without losing occupancy.