Performance considerations¶

Since FFTs are communication-bound operations, the performances of cuFFTMp depend strongly on the interconnect between GPUs. cuFFTMp performs best when:

Within a node, GPUs are connected with a fast interconnect such as NVlink with NVSwitch
Between nodes, GPUs are connected using many fast Infiniband Network Interface Cards.
The network has a fat-tree topology, where every node (gpu) has maximum bandwidth to all other nodes (gpus).

In addition, best performance (in Flop/s) is achieved with large FFTs. Note that there are no guarantees that using cuFFTMp will deliver better performances than using a single GPU. In particular, for small transforms, using a single GPU may deliver best performances.

Warning

Since cuFFTMp uses NVSHMEM and that NVSHMEM spawns a proxy thread, the user should ensure every process has exclusive access to at least two CPU cores.