Performance considerationsΒΆ
Since FFTs are communication-bound operations, the performances of cuFFTMp depend strongly on the interconnect between GPUs. cuFFTMp performs best when
Within a node, GPUs are connected with a fast interconnect such as NVlink with NVSwitch
Between nodes, GPUs are connected using many fast Infiniband Network Interface Cards.
The network has a fat-tree topology, where every node (gpu) has maximum bandwidth to all other nodes (gpus).
In addition, best performance (in Flop/s) is achieved with large FFTs. Note that there are no guarantees that using cuFFTMp will deliver better performances than using a single GPU. In particular, for small transforms, using a single GPU may deliver best performances.
Warning
Note: Since cuFFTMp uses NVSHMEM and that NVSHMEM spawns a proxy thread, the user should ensure every process has exclusive access to at least two CPU cores.