Since FFTs are communication-bound operations, the performances of cuFFTMp depend strongly on the interconnect between GPUs. cuFFTMp performs best when:
Within a node, GPUs are connected with a fast interconnect such as NVlink with NVSwitch
Between nodes, GPUs are connected using many fast Infiniband Network Interface Cards.
The network has a fat-tree topology, where every node (gpu) has maximum bandwidth to all other nodes (gpus).
In addition, best performance (in Flop/s) is achieved with large FFTs. Note that there are no guarantees that using cuFFTMp will deliver better performances than using a single GPU. In particular, for small transforms, using a single GPU may deliver best performances.
Since cuFFTMp uses NVSHMEM and that NVSHMEM spawns a proxy thread, the user should ensure every process has exclusive access to at least two CPU cores.
It is highly recommended to bind process to the cores, socket, NUMA node, etc. near their corresponding GPU. Failure to do so may result in performances regressions, in particular on many GPUs or when the problem sizes are small. For instance if each GPU is associated to a distinct NUMA node, it is recommended to bind processes to the NUMA node closest to the GPU.