Performance considerations#

Since FFTs are communication-bound operations, the performances of cuFFTMp depend strongly on the interconnect between GPUs. cuFFTMp performs best when:

Within a node, GPUs are connected with a fast interconnect such as NVlink with NVSwitch
Between nodes, GPUs are connected using many fast Infiniband Network Interface Cards.
The network has a fat-tree topology, where every node (gpu) has maximum bandwidth to all other nodes (gpus).

In addition, best performance (in Flop/s) is achieved with large FFTs. Note that there are no guarantees that using cuFFTMp will deliver better performances than using a single GPU. In particular, for small transforms, using a single GPU may deliver best performances.

Warning

Since cuFFTMp uses NVSHMEM and that NVSHMEM spawns a proxy thread, the user should ensure every process has exclusive access to at least two CPU cores.

Warning

It is highly recommended to bind process to the cores, socket, NUMA node, etc. near their corresponding GPU. Failure to do so may result in performances regressions, in particular on many GPUs or when the problem sizes are small. For instance if each GPU is associated to a distinct NUMA node, it is recommended to bind processes to the NUMA node closest to the GPU.

Note

Infiniband adaptive routing is usually beneficial for performance for large FFTs on many GPUs, when it is available. It is recommanded to compare performance with and without adaptive routing. For NVSHMEM on Infiniband, adaptive routing can usually be enabled by setting the service level to 1 using the environment variable NVSHMEM_IB_SL=1.