Supported functionalities#

The following limitations apply to cuFFTMp:

If defined, CUDA_VISIBLE_DEVICES should be identical on all processes within a node.
Callbacks are not supported.
Because NVSHMEM spawns a hidden thread to handle communications, each process should have exclusive access to at least 2 CPU cores.
Only 2D and 3D transforms are supported, with the following restrictions:
- The first two dimensions have length greater than or equal to the number of GPUs
- When using built-in data layouts (CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED):
  - in 2D, R2C only supports an CUFFT_XT_FORMAT_INPLACE input;
  - in 2D, C2R only supports an CUFFT_XT_FORMAT_INPLACE_SHUFFLED input;
  - no strides are allowed;
  - only in-place data layouts are allowed. In particular, for R2C, the real dimension has to be padded to accommodate the complex elements in the output.
Using different MPI communicators (for different processes in MPI_COMM_WORLD) is allowed, but those MPI communicators cannot overlap: for a given process, one cannot use cuFFTMp with two distinct MPI communicators.
Only NVSHMEM-allocated memory can be used for descriptors and workspace. In particular, cudaMalloc’ed memory cannot be used. Note that memory allocated using cufftXtMalloc is automatically NVSHMEM-allocated.