Supported functionalities¶

The following limitations apply to cuFFTMp:

If defined, CUDA_VISIBLE_DEVICES should be identical on all processes within a node.

Callbacks are not supported.

Because NVSHMEM spawns a hidden thread to handle communications, each process should have exclusive access to at least 2 CPU cores.

Only 2D and 3D transforms are supported, with the following restrictions:

The first two dimensions have length greater than or equal to the number of GPUs

When using built-in data layouts (CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED):

in 2D, R2C only supports an CUFFT_XT_FORMAT_INPLACE input;

in 2D, C2R only supports an CUFFT_XT_FORMAT_INPLACE_SHUFFLED input;

no strides are allowed;

only in-place data layouts are allowed. In particular, for R2C, the real dimension has to be padded to accommodate the complex elements in the output.

Using different MPI communicators (for different processes in MPI_COMM_WORLD) is allowed, but those MPI communicators cannot overlap: for a given process, one cannot use cuFFTMp with two distinct MPI communicators.

The user cannot use NVSHMEM directly (by linking to it), but only through some functions re-exposed through cuFFT. See cuFFTMp and NVSHMEM.

Only NVSHMEM-allocated memory can be used for descriptors and workspace. In particular, cudaMalloc’ed memory cannot be used. Note that memory allocated using cufftXtMalloc is automatically NVSHMEM-allocated.