Supported functionalities

The following limitations apply to cuFFTMp:

  • If defined, CUDA_VISIBLE_DEVICES should be identical on all processes within a node.

  • Callbacks are not supported.

  • Because NVSHMEM spawns a hidden thread to handle communications, each process should have exclusive access to at least 2 CPU cores.

  • Only 2D and 3D transforms are supported, with the following restrictions:

    • The first two dimensions have length greater than or equal to the number of GPUs

    • When using built-in data layouts (CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED):

      • in 2D, R2C only supports an CUFFT_XT_FORMAT_INPLACE input;

      • in 2D, C2R only supports an CUFFT_XT_FORMAT_INPLACE_SHUFFLED input;

      • no strides are allowed;

      • only in-place data layouts are allowed. In particular, for R2C, the real dimension has to be padded to accommodate the complex elements in the output.

  • Using different MPI communicators (for different processes in MPI_COMM_WORLD) is allowed, but those MPI communicators cannot overlap: for a given process, one cannot use cuFFTMp with two distinct MPI communicators.

  • The user cannot use NVSHMEM directly (by linking to it), but only through some functions re-exposed through cuFFT. See cuFFTMp and NVSHMEM.

  • Only NVSHMEM-allocated memory can be used for descriptors and workspace. In particular, cudaMalloc’ed memory cannot be used. Note that memory allocated using cufftXtMalloc is automatically NVSHMEM-allocated.