Release Notes¶

cuFFTMp 11.0.5 EA (HPC-SDK 23.3)¶

cuFFTMp 11.0.5 integrates NVSHMEM 2.8 and supports both CUDA 11 and CUDA 12. A matching libnvshmem_host.so library (with a matching NVSHMEM and CUDA version) should be available at runtime.
Added support for the Hopper GPU architecture.
Added NVSHMEM interoperability support. Applications and libraries can now all use NVSHMEM and share resources, such as NVSHMEM-allocated buffers. This requires the application and all NVSHMEM-enabled libraries to dynamically link libnvshmem_host.so.
cuFFTMp can now be bootstrapped without an MPI communicator. See cufftMpAttachComm for more details.
The cufftXtSetDistribution API was changed, see cufftXtSetDistribution.
Added a new cufftXtSetSubformatDefault API to let users use cuFFTMp without cuFFT multi-GPU descriptors through the cufftExecC2C, cufftXtExec and similar APIs. See cufftXtSetSubformatDefault.
Improved performance on single-node, 3D, complex-to-complex transforms.

HPC-SDK 23.3 releases NVSHMEM 2.9, but cuFFTMP users should point to NVSHMEM 2.8 in the compatible folder at runtime. See Compatibility.

cuFFTMp now supports the same GPU architectures as cuFFT for all single-process functionalities

HPC-SDK 23.1 releases NVSHMEM 2.8, but cuFFTMP users should point to NVSHMEM 2.6 in the compatible folder at runtime. See Compatibility.

cuFFTMp 10.8.1 integrates NVSHMEM 2.5.0 and fixes a few issues as indicated below.

The issue with single-node, single-precision, 3D, complex-to-complex powers of 2 transforms in which Z > 8192 producing incorrect results has been resolved.
cuFFTMp’s versioning has been corrected. Going forward, cuFFTMp will be versioned similarly to cuFFT. See Versioning.

Improved performances of cufftXtSetDistribution and distributed descriptors. This effectively gives full support to Pencil data decompositions.
Improved performances of the Reshape API.

N/A

Single-node, single-precision, 3D, complex-to-complex powers of 2 transforms in which Z > 8192 (e.g. a transform of size 2x2x16384) will lead to incorrect results when using built-in Slab decompositions (i.e. CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED). This will be fixed in the future release of cuFFTMp. cufftXtSetDistribution can be used as a workaround.

New multi-process API interoperable with MPI.
Built-in Slab decompositions (using CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED descriptors) using cufftMpAttachComm
Custom data decomposition (using CUFFT_XT_FORMAT_DISTRIBUTED_INPUT and CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT descriptors) using cufftXtSetDistribution and cufftMpAttachComm
cufftXtMalloc, cufftXtFree and cufftXtMemcpy are fully compatible with the above
Standalone distributed reshape API with cufftReshapeHandle and associated APIs

In addition, the following limitations have been lifted

The following restrictions have been lifted for CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED

The following restrictions have been lifted for R2C/D2Z/C2R/Z2D with CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED

N/A

cufftXtMemcpy with CUFFT_COPY_DEVICE_TO_DEVICE was returning wrong results for 2D and 3D transforms in all previous versions of cuFFT. This has been fixed.