Release Notes

cuFFTMp 11.0.14 EA (HPC-SDK 23.11)

New features

  • Support for Grace-Hopper CPU-GPU architecture on CUDA 12.

Deprecations

  • N/A

Known issue

  • NVSHMEM 2.10.1 has a bug leading to the first call to nvshmem_finalize releasing all resources. Any subsequent nvshmem_init will re-initialize all resources. cuFFTMp only initialize NVSHMEM if it hasn’t already been initialized, and will finalize NVSHMEM when all plans have been destroyed. As a consequence, any NVSHMEM or cuFFTMp API call (other than nvshmem_finalize) after that first nvshmem_finalize and before any subsequent nvshmem_init may crash the application. A workaround consists of keeping at least one cuFFTMp plan alive throughout the application lifetime, and destorying it right before the last call to nvshmem_finalize.

For instance this may crash

nvshmem_init(...)
void* ptr = nvshmem_malloc(...)
cufftCreate(...)
cufftMpAttachComm(...)
cufftXtMakePlan(...)
cufftExecC2C(...)
cufftDestroy(...) // Will finalize all NVSHMEM resources (bug)
nvshmem_free(ptr) // May crash
nvshmem_finalize()

and should be replaced by

nvshmem_init(...)
void* ptr = nvshmem_malloc(...)
cufftCreate(...)
cufftMpAttachComm(...)
cufftXtMakePlan(...)
cufftExecC2C(...)
nvshmem_free(ptr) // Will not crash
cufftDestroy(...)
nvshmem_finalize()

Refer to the cuFFTMp and NVSHMEM C++ sample for a working example.

  • NVSHMEM 2.10.1 has minor resource leaks, leading to a ceiling on the number of times NVSHMEM can be initialized/finalized throughout a process lifetime. This bug manifests itself through deadlocks or segfault at cuFFTMp plan creation. A workaround consists of keeping at least one cuFFTMp plan alive throughout the application, to avoid repeated nvshmem_init / nvshmem_finalize.

Resolved issues

  • N/A

cuFFTMp 11.0.5 EA (HPC-SDK 23.3)

New features

  • cuFFTMp 11.0.5 integrates NVSHMEM 2.8 and supports both CUDA 11 and CUDA 12. A matching libnvshmem_host.so library (with a matching NVSHMEM and CUDA version) should be available at runtime.

  • Added support for the Hopper GPU architecture.

  • Added NVSHMEM interoperability support. Applications and libraries can now all use NVSHMEM and share resources, such as NVSHMEM-allocated buffers. This requires the application and all NVSHMEM-enabled libraries to dynamically link libnvshmem_host.so.

  • cuFFTMp can now be bootstrapped without an MPI communicator. See cufftMpAttachComm for more details.

  • The cufftXtSetDistribution API was changed, see cufftXtSetDistribution.

  • Added a new cufftXtSetSubformatDefault API to let users use cuFFTMp without cuFFT multi-GPU descriptors through the cufftExecC2C, cufftXtExec and similar APIs. See cufftXtSetSubformatDefault.

  • Improved performance on single-node, 3D, complex-to-complex transforms.

Deprecations

  • N/A

Known issue

  • HPC-SDK 23.3 releases NVSHMEM 2.9, but cuFFTMP users should point to NVSHMEM 2.8 in the compatible folder at runtime. See Compatibility.

Resolved issues

  • cuFFTMp now supports the same GPU architectures as cuFFT for all single-process functionalities

cuFFTMp 10.8.1 EA (HPC-SDK 23.1)

New features

  • N/A

Deprecations

  • N/A

Known / resolved issues

  • HPC-SDK 23.1 releases NVSHMEM 2.8, but cuFFTMP users should point to NVSHMEM 2.6 in the compatible folder at runtime. See Compatibility.

cuFFTMp 10.8.1 EA (HPC-SDK 22.5+)

cuFFTMp 10.8.1 integrates NVSHMEM 2.5.0 and fixes a few issues as indicated below.

New features

  • N/A

Deprecations

  • N/A

Known / resolved issues

  • The issue with single-node, single-precision, 3D, complex-to-complex powers of 2 transforms in which Z > 8192 producing incorrect results has been resolved.

  • cuFFTMp’s versioning has been corrected. Going forward, cuFFTMp will be versioned similarly to cuFFT. See Versioning.

cuFFTMp 0.0.2 EA (HPC-SDK 22.3)

New features

  • Improved performances of cufftXtSetDistribution and distributed descriptors. This effectively gives full support to Pencil data decompositions.

  • Improved performances of the Reshape API.

Deprecations

N/A

Known / resolved issues

  • Single-node, single-precision, 3D, complex-to-complex powers of 2 transforms in which Z > 8192 (e.g. a transform of size 2x2x16384) will lead to incorrect results when using built-in Slab decompositions (i.e. CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED). This will be fixed in the future release of cuFFTMp. cufftXtSetDistribution can be used as a workaround.

Standalone EA (November 2021)

New features

  • New multi-process API interoperable with MPI.

  • Built-in Slab decompositions (using CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED descriptors) using cufftMpAttachComm

  • Custom data decomposition (using CUFFT_XT_FORMAT_DISTRIBUTED_INPUT and CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT descriptors) using cufftXtSetDistribution and cufftMpAttachComm

  • cufftXtMalloc, cufftXtFree and cufftXtMemcpy are fully compatible with the above

  • Standalone distributed reshape API with cufftReshapeHandle and associated APIs

In addition, the following limitations have been lifted

  • C2R/Z2D now support CUFFT_XT_FORMAT_INPLACE in 3D

  • R2C/D2Z now support CUFFT_XT_FORMAT_INPLACE_SHUFFLED in 3D

The following restrictions have been lifted for CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED

  • “Dimension must factor into primes less than or equal to 127”

  • “Maximum dimension size is 4096 for single precision”

  • “Maximum dimension size is 2048 for double precision”

The following restrictions have been lifted for R2C/D2Z/C2R/Z2D with CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED

  • “Fastest changing dimension size needs to be even”

Deprecations

N/A

Known / resolved issues

  • cufftXtMemcpy with CUFFT_COPY_DEVICE_TO_DEVICE was returning wrong results for 2D and 3D transforms in all previous versions of cuFFT. This has been fixed.