Release Notes

cuFFTMp 11.2.6 EA (HPC-SDK 24.07)

New features

  • Support for systems with Multi-Node NVLINK (MNNVL).

  • Support for NVSHMEM 3.0.6, which provides ABI backward compatibility between NVSHMEM host and device libraries. Starting from NVSHMEM 3.0.6, the later host library will continue to be compatible with earlier device library versions, within the same major version. This means cuFFTMp 11.2.6 (released with NVSHMEM 3.0.6) can be linked to future NVSHMEM 3.x host libraries. Refer to Compatibility for more details.

Deprecations

  • N/A

Known issue

  • cuFFT LTO features are currently not supported in cuFFTMp.

Resolved issues

  • The resource allocation issues with nvshmem_init/nvshmem_finalize in previous release have been fixed.

cuFFTMp 11.0.14 EA (HPC-SDK 23.11)

New features

  • Support for Grace-Hopper CPU-GPU architecture on CUDA 12.

Deprecations

  • N/A

Known issue

  • NVSHMEM 2.10.1 has a bug leading to the first call to nvshmem_finalize releasing all resources. Any subsequent nvshmem_init will re-initialize all resources. cuFFTMp only initialize NVSHMEM if it hasn’t already been initialized, and will finalize NVSHMEM when all plans have been destroyed. As a consequence, any NVSHMEM or cuFFTMp API call (other than nvshmem_finalize) after that first nvshmem_finalize and before any subsequent nvshmem_init may crash the application. A workaround consists of keeping at least one cuFFTMp plan alive throughout the application lifetime, and destorying it right before the last call to nvshmem_finalize.

For instance this may crash

nvshmem_init(...)
void* ptr = nvshmem_malloc(...)
cufftCreate(...)
cufftMpAttachComm(...)
cufftXtMakePlan(...)
cufftExecC2C(...)
cufftDestroy(...) // Will finalize all NVSHMEM resources (bug)
nvshmem_free(ptr) // May crash
nvshmem_finalize()

and should be replaced by

nvshmem_init(...)
void* ptr = nvshmem_malloc(...)
cufftCreate(...)
cufftMpAttachComm(...)
cufftXtMakePlan(...)
cufftExecC2C(...)
nvshmem_free(ptr) // Will not crash
cufftDestroy(...)
nvshmem_finalize()

Refer to the cuFFTMp and NVSHMEM C++ sample for a working example.

  • NVSHMEM 2.10.1 has minor resource leaks, leading to a ceiling on the number of times NVSHMEM can be initialized/finalized throughout a process lifetime. This bug manifests itself through deadlocks or segfault at cuFFTMp plan creation. A workaround consists of keeping at least one cuFFTMp plan alive throughout the application, to avoid repeated nvshmem_init / nvshmem_finalize.

Resolved issues

  • N/A

cuFFTMp 11.0.5 EA (HPC-SDK 23.3)

New features

  • cuFFTMp 11.0.5 integrates NVSHMEM 2.8 and supports both CUDA 11 and CUDA 12. A matching libnvshmem_host.so library (with a matching NVSHMEM and CUDA version) should be available at runtime.

  • Added support for the Hopper GPU architecture.

  • Added NVSHMEM interoperability support. Applications and libraries can now all use NVSHMEM and share resources, such as NVSHMEM-allocated buffers. This requires the application and all NVSHMEM-enabled libraries to dynamically link libnvshmem_host.so.

  • cuFFTMp can now be bootstrapped without an MPI communicator. See cufftMpAttachComm for more details.

  • The cufftXtSetDistribution API was changed, see cufftXtSetDistribution.

  • Added a new cufftXtSetSubformatDefault API to let users use cuFFTMp without cuFFT multi-GPU descriptors through the cufftExecC2C, cufftXtExec and similar APIs. See cufftXtSetSubformatDefault.

  • Improved performance on single-node, 3D, complex-to-complex transforms.

Deprecations

  • N/A

Known issue

  • HPC-SDK 23.3 releases NVSHMEM 2.9, but cuFFTMP users should point to NVSHMEM 2.8 in the compatible folder at runtime. See Compatibility.

Resolved issues

  • cuFFTMp now supports the same GPU architectures as cuFFT for all single-process functionalities

cuFFTMp 10.8.1 EA (HPC-SDK 23.1)

New features

  • N/A

Deprecations

  • N/A

Known / resolved issues

  • HPC-SDK 23.1 releases NVSHMEM 2.8, but cuFFTMP users should point to NVSHMEM 2.6 in the compatible folder at runtime. See Compatibility.

cuFFTMp 10.8.1 EA (HPC-SDK 22.5+)

cuFFTMp 10.8.1 integrates NVSHMEM 2.5.0 and fixes a few issues as indicated below.

New features

  • N/A

Deprecations

  • N/A

Known / resolved issues

  • The issue with single-node, single-precision, 3D, complex-to-complex powers of 2 transforms in which Z > 8192 producing incorrect results has been resolved.

  • cuFFTMp’s versioning has been corrected. Going forward, cuFFTMp will be versioned similarly to cuFFT. See Versioning.

cuFFTMp 0.0.2 EA (HPC-SDK 22.3)

New features

  • Improved performances of cufftXtSetDistribution and distributed descriptors. This effectively gives full support to Pencil data decompositions.

  • Improved performances of the Reshape API.

Deprecations

N/A

Known / resolved issues

  • Single-node, single-precision, 3D, complex-to-complex powers of 2 transforms in which Z > 8192 (e.g. a transform of size 2x2x16384) will lead to incorrect results when using built-in Slab decompositions (i.e. CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED). This will be fixed in the future release of cuFFTMp. cufftXtSetDistribution can be used as a workaround.

Standalone EA (November 2021)

New features

  • New multi-process API interoperable with MPI.

  • Built-in Slab decompositions (using CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED descriptors) using cufftMpAttachComm

  • Custom data decomposition (using CUFFT_XT_FORMAT_DISTRIBUTED_INPUT and CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT descriptors) using cufftXtSetDistribution and cufftMpAttachComm

  • cufftXtMalloc, cufftXtFree and cufftXtMemcpy are fully compatible with the above

  • Standalone distributed reshape API with cufftReshapeHandle and associated APIs

In addition, the following limitations have been lifted

  • C2R/Z2D now support CUFFT_XT_FORMAT_INPLACE in 3D

  • R2C/D2Z now support CUFFT_XT_FORMAT_INPLACE_SHUFFLED in 3D

The following restrictions have been lifted for CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED

  • “Dimension must factor into primes less than or equal to 127”

  • “Maximum dimension size is 4096 for single precision”

  • “Maximum dimension size is 2048 for double precision”

The following restrictions have been lifted for R2C/D2Z/C2R/Z2D with CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED

  • “Fastest changing dimension size needs to be even”

Deprecations

N/A

Known / resolved issues

  • cufftXtMemcpy with CUFFT_COPY_DEVICE_TO_DEVICE was returning wrong results for 2D and 3D transforms in all previous versions of cuFFT. This has been fixed.