Release Notes¶
cuFFTMp 11.0.14 EA (HPC-SDK 23.11)¶
New features¶
Support for Grace-Hopper CPU-GPU architecture on CUDA 12.
Deprecations¶
N/A
Known issue¶
NVSHMEM 2.10.1 has a bug leading to the first call to
nvshmem_finalize
releasing all resources. Any subsequentnvshmem_init
will re-initialize all resources. cuFFTMp only initialize NVSHMEM if it hasn’t already been initialized, and will finalize NVSHMEM when all plans have been destroyed. As a consequence, any NVSHMEM or cuFFTMp API call (other thannvshmem_finalize
) after that firstnvshmem_finalize
and before any subsequentnvshmem_init
may crash the application. A workaround consists of keeping at least one cuFFTMp plan alive throughout the application lifetime, and destorying it right before the last call tonvshmem_finalize
.
For instance this may crash
nvshmem_init(...) void* ptr = nvshmem_malloc(...) cufftCreate(...) cufftMpAttachComm(...) cufftXtMakePlan(...) cufftExecC2C(...) cufftDestroy(...) // Will finalize all NVSHMEM resources (bug) nvshmem_free(ptr) // May crash nvshmem_finalize()
and should be replaced by
nvshmem_init(...) void* ptr = nvshmem_malloc(...) cufftCreate(...) cufftMpAttachComm(...) cufftXtMakePlan(...) cufftExecC2C(...) nvshmem_free(ptr) // Will not crash cufftDestroy(...) nvshmem_finalize()
Refer to the cuFFTMp and NVSHMEM C++ sample for a working example.
NVSHMEM 2.10.1 has minor resource leaks, leading to a ceiling on the number of times NVSHMEM can be initialized/finalized throughout a process lifetime. This bug manifests itself through deadlocks or segfault at cuFFTMp plan creation. A workaround consists of keeping at least one cuFFTMp plan alive throughout the application, to avoid repeated
nvshmem_init
/nvshmem_finalize
.
Resolved issues¶
N/A
cuFFTMp 11.0.5 EA (HPC-SDK 23.3)¶
New features¶
cuFFTMp 11.0.5 integrates NVSHMEM 2.8 and supports both CUDA 11 and CUDA 12. A matching
libnvshmem_host.so
library (with a matching NVSHMEM and CUDA version) should be available at runtime.Added support for the Hopper GPU architecture.
Added NVSHMEM interoperability support. Applications and libraries can now all use NVSHMEM and share resources, such as NVSHMEM-allocated buffers. This requires the application and all NVSHMEM-enabled libraries to dynamically link
libnvshmem_host.so
.cuFFTMp can now be bootstrapped without an MPI communicator. See cufftMpAttachComm for more details.
The
cufftXtSetDistribution
API was changed, see cufftXtSetDistribution.Added a new
cufftXtSetSubformatDefault
API to let users use cuFFTMp without cuFFT multi-GPU descriptors through thecufftExecC2C
,cufftXtExec
and similar APIs. See cufftXtSetSubformatDefault.Improved performance on single-node, 3D, complex-to-complex transforms.
Deprecations¶
N/A
Known issue¶
HPC-SDK 23.3 releases NVSHMEM 2.9, but cuFFTMP users should point to NVSHMEM 2.8 in the compatible folder at runtime. See Compatibility.
Resolved issues¶
cuFFTMp now supports the same GPU architectures as cuFFT for all single-process functionalities
cuFFTMp 10.8.1 EA (HPC-SDK 23.1)¶
New features¶
N/A
Deprecations¶
N/A
Known / resolved issues¶
HPC-SDK 23.1 releases NVSHMEM 2.8, but cuFFTMP users should point to NVSHMEM 2.6 in the compatible folder at runtime. See Compatibility.
cuFFTMp 10.8.1 EA (HPC-SDK 22.5+)¶
cuFFTMp 10.8.1 integrates NVSHMEM 2.5.0 and fixes a few issues as indicated below.
New features¶
N/A
Deprecations¶
N/A
Known / resolved issues¶
The issue with single-node, single-precision, 3D, complex-to-complex powers of 2 transforms in which Z > 8192 producing incorrect results has been resolved.
cuFFTMp’s versioning has been corrected. Going forward, cuFFTMp will be versioned similarly to cuFFT. See Versioning.
cuFFTMp 0.0.2 EA (HPC-SDK 22.3)¶
New features¶
Improved performances of
cufftXtSetDistribution
and distributed descriptors. This effectively gives full support to Pencil data decompositions.Improved performances of the Reshape API.
Deprecations¶
N/A
Known / resolved issues¶
Single-node, single-precision, 3D, complex-to-complex powers of 2 transforms in which Z > 8192 (e.g. a transform of size 2x2x16384) will lead to incorrect results when using built-in Slab decompositions (i.e.
CUFFT_XT_FORMAT_INPLACE
andCUFFT_XT_FORMAT_INPLACE_SHUFFLED
). This will be fixed in the future release of cuFFTMp.cufftXtSetDistribution
can be used as a workaround.
Standalone EA (November 2021)¶
New features¶
New multi-process API interoperable with MPI.
Built-in Slab decompositions (using
CUFFT_XT_FORMAT_INPLACE
andCUFFT_XT_FORMAT_INPLACE_SHUFFLED
descriptors) usingcufftMpAttachComm
Custom data decomposition (using
CUFFT_XT_FORMAT_DISTRIBUTED_INPUT
andCUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT
descriptors) usingcufftXtSetDistribution
andcufftMpAttachComm
cufftXtMalloc
,cufftXtFree
andcufftXtMemcpy
are fully compatible with the aboveStandalone distributed reshape API with
cufftReshapeHandle
and associated APIs
In addition, the following limitations have been lifted
C2R/Z2D now support
CUFFT_XT_FORMAT_INPLACE
in 3DR2C/D2Z now support
CUFFT_XT_FORMAT_INPLACE_SHUFFLED
in 3D
The following restrictions have been lifted for CUFFT_XT_FORMAT_INPLACE
and CUFFT_XT_FORMAT_INPLACE_SHUFFLED
“Dimension must factor into primes less than or equal to 127”
“Maximum dimension size is 4096 for single precision”
“Maximum dimension size is 2048 for double precision”
The following restrictions have been lifted for R2C/D2Z/C2R/Z2D with CUFFT_XT_FORMAT_INPLACE
and CUFFT_XT_FORMAT_INPLACE_SHUFFLED
“Fastest changing dimension size needs to be even”
Deprecations¶
N/A
Known / resolved issues¶
cufftXtMemcpy
withCUFFT_COPY_DEVICE_TO_DEVICE
was returning wrong results for 2D and 3D transforms in all previous versions of cuFFT. This has been fixed.