Release Notes#

This section includes significant changes, new features, performance improvements, and various issues. Unless noted, listed issues should not impact functionality. When functionality is impacted, we offer a work-around to avoid the issue (if available).

1.5.0#

Update for MathDx 25.06 release. It does not include features from 1.4.0 EA.

Added a new chapter in the documentation: Python Bindings.
Deprecated legacy architecture: sm72.

New Features#

Added support for new Blackwell architectures:
- SM100, SM101, SM120.
- Experimental support: SM103, SM121.

Known Issues#

cuFFTDx discontinues experimental support for MSVC compiler. See MSVC compiler warning for more information.

1.4.0 Early Access#

The first Early Access (EA) release of cuFFTDx library with Link Time Optimization via cuFFT. It is packaged as a separate library and can be downloaded from here.

New Features#

Over 1000 additional sizes supported with improved performance and without workspace requirement, via code sharing across cuFFT and cuFFTDx enabled by LTO.
New LTO-enabled features supports both offline (NVCC) and online (NVRTC / nvJitLink) kernel generation.

1.3.1#

Minor update for MathDx 25.01.1 release.

Updated version incompatibilities. Now, it is recommended to disable CUTLASS dependency when using CUDA toolkits with versions earlier than 11.4. See Installation Guide for more information.

Resolved Issues#

Removed cufftdx_separate_twiddles_lut target from all when including cuFFTDx package.
Minor corrections and fixes in documentation.

1.3.0#

cuFFTDx update with performance improvements.

New Features#

Extended the number of sizes supported without additional workspace. Check here for the full list of supported sizes.
Added a new linking target for optimization of the final exeutable size. See Installation Guide for more information.
Examples updates
- Introduced convolution_3d examples.
- Extended explanation in docs of mixed_precision examples.

Resolved Issues#

cuFFTDx internal versioning definitions have been fixed:
- CUFFTDX_VERSION_MAJOR is now defined as a multiple of 10000 in CUFFTDX_VERSION, i.e. CUFFTDX_VERSION_MAJOR = CUFFTDX_VERSION / 10000.
- CUFFTDX_VERSION_MINOR now can have a new maximum of two digits.
- Absolute ordering between version remains valid. However, CUFFTDX_VERSION between 1201 and 10300 will not exist.
- All definitions can be checked in the file cufftdx_version.hpp.

1.2.1#

Minor update for MathDx 24.08 release.

All Device Extensions libraries are bundled together in a single package named nvidia-mathdx-24.08.0.tar.gz.
cuFFTDx now has an indirect dependency on CUTLASS library. See Installation Guide for more information.

1.2.0#

cuFFTDx update extending R2C/C2R capabilities.

New Features#

Added RealFFTOptions Operator with two knobs:
- complex_layout allows for changing complex value layouts for both real-to-complex and complex-to-real transforms.
- real_mode introduces optimized mode, providing up to 2x performance boosts for real-to-complex and complex-to-real transforms and allowing for twice as long sequences.
Extended thread range twofold for each precision and architecture (e.g. SM<900> can now execute up to 64-point thread FFTs instead of 32-point).
Added an optional stream parameter to make_workspace functions taking a CUDA stream.
Added new input/output traits (e.g. Input Length Trait)
- Among new threads the input_type and output_type traits have been repurposed from previous use. Please refer to Traits for further detail.
Examples updates
- Updated all input/output to follow the new simplified trait-based idiom.
- Updated real-to-complex and complex-to-real examples to utilize the new RealFFTOptions Operator.
- Introduced convolution_padded and mixed_precision examples.
Improvements to the documentation:
- Added missing information regarding memory alignment requirements for shared memory.
- Updated the documentation to follow the new trait-based input/output idiom.
- Added description of features introduced in this release: RealFFTOptions Operator, new make_workspace overloads, new supported size ranges.
- Added padded convolution performance comparison with cuFFT and non-padded cuFFTDx to the examples chapter.

Resolved Issues#

Added missing acquire synchronization for complex-to-real block FFTs when input data is in thread local arrays.

1.1.1#

cuFFTDx patch update to accommodate cuBLASDx EA 0.1.0 release.

New Features#

Added 2D and 3D FFT examples.
cuFFTDx now depends on commonDx headers. commonDx includes private tools and types that all Dx libraries use.

Resolved Issues#

Disabled runtime CUDA block dimensions assertions (using assert()) in execute() methods by default. In previous versions they are disabled only when NDEBUG (defined by CMake in Release mode) or CUFFTDX_DISABLE_RUNTIME_ASSERTS are defined, or when compilation is done by NVRTC. Keeping assertions could result in performance penalty. Now, user has to define CUFFTDX_ENABLE_RUNTIME_ASSERTS to enable them.

1.1.0#

The first release of cuFFTDx library with support for Hopper and Ada architectures.

New Features#

Initial support for Orin architecture (SM87).
Initial support for Ada architecture (SM89).
Initial support for Hopper architecture (SM90).
Added cufftdx::is_supported.
Added preliminary support for MSVC.
Improvements to the documentation:
- Examples chapter,
- Quick Installation Guide chapter,
- information about shared memory usage in cuFFTDx, and
- updated introduction chapters: First FFT Using cuFFTDx and Your Next Custom FFT Kernels.

Known Issues#

Compiling using MSVC as CUDA host compiler requires enabling __cplusplus (/Zc:__cplusplus). In order to do so, pass -Xcompiler "/Zc:__cplusplus" as an option to NVCC (NVCC: Options for Passing Specific Phase Options).
When compiling using MSVC as CUDA host compiler please be aware of the limit on the length of mangled names and other compiler limits which in extreme cases can result in calling incorrect instances of kernel templates involving cuFFTDx.

1.0.0#

The first general availability (GA) release of cuFFTDx library.

New Features#

Added new shared API for block FFT execution, see block execution methods.
Added and documented FFT::stride.
Optimized default ElementsPerThread and FFTsPerBlock values for SM80 (targeting A100) and SM70 (targeting V100).
Restored full performance of powers-of-two kernels in cuFFTDx.

Resolved Issues#

ptxas warning program uses 32-bit address on line XXX which is conflicting with .address_size 64 shouldn’t appear anymore.

0.3.1#

The last early access (EA) release of cuFFTDx library.

Known Issues#

ptxas warning about pointer size conflict:
```
ptxas warning : Program uses 32-bit address on line 'XXX' which is conflicting with .address_size 64
```
This warning may appear when compiling, but it does not impact functionality or performance.