Requirements and Functionality#


Requirements#

The cuFFTDx library is a CUDA C++ header only library. Therefore, the list of required software to use the library is relatively small. User needs:

  • CUDA Toolkit 11.0 or newer

  • Supported CUDA compiler

  • Supported host compiler (C++17 required)

  • (Optionally) CMake (version 3.18 or greater)

Supported Compilers#

CUDA Compilers:

  • NVCC 11.0.194+ (CUDA Toolkit 11.0 or newer)

  • (Experimental support) NVRTC 11.0.194+ (CUDA Toolkit 11.0 or newer)

Host / C++ Compilers:

  • GCC 7+

  • Clang 9+ (only on Linux/WSL2)

  • (Preliminary support) MSVC 1920+ (Windows/Visual Studio 2019 16.0 or newer)

Warning

Compiling using MSVC as CUDA host compiler requires enabling __cplusplus (/Zc:__cplusplus). In order to do so, pass -Xcompiler "/Zc:__cplusplus" as an option to NVCC. Failing to do so may result in errors reporting that a variable of class dim3 can’t be declared constexpr.

Note

cuFFTDx emits errors for unsupported versions of compilers, which can be silenced by defining CUFFTDX_IGNORE_DEPRECATED_COMPILER during compilation. cuFFTDx is not guaranteed to work with versions of compilers that are not supported in cuFFTDx.

Note

cuFFTDx emits errors for unsupported versions of C++ standard, which can be silenced by defining CUFFTDX_IGNORE_DEPRECATED_DIALECT during compilation. cuFFTDx is not guaranteed to work with versions of C++ standard that are not supported in cuFFTDx.

Supported Functionality#

Supported functions include:
  • Create block descriptors that run collective FFT operations (with one or more threads collaborating to compute one or more FFTs) in a single CUDA block. See Block Operator.

  • Create thread descriptors that run a single FFT operation per thread. This function might require more expertise with cuFFTDx in order to obtain correct results with higher performance. See Thread Operator.

  • Bi-directional information flow, from the user to the descriptor via Operators and from the descriptor to the user via Traits.

  • Target specific GPU architectures using the SM Operator. This enables users to configure the descriptor with suggested parameters to target performance.

cuFFTDx supports selected FFT sizes in the range [0; max_size], where max_size depends on precision and CUDA architecture as presented in table below, and all FFT sizes in the range [0; max_size_fp64/2], where max_size_fp64 is max FFT size for double precision for a given CUDA architecture. However, not every combination of size, precision, elements per thread, and FFTs per block is correct and available. The following table summarizes the available configurations:

Note

Real-to-complex and complex-to-real block FFTs have extra available size in each precision (in the table added with a plus), but it is accessible only when using real_mode::folded value for the RealFFTOptions operator. Analogically thread execution can employ real_mode::folded for all even sizes twice the regular range, as put forth in the table below. For further details please refer to RealFFTOptions Operator.

Precision

Thread FFT

Block FFT

C2C and R2C/C2R Normal

R2C/C2R Folded

Architecture

C2C and R2C/C2R Normal

R2C/C2R Folded

half

[2; 64]

even in range: [4; 128]

75

[2; 16384]

powers of 2 in range: [4; 8192]

70;72;86;89

[2; 24389]

powers of 2 in range: [4; 32768]

80;87;90

[2; 32768]

powers of 2 in range: [2; 65536]

float

[2; 64]

even in range: [4; 128]

75

[2; 16384]

powers of 2 in range: [2; 8192]

70;72;86;89

[2; 24389]

powers of 2 in range: [2; 32768]

80;87;90

[2; 32768]

powers of 2 in range: [2; 65536]

double

[2; 32]

even in range: [4; 64]

75

[2; 8192]

powers of 2 in range: [2; 4096]

70;72;86;89

[2; 12167]

powers of 2 in range: [2; 16384]

80;87;90

[2; 16384]

powers of 2 in range: [2; 32768]

Note

cuFFTDx 0.3.0 added preliminary support for all sizes in range of [0; max_size_fp64/2]. Most sizes will require you to create additional workspace with global memory allocation. See Make Workspace Function for more details about workspace. You can check if a given FFT requires with FFT::requires_workspace trait.

Hint

Since cuFFTDx 1.1.0, whether an FFT description is supported on a given CUDA architecture or not can be checked using cufftdx::is_supported.

Workspace is not required for FFTs of following sizes:

  • Powers of 2 up to 32768.

  • Powers of 3 up to 19683.

  • Powers of 5 up to 15625.

  • Powers of 6 up to 7776.

  • Powers of 7 up to 16807.

  • Powers of 10 up to 10000.

  • Powers of 11 up to 1331.

  • Powers of 12 up to 1728.

  • Powers of 13 up to 2187.

  • Powers of 14 up to 2744.

  • Powers of 15 up to 3375.

  • Powers of 17 up to 4913.

  • Powers of 18 up to 5832.

  • Powers of 19 up to 6859.

  • Powers of 20 up to 8000.

  • Powers of 21 up to 9261.

  • Powers of 22 up to 10649.

  • Powers of 23 up to 12167.

  • Powers of 24 up to 13824.

  • Powers of 26 up to 17576.

  • Powers of 29 up to 21952.

  • Powers of 30 up to 27000.

  • Powers of 32 up to 29781.

  • Factors of 4 lower than 512.

In the future versions of cuFFTDx:
  • Workspace requirement may be removed for other configurations.

  • FFT configurations that do not require workspace will continue to do so.

The performance speedup that can be obtained for a sample of the sizes listed above with no workspace requirements can be seen in Fig. 1 and Fig. 2. It shows the speedup obtained for the same FFT sizes in cuFFTDx 1.3.0 comparing with their previous implementation in cuFFTDx 1.2.1 (with workspace requirements).

FFT convolution performance speedup for some of the new sizes introduced in cuFFTDx 1.3.0 vs cuFFTDx 1.2.1.

Fig. 1 Speedup of complex-to-complex forward FFT in single precision with suggested parameters with cuFFTDx 1.3.0 on H100 80GB HBM3 with maximum clocks set. Chart presents speedup compared to cuFFTDx 1.2.1.#

FFT convolution performance speedup for some of the new sizes introduced in cuFFTDx 1.3.0 vs cuFFTDx 1.2.1.

Fig. 2 Speedup of complex-to-complex forward FFT in double precision with suggested parameters with cuFFTDx 1.3.0 on H100 80GB HBM3 with maximum clocks set. Chart presents speedup compared to cuFFTDx 1.2.1.#

Functionality not yet supported include:
  • Input/output stored in global memory. Input data must be in registers or shared memory.

  • The BlockDim Operator, which enables fine-grain customization of the CUDA block dimensions.