Requirements and Functionality


Requirements

The cuFFTDx library is a CUDA C++ header only library. Therefore, the list of required software to use the library is relatively small. User needs:

  • CUDA Toolkit 11.0 or newer

  • Supported CUDA compiler

  • Supported host compiler (C++17 required)

  • (Optionally) CMake (version 3.18 or greater)

Supported Compilers

CUDA Compilers:

  • NVCC 11.0.194+ (CUDA Toolkit 11.0 or newer)

  • (Experimental support) NVRTC 11.0.194+ (CUDA Toolkit 11.0 or newer)

Host / C++ Compilers:

  • GCC 7+

  • Clang 9+ (only on Linux/WSL2)

  • (Preliminary support) MSVC 1920+ (Windows/Visual Studio 2019 16.0 or newer)

Warning

Compiling using MSVC as CUDA host compiler requires enabling __cplusplus (/Zc:__cplusplus). In order to do so, pass -Xcompiler "/Zc:__cplusplus" as an option to NVCC. Failing to do so may result in errors reporting that a variable of class dim3 can’t be declared constexpr.

Note

cuFFTDx emits errors for unsupported versions of compilers, which can be silenced by defining CUFFTDX_IGNORE_DEPRECATED_COMPILER during compilation. cuFFTDx is not guaranteed to work with versions of compilers that are not supported in cuFFTDx.

Note

cuFFTDx emits errors for unsupported versions of C++ standard, which can be silenced by defining CUFFTDX_IGNORE_DEPRECATED_DIALECT during compilation. cuFFTDx is not guaranteed to work with versions of C++ standard that are not supported in cuFFTDx.

Supported Functionality

Supported functions include:
  • Create block descriptors that run collective FFT operations (with one or more threads collaborating to compute one or more FFTs) in a single CUDA block. See Block Operator.

  • Create thread descriptors that run a single FFT operation per thread. This function might require more expertise with cuFFTDx in order to obtain correct results with higher performance. See Thread Operator.

  • Bi-directional information flow, from the user to the descriptor via Operators and from the descriptor to the user via Traits.

  • Target specific GPU architectures using the SM Operator. This enables users to configure the descriptor with suggested parameters to target performance.

cuFFTDx supports selected FFT sizes in the range [0; max_size], where max_size depends on precision and CUDA architecture as presented in table below, and all FFT sizes in the range [0; max_size_fp64/2], where max_size_fp64 is max FFT size for double precision for a given CUDA architecture. However, not every combination of size, precision, elements per thread, and FFTs per block is correct and available. The following table summarizes the available configurations:

Type

Precision

Thread FFT Sizes

Block FFT Sizes

Architecture

Size Range

  • Complex-to-complex

  • Real-to-complex

  • Complex-to-real

half

All sizes in range: [2; 32]

75

[2; 4096]

70;72;86;89

[2; 16384]

80;87;90

[2; 32768]

float

All sizes in range: [2; 32]

75

[2; 4096]

70;72;86;89

[2; 16384]

80;87;90

[2; 32768]

double

All sizes in range: [2; 16]

75

[2; 2048]

70;72;86;89

[2; 8192]

80;87;90

[2; 16384]

Note

cuFFTDx 0.3.0 added preliminary support for all sizes in range of [0; max_size_fp64/2]. Most sizes will require you to create additional workspace with global memory allocation. See Make Workspace Function for more details about workspace. You can check if a given FFT requires with FFT::requires_workspace trait.

Hint

Since cuFFTDx 1.1.0, whether an FFT description is supported on a given CUDA architecture or not can be checked using cufftdx::is_supported.

Workspace is not required for FFTs of following sizes:

  • Powers of 2 up to 32768

  • Powers of 3 up to 19683

  • Powers of 5 up to 15625

  • Powers of 6 up to 1296

  • Powers of 7 up to 2401

  • Powers of 10 up to 10000

  • Powers of 11 up to 1331

  • Powers of 12 up to 1728

In the future versions of cuFFTDx:
  • Workspace requirement may be removed for other configurations.

  • FFT configurations that do not require workspace will continue to do so.

Functionality not yet supported include:
  • Input/output stored in global memory. Input data must be in registers or shared memory.

  • The BlockDim Operator, which enables fine-grain customization of the CUDA block dimensions.