Requirements and Functionality¶
Requirements¶
The cuFFTDx library is a CUDA C++ header only library. Therefore, the list of required software to use the library is relatively small. User needs:
CUDA Toolkit 11.0 or newer
Supported CUDA compiler
Supported host compiler (C++17 required)
(Optionally) CMake (version 3.18 or greater)
Supported Compilers¶
CUDA Compilers:
NVCC 11.0.194+ (CUDA Toolkit 11.0 or newer)
(Experimental support) NVRTC 11.0.194+ (CUDA Toolkit 11.0 or newer)
Host / C++ Compilers:
GCC 7+
Clang 9+ (only on Linux/WSL2)
Compiling with MSVC (Windows) is not supported
Note
cuFFTDx emits errors for unsupported versions of compilers, which can be silenced by defining CUFFTDX_IGNORE_DEPRECATED_COMPILER
during compilation. cuFFTDx is not guaranteed to work with versions of compilers that are not supported in cuFTTDx.
Note
cuFFTDx emits errors for unsupported versions of C++ standard, which can be silenced by defining CUFFTDX_IGNORE_DEPRECATED_DIALECT
during compilation. cuFFTDx is not guaranteed to work with versions of C++ standard that are not supported in cuFTTDx.
Supported Functionality¶
- Supported functions include:
-
Create block descriptors that run collective FFT operations (with one or more threads collaborating to compute one or more FFTs) in a single CUDA block. See Block Operator.
Create thread descriptors that run a single FFT operation per thread. This function might require more expertise with cuFFTDx in order to obtain correct results with higher performance. See Thread Operator.
Bi-directional information flow, from the user to the descriptor via Operators and from the descriptor to the user via Traits.
Target specific GPU architectures using the SM Operator. This enables users to configure the descriptor with suggested parameters to target performance.
cuFFTDx supports selected FFT sizes in the range [0; max_size]
and all sizes in the range [0; max_size/2]
, where max_size
depends on precision, type,
and CUDA architecture. However, not every combination of size, precision, elements per thread, and FFTs per block is correct and available. The following
table summarizes the available configurations:
Type |
Precision |
Thread FFT Sizes |
Block FFT Sizes |
|
Architecture |
Size Range |
|||
|
half |
All sizes in range: [2; 32] |
75 |
[2; 4096] |
70;72;86 |
[2; 16384] |
|||
80 |
[2; 32768] |
|||
float |
All sizes in range: [2; 32] |
75 |
[2; 4096] |
|
70;72;86 |
[2; 16384] |
|||
80 |
[2; 32768] |
|||
double |
All sizes in range: [2; 16] |
75 |
[2; 2048] |
|
70;72;86 |
[2; 8192] |
|||
80 |
[2; 16384] |
Note
cuFFTDx 0.3.0 added preliminary support for all sizes in range of [0; max_size/2]
. Most sizes will require you to create additional workspace with global memory allocation. See Make Workspace Function
for more details about workspace. You can check if a given FFT requires with FFT::requires_workspace trait.
Workspace is not required for FFTs of following sizes:
Powers of 2 up to 32768
Powers of 3 up to 19683
Powers of 5 up to 15625
Powers of 6 up to 1296
Powers of 7 up to 2401
Powers of 10 up to 10000
Powers of 11 up to 1331
Powers of 12 up to 1728
- In the future versions of cuFFTDx:
Workspace requirement may be removed for other configurations.
FFT configurations that do not require workspace will continue to do so.
- Functionality not yet supported include:
-
Input/output stored in global memory. Input data must be in registers (local memory) or shared memory.
The BlockDim Operator, which enables fine-grain customization of the CUDA block dimensions.