Operators#

Operators are used to describe the FFT operation to solve, and to configure the execution. They are divided into Description Operators and Execution Operators.


Description Operators#

Operator

Default value

Description

Size<unsigned int S>

Not set.

Size S of the FFT to calculate.

Direction<fft_direction>

Not set.

Direction of the FFT, either fft_direction::inverse or fft_direction::forward.

Type<fft_type>

fft_type::c2c

Types of input and output data (C2C, R2C, C2R).

Precision<P>

float

Precision P of the floating-point values used to compute the FFT: double, float or __half.

SM<unsigned int CC>

Not set.

Target CUDA architecure for which the FFT function should be generated.

RealFFTOptions<complex_layout, real_mode>

complex_layout::natural and real_mode::normal.

Complex data layout choice and execution mode for R2C/C2R FFT types

experimental::CodeType<experimental::code_type>

experimental::code_type::ptx

Code type of the FFT compute device function, either experimental::code_type::ptx or experimental::code_type::ltoir.

Description operators define the FFT operation to be solved. Combined with Execution Operators, they form a complete FFT descriptor that can be executed on a GPU.

Operators are added to construct the FFT descriptor type. For example, for a forward FFT operation consisting of a FFT with 8 double elements per thread:

#include <cufftdx.hpp>

using FFT = decltype( cufftdx::Size<8>() + cufftdx::Type<fft_type::c2c>()
                    + cufftdx::Direction<fft_direction::forward>()
                    + cufftdx::Precision<double>() + cufftdx::Thread() );
For an FFT descriptor to be complete, the following is required:

Size Operator#

cufftdx::Size<unsigned int S>()

Sets the size S of the FFT operation to compute.

There is no default size.

Restrictions:
  • S must be greater than 1.

Direction Operator#

cufftdx::Direction<cufftdx::fft_direction>()

Sets the direction of the FFT, either fft_direction::inverse or fft_direction::forward.

There is no default direction.

If the FFT is constructed with the Type<R2C> operator, direction is assumed to be forward and a direction operator is not necessary.

If the FFT is constructed with the Type<C2R> operator, direction is assumed to be inverse and a direction operator is not necessary.

Restrictions:
  • fft_direction::forward requires Type<C2C> or Type<R2C>.

  • fft_direction::inverse requires Type<C2C> or Type<C2R>.

cuFFTDx performs un-normalized FFTs; that is, performing a forward FFT on an input data set followed by an inverse FFT on the resulting set yields data that is equal to the input, scaled by the size of the FFT. Scaling either transform by the reciprocal of the size of the data set is left for the user to perform as seen fit.

Type Operator#

cufftdx::Type<cufftdx::fft_type>()

Sets the type of the FFT to compute, either fft_type::c2c for complex-to-complex; fft_type::r2c for real-to-complex; or fft_type::c2r for complex-to-real.

The default type is fft_type::c2c.

Restrictions:
  • fft_type::r2c requires fft_direction::forward. If no direction is specified, it is assumed to be fft_direction::forward.

  • fft_type::c2r requires fft_direction::inverse. If no direction is specified, it is assumed to be fft_direction::inverse.

  • cuFFDx performs unnormalized Fast Fourier Transform calculations.

Precision Operator#

cufftdx::Precision<__half>()

cufftdx::Precision<float>()

cufftdx::Precision<double>()

Sets the floating-point precision used to compute the FFT. This is the type of the values used for input and output, as well as the underlying type of the values used to compute the FFT.

The default precision is float.

SM Operator#

cufftdx::SM<unsigned int CC>()

Sets the target architecture CC for the underlying FFT function to use. Supported architectures are:

  • Volta: 700 (sm_70).

  • Turing: 750 (sm_75).

  • Ampere: 800, 860 and 870 (sm_80, sm_86, sm_87).

  • Ada: 890 (sm_89).

  • Hopper: 900 (sm_90).

  • Blackwell: 1000, 1010, 1030, 1200, 1210 (sm_100, sm_101, sm_103, sm_120, sm_121).

Warning

Starting with cuFFTDx 1.5.0, support for NVIDIA Xavier Tegra SoC (SM<720> or sm_72) is deprecated.

Warning

Support for architectures sm_103, and sm_121 is experimental in this release.

Note

When compiling for XYa or XYf compute capability use XY0 in the SM operator (see also CUDA C++ Programming Guide: Feature Availability).

Warning

It is not guaranteed that executions of exactly the same FFTs on GPUs of different CUDA architectures will produce bit-identical results.

RealFFTOptions Operator#

cufftdx::RealFFTOptions<complex_layout, real_mode>()

The first argument defines the input and output data layouts for real-to-complex (fft_type::r2c) and complex-to-real (fft_type::c2r) FFT types, which is explained in detail in Complex Element Layouts for complex inputs and outputs and Real Element Layouts for real inputs and outputs.

Possible complex_layout values:

  • complex_layout::natural (default)

  • complex_layout::packed

  • complex_layout::full

The second argument, a real_mode enum value, allows to choose between optimized and non optimized execution kernel. There are two modes: normal (default), and optimized, which is called folded. The optimized execution is an opt-in, as it changes how the data is partitioned between threads in block execution.

Possible real_mode values:

  • real_mode::normal (default)

  • real_mode::folded

All traits necessary for correct I/O are shared by the topmost FFT type. Please refer to examples (e.g. simple_fft_block_r2c and simple_fft_block_c2r) for a direct showcase of proper idioms.

CodeType Operator#

cufftdx::experimental::CodeType<cufftdx::experimental::code_type>()

Sets the type of code to use for kernel generation. Available options are:

  • experimental::code_type::ptx (default) - Use inlined-PTX implementation of the FFT compute device functions.

  • experimental::code_type::ltoir - If available, use LTOIR implementation of the FFT compute device functions.

The default code type is experimental::code_type::ptx.

Note

If experimental::code_type::ltoir code type is specified but not available, cuFFTDx will fall back to using inlined-PTX implementation. On the other hand, if experimental::code_type::ptx code type is specified, cuFFTDx will not use LTOIR implementation even if it is available.


Execution Operators#

Operator

Default value

Description

Thread

Not set.

Creates FFT thread execution object.

Block

Not set.

Creates FFT block execution object. See Block Configuration Operators.

Execution operators configure how the FFT operation will run on the GPU. Combined with Description Operators, they form a complete FFT descriptor that can be executed on a GPU.

Operators are added to construct the FFT descriptor type. For example, for a forward FFT operation consisting of two FFTs with 128 float elements each, running simultaneously in one CUDA block:

#include <cufftdx.hpp>

using FFT = decltype( cufftdx::Size<128>() + cufftdx::Type<fft_type::c2c>()
                    + cufftdx::Direction<fft_direction::forward>()
                    + cufftdx::Precision<float>() + cufftdx::Block()
                    + cufftdx::ElementsPerThread<8>() + cufftdx::FFTsPerBlock<2>() );

Thread Operator#

cufftdx::Thread()

Sets the FFT operation to run in a thread context. The FFT operation will simultaneously run a single, independent FFT (described using Description Operators) per thread.

Each thread will compute one FFT of the size defined by the Size Operator.

Restrictions:
  • Is mutually exclusive with Block operator

  • Compilation will fail when used with block-only operators: FFTsPerBlock, ElementsPerThread, BlockDim.

  • With Precision<__half> restricts Size to range \([2, 32]\).

  • With Precision<float> restricts Size to range \([2, 32]\).

  • With Precision<double> restricts Size to range \([2, 16]\).

Block Operator#

cufftdx::Block()

Generates a collective FFT operation to run in a single CUDA block. One or more threads will cooperate to compute the collective FFT operation.

The number of FFTs to compute, as well as the number of threads used to calculate each FFT, can be configured using Block Configuration Operators.

Restrictions:
  • Is mutually exclusive with Thread operator

  • Unless a BlockDim Operator is used, the collective FFT operation can only be executed inside a 2D block of sizes:

    • blockDim.x = size_of<Description>::value/Description::elements_per_thread.

    • blockDim.y = Description::ffts_per_block/Description::implicit_batching.

    • blockDim.z = 1.

  • BlockDim Operator is not implemented yet.

  • Operator cufftdx::Precision<__half>() restricts cufftdx::Size<U>() to range \([2, 32768]\).

  • Operator cufftdx::Precision<float>() restricts cufftdx::Size<U>() to range \([2, 32768]\).

  • Operator cufftdx::Precision<double>() restricts cufftdx::Size<U>() to range \([2, 16384]\).

Block Configuration Operators#

Operators

Default value

Description

FFTsPerBlock<unsigned int F>

1

Number F of FFTs calculated per CUDA block.

ElementsPerThread<unsigned int E>

Heuristic.

Number E of FFT values per CUDA thread.

BlockDim<unsigned int X, Y, Z>

(size/elements_per_thread, ffts_per_block/implicit_batching, 1)

Required for executing block FFT within block with custom dimensions.

experimental::DynamicBatching

false

Configures the FFT execution to use dynamic batching.

Block-configuration operators allow the user to tune how the collective FFT operation will run on a single CUDA block. When no block-configuration operators are used, the FFT execution will use the default block dimensions.

Note

Block configuration operators can only be used with Block Operator.

Warning

It is not guaranteed that executions of the same FFTs (size, direction, type, precision) but with different

will produce bit-identical results.

FFTs Per Block Operator#

cufftdx::FFTsPerBlock<unsigned int>()

Sets the number of FFT to compute in parallel within a single CUDA block. Each FFT is computed concurrently by a separate group of threads.

The default is one FFT per block.

Elements Per Thread Operator#

cufftdx::ElementsPerThread<unsigned int>()

Sets the number of FFT elements to be computed by each thread.

The default is determined heuristically to target performance.

Restrictions:

  • If FFT::requires_workspace is false, it must be a divisor of the requested FFT size.

  • If FFT::requires_workspace is true, it must be a power of two smaller than the size of the FFT.

  • Must be in range \([2; 32]\) for cufftdx::Precision<float>() and cufftdx::Precision<__half>().

  • Must be in range \([2; 16]\) for cufftdx::Precision<double>().

BlockDim Operator#

struct cufftdx::BlockDim<unsigned int X, unsigned int Y, unsigned int Z>()

Note

BlockDim operator is not officially supported with CUDA Toolkit 13.1.0 (NVCC 13.1.80)and earlier. If you are using cuFFTDx with a previous version of nvcc, and you want to use the BlockDim operator, you can:

  • Define the CUFFTDX_IGNORE_BLOCK_DIM_UNSUPPORTED macro to ignore these assertions and use the BlockDim operator.

  • Add the -Xptxas -O1 flag to the compilation command and verify correctness of the results every time.

From version 1.6.0 onwards it is possible to set the CUDA block size to (X, Y, Z), to configure the FFT execution.

Using this operator, the user can adapt the FFT execution to use custom block dimensions.

The default block dimensions for an FFT are:

  • blockDim.x = size_of<Description>::value/Description::elements_per_thread.

  • blockDim.y = Description::ffts_per_block/Description::implicit_type_batching.

  • blockDim.z = 1.

The use of this operator does not guarantee that the FFT execute will make use of the full block dimensions but it ensures a correct execution of the FFT when the kernel is launched with the specified block dimensions. When setting custom block dimensions with BlockDim<X, Y, Z> certain restrictions apply:

  • X must be larger or equal than size_of<Description>::value/Description::elements_per_thread.

  • Y must be larger or equal than Description::ffts_per_block/Description::implicit_type_batching.

  • Runtime block dimensions must be equal to those set in the BlockDim operator.

  • experimental::CodeType operator must be set to experimental::code_type::ptx.

Examples

FFT Size<128>, ElementsPerThread<8>, FFTsPerBlock<1>, BlockDim<64> - OK
FFT Size<128>, ElementsPerThread<8>, FFTsPerBlock<1>, BlockDim<8> - INCORRECT, 8 is less than size_of<Description>::value/Description::elements_per_thread (16)
FFT Size<128>, ElementsPerThread<8>, FFTsPerBlock<6>, BlockDim<64, 2, 1> - INCORRECT, 2 is less than Description::ffts_per_block (6)
FFT Size<128>, ElementsPerThread<8>, FFTsPerBlock<6>, Precision<__half>, BlockDim<64, 3, 1> - OK, due to implicit batching caused by the use of __half precision.

If only BlockDim<X, Y, Z> is provided, the FFT execution will be configured to use the best possible elements_per_thread and ffts_per_block that fit the block dimension. For example, for an FFT Size<128>, Precision<float> and BlockDim<32,4,1>, the FFT execution will be configured to use the best elements per thread that makes 32 >= size_of<Description>::value / elements_per_thread. For the case of ffts_per_block, the best value will be 4. These values can be overridden by adding the ElementsPerThread and FFTsPerBlock operators. Since not all threads will participate equally in the FFT execution, we provide a new trait to properly handle input output for the participating threads. Check the ::Working group<working-group-trait-label> for more information.

See FFT::block_dim.

Dynamic Batching#

cufftdx::experimental::DynamicBatching()

When added to the description, the FFT execution will be configured to use dynamic batching.

When the FFT execution is configured to use dynamic batching, the number of FFTs per block is not fixed and can be set at runtime depending on the y dimension of the runtime block. The DynamicBatching operator is not compatible with the BlockDim Operator operator or the FFTs Per Block Operator operator.

Examples

FFT Size<128>, ElementsPerThread<8>, DynamicBatching, runtime block dimensions (16, 8, 1) - Executed FFTs per block: 8
FFT Size<128>, ElementsPerThread<8>, runtime block dimensions (16, 1, 1) - Executed FFTs per block: 1

Note

The dynamic batching operator is experimental and may be changed or removed in future releases.

Warning

When using the DynamicBatching operator, the number of FFTs per block is not fixed at compile time. The user must allocate sufficient shared memory based on the actual number of FFTs to be computed.

To compute the required shared memory:

where ffts_per_block is the runtime number of FFTs you intend to compute per block.

Warning

Only the y dimension of the runtime block can be dynamic. Runtime block dimensions for x and z must coincide with the FFT::block_dim trait dimensions obtained from the execution in order to ensure correct results.