Execution Methods¶

These methods are used to run the FFT operation.

A code example:

#include <cufftdx.hpp>

using FFT = decltype( cufftdx::Size<128>() + cufftdx::Type<fft_type::c2c>()
                    + cufftdx::Direction<fft_direction::forward>()
                    + cufftdx::Precision<float>() + cufftdx::Block() );

using complex_type = typename FFT::value_type;

__global__ kernel(... /* arguments */) {

  // Shared memory pointer
  extern __shared__ __align__(alignof(float4)) complex_type shared_mem[];

  // Register data
  complex_type thread_data[FFT::storage_size];

  // Load data into registers (thread_data)
  // ...

  FFT().execute(thread_data, shared_mem);

  // Store results (thread_data) into global memory
}

Thread Execute Method¶

void FFT().execute<typename T>(T* input)

Runs the FFT operation defined by the FFT descriptor. T can be any type (such as float2 or double2), as long as its alignment and element size are the same as those of FFT::value_type.

This method is available if the descriptor has been constructed using the Thread Operator and cufftdx::is_complete_fft_execution is true.

input array should be in registers. input must fit FFT::storage_size elements of type FFT::value_type.

Warning

It is not guaranteed that executions of exactly the same FFTs on GPUs of different CUDA architectures will produce bit-identical results.

Block Execute Method¶

// #1
void FFT().execute<typename T>(T* input, void* shared_memory, FFT::workspace_type& workspace)

// #2: Version of #1 for FFTs which don't require workspace
void FFT().execute<typename T>(T* input, void* shared_memory)

// #3: Execute with input data in shared memory
void FFT().execute<typename T>(T* shared_memory_input, FFT::workspace_type& workspace)

// #4: Version of #3 for FFTs which don't require workspace
void FFT().execute<typename T>(T* shared_memory_input)

Runs the FFT operation defined by the FFT descriptor. T can be any type (such as float2 or double2), as long as its alignment and element size are the same as those of FFT::value_type.

This method is available if the descriptor has been constructed using the Block Operator and cufftdx::is_complete_fft_execution is true.

When FFT::requires_workspace is false, overloads #2 and #4 can be used. Otherwise, user has to use methods #1 or #3 and pass a reference to a workspace.

Warning

The library code assumes that both shared_memory and shared_memory_input are aligned to 128 bits for optimal memory operations. This can be accomplished by using either __align__ or alignas compiler directives. Proper alignment is demonstrated in example simple_fft_block_shared. Please consult CUDA C++ Programming Guide for furher details on memory alignment. Although not required the user may also consider aligning thread local arrays in the same way, to reduce kernel resource usage.

Example

// Examples of how to make sure shared memory pointer is aligned to 128 bits (16 bytes):
using value_type = typename FFT::value_type;
extern __shared__ alignas(float4) unsigned char shared_mem[];                               // 1
extern __shared__ __align__(16) value_type shared_mem[];                                    // 2
extern __shared__ __align__(alignof(float4)) value_type shared_mem[];                       // 3
extern __shared__ float4 shared_mem[];                                                      // 4

// Warning: std::aligned_storage became deprecated in C++23
extern __shared__ std::aligned_storage_t<sizeof(float4), alignof(float4)> shared_mem[];     // 5

In methods #1 and #2 input is in thread local arrays, and shared_memory is a pointer to a shared memory of size FFT::shared_memory_size bytes. The operation is in-place meaning the results are stored in input. input must fit FFT::storage_size elements of type FFT::value_type.

Note

Methods #1 and #2 don’t assume that shared memory (shared_memory) is safe to modify or access without block synchronization, and perform required synchronization (__syncthreads()) before the first use of it. Also, methods #1 and #2 don’t synchronize any threads within a block after the last operation on shared memory is done. If that shared memory is going to be reused later a synchronization has to be performed first.

In methods #3 and #4 the input data is passed in shared memory (shared_memory_input). The operation is in-place, meaning the results are stored back to shared_memory_input. These methods don’t require an additional shared_memory pointer to be passed, as shared_memory_input will be used for the required communication between threads. Thus, shared_memory_input must fit all input and output values, and can’t be smaller than FFT::shared_memory_size bytes (i.e. shared memory size in bytes is a maximum of FFT::shared_memory_size, FFT::ffts_per_block * <FFT_input_size_in_bytes>, and FFT::ffts_per_block * <FFT_output_size_in_bytes>) bytes).

Note

Methods #3 and #4, which get input via shared memory, assume that a synchronization was already performed and the data can be safely accessed. Methods don’t synchronize any threads within a block after the last operation on shared memory is done. Before reading from or writing to shared memory a synchronization has to be performed first.

Warning

It is not guaranteed that executions of the same FFTs (size, direction, type, precision) but with different

number of elements per thread (ElementsPerThread),
number of FFTs calculated per CUDA block (FFTsPerBlock), or
block dimension (BlockDim),

will produce bit-identical results.

Warning

It is not guaranteed that executions of exactly the same FFTs on GPUs of different CUDA architectures will produce bit-identical results.

Shared Memory Usage¶

It’s important to note that large FFTs may require more than 48 KB of shared memory per CUDA block. Therefore, as described in CUDA Programming Guide (#1, #2, #3), kernels with such FFTs must use the dynamic shared memory rather than statically sized shared memory arrays. Additionally, these kernels require an explicit opt-in using cudaFuncSetAttribute() to set the cudaFuncAttributeMaxDynamicSharedMemorySize. See example code below and the introduction example.

#include <cufftdx.hpp>
using namespace cufftdx;

using FFT = decltype(Size<16384>() + Precision<float>() + Type<fft_type::c2c>()
                     + Direction<fft_direction::forward>() + SM<800>() + Block());

__global__ void block_fft_kernel(FFT::value_type* data) {
  // dynamic shared memory
  extern __shared__ __align__(alignof(float4)) FFT::value_type shared_mem[];

  (...)
}

void example() {
  (...)

  // Increases the max dynamic shared memory size to match FFT requirements
  cudaFuncSetAttribute(block_fft_kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, FFT::shared_memory_size)
  // Invokes kernel with FFT::block_dim threads in CUDA block
  block_fft_kernel<FFT><<<1, FFT::block_dim, FFT::shared_memory_size>>>(data, workspace);

  (...)
}

Make Workspace Function¶

template<class FFT>
auto cufftdx::make_workspace<FFT>(cudaError_t& error, cudaStream_t stream = 0)

cufftdx::make_workspace<FFT>(cudaError_t&, cudaStream_t) is a helper function for creating workspace required for block execute(...) method when FFT::requires_workspace is true. FFT is type of FFT descriptor. If no stream argument is passed, the default 0 stream is used for execution. If after calling the function error is not cudaSuccess the workspace was not created correctly and is invalid.

If FFT::requires_workspace trait is false, user doesn’t have to create workspace.
Workspace can be created for FFT with FFT::requires_workspace equal to false: such workspace is an empty workspace with no global memory allocation.
Workspace object is valid only for FFT it was created for.
Workspace object can allocate global memory, however never more than FFT::workspace_size, and it’s responsible for freeing it.
Workspace can’t be used concurrently since all copies share the same underlying global memory allocation. Using workspace concurrently will result in memory races.
Allocated global memory is freed upon destruction of the last copy of created workspace object.
Workspace object can be implicitly cast to FFT::workspace_type.

Note

Workspace is not required for FFTs of following sizes:

Powers of 2 up to 32768

Powers of 3 up to 19683

Powers of 5 up to 15625

Powers of 6 up to 1296

Powers of 7 up to 2401

Powers of 10 up to 10000

Powers of 11 up to 1331

Powers of 12 up to 1728

In the future versions of cuFFTDx:

Workspace requirement may be removed for other configurations.

FFT configurations that do not require workspace will continue to do so.

Warning

FFT::workspace_type object doesn’t track lifetime of underlying memory, and is only valid within a lifetime of workspace object it was casted from.

Warning

Type returned by cufftdx::make_workspace<FFT>(cudaError_t&, cudaStream_t) can be different for different FFT descriptions, and is not the same as FFT::workspace_type. User should use auto when creating a workspace object. Example:

// Kernel
template<class FFT>
__launch_bounds__(FFT::max_threads_per_block)
__global__ void block_fft_kernel(typename FFT::value_type* data, typename FFT::workspace_type workspace) {
    // ...

    // Execute FFT
    FFT().execute(thread_data, shared_mem, workspace);
}

// Create workspace
cudaError_t error = 0;
auto workspace = cufftdx::make_workspace<FFT>(error, stream);

// ...

// Run kernel with FFT
block_fft_kernel<FFT><<<1, FFT::block_dim, FFT::shared_memory_size, stream>>>(data, workspace);

Value Formats and Data Layouts¶

To perform the FFT correctly data needs to be passed to the library in a specific format described in section Value Formats and partitioned among threads as detailed in section Data Layouts. The user is responsible for getting these two factors right, but the library provides idioms, traits and examples which should make API use accessible and intuitive.

Value Formats¶

For complex numbers of single and double precision, the first value in a complex number is the real part and the second is the imaginary part.

The input value format for complex-to-complex and complex-to-real FFTs is the aforementioned complex type, but for real-to-complex FFT this property is dependent on the use of RealFFTOptions Operator and specifically its real_mode enum value. By default the real-to-complex execution takes real values as arguments, but if the execution mode is set to real_mode::folded the library performs an optimized execution which treats real input as complex input of half the size. In this case the value format is complex (identical to that of complex-to-complex or complex-to-real) containing two real values.

Similarly, the output value format for complex-to-complex and real-to-complex FFTs is the complex type, but for complex-to-real FFT this property is dependent on the use of RealFFTOptions Operator and specifically its real_mode enum value. By default the complex-to-real execution outputs real values as results, but if the execution mode is set to real_mode::folded the library performs an optimized execution which treats real output as complex output of half the size. In this case the value format is complex (identical to that of complex-to-complex or complex-to-real) and otherwise it’s made of real values.

Half Precision Implicit Batching¶

Processing of half (fp16) precision FFTs in cuFFTDx is implicitly batched, that is, a single computation processes two FFT batches. cuFFTDx expects that a complex number of half precision has 2 real parts and 2 imaginary parts in that order (i.e real₀, real₁, imaginary₀, imaginary₁). Real values of half precision (for R2C and C2R FFTs) follow the same logic and each should contain two real values. See also FFT::implicit_type_batching trait.

\[\begin{split}\text{complex half precision value:}\;\; & (real_0, real_1, imaginary_0, imaginary_1) \\ \text{real half precision value:}\;\; & (real_0, real_1) \\\end{split}\]

Data Layouts¶

Input and output data layouts of an FFT depend strictly on its configuration and the selected transform type. A data layout describes the size and the pattern of input or output data for the execute methods, no matter if input/output are partitioned between threads into thread local arrays or stored contiguously in shared memory.

As the section Value Formats put forth, the input / output formats are either real or complex. Another important execution attribute is the sequence length. In complex-to-complex transforms this property is equal to transform size, but for real-to-complex and complex-to-real the library offers a few options to choose from. This choice can be made by means of the RealFFTOptions Operator operator.

The sections below provide detailed descriptions of the input and output data layouts for both thread and block execution modes .

Complex-to-complex¶

In the case of complex-to-complex transforms both the input and output data must be a complex array in a corresponding precision. Both input and output arrays are always of the same length, equal to the size of the FFT.

Input/Output in Registers for Thread Execution¶

The input values of the FFT should be stored in input in the natural order. Results are stored in input following the same rule.

Input/Output in Registers for Block Execution¶

n-th thread (indexing from 0) participating in the FFT should include the following elements of FFT in its input values: n + FFT::stride * i where i is an index in input. Results are later stored in input following the same rule. Please note that in certain cases, especially if FFT::requires_workspace is true, the values are not partitioned equally between threads.

Input/Output in Shared Memory for Block Execution¶

The input values of the FFT should be stored in shared_memory_input in the natural order. Results are stored in shared_memory_input following the same rule.

Real-to-complex and complex-to-real¶

For real-to-complex (R2C) and complex-to-real (C2R) FFT input and output data layouts depend on the RealFFTOptions Operator. By default RealFFTOptions is set to complex_layout::natural and real_mode::normal.

Complex layouts: complex_layout::natural (default), complex_layout::packed, complex_layout::full.
Real modes: real_mode::normal (default), real_mode::folded.

Complex Element Layouts¶

Complex element layout is defined for the complex-to-real input and real-to-complex output. It depends on the passed complex_layout value of the RealFFTOptions operator.

complex_layout::natural for even length ((r₀, i₀ = 0), (r₁, i₁), … , (r _N/2, i _N/2 = 0)), only non-redundant N/2 + 1 elements, with first and last elements containing only real parts.
- imag₀ and imag_N/2 are both assumed to 0 due to mathematical properties of complex-to-real FFT input signal.
complex_layout::natural for odd length ((r₀, i₀ = 0), (r₁, i₁), … , (r _⌊N/2⌋, i _⌊N/2⌋)), only non-redundant ⌊N/2⌋ + 1 elements, with first element containing only the real part.
- imag₀ is assumed to be 0 due to mathematical properties of complex-to-real FFT input signal.
complex_layout::packed ((r₀, i₀), (r₁, i₁), … , (r _{N/2 - 1}, i _{N/2 - 1})), packs the last real element into the imaginary component of the first element.
- Packing means that the first element x₀ contains (real₀, real_⌊N/2⌋).
- Allowed only for even sizes, e.g. for 16-point FFT the length is 8, for 15-point it is unavailable.
complex_layout::full for even lengths ((r₀, i₀ = 0), (r₁, i₁), … , (r_N/2, i_N/2 = 0), … , (r_{N - 1}, i_{N - 1})), all elements, including redundant ones, since the output is hermitian.
- For example, for 16-point FFT the length is 16
- imag₀ and imag_N/2 are both assumed to be 0 due to mathematical properties of complex-to-real FFT input signal.
complex_layout::full for odd lengths ((r₀, i₀ = 0), (r₁, i₁ ), … , (r_{N - 1}, i_{N - 1})), all elements, including redundant ones, since the output is hermitian.
- For example, for 15-point FFT the length is 15
- imag₀ is assumed to be 0 due to mathematical properties of complex-to-real FFT input signal.

Complex Input/Output in Registers for Thread Execution¶

Follows the same rules as described in Complex Input/Output In Shared Memory for Block Execution, but the input/output data is stored in a thread local array.

Complex Input/Output in Registers for Block Execution¶

n-th thread (indexing from 0) participating in the FFT should include the following elements of FFT in its input values: n + FFT::stride * i where i is an index in input. Results are later stored in input following the same rule. Please note that in certain cases, especially if FFT::requires_workspace is true, the values are not partitioned equally between threads.

Example

For 16-point FP32/FP64 C2R FFT with ElementsPerThread set or defaulted to 4 (implying 4 threads, and FFT::stride equal to 4):

complex_layout::natural: the input layout (first row) and required partitioning of the data (entire table) into thread local arrays is presented below:

thread / element

(r₀, i₀ = 0)

(r₁, i₁)

(r₂, i₂)

(r₃, i₃)

(r₄, i₄)

(r₅, i₅)

(r₆, i₆)

(r₇, i₇),

(r₈, i₈ = 0)

0

X

X

X

1

X

X

2

X

X

3

X

X

complex_layout::packed: the input layout (first row) and required partitioning of the data (entire table) into a thread array is presented below:

thread / element

(r₀, r₈)

(r₁, i₁)

(r₂, i₂)

(r₃, i₃)

(r₄, i₄)

(r₅, i₅)

(r₆, i₆)

(r₇, i₇),

0

X

X

1

X

X

2

X

X

3

X

X

complex_layout::full: the input layout (first row) and required partitioning of the data (entire table) into thread local arrays is presented below:

thread / element

(r₀, i₀ = 0)

(r₁, i₁)

(r₂, i₂)

(r₃, i₃)

(r₄, i₄)

(r₅, i₅)

(r₆, i₆)

(r₇, i₇),

(r₈, i₈ = 0)

(r₉, i₉)

(r₁₀, i₁₀)

(r₁₁, i₁₁)

(r₁₂, i₁₂)

(r₁₃, i₁₃)

(r₁₄, i₁₄)

(r₁₅, i₁₅),

0

X

X

X

X

1

X

X

X

X

2

X

X

X

X

3

X

X

X

X

The layouts and the required partitioning look similarly for R2C output.

Half-precision FFTs follow the same rules but the implicit batching has be taken into account.

Complex Input/Output in Shared Memory for Block Execution¶

The input values of the FFT should be stored in shared_memory_input in the natural order. Results are stored in shared_memory_input following the same rule.

Example

For 8-point FP32/FP64 C2R FFT:

complex_layout::natural input layout for shared_memory_input looks like this: [(real₀, imag₀ = 0), (r₁, i₁), (r₂, i₂), (r₃, i₃),(r₄, i₄ = 0)].

complex_layout::packed input layout for shared_memory_input looks like this: [(real₀, real₄), (r₁, i₁), (r₂, i₂), (r₃, i₃)].

complex_layout::full input layout for shared_memory_input looks like this: [(real₀, imag₀ = 0), (r₁, i₁), (r₂, i₂), (r₃, i₃),(r₄, i₄ = 0), (r₅, i₅), (r₆, i₆), (r₇, i₇)].

The layouts look similarly for R2C output.

Half-precision FFTs follow the same rules but the implicit batching has be taken into account.

Real Element Layouts¶

Real element layout is defined for the complex-to-real output and real-to-complex input. It depends on real_mode parameter of the RealFFTOptions operator:

real_mode::normal (x₀, x₁, … , x_{N - 1}), real element array of length equal to the transformation size.
real_mode::folded (x₀, x₁, … , x _{N/2 - 1}), complex element array of length half that of the transformation size.

Note

The physical layout of the elements in shared memory is the same as in the real_mode::normal case, but the logical one changes: instead of being an array of real values, it is an array of complex values.

Note

real_mode::folded execution is dependent on certain FFT characteristics, and currently it is available in block execution only for sizes equal to 2^N where N is any exponent which still makes the transform fit into available size range. For thread execution this optimization supports all sizes equal to 2 * N, where N is any multiplier which makes the transform fit into available size range. Using folded execution doubles the size limit available otherwise. Consult Supported Functionality for further details.

Real Input/Output in Registers for Thread Execution¶

Follows the same rules as described in Reak Input/Output In Shared Memory for Block Execution, but the input/output data is stored in a thread local array.

Real Input/Output in Registers for Block Execution¶

n-th thread (indexing from 0) participating in the FFT should include the following values of type Input Type Trait : n + FFT::stride * i where i is an index in input. Results are later stored in input following the same rule. Please note that in certain cases, especially if FFT::requires_workspace is true, the values are not partitioned equally between threads. It is also important to remember that the count of values of this type per thread will be Input EPT Trait and not Elements Per Thread Trait

Example

For 16-point FP32/FP64 R2C FFT with ElementsPerThread set or defaulted to 4:

real_mode::normal input layout looks like this: [real₀, r₁, r₂, r₃, r₄, r₅, r₆, r₇]. The required partitioning of the data into thread local arrays is presented below:

thread / element

r₀

r₁

r₂

r₃

r₄

r₅

r₆

r₇

0

X

X

1

X

X

2

X

X

3

X

X

real_mode::folded input layout looks like this: [(real₀, r₁), (r₂, r₃), (r₄, r₅), (r₆, r₇)]. The required partitioning of the data into thread local arrays is presented below:

thread / element

(r₀, r₁)

(r₂, r₃)

(r₄, r₅)

(r₆, r₇)

0

X

1

X

2

X

3

X

The layouts and the required partitioning look similarly for C2R output.

Half-precision FFTs follow the same rules but the implicit batching has be taken into account.

Real Input/Output in Shared Memory for Block Execution¶

The input values of the FFT should be stored in shared_memory_input in natural order. Results are stored in shared_memory_input following the same rule.

Example

For 8-point FP32/FP64 R2C FFT:

real_mode::normal input layout for shared_memory_input looks like this: [real₀, r₁, r₂, r₃, r₄, r₅, r₆, r₇]

real_mode::folded input layout for shared_memory_input looks like this: [(real₀, r₁), (r₂, r₃), (r₄, r₅), (r₆, r₇)]

The layouts look similarly for C2R output.

Half-precision FFTs follow the same rules but the implicit batching has be taken into account.

Loading and Storing Data¶

To deal with the complexity of changing input and output lengths depending on configuration, the library provides various traits allowing for easy memory operations. Input Length Trait and Output Length Trait describe lengths of respectively input and output arrays. This covers real-to-complex (R2C) and complex-to-real (C2R) cases described above. Input EPT Trait describes the count of elements to be loaded by each thread, assuming that each element is of type Input Type Trait. Respectively Output EPT Trait and Output Type Trait describe the same properties for storing output.

The detailed idiomatic IO is shown in examples, but the general approach for loading data from global memory is as follows:

Example

Block FFT with Data in Registers

The following example shows block trait-based universal loading scheme for cuFFTDx register API execution mode. Here a thread group performs an FFT cooperatively, so the data needs to be spread among all participants. The following code snippets already account for this partitioning, but also different value formats (as described in Value Formats) and data layouts (as described in Real Element Layouts and Complex Element Layouts).

// Which FFT in this block is this thread performing
const auto local_fft_id = threadIdx.y;
// Which FFT in this grid is this thread performing
const auto global_fft_id = FFT::ffts_per_block * blockIdx.x + local_fft_id;
// Memory offset for accessing the first element of the global_fft
const auto global_offset = global_fft_id * FFT::input_length;

// Cast registers to type required as input to FFT execution
using input_t = typename FFT::input_type;
auto thread_fft_data = reinterpret_cast<input_t*>(thread_data);
auto fft_data = reinterpret_cast<const input_t*>(input);

auto index = threadIdx.x;
for (unsigned int i = 0; i < FFT::input_ept; ++i) {
  if (index < FFT::input_length) {
      thread_fft_data[i] = fft_data[global_offset + index];
      index += FFT::stride;
  }

Block FFT with Data in Shared Memory

The following example shows block trait-based universal loading scheme for cuFFTDx shared memory API execution mode. Here entire CUDA block performs ffts-per-block-trait FFTs cooperatively, so the data in shared memory needs to contain all necessary batches. To access memory in coalesced manner the loading of all batches is performed by the entire threadblock. The following code snippet accounts for different value formats (as described in Value Formats) and data layouts (as described in Real Element Layouts and Complex Element Layouts).

// The index of first FFT being performed by threads of this block
const auto block_fft_id = blockIdx.x * FFT::ffts_per_block;
// Offset in memory to the first element accessed by threads in this block
const auto block_offset = block_fft_id * FFT::input_length;
// Combined length of all FFTs performed by threads of this block
constexpr auto block_input_length = FFT::ffts_per_block * FFT::input_length;

// Cast registers to type required as input to FFT execution
using input_t = typename FFT::input_type;
auto shared_fft_data = reinterpret_cast<input_t*>(shared_memory);
auto fft_data = reinterpret_cast<const input_t*>(input);

// The entire block loads all required batches in a coalesced manner,
// threads will load elements from different batches than they will later
// execute on, and this is on purpose.
const auto stride = blockDim.x * blockDim.y;

auto index  = threadIdx.y * blockDim.x + threadIdx.x;
for (int i = 0; i < FFT::input_ept; ++i) {
    if (index < block_input_length) {
        shared_fft_data[index] = fft_data[block_offset + index];
    }
    index += stride;
}

The following example shows thread trait-based universal loading scheme for cuFFTDx execution. Here a single thread performs entire FFT, so there is no need to partition data, it will perform a load of entire sequence. This already accounts for different value_formats (as described in Value Formats) and data layouts (as described in Real Element Layouts and Complex Element Layouts).

// This example assumes a 2-dimensional block and 1-dimensional grid

// Which FFT in block is this thread performing
const auto local_fft_id = threadIdx.y * blockDim.x + threadIdx.x;
// Which FFT in grid is this thread performing
const auto global_fft_id = (blockDim.x * blockDim.y) * blockIdx.x + local_fft_id;
// Memory offset for accessing first element of this FFT
const auto global_offset = global_fft_id * FFT::input_length;

// Cast registers to type required as input to FFT execution
using input_t = typename FFT::input_type;
auto thread_fft_data = reinterpret_cast<input_t*>(thread_data);
auto fft_data = reinterpret_cast<const input_t*>(input);

for (unsigned int i = 0; i < FFT::input_length; ++i) {
  thread_fft_data[i] = fft_data[global_offset + i];
}