Execution Methods

These methods are used to run the FFT operation.

A code example:

#include <cufftdx.hpp>

using FFT = decltype( cufftdx::Size<128>() + cufftdx::Type<fft_type::c2c>()
                    + cufftdx::Direction<fft_direction::forward>()
                    + cufftdx::Precision<float>() + cufftdx::Block() );

using complex_type = typename FFT::value_type;

__global__ kernel(... /* arguments */) {

  // Shared memory pointer
  extern __shared__ complex_type shared_mem[];

  // Register data
  complex_type thread_data[FFT::storage_size];

  // Load data into registers (thread_data)
  // ...

  FFT().execute(thread_data, shared_mem);

  // Store results (thread_data) into global memory
}

Thread Execute Method

void FFT().execute<typename T>(T* input)

Runs the FFT operation defined by the FFT descriptor. T can be any type (such as float2 or double2), as long as its alignment and element size are the same as those of FFT::value_type.

This method is available if the descriptor has been constructed using the Thread Operator and cufftdx::is_complete_fft_execution is true.

input array should be in the per-thread registers. input must fit FFT::storage_size elements of type FFT::value_type.

Warning

It is not guaranteed that executions of exactly the same FFTs on GPUs of different CUDA architectures will produce bit-identical results.

Block Execute Method

// #1
void FFT().execute<typename T>(T* input, void* shared_memory, FFT::workspace_type& workspace)

// #2: Version of #1 for FFTs which don't require workspace
void FFT().execute<typename T>(T* input, void* shared_memory)

// #3: Execute with input data in shared memory
void FFT().execute<typename T>(T* shared_memory_input, FFT::workspace_type& workspace)

// #4: Version of #3 for FFTs which don't require workspace
void FFT().execute<typename T>(T* shared_memory_input)

Runs the FFT operation defined by the FFT descriptor. T can be any type (such as float2 or double2), as long as its alignment and element size are the same as those of FFT::value_type. Pointers input, shared_memory, shared_memory_input should be aligned to alignof(FFT::value_type).

This method is available if the descriptor has been constructed using the Block Operator and cufftdx::is_complete_fft_execution is true.

When FFT::requires_workspace is false, overloads #2 and #4 can be used. Otherwise, user has to use methods #1 or #3 and pass a reference to a workspace.

In methods #1 and #2 input is in per-thread registers, and shared_memory is a pointer to a shared memory of size FFT::shared_memory_size bytes. The operation is in-place meaning the results are stored in input. input must fit FFT::storage_size elements of type FFT::value_type.

Note

Methods #1 and #2 don’t assume that shared memory (shared_memory) is safe to modify or access without block synchronization, and perform required synchronization (__syncthreads()) before the first use of it. Also, methods #1 and #2 don’t synchronize any threads within a block after the last operation on shared memory is done. If that shared memory is going to be reused later a synchronization has to be performed first.

In methods #3 and #4 the input data is passed in shared memory (shared_memory_input). The operation is in-place, meaning the results are stored back to shared_memory_input. These methods don’t require an additional shared_memory pointer to be passed, as shared_memory_input will be used for the required communication between threads. Thus, shared_memory_input must fit all input and output values, and can’t be smaller than FFT::shared_memory_size bytes (i.e. shared memory size in bytes is a maximum of FFT::shared_memory_size, FFT::ffts_per_block * <FFT_input_size_in_bytes>, and FFT::ffts_per_block * <FFT_output_size_in_bytes>) bytes).

Note

Methods #3 and #4, which get input via shared memory, assume that a synchronization was already performed and the data can be safely accessed. Methods don’t synchronize any threads within a block after the last operation on shared memory is done. Before reading from or writing to shared memory a synchronization has to be performed first.

Warning

It is not guaranteed that executions of the same FFTs (size, direction, type, precision) but with different

will produce bit-identical results.

Warning

It is not guaranteed that executions of exactly the same FFTs on GPUs of different CUDA architectures will produce bit-identical results.

Value Format

For complex numbers of single and double precision, the first value in a complex number is the real part and the second is the imaginary part.

Half Precision Implicit Batching

Processing of half (fp16) precision FFTs in cuFFTDx is implicitly batched, that is, a single computation processes two FFT batches. cuFFTDx expects that a complex number of half precision has 2 real parts and 2 imaginary parts in that order (i.e real_1, real_2, imaginary_1, imaginary_2). Real values of half precision (for R2C and C2R FFTs) follows the same logic and each should contain two real values. See also FFT::implicit_type_batching trait.

Input/Output Data Format

This section describes the input and output data format.

Data In Registers

N-th thread (indexing from 0) participating in the FFT should include the following values of FFT in its input values: n + FFT::stride * i where i is an index in input. Results are later stored in input following to the same rule.

See also FFT::stride.

Example

0-th thread of 8-point FFT with FFT::stride equal to 2 should have values 0, 2, 4, and 6 in its input.

Data In Shared Memory

The input values of the FFT should be stored in shared_memory_input in natural order. Results are stored in shared_memory_input following to the same rule.

Shared Memory Usage

It’s important to note that large FFTs may require more than 48 KB of shared memory per CUDA block. Therefore, as described in CUDA Programming Guide (#1, #2, #3), kernels with such FFTs must use the dynamic shared memory rather than statically sized shared memory arrays. Additionally, these kernels require an explicit opt-in using cudaFuncSetAttribute() to set the cudaFuncAttributeMaxDynamicSharedMemorySize. See example code below and the introduction example.

#include <cufftdx.hpp>
using namespace cufftdx;

using FFT = decltype(Size<16384>() + Precision<float>() + Type<fft_type::c2c>()
                     + Direction<fft_direction::forward>() + SM<800>() + Block());

__global__ void block_fft_kernel(FFT::value_type* data) {
  // dynamic shared memory
  extern __shared__ FFT::value_type shared_mem[];

  (...)
}

void example() {
  (...)

  // Increases the max dynamic shared memory size to match FFT requirements
  cudaFuncSetAttribute(block_fft_kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, FFT::shared_memory_size)
  // Invokes kernel with FFT::block_dim threads in CUDA block
  block_fft_kernel<FFT><<<1, FFT::block_dim, FFT::shared_memory_size>>>(data, workspace);

  (...)
}

Make Workspace Function

template<class FFT>
auto cufftdx::make_workspace<FFT>(cudaError_t& error)

cufftdx::make_workspace<FFT>(cudaError_t&) is a helper function for creating workspace required for block execute(...) method when FFT::requires_workspace is true. FFT is type of FFT descriptor. If after calling the function error is not cudaSuccess the workspace was not created correctly and is invalid.

  • If FFT::requires_workspace trait is false, user doesn’t have to create workspace.

  • Workspace can be created for FFT with FFT::requires_workspace equal to false: such workspace is an empty workspace with no global memory allocation.

  • Workspace object is valid only for FFT it was created for.

  • Workspace object can allocate global memory, however never more than FFT::workspace_size, and it’s responsible for freeing it.

  • Workspace can’t be used concurrently since all copies share the same underlying global memory allocation. Using workspace concurrently will result in memory races.

  • Allocated global memory is freed upon destruction of the last copy of created workspace object.

  • Workspace object can be implicitly cast to FFT::workspace_type.

Note

Workspace is not required for FFTs of following sizes:

  • Powers of 2 up to 32768

  • Powers of 3 up to 19683

  • Powers of 5 up to 15625

  • Powers of 6 up to 1296

  • Powers of 7 up to 2401

  • Powers of 10 up to 10000

  • Powers of 11 up to 1331

  • Powers of 12 up to 1728

In the future versions of cuFFTDx:
  • Workspace requirement may be removed for other configurations.

  • FFT configurations that do not require workspace will continue to do so.

Warning

FFT::workspace_type object doesn’t track lifetime of underlying memory, and is only valid within a lifetime of workspace object it was casted from.

Warning

Type returned by cufftdx::make_workspace<FFT>(cudaError_t&) can be different for different FFT descriptions, and is not the same as FFT::workspace_type. User should use auto when creating a workspace object. Example:

// Kernel
template<class FFT>
__launch_bounds__(FFT::max_threads_per_block)
__global__ void block_fft_kernel(typename FFT::value_type* data, typename FFT::workspace_type workspace) {
    // ...

    // Execute FFT
    FFT().execute(thread_data, shared_mem, workspace);
}

// Create workspace
cudaError_t error = 0;
auto workspace = cufftdx::make_workspace<FFT>(error);

// ...

// Run kernel with FFT
block_fft_kernel<FFT><<<1, FFT::block_dim, FFT::shared_memory_size>>>(data, workspace);