# Execution Methods¶

These methods are used to run the FFT operation.

A code example:

```
#include <cufftdx.hpp>
using FFT = decltype( cufftdx::Size<128>() + cufftdx::Type<fft_type::c2c>()
+ cufftdx::Direction<fft_direction::forward>()
+ cufftdx::Precision<float>() + cufftdx::Block() );
using complex_type = typename FFT::value_type;
__global__ kernel(... /* arguments */) {
// Shared memory pointer
extern __shared__ complex_type shared_mem[];
// Register data
complex_type thread_data[FFT::storage_size];
// Load data into registers (thread_data)
// ...
FFT().execute(thread_data, shared_mem);
// Store results (thread_data) into global memory
}
```

## Thread Execute Method¶

```
void FFT().execute<typename T>(T* input)
```

Runs the FFT operation defined by the FFT descriptor. `T`

can be any type (such as `float2`

or `double2`

),
as long as its alignment and element size are the same as those of FFT::value_type.

This method is available if the descriptor has been constructed using the Thread Operator and
cufftdx::is_complete_fft_execution is `true`

.

`input`

array should be in the per-thread registers. `input`

must fit FFT::storage_size
elements of type FFT::value_type.

Warning

It is not guaranteed that executions of exactly the same FFTs on GPUs of different CUDA architectures will produce bit-identical results.

## Block Execute Method¶

```
// #1
void FFT().execute<typename T>(T* input, void* shared_memory, FFT::workspace_type& workspace)
// #2: Version of #1 for FFTs which don't require workspace
void FFT().execute<typename T>(T* input, void* shared_memory)
// #3: Execute with input data in shared memory
void FFT().execute<typename T>(T* shared_memory_input, FFT::workspace_type& workspace)
// #4: Version of #3 for FFTs which don't require workspace
void FFT().execute<typename T>(T* shared_memory_input)
```

Runs the FFT operation defined by the FFT descriptor. `T`

can be any type (such as `float2`

or `double2`

),
as long as its alignment and element size are the same as those of FFT::value_type.
Pointers `input`

, `shared_memory`

, `shared_memory_input`

should be aligned to `alignof(FFT::value_type)`

.

This method is available if the descriptor has been constructed using the Block Operator
and cufftdx::is_complete_fft_execution is `true`

.

When FFT::requires_workspace is `false`

, overloads #2 and #4 can be used. Otherwise, user has to use
methods #1 or #3 and pass a reference to a workspace.

In methods #1 and #2 `input`

is in per-thread registers, and `shared_memory`

is a pointer to a shared memory of size
`FFT::shared_memory_size`

bytes. The operation is in-place meaning the results are stored in `input`

. `input`

must
fit FFT::storage_size elements of type FFT::value_type.

In methods #3 and #4 the input data is passed in shared memory (`shared_memory_input`

). The operation is in-place, meaning
the results are stored back to `shared_memory_input`

. These methods don’t require an additional `shared_memory`

pointer
to be passed, as `shared_memory_input`

will be used for the required communication between threads. Thus, `shared_memory_input`

must fit all input and output values, and can’t be smaller than `FFT::shared_memory_size`

bytes
(i.e. shared memory size in bytes is a maximum of `FFT::shared_memory_size`

, `FFT::ffts_per_block * <FFT_input_size_in_bytes>`

, and
`FFT::ffts_per_block * <FFT_output_size_in_bytes>)`

bytes).

Warning

It is not guaranteed that executions of the same FFTs (size, direction, type, precision) but with different

number of elements per thread (ElementsPerThread),

number of FFTs calculated per CUDA block (FFTsPerBlock), or

block dimension (BlockDim),

will produce bit-identical results.

Warning

It is not guaranteed that executions of exactly the same FFTs on GPUs of different CUDA architectures will produce bit-identical results.

### Value Format¶

For complex numbers of single and double precision, the first value in a complex number is the real part and the second is the imaginary part.

#### Half Precision Implicit Batching¶

Processing of half (fp16) precision FFTs in cuFFTDx is implicitly batched, that is, a single computation processes two FFT batches. cuFFTDx expects that a complex number of half precision has 2 real parts and 2 imaginary parts in that order (i.e real_1, real_2, imaginary_1, imaginary_2). Real values of half precision (for R2C and C2R FFTs) follows the same logic and each should contain two real values. See also FFT::implicit_type_batching trait.

### Input/Output Data Format¶

This section describes the input and output data format.

#### Data In Registers¶

`N`

-th thread (indexing from 0) participating in the FFT should include the following values of FFT in its `input`

values: `n + FFT::stride * i`

where `i`

is an index in `input`

. Results are later stored in `input`

following to the same rule.

See also FFT::stride.

Example

`0`

-th thread of 8-point FFT with `FFT::stride`

equal to 2 should have values 0, 2, 4, and 6 in its `input`

.

## Make Workspace Function¶

```
template<class FFT>
auto cufftdx::make_workspace<FFT>(cudaError_t& error)
```

`cufftdx::make_workspace<FFT>(cudaError_t&)`

is a helper function for creating workspace required for block `execute(...)`

method
when FFT::requires_workspace is `true`

. `FFT`

is type of FFT descriptor.
If after calling the function `error`

is not `cudaSuccess`

the workspace was not created correctly and is invalid.

If FFT::requires_workspace trait is

`false`

, user doesn’t have to create workspace.Workspace can be created for FFT with

`FFT::requires_workspace`

equal to false: such workspace is an empty workspace with no global memory allocation.Workspace object is valid only for FFT it was created for.

Workspace object can allocate global memory, however never more than FFT::workspace_size, and it’s responsible for freeing it.

Workspace can’t be used concurrently since all copies share the same underlying global memory allocation. Using workspace concurrently will result in memory races.

Allocated global memory is freed upon destruction of the last copy of created workspace object.

Workspace object can be implicitly cast to FFT::workspace_type.

Note

Workspace is not required for FFTs of following sizes:

Powers of 2 up to 32768

Powers of 3 up to 19683

Powers of 5 up to 15625

Powers of 6 up to 1296

Powers of 7 up to 2401

Powers of 10 up to 10000

Powers of 11 up to 1331

Powers of 12 up to 1728

- In the future versions of cuFFTDx:

Workspace requirement may be removed for other configurations.

FFT configurations that do not require workspace will continue to do so.

Warning

FFT::workspace_type object doesn’t track lifetime of underlying memory, and is only valid within a lifetime of workspace object it was casted from.

Warning

Type returned by `cufftdx::make_workspace<FFT>(cudaError_t&)`

can be different for different FFT descriptions,
and is not the same as FFT::workspace_type. User should use `auto`

when
creating a workspace object. Example:

```
// Kernel
template<class FFT>
__launch_bounds__(FFT::max_threads_per_block)
__global__ void block_fft_kernel(typename FFT::value_type* data, typename FFT::workspace_type workspace) {
// ...
// Execute FFT
FFT().execute(thread_data, shared_mem, workspace);
}
// Create workspace
cudaError_t error = 0;
auto workspace = cufftdx::make_workspace<FFT>(error);
// ...
// Run kernel with FFT
block_fft_kernel<FFT><<<1, FFT::block_dim, FFT::shared_memory_size>>>(data, workspace);
```