Execution Methods¶
These methods are used to run the FFT operation.
A code example:
#include <cufftdx.hpp>
using FFT = decltype( cufftdx::Size<128>() + cufftdx::Type<fft_type::c2c>()
+ cufftdx::Direction<fft_direction::forward>()
+ cufftdx::Precision<float>() + cufftdx::Block() );
using complex_type = typename FFT::value_type;
__global__ kernel(... /* arguments */) {
// Shared memory pointer
extern __shared__ complex_type shared_mem[];
// Register data
complex_type thread_data[FFT::storage_size];
// Load data into registers (thread_data)
// ...
FFT().execute(thread_data, shared_mem);
// Store results (thread_data) into global memory
}
Thread Execute Method¶
void FFT().execute<typename T>(T* input)
Runs the FFT operation defined by the FFT descriptor. T
can be any type (such as float2
or double2
),
as long as its alignment and element size are the same as those of FFT::value_type.
This method is available if the descriptor has been constructed using the Thread Operator and
cufftdx::is_complete_fft_execution is true
.
input
array should be in the per-thread registers. input
must fit FFT::storage_size
elements of type FFT::value_type.
Warning
It is not guaranteed that executions of exactly the same FFTs on GPUs of different CUDA architectures will produce bit-identical results.
Block Execute Method¶
// #1
void FFT().execute<typename T>(T* input, void* shared_memory, FFT::workspace_type& workspace)
// #2: Version of #1 for FFTs which don't require workspace
void FFT().execute<typename T>(T* input, void* shared_memory)
// #3: Execute with input data in shared memory
void FFT().execute<typename T>(T* shared_memory_input, FFT::workspace_type& workspace)
// #4: Version of #3 for FFTs which don't require workspace
void FFT().execute<typename T>(T* shared_memory_input)
Runs the FFT operation defined by the FFT descriptor. T
can be any type (such as float2
or double2
),
as long as its alignment and element size are the same as those of FFT::value_type.
Pointers input
, shared_memory
, shared_memory_input
should be aligned to alignof(FFT::value_type)
.
This method is available if the descriptor has been constructed using the Block Operator
and cufftdx::is_complete_fft_execution is true
.
When FFT::requires_workspace is false
, overloads #2 and #4 can be used. Otherwise, user has to use
methods #1 or #3 and pass a reference to a workspace.
In methods #1 and #2 input
is in per-thread registers, and shared_memory
is a pointer to a shared memory of size
FFT::shared_memory_size
bytes. The operation is in-place meaning the results are stored in input
. input
must
fit FFT::storage_size elements of type FFT::value_type.
In methods #3 and #4 the input data is passed in shared memory (shared_memory_input
). The operation is in-place, meaning
the results are stored back to shared_memory_input
. These methods don’t require an additional shared_memory
pointer
to be passed, as shared_memory_input
will be used for the required communication between threads. Thus, shared_memory_input
must fit all input and output values, and can’t be smaller than FFT::shared_memory_size
bytes
(i.e. shared memory size in bytes is a maximum of FFT::shared_memory_size
, FFT::ffts_per_block * <FFT_input_size_in_bytes>
, and
FFT::ffts_per_block * <FFT_output_size_in_bytes>)
bytes).
Warning
It is not guaranteed that executions of the same FFTs (size, direction, type, precision) but with different
number of elements per thread (ElementsPerThread),
number of FFTs calculated per CUDA block (FFTsPerBlock), or
block dimension (BlockDim),
will produce bit-identical results.
Warning
It is not guaranteed that executions of exactly the same FFTs on GPUs of different CUDA architectures will produce bit-identical results.
Value Format¶
For complex numbers of single and double precision, the first value in a complex number is the real part and the second is the imaginary part.
Half Precision Implicit Batching¶
Processing of half (fp16) precision FFTs in cuFFTDx is implicitly batched, that is, a single computation processes two FFT batches. cuFFTDx expects that a complex number of half precision has 2 real parts and 2 imaginary parts in that order (i.e real_1, real_2, imaginary_1, imaginary_2). Real values of half precision (for R2C and C2R FFTs) follows the same logic and each should contain two real values. See also FFT::implicit_type_batching trait.
Input/Output Data Format¶
This section describes the input and output data format.
Data In Registers¶
N
-th thread (indexing from 0) participating in the FFT should include the following values of FFT in its input
values: n + FFT::stride * i
where i
is an index in input
. Results are later stored in input
following to the same rule.
See also FFT::stride.
Example
0
-th thread of 8-point FFT with FFT::stride
equal to 2 should have values 0, 2, 4, and 6 in its input
.
Make Workspace Function¶
template<class FFT>
auto cufftdx::make_workspace<FFT>(cudaError_t& error)
cufftdx::make_workspace<FFT>(cudaError_t&)
is a helper function for creating workspace required for block execute(...)
method
when FFT::requires_workspace is true
. FFT
is type of FFT descriptor.
If after calling the function error
is not cudaSuccess
the workspace was not created correctly and is invalid.
If FFT::requires_workspace trait is
false
, user doesn’t have to create workspace.Workspace can be created for FFT with
FFT::requires_workspace
equal to false: such workspace is an empty workspace with no global memory allocation.Workspace object is valid only for FFT it was created for.
Workspace object can allocate global memory, however never more than FFT::workspace_size, and it’s responsible for freeing it.
Workspace can’t be used concurrently since all copies share the same underlying global memory allocation. Using workspace concurrently will result in memory races.
Allocated global memory is freed upon destruction of the last copy of created workspace object.
Workspace object can be implicitly cast to FFT::workspace_type.
Note
Workspace is not required for FFTs of following sizes:
Powers of 2 up to 32768
Powers of 3 up to 19683
Powers of 5 up to 15625
Powers of 6 up to 1296
Powers of 7 up to 2401
Powers of 10 up to 10000
Powers of 11 up to 1331
Powers of 12 up to 1728
- In the future versions of cuFFTDx:
Workspace requirement may be removed for other configurations.
FFT configurations that do not require workspace will continue to do so.
Warning
FFT::workspace_type object doesn’t track lifetime of underlying memory, and is only valid within a lifetime of workspace object it was casted from.
Warning
Type returned by cufftdx::make_workspace<FFT>(cudaError_t&)
can be different for different FFT descriptions,
and is not the same as FFT::workspace_type. User should use auto
when
creating a workspace object. Example:
// Kernel
template<class FFT>
__launch_bounds__(FFT::max_threads_per_block)
__global__ void block_fft_kernel(typename FFT::value_type* data, typename FFT::workspace_type workspace) {
// ...
// Execute FFT
FFT().execute(thread_data, shared_mem, workspace);
}
// Create workspace
cudaError_t error = 0;
auto workspace = cufftdx::make_workspace<FFT>(error);
// ...
// Run kernel with FFT
block_fft_kernel<FFT><<<1, FFT::block_dim, FFT::shared_memory_size>>>(data, workspace);