Your Next Custom FFT Kernels¶
For real world use cases, it is likely we will need more than a single kernel. A single use case, aiming at obtaining the maximum performance on multiple architectures, may require a number of different implementations. cuFFTDx was designed to handle this burden automatically, while offering users full control over the implementation details.
Using Optimal parameters¶
cuFFTDx allows user to defer the definition of certain details of the implementation (such as the number of FFT elements
computed per thread, or the number of FFTs per block) to the library. The default value of
ElementsPerThread is the suggested one and it is based on a heuristic which depends on the FFT
itself (size, precision, GPU architecture etc.). However, the suggested FFTsPerBlock
for the selected ElementsPerThread, either the default value or set manually, can be obtained via
FFT::suggested_ffts_per_block
static field. The default value of FFTsPerBlock is 1.
#include <cufftdx.hpp>
using namespace cufftdx;
template<class FFT>
__global__ void block_fft_kernel(FFT::value_type* data, typename FFT::workspace_type workspace) {
using complex_type = typename FFT::value_type;
// Registers
complex_type thread_data[FFT::storage_size];
// Local batch id of this FFT in CUDA block, in range [0; FFT::ffts_per_block)
const unsigned int local_fft_id = threadIdx.y;
// Global batch id of this FFT in CUDA grid is equal to number of batches per CUDA block (ffts_per_block)
// times CUDA block id, plus local batch id.
const unsigned int global_fft_id = (blockIdx.x * FFT::ffts_per_block) + local_fft_id;
// Load data from global memory to registers
const unsigned int offset = cufftdx::size_of<FFT>::value * global_fft_id;
const unsigned int stride = FFT::stride;
unsigned int index = offset + threadIdx.x;
for (unsigned int i = 0; i < FFT::elements_per_thread; i++) {
// Make sure not to go out-of-bounds
if ((i * stride + threadIdx.x) < cufftdx::size_of<FFT>::value) {
thread_data[i] = data[index];
index += stride;
}
}
// FFT::shared_memory_size bytes of shared memory
extern __shared__ __align__(alignof(float4)) complex_type shared_mem[];
// Execute FFT
FFT().execute(thread_data, shared_mem, workspace);
// Save results
index = offset + threadIdx.x;
for (unsigned int i = 0; i < FFT::elements_per_thread; i++) {
if ((i * stride + threadIdx.x) < cufftdx::size_of<FFT>::value) {
data[index] = thread_data[i];
index += stride;
}
}
}
void introduction_example(cudaStream_t stream = 0) {
// Base of the FFT description
using FFT_base = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>()
+ Direction<fft_direction::forward>()
/* Notice lack of ElementsPerThread and FFTsPerBlock operators */
+ SM<700>() + Block());
// FFT description with suggested FFTs per CUDA block for the default (optimal) elements per thread
using FFT = decltype(FFT_base() + FFTsPerBlock<FFT_base::suggested_ffts_per_block>());
// Allocate managed memory for input/output
complex_type* data;
auto size = FFT::ffts_per_block * cufftdx::size_of<FFT>::value;
auto size_bytes = size * sizeof(complex_type);
cudaMallocManaged(&data, size_bytes);
// Generate data
for (size_t i = 0; i < size; i++) {
data[i] = complex_type {float(i), -float(i)};
}
cudaError_t error_code = cudaSuccess;
auto workspace = make_workspace<FFT>(error_code, stream);
// Invokes kernel with FFT::block_dim threads in CUDA block
block_fft_kernel<FFT><<<1, FFT::block_dim, FFT::shared_memory_size, stream>>>(data, workspace);
cudaDeviceSynchronize();
cudaFree(data);
}
To retrieve the optimal FFTsPerBlock
value, we require a complete execution descriptor (as indicated by
cufftdx::is_complete_fft_execution). This is because some of the details are
only available after the FFT operation has been fully described, and the target architecture and the execution mode have
been selected. SM Operator compiled on the host allows the user to query launch parameters for a particular
architecture.
What happens under the hood¶
- Expression templates
The cuFFTDx API is using a variation of a C++ technique called expression templates. We use expression templates to allow the user to construct compile-time objects that describe the FFT calculation to compute. Compile-time C++ mechanisms allow cuFFTDx to attach optimized FFT routines to the object, and expose them as a compute method that can be called by the user.
- Header only
cuFFTDx FFT routines are shipped as optimized inlined PTX.
Why?¶
For a library to be useful, it needs to abstract functionality in a future-proof manner. By future-proof we mean that an existing user code should not need to be modified in the future, and new functionality should consist of simple extensions to the existing code. On the CUDA platform, this requires adapting to quickly evolving GPU hardware.
cuFFTDx approaches future-proofing in two ways. On one hand, the API is a source-level abstraction which decouples the library from ABI changes. Along with the PTX code in headers, cuFFTDx is forward-compatible with any CUDA toolkit, driver and compiler that supports hardware that cuFFDx was released for. PTX can be recompiled by the CUDA compiler to run on future GPU architectures.
On the other hand, the API organization allows preserving operators describing what gets computed and how. New features depending on type can either be picked up automatically if code defers implementation choices to the library, or require adding operators to an existing expression.