Your Next Custom FFT Kernels

For real world use cases, it is likely we will need more than a single kernel. A single use case, aiming at obtaining the maximum performance on multiple architectures, may require a number of different implementations. cuFFTDx was designed to handle this burden automatically, while offering users full control over the implementation details.

Using Optimal parameters

cuFFTDx allows user to defer the definition of certain details of the implementation (such as the number of FFT elements computed per thread, or the number of FFTs per block) to the library. The default value of ElementsPerThread is the suggested one and it is based on a heuristic which depends on the FFT itself (size, precision, GPU architecture etc.). However, the suggested FFTsPerBlock for the selected ElementsPerThread, either the default value or set manually, can be obtained via FFT::suggested_ffts_per_block static field. The default value of FFTsPerBlock is 1.

#include <cufftdx.hpp>
using namespace cufftdx;

template<class FFT>
__global__ void block_fft_kernel(FFT::value_type* data, typename FFT::workspace_type workspace) {
  using complex_type = typename FFT::value_type;

  // Registers
  complex_type thread_data[FFT::storage_size];

  // Local batch id of this FFT in CUDA block, in range [0; FFT::ffts_per_block)
  const unsigned int local_fft_id = threadIdx.y;
  // Global batch id of this FFT in CUDA grid is equal to number of batches per CUDA block (ffts_per_block)
  // times CUDA block id, plus local batch id.
  const unsigned int global_fft_id = (blockIdx.x * FFT::ffts_per_block) + local_fft_id;

  // Load data from global memory to registers
  const unsigned int offset = cufftdx::size_of<FFT>::value * global_fft_id;
  const unsigned int stride = FFT::stride;
  unsigned int       index  = offset + threadIdx.x;
  for (unsigned int i = 0; i < FFT::elements_per_thread; i++) {
      // Make sure not to go out-of-bounds
      if ((i * stride + threadIdx.x) < cufftdx::size_of<FFT>::value) {
          thread_data[i] = data[index];
          index += stride;
      }
  }

  // FFT::shared_memory_size bytes of shared memory
  extern __shared__ complex_type shared_mem[];

  // Execute FFT
  FFT().execute(thread_data, shared_mem, workspace);

  // Save results
  index = offset + threadIdx.x;
  for (unsigned int i = 0; i < FFT::elements_per_thread; i++) {
      if ((i * stride + threadIdx.x) < cufftdx::size_of<FFT>::value) {
          data[index] = thread_data[i];
          index += stride;
      }
  }
}

void introduction_example() {
  // Base of the FFT description
  using FFT_base = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>()
                            + Direction<fft_direction::forward>()
                            /* Notice lack of ElementsPerThread and FFTsPerBlock operators */
                            + SM<700>() + Block());
  // FFT description with suggested FFTs per CUDA block for the default (optimal) elements per thread
  using FFT = decltype(FFT_base() + FFTsPerBlock<FFT_base::suggested_ffts_per_block>());

  // Allocate managed memory for input/output
  complex_type* data;
  auto          size       = FFT::ffts_per_block * cufftdx::size_of<FFT>::value;
  auto          size_bytes = size * sizeof(complex_type);
  cudaMallocManaged(&data, size_bytes);
  // Generate data
  for (size_t i = 0; i < size; i++) {
      data[i] = complex_type {float(i), -float(i)};
  }

  cudaError_t error_code = cudaSuccess;
  auto workspace = make_workspace<FFT>(error_code);

  // Invokes kernel with FFT::block_dim threads in CUDA block
  block_fft_kernel<FFT><<<1, FFT::block_dim, FFT::shared_memory_size>>>(data, workspace);
  cudaDeviceSynchronize();

  cudaFree(data);
}

To retrieve the optimal FFTsPerBlock value, we require a complete execution descriptor (as indicated by cufftdx::is_complete_fft_execution). This is because some of the details are only available after the FFT operation has been fully described, and the target architecture and the execution mode have been selected. SM Operator compiled on the host allows the user to query launch parameters for a particular architecture.

Extra Shared Memory

As described in Shared Memory Usage section, some FFTs may require more shared memory per CUDA block than the default allocation. In such cases we may need to opt-in for larger shared memory allocation using cudaFuncSetAttribute() to set the cudaFuncAttributeMaxDynamicSharedMemorySize attribute to FFT::shared_memory_size for the kernel with FFT.

#include <cufftdx.hpp>
using namespace cufftdx;

template<class FFT>
__global__ void block_fft_kernel(FFT::value_type* data, typename FFT::workspace_type workspace) {
  using complex_type = typename FFT::value_type;

  (...)

  // FFT::shared_memory_size bytes of shared memory
  extern __shared__ complex_type shared_mem[];

  (...)
}

void introduction_example() {
  // Base of the FFT description
  using FFT_base = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>()
                            + Direction<fft_direction::forward>()
                            /* Notice lack of ElementsPerThread and FFTsPerBlock operators */
                            + SM<700>() + Block());
  // FFT description with suggested FFTs per CUDA block for the default (optimal) elements per thread
  using FFT = decltype(FFT_base() + FFTsPerBlock<FFT_base::suggested_ffts_per_block>());

  (...)

  // Increases the max dynamic shared memory size to match FFT requirements
  cudaFuncSetAttribute(block_fft_kernel<FFT>,
      cudaFuncAttributeMaxDynamicSharedMemorySize,
      FFT::shared_memory_size);

  // Invokes kernel with FFT::block_dim threads in CUDA block
  block_fft_kernel<FFT><<<1, FFT::block_dim, FFT::shared_memory_size>>>(data, workspace);

  (...)
}

What happens under the hood

Expression templates

The cuFFTDx API is using a variation of a C++ technique called expression templates. We use expression templates to allow the user to construct compile-time objects that describe the FFT calculation to compute. Compile-time C++ mechanisms allow cuFFTDx to attach optimized FFT routines to the object, and expose them as a compute method that can be called by the user.

Header only

cuFFTDx FFT routines are shipped as optimized inlined PTX.

Why?

For a library to be useful, it needs to abstract functionality in a future-proof manner. By future-proof we mean that an existing user code should not need to be modified in the future, and new functionality should consist of simple extensions to the existing code. On the CUDA platform, this requires adapting to quickly evolving GPU hardware.

cuFFTDx approaches future-proofing in two ways. On one hand, the API is a source-level abstraction which decouples the library from ABI changes. Along with the PTX code in headers, cuFFTDx is forward-compatible with any CUDA toolkit, driver and compiler that supports hardware that cuFFDx was released for. PTX can be recompiled by the CUDA compiler to run on future GPU architectures.

On the other hand, the API organization allows preserving operators describing what gets computed and how. New features depending on type can either be picked up automatically if code defers implementation choices to the library, or require adding operators to an existing expression.