First FFT Using cuFFTDx

In this introduction, we will calculate an FFT of size 128 using a standalone kernel. This section is based on the introduction_example.cu example shipped with cuFFTDx. See Examples section to check other cuFFTDx samples.

Defining Basic FFT

The first step is defining the FFT we want to perform. It’s done by adding together cuFFTDx operators to create an FFT description. The correctness of this type is evaluated at compile time. A well-defined FFT must include the problem size, the precision used (float, double, etc.), the type of operation (complex-to-complex, real-to-complex, etc.), and its direction (forward, or inverse).

// cuFFTDx header
#include <cufftdx.hpp>

// FFT description:
// A 128-point single precision complex-to-complex forward FFT description
using FFT = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>() + Direction<fft_direction::forward>());

In order to encode the FFT properties, cuFFTDx provides operators Size Operator, Precision Operator, Type Operator, and Direction Operator. As mentioned before, listed operators can be combined by using the addition operator (+).

To obtain a fully usable CUDA FFT kernel, we need to provide three additional pieces of information. The first one is how many FFTs we would like to compute, the second one is how to map the calculations into a CUDA block, and the last one is what CUDA architecture we are targeting.

Note

If FFT is to be executed in a single thread (see Thread Operator), both FFTsPerBlock and ElementsPerThread operators are not allowed. In that mode each thread executes one FFT.

In cuFFTDx, we specify how many FFTs we want to compute using the FFTs Per Block Operator. It defines how many FFT to do in parallel inside of a single CUDA block. In this example, we will set it to 2 FFT per CUDA block (the default value is 1 FFT per CUDA block):

// cuFFTDx header
#include <cufftdx.hpp>

// FFT description
using FFT = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>() + Direction<fft_direction::forward>()
                     + FFTsPerBlock<2>());

To map the computing of the FFT to the CUDA block, we use the Elements Per Thread Operator. This operator determines the number of registers required per thread and the exact implementation to be used. It also influences the required CUDA block size. We add that operator to the description:

#include <cufftdx.hpp>

// FFT description
using FFT = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>() + Direction<fft_direction::forward>()
                     + FFTsPerBlock<2>() + ElementsPerThread<8>());

Note

If ElementsPerThread and FFTsPerBlock are not set, the default values are used. See Block Configuration Operators section.

Finally, we use the SM Operator to indicate the target CUDA architecture on which we want to build the FFT descriptor. Each GPU architecture can use different parameters. Therefore, the choice of architecture potentially affects the configuration to maximize performance. In the introduction_example.cu example this is passed as template parameter, but in here we can assume we’re targeting Volta GPUs (SM<700>()):

#include <cufftdx.hpp>

// FFT description
using FFT = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>() + Direction<fft_direction::forward>()
                     + FFTsPerBlock<2>() + ElementsPerThread<8>()
                     + SM<700>());

Once the FFT description is fully formed, we can finalize it by adding the Block Operator. It indicates that we are asking for the collective FFT operation to be performed by a single CUDA block. The operator verifies correctness of the description, and it is a type of Execution Operators, (the other being the Thread Operator).

#include <cufftdx.hpp>

// FFT description:
// This description says that we want to execute a 128-point single precision complex-to-complex forward FFT with
// 2 batches per CUDA block and with 8 elements per thread on Volta GPU.
using FFT = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>() + Direction<fft_direction::forward>()
                     + FFTsPerBlock<2>() + ElementsPerThread<8>()
                     + SM<700>()
                     + Block());

Executing FFT

FFT description types can be instantiated into objects. Forming the object has no computational cost, and should be seen as a handle. The FFT descriptor object provides a compute method, execute(...) that performs the requested FFT.

#include <cufftdx.hpp>
using namespace cufftdx;

// FFT description
using FFT = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>()
                     + Direction<fft_direction::forward>() + FFTsPerBlock<1>()
                     + ElementsPerThread<8>() + SM<700>() + Block());
using complex_type = typename FFT::value_type;

__global__ void block_fft_kernel(complex_type* data) {
  // Execute FFT
  FFT().execute(/* What are the arguments? */);
}

cuFFTDx operations require registers and shared memory to operate. Users can query the FFT descriptor for needed resources.

#include <cufftdx.hpp>
using namespace cufftdx;

// FFT description
using FFT = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>()
                     + Direction<fft_direction::forward>() + FFTsPerBlock<1>()
                     + ElementsPerThread<8>() + SM<700>() + Block());
using complex_type = typename FFT::value_type;

__global__ void block_fft_kernel(complex_type* data) {
  // Registers
  complex_type thread_data[FFT::storage_size];

  // FFT::shared_memory_size bytes of shared memory
  extern __shared__ __align__(alignof(float4)) complex_type shared_mem[];

  // Execute FFT
  FFT().execute(thread_data, shared_mem);
}

In the execute(...) method presented above the cuFFTDx requires the input data to be in thread_data registers and stores the FFT results there. Users can also API which takes only pointer to shared memory and assumes all data is there in a natural order, see for more details Block Execute Method section.

Some FFTs, depending on the selected size, may also require additional global memory workspace, which needs to be allocated on the host and passed to the kernel. You can check if you have to create workspace using FFT::requires_workspace trait.

#include <cufftdx.hpp>
using namespace cufftdx;

using FFT = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>()
                     + Direction<fft_direction::forward>() + FFTsPerBlock<1>()
                     + ElementsPerThread<8>() + SM<700>() + Block());
using complex_type = typename FFT::value_type;

__global__ void block_fft_kernel(complex_type* data, typename FFT::workspace_type workspace) {
  // Registers
  complex_type thread_data[FFT::storage_size];

  // FFT::shared_memory_size bytes of shared memory
  extern __shared__ __align__(alignof(float4)) complex_type shared_mem[];

  // Execute FFT
  FFT().execute(thread_data, shared_mem, workspace /* additional workspace */);
}

Launching FFT Kernel

To launch a kernel we need to know the block size and required amount of shared memory needed to perform the FFT operation. Both are fixed and determined by the FFT description.

Since we defined the FFT description in device code, information about the block size needs to be propagated to the host. When all parameters are fully specified, all GPU architectures use the same block size, so the kernel can be launched in the same manner for all architectures.

#include <cufftdx.hpp>
using namespace cufftdx;

template<class FFT>
__global__ void block_fft_kernel(FFT::value_type* data, typename FFT::workspace_type workspace) {
  using complex_type = typename FFT::value_type;
  // Registers
  complex_type thread_data[FFT::storage_size];

  // FFT::shared_memory_size bytes of shared memory
  extern __shared__ __align__(alignof(float4)) complex_type shared_mem[];

  // Execute FFT
  FFT().execute(thread_data, shared_mem, workspace);
}

// CUDA_CHECK_AND_EXIT - marco checks if function returns cudaSuccess; if not it prints
// the error code and exits the program
void introduction_example(FFT::value_type* data /* data is a manage memory pointer*/) {
  // FFT description
  using FFT = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>()
                       + Direction<fft_direction::forward>() + FFTsPerBlock<1>()
                       + ElementsPerThread<8>() + SM<700>() + Block());

  cudaError_t error_code = cudaSuccess;
  auto workspace = make_workspace<FFT>(error_code);
  CUDA_CHECK_AND_EXIT(error_code);

  // Invokes kernel with FFT::block_dim threads in CUDA block
  block_fft_kernel<FFT><<<1, FFT::block_dim, FFT::shared_memory_size>>>(data, workspace);
  CUDA_CHECK_AND_EXIT(cudaPeekAtLastError());
  CUDA_CHECK_AND_EXIT(cudaDeviceSynchronize());
}

If we also add input/output operations from/to global memory, we obtain a kernel that is functionally equivalent to the cuFFT complex-to-complex kernel for size 128 and single precision. The data is loaded from global memory and stored into registers as described in Input/Output Data Format section, and similarly result are saved back to global memory.

For simplicity, in the example we allocate managed memory for device input/output array, assume that Volta architecture is used, and don’t check CUDA error codes returned by CUDA API functions and cufftdx::make_workspace. Please check the full introduction_example.cu example, as well as others shipped with cuFFTDx, for more detailed code.

#include <cufftdx.hpp>
using namespace cufftdx;

template<class FFT>
__global__ void block_fft_kernel(FFT::value_type* data, typename FFT::workspace_type workspace) {
  using complex_type = typename FFT::value_type;

  // Registers
  complex_type thread_data[FFT::storage_size];

  // Local batch id of this FFT in CUDA block, in range [0; FFT::ffts_per_block)
  const unsigned int local_fft_id = threadIdx.y;
  // Global batch id of this FFT in CUDA grid is equal to number of batches per CUDA block (ffts_per_block)
  // times CUDA block id, plus local batch id.
  const unsigned int global_fft_id = (blockIdx.x * FFT::ffts_per_block) + local_fft_id;

  // Load data from global memory to registers
  const unsigned int offset = cufftdx::size_of<FFT>::value * global_fft_id;
  const unsigned int stride = FFT::stride;
  unsigned int       index  = offset + threadIdx.x;
  for (unsigned int i = 0; i < FFT::elements_per_thread; i++) {
      // Make sure not to go out-of-bounds
      if ((i * stride + threadIdx.x) < cufftdx::size_of<FFT>::value) {
          thread_data[i] = data[index];
          index += stride;
      }
  }

  // FFT::shared_memory_size bytes of shared memory
  extern __shared__ __align__(alignof(float4)) complex_type shared_mem[];

  // Execute FFT
  FFT().execute(thread_data, shared_mem, workspace);

  // Save results
  index = offset + threadIdx.x;
  for (unsigned int i = 0; i < FFT::elements_per_thread; i++) {
      if ((i * stride + threadIdx.x) < cufftdx::size_of<FFT>::value) {
          data[index] = thread_data[i];
          index += stride;
      }
  }
}

void introduction_example() {
  // FFT description
  using FFT = decltype(Size<128>() + Precision<float>() + Type<fft_type::c2c>()
                       + Direction<fft_direction::forward>() + FFTsPerBlock<1>()
                       + ElementsPerThread<8>() + SM<700>() + Block());

  // Allocate managed memory for input/output
  complex_type* data;
  auto          size       = FFT::ffts_per_block * cufftdx::size_of<FFT>::value;
  auto          size_bytes = size * sizeof(complex_type);
  cudaMallocManaged(&data, size_bytes);
  // Generate data
  for (size_t i = 0; i < size; i++) {
      data[i] = complex_type {float(i), -float(i)};
  }

  cudaError_t error_code = cudaSuccess;
  auto workspace = make_workspace<FFT>(error_code);

  // Invokes kernel with FFT::block_dim threads in CUDA block
  block_fft_kernel<FFT><<<1, FFT::block_dim, FFT::shared_memory_size>>>(data, workspace);
  cudaDeviceSynchronize();

  cudaFree(data);
}

It’s important to notice that unlike cuFFT, cuFFTDx does not require moving data back to global memory after executing a FFT operation. This can be a major performance advantage as FFT calculations can be fused together with custom pre- and post-processing operations.

Compilation

In order to compile program which includes cufftdx.hpp, users only need to pass the location of the cuFFTDx library (the directory with the cufftdx.hpp file). More on how to use cuFFTDx in your project can be found in Quick Installation Guide.

nvcc -std=c++17 -arch sm_70 -O3 -I<mathdx_include_dir> introduction_example.cu -o introduction_example

Note

Since version 0.3.0 cuFFTDx has an experimental support for compilation with NVRTC. See Requirements and Functionality section.