Examples

The cuFFTDx library provides multiple thread and block-level FFT samples covering all supported precisions and types, as well as a few special examples that highlight performance benefits of cuFFTDx.

Examples

Group

Example

Description

Subgroup

Introduction Examples

introduction_example

cuFFTDx API introduction

Simple FFT Examples

Thread FFT Examples

simple_fft_thread

Complex-to-complex thread FFT

simple_fft_thread_fp16

Complex-to-complex thread FFT half-precision

Block FFT Examples

simple_fft_block

Complex-to-complex block FFT

simple_fft_block_r2c

Real-to-complex block FFT

simple_fft_block_c2r

Complex-to-real block FFT

simple_fft_block_half2

Complex-to-complex block FFT with __half2 as data type

simple_fft_block_fp16

Complex-to-complex block FFT half-precision

simple_fft_block_r2c_fp16

Real-to-complex block FFT half-precision

simple_fft_block_c2r_fp16

Complex-to-real block FFT half-precision

Extra Block FFT Examples

simple_fft_block_shared

Complex-to-complex block FFT shared-memory API

simple_fft_block_std_complex

Complex-to-complex block FFT with cuda::std::complex as data type

simple_fft_block_cub_io

Complex-to-complex block FFT with CUB used for loading/storing data

NVRTC Examples

nvrtc_fft_thread

Complex-to-complex thread FFT

nvrtc_fft_block

Complex-to-complex block FFT

FFT Performance

block_fft_performance

Benchmark for C2C block FFT

block_fft_performance_many

Benchmark for C2C/R2C/C2R block FFT

Convolution Examples

convolution

Simplified FFT convolution

convolution_r2c_c2r

Simplified R2C-C2R FFT convolution

convolution_performance

Benchmark for FFT convolution using cuFFTDx and cuFFT

2D/3D FFT Advanced Examples

fft_2d

Example showing how to perform 2D FP32 C2C FFT with cuFFTDx

fft_2d_r2c_c2r

Example showing how to perform 2D FP32 R2C/C2R convolution with cuFFTDx

fft_2d_single_kernel

2D FP32 FFT in a single kernel using Cooperative Groups kernel launch

fft_3d_box_single_block

Small 3D FP32 FFT that fits into a single block, each dimension is different

fft_3d_cube_single_block

Small 3D (equal dimensions) FP32 FFT that fits into a single block

Introduction Examples

  • introduction_example

Examples used in the documentation to explain basics of the cuFFTDx library and its API. introduction_example is used in the introductory guide to cuFFTDx API: First FFT Using cuFFTDx.

Simple FFT Examples

simple_fft_thread* Examples

  • simple_fft_thread

  • simple_fft_thread_fp16

In each of the examples listed above a one-dimensional complex-to-complex FFT routine is performed by a single CUDA thread. In both samples multiple threads are run, and each thread calculates an FFT. The input data is generated on the host, copied to a device buffer, and then the final results are copied back to the host.

The simple_fft_thread_fp16 example showcases the support for half-precision (fp16) in cuFFTDx. Please note that for half-precision cuFFTDx processes values in implicit batches of two FFTs, ie. each thread processes two FFTs. See also Half-Precision Implicit Batching section.

simple_fft_block* Examples

  • simple_fft_block

  • simple_fft_block_r2c

  • simple_fft_block_c2r

  • simple_fft_block_half2

  • simple_fft_block_fp16

  • simple_fft_block_r2c_fp16

  • simple_fft_block_c2r_fp16

In each of the examples listed above a one-dimensional complex-to-complex, real-to-complex or complex-to-real FFT is performed in a CUDA block. The examples show how to create a complete FFT description, and then set the correct block dimensions and the necessary amount of shared memory. In the kernels the required array (thread_data) in per-thread registers is allocated, the input data is copied into them, the FFT is executed, and results are transferred back to global memory. All samples use input/output functions from block_io.hpp. The input data is generated on the host, copied to a device buffer, and then the final results are copied back to the host.

The simple_fft_block_(*)_fp16 examples showcase the support for half-precision (fp16) in cuFFTDx. Please note that in half-precision processes values in implicit batches of two FFTs, ie. each thread processes two FFTs. See also Half-Precision Implicit Batching section.

The simple_fft_block_half2 example differs from simple_fft_block_fp16 as it uses __half2 type instead of cufftdx::complex<__half2> for half-precision complex values, which means data is not implicitly batched on the type level. For this reason this examples uses a special load function (and accordingly store function) that loads and rearranges values from input buffer into cufftdx::complex<__half2> values introducing implicit batching. See also Half-Precision Implicit Batching section.

Extra simple_fft_block(*) Examples

  • simple_fft_block_shared

  • simple_fft_block_std_complex

  • simple_fft_block_cub_io

The simple_fft_block_shared is different from other simple_fft_block_(*) examples because it uses the shared memory cuFFTDx API, see methods #3 and #4 in section Block Execute Method.

The simple_fft_block_std_complex sample shows that cuda::std::complex type can be used as the complex value type for data passed to cuFFTDx. It works as it has the same size and alignment as cufftdx::complex.

In the simple_fft_block_cub_io the NVIDIA CUB library (https://github.com/NVIDIA/cub) is used for input/output functions instead of functions from block_io.hpp. It requires CUB in 1.13 version or newer.

NVRTC Examples

  • nvrtc_fft_thread

  • nvrtc_fft_block

The NVRTC examples present how to use cuFFTDx on thread and block level with NVRTC runtime compilation. The FFT descriptions created with cuFFTDx operators are defined only in the device code. The header file cufftdx.hpp is also included only in the device code that’s passed to the NVRTC. It works as long as the FFT doesn’t require extra workspace, see Make Workspace Function section and FFT::requires_workspace.

Note

Since version 0.3.0 cuFFTDx has an experimental support for compilation with NVRTC. See Requirements and Functionality section.

FFT Performance

  • block_fft_performance

  • block_fft_performance_many

The examples listed above report the performance of cuFFTDx device functions calculating FFT. Users can easily modify block_fft_performance to test the performance of a particular FFT they want to use. block_fft_performance_many example runs benchmarks for multiple different single precision FFT problems to show how performance changes depending on the size and the type of an FFT.

Convolution Examples

  • convolution

  • convolution_r2c_c2r

  • convolution_performance

The convolution examples perform a simplified FFT convolution, either with complex-to-complex forward and inverse FFTs (convolution), or real-to-complex and complex-to-real FFTs (convolution_r2c_c2r).

convolution_performance examples reports the performance difference between 3 options: single-kernel path using cuFFTDx (forward FFT, pointwise operation, inverse FFT in a single kernel), 3-kernel path using cuFFT calls and a custom kernel for the pointwise operation, 2-kernel path using cuFFT callback API (requires CUFFTDX_EXAMPLES_CUFFT_CALLBACK cmake option to be set to ON: -DCUFFTDX_EXAMPLES_CUFFT_CALLBACK=ON). Depending on the device, the precision and the size of a given FFT the improvements from using cuFFTDx range from 20% to up to 3x speed-ups. The results of convolution_performance on NVIDIA A100 80GB GPU are presented in Fig. 1.

FFT convolution performance with cuFFT and cuFFTDx on A100 80GB with maximum clocks set.

Fig. 1 Comparison of batched complex-to-complex convolution with pointwise scaling (forward FFT, scaling, inverse FFT) performed with cuFFT, cuFFT utilizing callback API (requires separate compilation), and cuFFTDx on A100 80GB with maximum clocks set. Chart presents relative performance compared to cuFFT (light blue).

2D/3D FFT Advanced Examples

  • fft_2d

  • fft_2d_r2c_c2r

  • fft_2d_single_kernel

  • fft_3d_box_single_block

  • fft_3d_cube_single_block

In each of the examples listed above cuFFTDx is used to perform multi-dimensional FFTs. Additionally, some of them include a performance comparison with cuFFT. The final performance of using cuFFTDx for 2D or 3D FFTs will depend on input/output functions, exact definitions of FFTs (precision, size, etc.), and custom pre- and post-processing functions that can be fused into kernels.

fft_2d, fft_2d_r2c_c2r, and fft_2d_single_kernel examples show how to calculate 2D FFTs using cuFFTDx block-level execution (cufftdx::Block). The dimensions are big enough that the data doesn’t fit into shared memory, thus synchronization and data exchange have to be done via global memory. The fft_2d_r2c_c2r example is similar to convolution_r2c_c2r as it transforms input with real-to-complex FFT and then back with complex-to-real FFT. The fft_2d_single_kernel is an attempt to do 2D FFT in a single kernel using Cooperative Groups grid launch and grid-wide synchronization.

In fft_3d_box_single_block and fft_3d_cube_single_block samples cuFFTDx is used on a thread-level (cufftdx::Thread) to executed small 3D FFTs in a single block.

Input/Output Helper Functions

  • block_io.hpp

block_io.hpp contains all helper input/output functions that are used in the example kernels. They are implemented according to the data layout requirements described in Input/Output Data Format and Value Format sections. Included i/o functions are not promised to deliver the best performance for every FFT configuration. Users may have to write their own to match their needs.