Examples¶
The cuFFTDx library provides multiple thread and blocklevel FFT samples covering all supported precisions and types, as well as a few special examples that highlight performance benefits of cuFFTDx.
Examples 


Group 
Example 
Description 

Subgroup 

Introduction Examples 
introduction_example 
cuFFTDx API introduction 

Simple FFT Examples 
Thread FFT Examples 
simple_fft_thread 
Complextocomplex thread FFT 
simple_fft_thread_fp16 
Complextocomplex thread FFT halfprecision 

Block FFT Examples 
simple_fft_block 
Complextocomplex block FFT 

simple_fft_block_r2c 
Realtocomplex block FFT 

simple_fft_block_c2r 
Complextoreal block FFT 

simple_fft_block_half2 
Complextocomplex block FFT with 

simple_fft_block_fp16 
Complextocomplex block FFT halfprecision 

simple_fft_block_r2c_fp16 
Realtocomplex block FFT halfprecision 

simple_fft_block_c2r_fp16 
Complextoreal block FFT halfprecision 

Extra Block FFT Examples 
simple_fft_block_shared 
Complextocomplex block FFT sharedmemory API 

simple_fft_block_std_complex 
Complextocomplex block FFT with 

simple_fft_block_cub_io 
Complextocomplex block FFT with 

NVRTC Examples 
nvrtc_fft_thread 
Complextocomplex thread FFT 

nvrtc_fft_block 
Complextocomplex block FFT 

FFT Performance 
block_fft_performance 
Benchmark for C2C block FFT 

block_fft_performance_many 
Benchmark for C2C/R2C/C2R block FFT 

Convolution Examples 
convolution 
Simplified FFT convolution 

convolution_r2c_c2r 
Simplified R2CC2R FFT convolution 

convolution_performance 
Benchmark for FFT convolution using cuFFTDx and cuFFT 

2D/3D FFT Advanced Examples 
fft_2d 
Example showing how to perform 2D FP32 C2C FFT with cuFFTDx 

fft_2d_r2c_c2r 
Example showing how to perform 2D FP32 R2C/C2R convolution with cuFFTDx 

fft_2d_single_kernel 
2D FP32 FFT in a single kernel using Cooperative Groups kernel launch 

fft_3d_box_single_block 
Small 3D FP32 FFT that fits into a single block, each dimension is different 

fft_3d_cube_single_block 
Small 3D (equal dimensions) FP32 FFT that fits into a single block 
Introduction Examples¶
introduction_example
Examples used in the documentation to explain basics of the cuFFTDx library and its API. introduction_example
is used
in the introductory guide to cuFFTDx API: First FFT Using cuFFTDx.
Simple FFT Examples¶
simple_fft_thread* Examples¶
simple_fft_thread
simple_fft_thread_fp16
In each of the examples listed above a onedimensional complextocomplex FFT routine is performed by a single CUDA thread. In both samples multiple threads are run, and each thread calculates an FFT. The input data is generated on the host, copied to a device buffer, and then the final results are copied back to the host.
The simple_fft_thread_fp16
example showcases the support for halfprecision (fp16) in cuFFTDx. Please note that for halfprecision
cuFFTDx processes values in implicit batches of two FFTs, ie. each thread processes two FFTs. See also
HalfPrecision Implicit Batching section.
simple_fft_block* Examples¶
simple_fft_block
simple_fft_block_r2c
simple_fft_block_c2r
simple_fft_block_half2
simple_fft_block_fp16
simple_fft_block_r2c_fp16
simple_fft_block_c2r_fp16
In each of the examples listed above a onedimensional complextocomplex, realtocomplex or complextoreal FFT is performed in a CUDA block.
The examples show how to create a complete FFT description, and then set the correct block dimensions and the necessary amount of shared memory.
In the kernels the required array (thread_data
) in perthread registers is allocated, the input data is copied into them, the FFT
is executed, and results are transferred back to global memory. All samples use input/output functions from block_io.hpp
. The input data
is generated on the host, copied to a device buffer, and then the final results are copied back to the host.
The simple_fft_block_(*)_fp16
examples showcase the support for halfprecision (fp16) in cuFFTDx. Please note that in halfprecision
processes values in implicit batches of two FFTs, ie. each thread processes two FFTs. See also
HalfPrecision Implicit Batching section.
The simple_fft_block_half2
example differs from simple_fft_block_fp16
as it uses __half2
type instead of cufftdx::complex<__half2>
for halfprecision complex values, which means data is not implicitly batched on the type level. For this reason this examples uses
a special load function (and accordingly store function) that loads and rearranges values from input buffer into cufftdx::complex<__half2>
values
introducing implicit batching. See also
HalfPrecision Implicit Batching section.
Extra simple_fft_block(*) Examples¶
simple_fft_block_shared
simple_fft_block_std_complex
simple_fft_block_cub_io
The simple_fft_block_shared
is different
from other simple_fft_block_(*)
examples because it uses the shared memory cuFFTDx API, see methods #3 and #4 in section
Block Execute Method.
The simple_fft_block_std_complex
sample shows that cuda::std::complex
type can be used as the complex value type for data passed to cuFFTDx. It works as it has the same size and alignment as cufftdx::complex
.
In the simple_fft_block_cub_io
the NVIDIA CUB library (https://github.com/NVIDIA/cub) is used for input/output functions instead
of functions from block_io.hpp
. It requires CUB in 1.13 version or newer.
NVRTC Examples¶
nvrtc_fft_thread
nvrtc_fft_block
The NVRTC examples present how to use cuFFTDx on thread and block level with NVRTC
runtime compilation. The FFT descriptions created with cuFFTDx operators are defined only in the device code. The header file
cufftdx.hpp
is also included only in the device code that’s passed to the NVRTC. It works as long as the FFT doesn’t require
extra workspace, see Make Workspace Function section and
FFT::requires_workspace.
Note
Since version 0.3.0 cuFFTDx has an experimental support for compilation with NVRTC. See Requirements and Functionality section.
FFT Performance¶
block_fft_performance
block_fft_performance_many
The examples listed above report the performance of cuFFTDx device functions calculating FFT. Users can easily modify block_fft_performance
to test the performance of a particular FFT they want to use. block_fft_performance_many
example runs benchmarks for multiple different
single precision FFT problems to show how performance changes depending on the size and the type of an FFT.
Convolution Examples¶
convolution
convolution_r2c_c2r
convolution_performance
The convolution examples perform a simplified FFT convolution, either with complextocomplex forward and inverse FFTs (convolution
),
or realtocomplex and complextoreal FFTs (convolution_r2c_c2r
).
convolution_performance
examples reports the performance difference between 3 options: singlekernel path using cuFFTDx (forward FFT,
pointwise operation, inverse FFT in a single kernel), 3kernel path using cuFFT calls and a custom kernel for the pointwise operation,
2kernel path using cuFFT callback API (requires CUFFTDX_EXAMPLES_CUFFT_CALLBACK
cmake option to be set to
ON
: DCUFFTDX_EXAMPLES_CUFFT_CALLBACK=ON
). Depending on the device, the precision and the size of a given FFT the improvements
from using cuFFTDx range from 20% to up to 3x speedups. The results of convolution_performance
on NVIDIA A100 80GB GPU are presented
in Fig. 1.
2D/3D FFT Advanced Examples¶
fft_2d
fft_2d_r2c_c2r
fft_2d_single_kernel
fft_3d_box_single_block
fft_3d_cube_single_block
In each of the examples listed above cuFFTDx is used to perform multidimensional FFTs. Additionally, some of them include a performance comparison with cuFFT. The final performance of using cuFFTDx for 2D or 3D FFTs will depend on input/output functions, exact definitions of FFTs (precision, size, etc.), and custom pre and postprocessing functions that can be fused into kernels.
fft_2d
, fft_2d_r2c_c2r
, and fft_2d_single_kernel
examples show how to calculate 2D FFTs using cuFFTDx blocklevel execution (cufftdx::Block). The dimensions
are big enough that the data doesn’t fit into shared memory, thus synchronization and data exchange have to be done via global memory.
The fft_2d_r2c_c2r
example is similar to convolution_r2c_c2r
as it transforms input with realtocomplex FFT and then back with
complextoreal FFT. The fft_2d_single_kernel
is an attempt to do 2D FFT in a single kernel using Cooperative Groups grid launch and
gridwide synchronization.
In fft_3d_box_single_block
and fft_3d_cube_single_block
samples cuFFTDx is used on a threadlevel (cufftdx::Thread)
to executed small 3D FFTs in a single block.
Input/Output Helper Functions¶
block_io.hpp
block_io.hpp
contains all helper input/output functions that are used in the example kernels. They are implemented according to the
data layout requirements described in Input/Output Data Format and
Value Format sections. Included i/o functions are not promised to deliver the best
performance for every FFT configuration. Users may have to write their own to match their needs.