Examples#

The cuFFTDx library provides multiple thread and block-level FFT samples covering all supported precisions and types, as well as a few special examples that highlight performance benefits of cuFFTDx.

Examples
Group		Example	Description
	Subgroup	Example	Description
Introduction Examples		introduction_example	cuFFTDx API introduction
Simple FFT Examples	01_simple_fft_thread	simple_fft_thread	Complex-to-complex thread FFT
		simple_fft_thread_fp16	Complex-to-complex thread FFT half-precision

	02_simple_fft_block	simple_fft_block	Complex-to-complex block FFT
		simple_fft_block_shared	Complex-to-complex block FFT shared-memory API
		simple_fft_block_std_complex	Complex-to-complex block FFT with `cuda::std::complex` as data type
		simple_fft_block_half2	Complex-to-complex block FFT with `__half2` as data type
		simple_fft_block_fp16	Complex-to-complex block FFT half-precision
		simple_fft_block_c2r	Complex-to-real block FFT
		simple_fft_block_r2c	Real-to-complex block FFT
		simple_fft_block_c2r_fp16	Complex-to-real block FFT half-precision
		simple_fft_block_r2c_fp16	Real-to-complex block FFT half-precision
		simple_fft_block_cub_io	Complex-to-complex block FFT with `CUB` used for loading/storing data
03_block_fft_performance		block_fft_performance	Benchmark for C2C block FFT
		block_fft_performance_many	Benchmark for C2C/R2C/C2R block FFT

04_nvrtc_fft		nvrtc_fft_thread	Complex-to-complex thread FFT
		nvrtc_fft_block	Complex-to-complex block FFT

05_fft_Xd		fft_2d	Example showing how to perform 2D FP32 C2C FFT with cuFFTDx
		fft_2d_single_kernel	2D FP32 FFT in a single kernel using Cooperative Groups kernel launch
		fft_2d_r2c_c2r	Example showing how to perform 2D FP32 R2C/C2R convolution with cuFFTDx
		fft_3d	Example showing how to perform 3D FP32 C2C FFT with cuFFTDx
		fft_3d_box_single_block	Small 3D FP32 FFT that fits into a single block, each dimension is different
		fft_3d_cube_single_block	Small 3D (equal dimensions) FP32 FFT that fits into a single block
06_convolution		convolution	Simplified FFT convolution
		convolution_padded	R2C-C2R FFT convolution with optimization and zero padding
		convolution_r2c_c2r	Simplified R2C-C2R FFT convolution
		convolution_performance	Benchmark for FFT convolution using cuFFTDx and cuFFT
07_convolution_3d		convolution_3d	cuFFTDx fused 3D convolution with preprocessing, filtering and postprocessing
		convolution_3d_c2r	cuFFTDx fused 3D C2R/R2C FFT convolution
		convolution_3d_r2c	cuFFTDx fused 3D R2C/C2R FFT convolution
		convolution_3d_padded	cuFFTDx fused 3D FFT convolution using zero padding
		convolution_3d_padded_r2c	cuFFTDx fused 3D R2C/C2R FFT convolution wothzero padding
08_mixed_precision		mixed_precision_fft_1d	Example showing how to use separate storage and compute precisions
		mixed_precision_fft_2d	Mixed precision 2D FFT with benchmarking and accuracy comparison

Introduction Examples#

introduction_example

Examples used in the documentation to explain basics of the cuFFTDx library and its API. introduction_example is used in the introductory guide to cuFFTDx API: First FFT Using cuFFTDx.

Simple FFT Examples#

simple_fft_thread* Examples#

simple_fft_thread
simple_fft_thread_fp16

In each of the examples listed above a one-dimensional complex-to-complex FFT routine is performed by a single CUDA thread. In both samples multiple threads are run, and each thread calculates an FFT. The input data is generated on the host, copied to a device buffer, and then the final results are copied back to the host.

The simple_fft_thread_fp16 example showcases the support for half-precision (fp16) in cuFFTDx. Please note that for half-precision cuFFTDx processes values in implicit batches of two FFTs, ie. each thread processes two FFTs. See also Half-Precision Implicit Batching section.

simple_fft_block* Examples#

simple_fft_block
simple_fft_block_half2
simple_fft_block_fp16
simple_fft_block_c2r
simple_fft_block_r2c
simple_fft_block_c2r_fp16
simple_fft_block_r2c_fp16

In each of the examples listed above a one-dimensional complex-to-complex, real-to-complex or complex-to-real FFT is performed in a CUDA block. The examples show how to create a complete FFT description, and then set the correct block dimensions and the necessary amount of shared memory. In the kernels the required array (thread_data) in per-thread registers is allocated, the input data is copied into them, the FFT is executed, and results are transferred back to global memory. All samples use input/output functions from block_io.hpp. The input data is generated on the host, copied to a device buffer, and then the final results are copied back to the host.

The simple_fft_block_(*)_fp16 examples showcase the support for half-precision (fp16) in cuFFTDx. Please note that in half-precision processes values in implicit batches of two FFTs, ie. each thread processes two FFTs. See also Half-Precision Implicit Batching section.

The simple_fft_block_half2 example differs from simple_fft_block_fp16 as it uses __half2 type instead of cufftdx::complex<__half2> for half-precision complex values, which means data is not implicitly batched on the type level. For this reason this examples uses a special load function (and accordingly store function) that loads and rearranges values from input buffer into cufftdx::complex<__half2> values introducing implicit batching. See also Half-Precision Implicit Batching section.

Extra simple_fft_block(*) Examples#

simple_fft_block_shared
simple_fft_block_std_complex
simple_fft_block_cub_io

The simple_fft_block_shared is different from other simple_fft_block_(*) examples because it uses the shared memory cuFFTDx API, see methods #3 and #4 in section Block Execute Method.

The simple_fft_block_std_complex sample shows that cuda::std::complex type can be used as the complex value type for data passed to cuFFTDx. It works as it has the same size and alignment as cufftdx::complex.

In the simple_fft_block_cub_io the NVIDIA CUB library (NVIDIA/cub) is used for input/output functions instead of functions from block_io.hpp. It requires CUB in 1.13 version or newer.

NVRTC Examples#

nvrtc_fft_thread
nvrtc_fft_block

The NVRTC examples present how to use cuFFTDx on thread and block level with NVRTC runtime compilation. The FFT descriptions created with cuFFTDx operators are defined only in the device code. The header file cufftdx.hpp is also included only in the device code that’s passed to the NVRTC. It works as long as the FFT doesn’t require extra workspace, see Make Workspace Function section and FFT::requires_workspace.

Note

Since version 0.3.0 cuFFTDx has an experimental support for compilation with NVRTC. See Requirements and Functionality section.

FFT Performance#

block_fft_performance
block_fft_performance_many

The examples listed above report the performance of cuFFTDx device functions calculating FFT. Users can easily modify block_fft_performance to test the performance of a particular FFT they want to use. block_fft_performance_many example runs benchmarks for multiple different single precision FFT problems to show how performance changes depending on the size and the type of an FFT.

Convolution Examples#

convolution
convolution_padded
convolution_r2c_c2r
convolution_performance

The convolution examples perform a simplified FFT convolution, either with complex-to-complex forward and inverse FFTs (convolution), or real-to-complex and complex-to-real FFTs (convolution_r2c_c2r). The most detailed example (convolution_padded) performs a real convolution in 3 ways:

by padding the input with 0s to the closest power of 2 and executing an optimized cuFFTDx R2C / C2R convolution
by leaving the input as is and executing a non-optimized cuFFTDx R2C / C2R convolution
by using a 3-kernel cuFFT convolution method

and compares their accuracy and performance on 8 different FFT sizes to point out where such optimization may be the most useful.

FFT padded convolution performance with cuFFT and cuFFTDx on H100 80GB with maximum clocks set. — Fig. 3 Comparison of batched real-to-real convolution with pointwise scaling (forward FFT, scaling, inverse FFT) performed with cuFFT, cuFFTDx with default setttings and unchanged input, and cuFFTDx with zero-padded input to the closest power of 2 and `real_mode::folded` optimization enabled on H100 80GB with maximum clocks set. Chart presents relative performance compared to cuFFT (light blue).#

convolution_performance examples reports the performance difference between 3 options: single-kernel path using cuFFTDx (forward FFT, pointwise operation, inverse FFT in a single kernel), 3-kernel path using cuFFT calls and a custom kernel for the pointwise operation, 2-kernel path using cuFFT callback API (requires CUFFTDX_EXAMPLES_CUFFT_CALLBACK cmake option to be set to ON: -DCUFFTDX_EXAMPLES_CUFFT_CALLBACK=ON). Depending on the device, the precision and the size of a given FFT the improvements from using cuFFTDx range from 45% to up to 3x speed-ups. Performance comparison between cuFFTDx and cuFFT convolution_performance NVIDIA H100 80GB HBM3 GPU results is presented in Fig. 4.

FFT convolution performance with cuFFT and cuFFTDx on H100 80GB HBM3 with maximum clocks set. — Fig. 4 Comparison of batched complex-to-complex convolution with pointwise scaling (forward FFT, scaling, inverse FFT) performed with cuFFT and cuFFTDx on H100 80GB HBM3 with maximum clocks set. Chart presents relative performance compared to cuFFT (light blue).#

3D Convolution Examples#

convolution_3d
convolution_3d_c2r
convolution_3d_r2c
convolution_3d_padded
convolution_3d_padded_r2c

In the conv_3d folder there are several 3D FFT convolution examples as well as required I/O functions.

In convolution_3d we present an idiomatic way to perform a large 3D FFT convolution which does not fit into shared memory, with: * elementwise preprocessing function * elementwise filtering function * elementwise postprocessing function.

Preprocessing is merged with the first execution of outermost dimension and postprocessing is merged with its second execution. The innermost dimension FFTs are merged together with the filtering function. This allows to get number of kernels down from 9 to only 5. Since strided kernels are necessary for the outer dimensions, it’s necessary to add input and output handling which allow for staging those transfers through padded shared memory to maximize global memory coalescing and minimize shared memory bank conflicts.

convolution_3d_padded example shows a similar approach to the 3D FFT convolution but with padding the dimensions to different sizes. This can improve performance for unusual dimensions which require fallback to a slower path, for example prime numbers. Normally padding to the closest power of 2 gives the best speed-ups but padding to the closest optimized size can provide higher accuracy of the results. All the supported optimized (no workspace required) sizes can be checked here.

When comparing a 3D FFT with cuFFTDx with a 3D FFT with cuFFT+Thrust (for pre/pot processing and filtering function) implementation, cuFFTDx obtains speedups of between 3.9x and 1.3x depending on the size. A comparison between cuFFTDx and cuFFT is shown in Fig. 5 on H100 80GB HBM3.

FFT 3D convolution performance with cuFFT+Thrust and cuFFTDx on H100 80GB HBM3 with maximum clocks set. — Fig. 5 Comparison of 3D complex-to-complex convolution with pre/post processing and filtering performed with cuFFT+Thrust and cuFFTDx with default EPT and suggested FFTs per block on H100 80GB HBM3 with maximum clocks set. For non-optimized sizes results from 3D FFT using cuFFTDx with zero-padded input to the closest power of 2 are also shown. Chart presents relative speedup compared to cuFFT (light blue).#

Other examples show how to perform a forward-inverse 3D R2C/C2R FFT convolution (convolution_3d_r2c) or an inverse-forward 3D C2R/R2C FFT convolution (convolution_3d_c2r) with the convolution fused either with the X or Z dimension FFT kernel. The main difference with previous complex-to-complex convolution examples is the configuration of the Z dimension kernel that performs the C2R/R2C change. If the size is a power of 2, this kernel can benefit from the real_mode::folded optimization.

Note

3D Convolution examples have limitations on their shape and chosen parameters. Assuming that the convolution size is X/Y/Z then number of subbatches of each dimension (so the product of remaining dimensions) must be divisible by the ffts-per-block parameter of the dimension in question.

For C2R/R2C this number of subbatches is dependent on input/output length, which is different from FFT size. Please refer to notes in examples for further detail.

2D/3D FFT Advanced Examples#

fft_2d
fft_2d_single_kernel
fft_2d_r2c_c2r
fft_3d
fft_3d_box_single_block
fft_3d_cube_single_block

In each of the examples listed above cuFFTDx is used to perform multi-dimensional FFTs. Additionally, some of them include a performance comparison with cuFFT. The final performance of using cuFFTDx for 2D or 3D FFTs will depend on input/output functions, exact definitions of FFTs (precision, size, etc.), and custom pre- and post-processing functions that can be fused into kernels.

fft_2d, fft_2d_r2c_c2r, and fft_2d_single_kernel examples show how to calculate 2D FFTs using cuFFTDx block-level execution (cufftdx::Block). The dimensions are big enough that the data doesn’t fit into shared memory, thus synchronization and data exchange have to be done via global memory. The fft_2d_r2c_c2r example is similar to convolution_r2c_c2r as it transforms input with real-to-complex FFT and then back with complex-to-real FFT. The fft_2d_single_kernel is an attempt to do 2D FFT in a single kernel using Cooperative Groups grid launch and grid-wide synchronization.

fft_3d example shows how to calculate 3D FFTs using cuFFTDx block-level execution (cufftdx::Block). The dimensions are big enough that the data doesn’t fit into shared memory, thus synchronization and data exchange have to be done via global memory. The fft_3d example is similar to fft_2d as it is an evolution of the idea of unfused kernels processing 1D FFT at time. It shows the general concept but for the best performance the remaining 3D FFT examples mentioned below should be considered.

The fft_3d example can be used as a foundation for experiments with different precisions for data and computation, or to fuse within a single kernel pre- and post-processing.

In fft_3d_box_single_block and fft_3d_cube_single_block samples cuFFTDx is used on a thread-level (cufftdx::Thread) to executed small 3D FFTs in a single block.

Mixed Precision FFT Examples#

mixed_precision_fft_1d
mixed_precision_fft_2d

The mixed examples demonstrate how to decouple cuFFTDx computation precision from the type of input and output global memory buffers. Techniques presented in the two aforementioned files let the user minimize I/O allowing for a significant speedup. The decrease in accuracy of the results can be minimal, but may depend on the input data and exact algorithm. Both examples include performance benchmarking and accuracy comparison with an equivalent full precision computation. It is important to note that this does not imply utilizing any of the CUDA Mixed Precision capabilities.

mixed_precision_fft_1d example shows how to compute a 1D FFT (forward FFT, complex-to-complex) in double precision using cuFFTDx with data stored in global memory in single precision. It shows the performance of this approach using both the register and shared memory APIs. The example also compares the accuracy of the results with the typical approach where the same precision is used for both I/O and computations. The speed-up obtained by using mixed precision depends on several factors reaching 1.7x for a FFT of 512 elements. Performance results are shown in Fig. 6.

FFT performance for mixed precision with cuFFT and cuFFTDx on H100 80GB HBM3 with maximum clocks set. — Fig. 6 Comparison of batched complex-to-complex FFT with mixed precision (storage: float, compute: double) performed with cuFFT and cuFFTDx with register API on H100 80GB HBM3 with maximum clocks set. Chart presents relative performance compared to cuFFT (light blue).#

In the mixed_precision_fft_2d example a similar approach is taken to compute a 2D FFT (forward FFT, complex-to-complex) in double precision using cuFFTDx with data in global memory stored in single precision. The 2D FFT is implemented in one kernel just as in the fft_2d_single_kernel example.

Both examples use specific I/O functions that convert from the original to the target precision while loading and storing from/to global memory (see in Input/Output Helper Functions).

Input/Output Helper Functions#

Note

Included I/O functions are not promised to deliver the best performance for every FFT configuration. Users may have to write their own to match their needs.

block_io.hpp

block_io.hpp contains all helper input/output functions that are used in the example kernels. They are implemented according to the data layout requirements described in Data Layouts and Value Format sections.

mixed_io.hpp

Additionally, mixed_io.hpp contains helper input/output functions and structures necessary for mixed precision cuFFTDx usage, performing data type conversions where required. Currently only several storage/compute precision combinations are handled: fp16/fp32, bf16/fp32, fp32/fp64.

padded_io.hpp

padded_io.hpp contains helper input/output functions and structures necessary for zero-padded convolutions presented in example (convolution_padded). Utilities present there load data conditionally, based on signal length, and offset memory accordingly.