Accuracy and performance considerations

NVPL FFT employs the Cooley-Tukey algorithm to optimize the performance of particular transform sizes.

The Cooley-Tukey algorithm expresses the DFT matrix as a product of sparse building block matrices, where NVPL FFT has implementations of building blocks of radix-2, radix-3, radix-5 and radix-7. Hence NVPL FFT is more efficient when the transform sizes can be written in the form \(2^a \times 3^b \times 5^c \times 7^d\), for a, b, c, and d being some non-negative integers.

For the sizes that can not be written in this form, the Bluestein algorithm is used. Since the implementation of the Bluestein algorithm requires more computations per output point than the implementation of the Cooley-Tukey algorithm, the Cooley-Tukey algorithm has better accuracy.

For general tips for how to get better performance of NVPL FFT, please see Table: NVPL FFT performance tips.

Table: NVPL FFT performance tips

Applies to

Recommendation

Comment

All

Use single precision transforms.

Single precision transforms require less bandwidth per computation than double precision transforms.

All

Restrict the size along each dimension to use fewer distinct prime factors.

A transform of \(2^n\) or \(3^m\) will usually be faster than one of size \(2^i\times 3^j\) even if the latter is slightly smaller, due to the composition of specialized paths.

All

Restrict the data to be contiguous in memory when performing a single transform. When performing multiple transforms make the individual datasets contiguous.

The library is optimized for this data layout.

All

Perform multiple, i.e., batched, transforms.

The library has additional optimizations performed in batched mode.

Grace & AWS Graviton 3 CPUs

Use number of threads equal to the number CPU cores

The library achieves best performance with this configuration.

Warning

As noted in Release Notes, some of the supported FFT sizes, including composite sizes and sizes greater than 50K elements, are not optimized to the full extent.

Result reproducibility

Results produced by NVPL FFT are deterministic, i.e., bitwise reproducible, as long as (1) the plan input parameters, (2) the multi-threading setup, (3) the NVPL FFT version and (4) the CPU model are kept constant between runs.

Note

Different number of threads could lead to different code paths for performance optimizations. Hence the results from running with different number of threads could be different from each other.

Note

As cuFFT, when computing DFTs in batched mode, signals from different batches could be combined for processing for performance optimizations. Therefore, the results from batched mode and non-batched executions could exhibit differences.