Accuracy and performance considerations¶
NVPL FFT employs the Cooley-Tukey algorithm to optimize the performance of particular transform sizes.
The Cooley-Tukey algorithm expresses the DFT matrix as a product of sparse building block matrices, where NVPL FFT has implementations of building blocks of radix-2
, radix-3
, radix-5
and radix-7
.
Hence NVPL FFT is more efficient when the transform sizes can be written in the form \(2^a \times 3^b \times 5^c \times 7^d\), for a, b, c, and d being some non-negative integers.
For the sizes that can not be written in this form, the Bluestein algorithm is used. Since the implementation of the Bluestein algorithm requires more computations per output point than the implementation of the Cooley-Tukey algorithm, the Cooley-Tukey algorithm has better accuracy.
For general tips for how to get better performance of NVPL FFT, please see Table: NVPL FFT performance tips.
Applies to |
Recommendation |
Comment |
---|---|---|
All |
Use single precision transforms. |
Single precision transforms require less bandwidth per computation than double precision transforms. |
All |
Restrict the size along each dimension to use fewer distinct prime factors. |
A transform of \(2^n\) or \(3^m\) will usually be faster than one of size \(2^i\times 3^j\) even if the latter is slightly smaller, due to the composition of specialized paths. |
All |
Restrict the data to be contiguous in memory when performing a single transform. When performing multiple transforms make the individual datasets contiguous. |
The library is optimized for this data layout. |
All |
Perform multiple, i.e., batched, transforms. |
The library has additional optimizations performed in batched mode. |
Grace & AWS Graviton 3 CPUs |
Use number of threads equal to the number CPU cores |
The library achieves best performance with this configuration. |
Warning
As noted in Release Notes, some of the supported FFT sizes, including composite sizes and sizes greater than 50K elements, are not optimized to the full extent.
Result reproducibility¶
Results produced by NVPL FFT are deterministic, i.e., bitwise reproducible, as long as (1) the plan input parameters, (2) the multi-threading setup, (3) the NVPL FFT version and (4) the CPU model are kept constant between runs.
Note
Different number of threads could lead to different code paths for performance optimizations. Hence the results from running with different number of threads could be different from each other.
Note
As cuFFT, when computing DFTs in batched mode, signals from different batches could be combined for processing for performance optimizations. Therefore, the results from batched mode and non-batched executions could exhibit differences.