Fast Fourier Transform#

Overview#

The Fast Fourier Transform (FFT) module nvmath.fft in nvmath-python leverages the NVIDIA cuFFT library and provides a powerful suite of APIs that can be directly called from the host to efficiently perform discrete Fourier Transformations. Both stateless function-form APIs and stateful class-form APIs are provided to support a spectrum of N-dimensional FFT operations. These include forward and inverse transformations, as well as complex-to-complex (C2C), complex-to-real (C2R), and real-to-complex (R2C) transforms:

Furthermore, the nvmath.fft.FFT class includes utility APIs designed to help users cache FFT plans, facilitating the efficient execution of repeated calculations across various computational tasks (see create_key()).

The FFT transforms performed on GPU can be fused with other operations using FFT callbacks. This enables users to write custom functions in Python for pre or post-processing, while leveraging Just-In-Time (JIT) and Link-Time Optimization (LTO).

Users can also choose CPU execution to utilize all available computational resources.

Note

The API fft() and related function-form APIs perform N-D FFT operations, similar to numpy.fft.fftn(). There are no special 1-D (numpy.fft.fft()) or 2-D FFT (numpy.fft.fft2()) APIs. This not only reduces the API surface, but also avoids the potential for incorrect use because the number of batch dimensions is \(N - 1\) for numpy.fft.fft() and \(N - 2\) for numpy.fft.fft2(), where \(N\) is the operand dimension.

FFT Callbacks#

User-defined functions can be compiled to the LTO-IR format and provided as epilog or prolog to the FFT operation, allowing for Link-Time Optimization and fusing. This can be used to implement DFT-based convolutions or scale the FFT output, for example.

The FFT module comes with convenient helper functions nvmath.fft.compile_prolog() and nvmath.fft.compile_epilog() that compile functions written in Python to LTO-IR format. Under the hood, the helpers rely on Numba as the compiler. The compiled callbacks can be passed to functional or stateful FFT APIs as DeviceCallable. Alternatively, users can compile the callbacks to LTO-IR format with a compiler of their choice and pass them as DeviceCallable to the FFT call.

Examples illustrating use of prolog and epilog functions can be found in the FFT examples directory.

Note

FFT Callbacks are not currently supported on Windows.

Setting-up#

The fastest way to start using cuFFT LTO with nvmath is to install it with device API dependencies. Pip users should run the following command:

pip install nvmath-python[cu12,dx]

Required dependencies#

For those who need to collect the required dependencies manually:

  • LTO callbacks are supported by cuFFT 11.3 which is shipped with CUDA Toolkit 12.6 Update 2 and newer.

  • Using cuFFT LTO callbacks requires nvJitLink from the same CUDA toolkit or newer (within the same major CUDA release, e.g. 12).

  • Compiling the callbacks with the nvmath.fft.compile_prolog() and nvmath.fft.compile_epilog() helpers requires Numba 0.59+ and nvcc/nvvm from the same CUDA toolkit as nvJitLink or older (within the same major CUDA release). The helpers require the target device to have compute capability 7.0 or higher.

For further details, please refer to cuFFT LTO documentation.

Older CTKs#

Adventurous users who want to try callback functionality and cannot upgrade the CUDA Toolkit to 12.6U2, can download and install the older preview release cuFFT LTO EA version 11.1.3.0 from here, which requires at least CUDA Toolkit 12.2. When using LTO EA, setting environmental variables may be needed for nvmath to pick the desired cuFFT version. Users should adjust the LD_PRELOAD variable, so that the right cuFFT shared library is used:

export LD_PRELOAD="/path_to_cufft_lto_ea/libcufft.so"

Execution space#

FFT transforms can be executed either on NVIDIA GPU or CPU. By default, the execution space is selected based on the memory space of the operand passed to the FFT call, but it can be explicitly controlled with ExecutionCUDA and ExecutionCPU passed as the execution option to the call (e.g. FFT or fft()).

Note

CPU execution is not currently supported on Windows.

Required dependencies#

With ARM CPUs, such as NVIDIA Grace, nvmath-python can utilize NVPL (Nvidia Performance Libraries) FFT to run the transform. On x86_64 architecture, the MKL library can be used.

For pip users, the fastest way to get the required dependencies is to use 'cu12' / 'cu11' and 'cpu' extras:

# for CPU-only dependencies
pip install nvmath-python[cpu]

# for CUDA-only dependencies (assuming CUDA 12)
pip install nvmath-python[cu12]

# for CUDA 12 and CPU dependencies
pip install nvmath-python[cu12,cpu]

Custom CPU library#

Other libraries that conform to FFTW3 API and ship single and double precision symbols in the single so file can be used to back the CPU FFT execution. Users who would like to use different library for CPU FFT, or point to a custom installation of NVPL or MKL library, can do so by including the library path in LD_LIBRARY_PATH and specifying the library name with NVMATH_FFT_CPU_LIBRARY. For example:

# nvpl
export LD_LIBRARY_PATH=/path/to/nvpl/:$LD_LIBRARY_PATH
export NVMATH_FFT_CPU_LIBRARY=libnvpl_fftw.so.0

# mkl
export LD_LIBRARY_PATH=/path/to/mkl/:$LD_LIBRARY_PATH
export NVMATH_FFT_CPU_LIBRARY=libmkl_rt.so.2

Host API Reference#

FFT support (nvmath.fft)#

fft(operand[, axes, direction, options, ...])

Perform an N-D complex-to-complex (C2C) FFT on the provided complex operand.

ifft(operand[, axes, options, execution, ...])

Perform an N-D complex-to-complex (C2C) inverse FFT on the provided complex operand.

rfft(operand[, axes, options, execution, ...])

Perform an N-D real-to-complex (R2C) FFT on the provided real operand.

irfft(operand[, axes, options, execution, ...])

Perform an N-D complex-to-real (C2R) FFT on the provided complex operand.

FFT(operand, *[, axes, options, execution, ...])

Create a stateful object that encapsulates the specified FFT computations and required resources.

compile_prolog(prolog_fn, element_dtype, ...)

Compile a Python function to LTO-IR to provide as a prolog function for fft() and plan().

compile_epilog(epilog_fn, element_dtype, ...)

Compile a Python function to LTO-IR to provide as an epilog function for fft() and plan().

UnsupportedLayoutError(message, permutation, ...)

Error type for layouts not supported by the library.

FFTOptions([fft_type, inplace, ...])

A data class for providing options to the FFT object and the family of wrapper functions fft(), ifft(), rfft(), and irfft().

FFTDirection(value[, names, module, ...])

An IntEnum class specifying the direction of the transform.

ExecutionCUDA([device_id])

A data class for providing GPU execution options to the FFT object and the family of wrapper functions fft(), ifft(), rfft(), and irfft().

ExecutionCPU([num_threads])

A data class for providing CPU execution options to the FFT object and the family of wrapper functions fft(), ifft(), rfft(), and irfft().

DeviceCallable([ltoir, size, data])

A data class capturing LTO-IR callables.