cuFFTDx Python Bindings#
cuFFTDx offers a C++ API that’s callable from CUDA C++ kernels, but its functionality can also be easily accessed from Python using either warp or nvmath-python.
Note
cuFFTDx Python bindings match C++ code in performance and offer extensive and easy autotuning opportunities allowing to bypass CUDA C++ compilation limitations.
NVIDIA Warp#
NVIDIA Warp is a Python library that allows developers to write high-performance simulation and graphics code that runs efficiently on both CPUs and NVIDIA GPUs. It uses just-in-time (JIT) compilation to turn Python functions into fast, parallel kernels, making it ideal for tasks like physics simulation, robotics, and geometry processing. Warp also supports differentiable programming, allowing integration with machine learning frameworks for gradient-based optimization-all while keeping the simplicity of Python.
Warp uses cuFFTDx for efficient signal processing operations in Tile mode.
This is what a simple convolution kernel using Warp looks like:
import warp as wp
@wp.func
def filter(x: wp.vec2d):
    return wp.cw_mul(x, scale)
@wp.kernel
def conv_tiled(x: wp.array2d(dtype=wp.vec2d), y: wp.array2d(dtype=wp.vec2d)):
    i, j, _ = wp.tid()
    a = wp.tile_load(x, shape=(TILE_M, TILE_N))
    wp.tile_fft(a)
    b = wp.tile_map(filter, a)
    wp.tile_ifft(b)
    wp.tile_store(y, b)
The Warp GitHub repository can be accessed here and offers multiple examples.
Warp provides autotuning out of the box through its Tile model of programming, where the user describes the problem on a high level, and then it is autotuned and mapped onto the hardware by Warp.
nvmath-python#
nvmath-python is a Python library that provides high-performance, pythonic access to NVIDIA’s CUDA-X math libraries, enabling accelerated mathematical operations like linear algebra and fast Fourier transforms on both CPUs and NVIDIA GPUs. It integrates seamlessly with popular Python libraries such as CuPy, PyTorch, and NumPy, allowing users to leverage NVIDIA hardware acceleration within familiar workflows without needing C or C++ bindings. With both stateless and stateful APIs, nvmath-python delivers near-native performance for computational tasks in deep learning, data processing, and scientific computing, while supporting advanced features like device kernel fusion and customizable callbacks.
This is what a simple convolution using nvmath-python looks like:
FFT = FFT_base(direction="forward")
IFFT = FFT_base(direction="inverse")
@cuda.jit(link=FFT.files + IFFT.files)
def f(data):
    thread_data = cuda.local.array(shape=(storage_size,), dtype=value_type)
    local_fft_id = cuda.threadIdx.y
    fft_id = cuda.blockIdx.x * ffts_per_block + local_fft_id
    index = cuda.threadIdx.x
    for i in range(elements_per_thread):
        thread_data[i] = data[fft_id, index]
        index += stride
    shared_mem = cuda.shared.array(shape=(0,), dtype=value_type)
    FFT(thread_data, shared_mem)
    for i in range(elements_per_thread):
        thread_data[i] = thread_data[i] / size
    IFFT(thread_data, shared_mem)
    index = cuda.threadIdx.x
    for i in range(elements_per_thread):
        data[fft_id, index] = thread_data[i]
        index += stride
nvmath-python GitHub repository can be accessed here and offers multiple examples as well,
mirroring the examples from the cuFFTDx C++ repository .
nvmath-python also offers autotuning by iterating over all possible configurations, measuring their performance and selecting the best one.
This can be done in a simple python for loop and does not require any additional code from the user. An example of this functionality can be seen here.