cuSolverDx Python Bindings#

Overview#

cuSolverDx offers C++ APIs that are callable from CUDA C++ kernels, but its functionalities can also be easily accessed from Python using NVIDIA Warp. We expect to add support for cuSolverDx to nvmath-python in the future.

Note

cuSolverDx Python bindings match C++ code in performance and offer extensive and easy autotuning opportunities by bypassing the CUDA C++ compilation limitation.

NVIDIA Warp Integration#

NVIDIA Warp is a Python library that allows developers to write high-performance simulation and graphics code that runs efficiently on both CPUs and NVIDIA GPUs. It uses just-in-time (JIT) compilation to turn Python functions into fast, parallel kernels, making it ideal for tasks like physics simulation, robotics, and geometry processing. Warp also supports differentiable programming, allowing integration with machine learning frameworks for gradient-based optimization - all while keeping the simplicity of Python.

Warp uses cuSolverDx for its linear solver operations in Tile mode.

Example Implementation#

This is what a simple matmul kernel using Warp looks like:

@wp.kernel
def cholesky(
    A: wp.array2d(dtype=wp_type),
    L: wp.array2d(dtype=wp_type),
    X: wp.array1d(dtype=wp_type),
    Y: wp.array1d(dtype=wp_type),
):

    i, j, _ = wp.tid()

    a = wp.tile_load(A, shape=(TILE, TILE))
    l = wp.tile_cholesky(a)
    wp.tile_store(L, l)

    x = wp.tile_load(X, shape=TILE)
    y = wp.tile_cholesky_solve(l, x)
    wp.tile_store(Y, y)

Additional Resources#

The Warp GitHub repository can be accessed here and offers multiple examples, including a simple example with warp Cholesky Tile APIs using cuSolverDx’s Cholesky factorization and triangular solver, or how to compute a large size Cholesky factorization using the blocked algorithm.

Warp provides autotuning out of the box through its Tile model of programming, where the user describes the problem on a high level, and then it is autotuned and mapped onto the hardware by Warp.