cuBLASDx Python Bindings#

cuBLASDx offers a C++ API that’s callable from CUDA C++ kernels, but its functionality can also be easily accessed from Python using either NVIDIA Warp or nvmath-python.

Note

cuBLASDx Python bindings match C++ code in performance and offer extensive and easy autotuning opportunities allowing to bypass CUDA C++ compilation limitations.

NVIDIA Warp #

NVIDIA Warp is a Python library that allows developers to write high-performance simulation and graphics code that runs efficiently on both CPUs and NVIDIA GPUs. It uses just-in-time (JIT) compilation to turn Python functions into fast, parallel kernels, making it ideal for tasks like physics simulation, robotics, and geometry processing. Warp also supports differentiable programming, allowing integration with machine learning frameworks for gradient-based optimization-all while keeping the simplicity of Python.

Warp uses cuBLASDx for its matrix multiplication operations in Tile mode.

This is what a simple matmul kernel using Warp looks like:

@wp.kernel
def tile_gemm(A: wp.array2d(dtype=wp.float32), B: wp.array2d(dtype=wp.float16), C: wp.array2d(dtype=wp.float64)):
    # output tile index
    i, j = wp.tid()

    sum = wp.tile_zeros(shape=(TILE_M, TILE_N), dtype=wp.float64)

    _M = A.shape[0]
    _N = B.shape[1]
    K = A.shape[1]

    count = int(K / TILE_K)

    for k in range(0, count):
        a = wp.tile_load(A, shape=(TILE_M, TILE_K), offset=(i * TILE_M, k * TILE_K))
        b = wp.tile_load(B, shape=(TILE_K, TILE_N), offset=(k * TILE_K, j * TILE_N))

        # sum += a*b
        wp.tile_matmul(a, b, sum)

    wp.tile_store(C, sum, offset=(i * TILE_M, j * TILE_N))

The Warp GitHub repository can be accessed here and offers multiple examples, including one that implements multi-level-perceptron using cuBLASDx for necessary matrix multiplications or how to simulate an N-Body gravitational problem using cuBLASDx for the matrix multiplications.

Warp provides autotuning out of the box through its Tile model of programming, where the user describes the problem on a high level, and then it is autotuned and mapped onto the hardware by Warp.

nvmath-python #

nvmath-python is a Python library that provides high-performance, pythonic access to NVIDIA’s CUDA-X math libraries, enabling accelerated mathematical operations like linear algebra and fast Fourier transforms on both CPUs and NVIDIA GPUs. It integrates seamlessly with popular Python libraries such as CuPy, PyTorch, and NumPy, allowing users to leverage NVIDIA hardware acceleration within familiar workflows without needing C or C++ bindings. With both stateless and stateful APIs, nvmath-python delivers near-native performance for computational tasks in deep learning, data processing, and scientific computing, while supporting advanced features like device kernel fusion and customizable callbacks

This is what a simple matmul kernel using nvmath-python looks like:

MM = matmul(
    size=(m, n, k),
    precision=np.float16,
    data_type="complex",
    transpose_mode=("non_transposed", "transposed"),
    execution="Block",
    compiler="numba",
)

@cuda.jit(link=MM.files)
def f(a, b, c, alpha, beta, output):
    smem = cuda.shared.array(shape=(0,), dtype=value_type)
    smem_a = smem[0:]
    smem_b = smem[a_size:]
    smem_c = smem[a_size + b_size :]

    load_to_shared(a, smem_a, a_dim, lda)
    load_to_shared(b, smem_b, b_dim, ldb)
    load_to_shared(c, smem_c, c_dim, ldc)

    cuda.syncthreads()

    MM(alpha, smem_a, smem_b, beta, smem_c)

    cuda.syncthreads()

    store_from_shared(smem_c, output, c_dim, ldc)

nvmath-python GitHub repository can be accessed here and offers multiple examples as well, mirroring the examples from the cuBLASDx C++ repository

nvmath-python also offers autotuning by iterating over all possible configurations, measuring their performance and selecting the best one. This can be done in a simple python for loop and does not require any additional code from the user.

cuBLASDx Python Bindings#

NVIDIA Warp#

nvmath-python#

NVIDIA Warp #

nvmath-python #