Overview

The primary goal of nvmath-python is to bring the power of the NVIDIA math libraries to the Python ecosystem. The package aims to provide intuitive pythonic APIs that provide users full access to all the features offered by our libraries in a variety of execution spaces.

We hope to empower a wide range of Python users by providing easy access to high-performance core math operations such as FFT, dense and sparse linear algebra, and more. This includes the following groups of users:

  1. Practitioners: Researchers and application programmers who require robust, high-performance mathematical tools.

  2. Library Package Developers: Developers crafting libraries that rely on advanced mathematical operations.

  3. CUDA Kernel Authors: Programmers who write CUDA kernels and need customized mathematical functionality.

The APIs provided by nvmath-python can be categorized into:

  • Host APIs: Invoked from the host and executed in the chosen space (currently limited to single GPUs).

  • Device APIs: Called directly from within CUDA kernels.

nvmath-python is dedicated to delivering the following key features and commitments:

  1. Logical Feature Parity: While the pythonic API surface (the number of APIs and the complexity of each) is more concise compared to that of the C libraries, it provides access to their complete functionality.

  2. Consistent Design Patterns: Uniform design across all modules to simplify user experience.

  3. Transparency and Explicitness: Avoiding implicit, costly operations such as copying data across the same memory space, automatic type promotion, and alterations to the user environment or state (current device, current stream, etc.). This allows users to perform the required conversion once for use in all subsequent operations instead of incurring hidden costs on each call.

  4. Clear, Actionable Error Messages: Ensuring that errors are informative and helpful in resolving the problem.

  5. DRY Principle Compliance: Automatically utilizing available information such as the current stream and memory pool to avoid redundant specification (“don’t repeat yourself”).

With nvmath-python, a few lines of code are sufficient to unlock the extensive performance capabilities of the NVIDIA math libraries. Explore our sample Python codes and more detailed examples in the examples directory on GitHub.

Architecture

nvmath-python is designed to support integration at any level desired by the user. This flexibility allows:

  • Alice, a Python package developer, to utilize core math operations to compose into higher-level algorithms or adapt these operations into her preferred interfaces.

  • Bob, an application developer, to use core operations directly from nvmath-python or indirectly through other libraries that leverage math-python.

  • Carol, a researcher, to write kernels entirely in Python that call core math operations such as FFT.

_images/nvmath-python.png

Additionally, we offer Python bindings that provide a 1:1 mapping with the C APIs. These bindings, which serve as wrappers with API signatures similar to their C counterparts, are ideal for library developers looking to integrate the capabilities of the NVIDIA Math Libraries in a customized manner, in the event that the pythonic APIs don’t meet their specific requirements. Conversely, our high-level pythonic APIs deliver a fully integrated solution suitable for native Python users as well as library developers, encompassing both host and device APIs. In the future, select host APIs will accept callback functions written in Python and compiled into supported formats such as LTO-IR, using compilers like Numba.

Host APIs

nvmath-python provides a collection of APIs that can be directly invoked from the CPU (host). At present, these APIs encompass a selection of functionalities within the following categories:

Effortless Interoperability

All host APIs support input arrays/tensors from NumPy, CuPy, and PyTorch while returning output operands using the same package, thus offering effortless interoperability with these frameworks. One example for the interoperability is shown below:

import numpy as np
import nvmath

# Create a numpy.ndarray as input
a = np.random.random(128) + 1.j * np.random.random(128)

# Call nvmath-python pythonic APIs
b = nvmath.fft.fft(a)

# Verify that output is also a numpy.ndarray
assert isinstance(b, np.ndarray)

Stateless and Stateful APIs

The host APIs within nvmath-python can be generally categorized into two types: stateless function-form APIs and stateful class-form APIs.

The function-form APIs, such as nvmath.fft.fft() and nvmath.linalg.advanced.matmul(), are designed to deliver quick, end-to-end results with a single function call. These APIs are ideal for instances where a user needs to perform a single computation without the need for intermediate steps, customization of algorithm selection, or cost amortization of preparatory steps. Conversely, the stateful class-form APIs, like nvmath.fft.FFT and nvmath.linalg.advanced.Matmul, offer a more comprehensive and flexible approach. They not only encompass the functionality found in their function-form counterparts but also allow for amortization of one-time costs, potentially enhancing performance significantly.

The design pattern for all stateful APIs in nvmath-python consists of several key phases:

  • Problem Specification: This initial phase involves defining the operation and setting options that affect its execution. It’s designed to be as lightweight as possible, ensuring the problem is well-defined and supported by the current implementation.

  • Preparation: Using FFT as an example, this phase includes a planning step to select the optimal algorithm for the defined FFT operation. An optional autotuning operation, when available, also falls within the preparation phase. The preparation phase is generally the most resource-intensive and may incorporate user-specified planning and autotuning options.

  • Execution: This phase allows for repeated execution, where the operand can be either modified in-place or explicitly reset using the reset_operand/reset_operands method. The costs associated with the first two phases are therefore amortized over these multiple executions.

  • Resource Release: Users are advised to use stateful objects from within a context using the with statement, which automatically handles the release of internal resources upon exit. If the object is not used as a context manager using with, it is necessary to explicitly call the free method to ensure all resources are properly released.

Note

By design, nvmath-python does NOT cache plans with stateless function-form APIs. This is to enable library developers and others to use their own caching mechanisms with nvmath-python. Therefore users should use the stateful object APIs for repeated use as well as benchmarking to avoid incurring repeated preparatory costs, or use a cached API (see caching.py for an example implementation).

Note

The decision to require explicit free calls for resource release is driven by the fact that Python’s garbage collector can delay freeing object resources when the object goes out of scope or its reference count drops to zero. For details, refer to the __del__ method Python documentation.

Generic and Specialized APIs

Another way of categorizing the host APIs within nvmath-python is by splitting them into generic and specialized APIs, based on their flexibility and the scope of their functionality:

  • Generic APIs are designed to accommodate a broad range of operands and customization with these APIs is confined to options that are universally applicable across all supported operand types. For instance, the generic matrix multiplication API can handle structured matrices (such as triangular and banded, in full or packed form) in addition to dense full matrices, but the available options are limited to those applicable to all these matrix types.

  • Specialized APIs, on the other hand, are tailored for specific types of operands, allowing for full customization that is available to this kind. A prime example is the specialized matrix multiplication API for dense matrices, which provides numerous options specifically suited to dense matrices.

It should be noted that the notion of generic and specialized APIs is orthogonal to the notion of stateful versus stateless APIs. Currently, nvmath-python offers the specialized interface for dense matrix multiplication, in stateful and stateless forms.

Full Logging Support

nvmath-python provides integration with the Python standard library logger from the logging module to offer full logging of the computational details at various level, e.g, debug, information, warning and error. An example illustrating the use of the global Python logger is shown below:

import logging

# Turn on logging with level set to "debug" and use a custom format for the log
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(levelname)-8s %(message)s', datefmt='%m-%d %H:%M:%S')

# Call nvmath-python pythonic APIs
out = nvmath.linalg.advanced.matmul(...)

Alternatively, for APIs that contain the options argument, users can set a custom logger by directly passing it inside a dictionary or as part of the corresponding Options object, e.g., nvmath.fft.FFTOptions.logger for nvmath.fft.fft() and nvmath.fft.FFT. An example based on FFT is shown below:

import logging

# Create a custom logger
logger = logging.getLogger('userlogger')
...

# Call nvmath-python pythonic APIs
out = nvmath.fft.fft(..., options={'logger': logger})

For the complete examples, please refer to global logging example and custom user logging example.

Note

The Python logging is orthogonal to the logging provided by certain NVIDIA math libraries, which encapsulates low level implementation details and can be activated via either specific environment variables (e.g., CUBLASLT_LOG_LEVEL for cuBLASLt) or programmatically through the Python bindings (e.g., nvmath.bindings.cusolverDn.logger_set_level() for cuSOLVER).

Call Blocking Behavior

By default, calls to all pythonic host APIs that require GPU execution are not blocking if the input operands reside on the device. This means that functions like nvmath.linalg.advanced.matmul(), nvmath.fft.FFT.execute(), and nvmath.linalg.advanced.Matmul.execute() will return immediately after the operation is launched on the GPU without waiting for it to complete. Users are therefore responsible for properly synchronizing the stream when needed. The default behavior can be modified by setting the blocking attribute (default 'auto') of the relevant Options object to True. For example, users may set nvmath.fft.FFTOptions.blocking to True and pass this options object to the corresponding FFT API calls. If the input operands are on the host, the pythonic API calls will always block since the computation yields an output operand that will also reside on the host. Meanwhile, APIs that execute on the host (such as nvmath.fft.FFT.create_key()) always block.

Stream Semantics

The stream semantics depend on whether the behavior of the execution APIs is chosen to be blocking or non-blocking (see Call Blocking Behavior).

For blocking behavior, stream ordering is automatically handled by the nvmath-python high-level APIs for operations that are performed within the package. A stream can be provided for two reasons:

  1. When the computation that prepares the input operands is not already complete by the time the execution APIs are called. This is a correctness requirement for user-provided data.

  2. To enable parallel computations across multiple streams if the device has sufficient resources and the current stream (which is the default) has concomitant operations. This can be done for performance reasons.

For non-blocking behavior, it is the user’s responsibility to ensure correct stream ordering between the execution API calls.

In any case, the execution APIs are launched on the provided stream.

For examples on stream ordering, please refer to FFT with multiple streams.

Memory Management

By default, the host APIs use the memory pool from the package that their operands belong to. This ensures that there is no contention for memory or spurious out-of-memory errors. However the user also has the ability to provide their own memory allocator if they choose to do so. In our pythonic APIs, we support an EMM-like interface as proposed and supported by Numba for users to set their Python mempool. Taking FFT as an example, users can set the option nvmath.fft.FFTOptions.allocator to a Python object complying with the nvmath.BaseCUDAMemoryManager protocol, and pass the options to the high-level APIs like nvmath.fft.fft() or nvmath.fft.FFT. Temporary memory allocations will then be done through this interface. Internally, we use the same interface to use CuPy or PyTorch’s mempool depending on the operands.

Note

nvmath’s BaseCUDAMemoryManager protocol is slightly different from Numba’s EMM interface (numba.cuda.BaseCUDAMemoryManager), but duck typing with an existing EMM instance (not type!) at runtime should be possible.

Common Objects (nvmath)

BaseCUDAMemoryManager(*args, **kwargs)

Protocol for memory manager plugins.

MemoryPointer(device_ptr, size, finalizer)

An RAII class for a device memory buffer.

Device APIs

The device APIs enable the user to call core mathematical operations in their Python CUDA kernels, resulting in a fully fused kernel. Fusion is essential for performance in latency-dominated cases to reduce the number of kernel launches, and in memory-bound operations to avoid the extra roundtrip to global memory.

We currently offer support for calling FFT and matrix multiplication APIs in kernels written using Numba, with plans to offer more core operations and support other compilers in the future. The design of the device APIs closely mimics the corresponding C++ APIs from NVIDIA MathDx libraries including cuFFTDx and cuBLASDx.

Compatibility Policy

nvmath-python is no different from any Python package, in that we would not succeed without depending on, collaborating with, and evolving alongside the Python community. Given these considerations, we strive to meet the following commitments:

  1. For the low-level Python bindings,

    • if the library to be bound is part of CUDA Toolkit, we support the library from the most recent two CUDA major versions (currently CUDA 11/12)

    • otherwise, we support the library within its major version

    Note that all bindings are currently experimental.

  2. For the high-level pythonic APIs, we maintain backward compatibility to the greatest extent feasible. When a breaking change is necessary, we issue a runtime warning to alert users of the upcoming changes in the next major release. This practice ensures that breaking changes are clearly communicated and reserved for major version updates, allowing users to prepare and adapt without surprises.

  3. We comply with NEP-29 and support a community-defined set of core dependencies (CPython, NumPy, etc).

Note

The policy on backwards compatibility will apply starting with release 1.0.0.