cuBLASDx APIs (nvmath. device)#
Overview#
These APIs offer integration with the NVIDIA cuBLASDx library. Detailed documentation of cuBLASDx can be found in the cuBLASDx documentation.
Note
The Matmul device API in module
nvmath. currently supports cuBLASDx 0.5.1, also available
as part of MathDx 25.12.1.
Traits Feature Readiness#
This table outlines the readiness of cuBLASDx traits in the Python API
(nvmath.).
1. Description Traits#
These traits provide information about the function descriptor constructed using Operators.
C++ Trait |
Python |
Status |
Notes |
|---|---|---|---|
|
✅ |
Returns |
|
|
✅ |
Returns |
|
|
✅ |
Returns |
|
|
✅ |
Returns the string (e.g., |
|
|
✅ |
Returns |
|
|
✅ |
Returns |
|
|
✅ |
Returns |
|
|
✅ |
Returns |
|
|
✅ |
Returns |
|
|
N/A |
❌ |
Unnecessary in Python. The |
|
N/A |
❌ |
The execution state is handled internally/implicitly. |
|
N/A |
❌ |
Construction of |
|
N/A |
❌ |
Same as above. |
2. Execution Traits (Block Traits)#
These traits describe execution configuration when using Block() operators.
C++ Trait |
Python |
Status |
Notes |
|---|---|---|---|
|
✅ |
Returns the NumPy compute data type for A, B, and C. |
|
|
✅ |
Returns the dimensions as |
|
|
✅ |
Exposed as part of the |
|
|
✅ |
Exposed as part of the |
|
|
✅ |
Number of elements in matrices, inclusive of padding. |
|
|
✅ |
Returns |
|
|
N/A |
✅ |
Automatically calculated and used if |
|
✅ |
Calculated as |
3. Other Traits#
Helper traits regarding hardware support and performance suggestions.
C++ Trait |
Python |
Status |
Notes |
|---|---|---|---|
|
N/A |
❌ |
Not currently implemented or exposed to the user. |
|
N/A |
❌ |
Not currently implemented or exposed to the user. |
|
N/A |
✅ |
Automatically calculated and used if |
|
N/A |
❌ |
Not explicitly implemented (although the backend imports
|
API Reference#
|
A class that encapsulates a partial Matmul device function. |
|
Create an |
|
make_tensor is a helper function for creating |
|
AXPBY operation: y = alpha * x + beta * y |
|
Copies data from the source tensor to the destination tensor. |
|
A bidirectional copying method to copy data between register fragments and global memory tensors. |
|
Clears the contents of the given tensor by setting all elements to zero. |
Creates synchronization point. |
|
|
Abstraction over the cuBLASDx tensor type (an alias of the CuTe tensor type). |
|
Layout for the |
|
Accumulator is an abstraction that provides the link between the global memory and register layouts. |
|
DevicePipeline allows users to optimally configure kernel calls for pipelined matrix multiplication. |
|
TilePipeline allows users to execute an pipelined matrix multiplication with partial tile results accumulated into an acuumulator. |
Helper class to calculate shared storage size. |
|
A namedtuple class that encapsulates the three leading dimensions in matrix multiplication \(C = \alpha Op(A) Op(B) + \beta C\). |
|
A namedtuple class that encapsulates the transpose mode for input matrices |
|
A namedtuple class that encapsulates the three precisions in matrix multiplication \(C = \alpha Op(A) Op(B) + \beta C\). |
|
A namedtuple class that encapsulates the three arrangements in matrix allocation. |
|
A type to encapsulate the memory alignment in bytes of the matrix operands A, B, and C. |