cuBLASDx APIs (nvmath.device)#

Overview#

These APIs offer integration with the NVIDIA cuBLASDx library. Detailed documentation of cuBLASDx can be found in the cuBLASDx documentation.

Note

The Matmul device API in module nvmath.device currently supports cuBLASDx 0.5.1, also available as part of MathDx 25.12.1.

Traits Feature Readiness#

This table outlines the readiness of cuBLASDx traits in the Python API (nvmath.device).

1. Description Traits#

These traits provide information about the function descriptor constructed using Operators.

C++ Trait

Python nvmath.device Implementation

Status

Notes

size_of

size

Returns (m, n, k) tuple.

type_of

data_type

Returns 'real' or 'complex'.

precision_of

precision

Returns Precision named tuple.

function_of

function

Returns the string (e.g., 'MM').

arrangement_of

arrangement

Returns Arrangement named tuple.

transpose_mode_of

transpose_mode

Returns TransposeMode named tuple (marked as deprecated).

alignment_of

alignment

Returns Alignment named tuple.

leading_dimension_of

leading_dimension

Returns LeadingDimension named tuple.

sm_of

sm

Returns ComputeCapability.

is_blas

N/A

Unnecessary in Python. The Matmul class acts as the guaranteed descriptor.

is_blas_execution

N/A

The execution state is handled internally/implicitly.

is_complete_blas

N/A

Construction of Matmul inherently validates completeness.

is_complete_blas_execution

N/A

Same as above.

2. Execution Traits (Block Traits)#

These traits describe execution configuration when using Block() operators.

C++ Trait

Python nvmath.device Implementation

Status

Notes

<a/b/c>_value_type

a_value_type, b_value_type, c_value_type

Returns the NumPy compute data type for A, B, and C.

<a/b/c>_dim

a_dim, b_dim, c_dim

Returns the dimensions as (rows, columns) tuples.

ld<a/b/c>

leading_dimension

Exposed as part of the LeadingDimension tuple.

<a/b/c>_alignment

alignment

Exposed as part of the Alignment tuple.

<a/b/c>_size

a_size, b_size, c_size

Number of elements in matrices, inclusive of padding.

block_dim

block_dim

Returns Dim3 representing CUDA block dimensions.

suggested_block_dim

N/A

Automatically calculated and used if block_dim="suggested" is passed during Matmul initialization.

max_threads_per_block

max_threads_per_block

Calculated as x * y * z threads.

3. Other Traits#

Helper traits regarding hardware support and performance suggestions.

C++ Trait

Python nvmath.device Implementation

Status

Notes

is_supported_smem_restrict

N/A

Not currently implemented or exposed to the user.

is_supported_rmem_restrict

N/A

Not currently implemented or exposed to the user.

suggested_leading_dimension_of

N/A

Automatically calculated and used if leading_dimension="suggested" is passed during Matmul initialization.

suggested_alignment_of

N/A

Not explicitly implemented (although the backend imports MAX_ALIGNMENT, there is no trait method returning the suggested tuple for A, B, C).

API Reference#

Matmul(size, precision, data_type, *[, sm, ...])

A class that encapsulates a partial Matmul device function.

matmul(*[, compiler, code_type, ...])

Create an Matmul object that encapsulates a compiled and ready-to-use device function for matrix multiplication.

make_tensor(array, layout)

make_tensor is a helper function for creating nvmath.device.OpaqueTensor objects.

axpby(alpha, x_tensor, beta, y_tensor)

AXPBY operation: y = alpha * x + beta * y

copy(src, dst[, alignment])

Copies data from the source tensor to the destination tensor.

copy_fragment(src, dst[, alignment])

A bidirectional copying method to copy data between register fragments and global memory tensors.

clear(arr)

Clears the contents of the given tensor by setting all elements to zero.

copy_wait()

Creates synchronization point.

OpaqueTensor(*args)

Abstraction over the cuBLASDx tensor type (an alias of the CuTe tensor type).

Layout()

Layout for the nvmath.device.OpaqueTensor.

Accumulator(*args)

Accumulator is an abstraction that provides the link between the global memory and register layouts.

DevicePipeline(mm, pipeline_depth, a, b)

DevicePipeline allows users to optimally configure kernel calls for pipelined matrix multiplication.

TilePipeline(device_pipeline)

TilePipeline allows users to execute an pipelined matrix multiplication with partial tile results accumulated into an acuumulator.

SharedStorageCalc()

Helper class to calculate shared storage size.

LeadingDimension(a, b, c)

A namedtuple class that encapsulates the three leading dimensions in matrix multiplication \(C = \alpha Op(A) Op(B) + \beta C\).

TransposeMode(a, b)

A namedtuple class that encapsulates the transpose mode for input matrices A and B in matrix multiplication.

Precision(a, b, c)

A namedtuple class that encapsulates the three precisions in matrix multiplication \(C = \alpha Op(A) Op(B) + \beta C\).

Arrangement(a, b, c)

A namedtuple class that encapsulates the three arrangements in matrix allocation.

Alignment(a, b, c)

A type to encapsulate the memory alignment in bytes of the matrix operands A, B, and C.