cuBLASDx APIs (`nvmath.device`)#

Overview#

These APIs offer integration with the NVIDIA cuBLASDx library. Detailed documentation of cuBLASDx can be found in the cuBLASDx documentation.

Note

The Matmul device API in module nvmath.device currently supports cuBLASDx 0.5.1, also available as part of MathDx 25.12.1.

This table outlines the readiness of cuBLASDx traits in the Python API (nvmath.device).

These traits provide information about the function descriptor constructed using Operators.

C++ Trait	Python `nvmath.device` Implementation	Status	Notes
`size_of`	`size`	✅	Returns `(m, n, k)` tuple.
`type_of`	`data_type`	✅	Returns `'real'` or `'complex'`.
`precision_of`	`precision`	✅	Returns `Precision` named tuple.
`function_of`	`function`	✅	Returns the string (e.g., `'MM'`).
`arrangement_of`	`arrangement`	✅	Returns `Arrangement` named tuple.
`transpose_mode_of`	`transpose_mode`	✅	Returns `TransposeMode` named tuple (marked as deprecated).
`alignment_of`	`alignment`	✅	Returns `Alignment` named tuple.
`leading_dimension_of`	`leading_dimension`	✅	Returns `LeadingDimension` named tuple.
`sm_of`	`sm`	✅	Returns `ComputeCapability`.
`is_blas`	N/A	❌	Unnecessary in Python. The `Matmul` class acts as the guaranteed descriptor.
`is_blas_execution`	N/A	❌	The execution state is handled internally/implicitly.
`is_complete_blas`	N/A	❌	Construction of `Matmul` inherently validates completeness.
`is_complete_blas_execution`	N/A	❌	Same as above.

These traits describe execution configuration when using Block() operators.

C++ Trait	Python `nvmath.device` Implementation	Status	Notes
`<a/b/c>_value_type`	`a_value_type`, `b_value_type`, `c_value_type`	✅	Returns the NumPy compute data type for A, B, and C.
`<a/b/c>_dim`	`a_dim`, `b_dim`, `c_dim`	✅	Returns the dimensions as `(rows, columns)` tuples.
`ld<a/b/c>`	`leading_dimension`	✅	Exposed as part of the `LeadingDimension` tuple.
`<a/b/c>_alignment`	`alignment`	✅	Exposed as part of the `Alignment` tuple.
`<a/b/c>_size`	`a_size`, `b_size`, `c_size`	✅	Number of elements in matrices, inclusive of padding.
`block_dim`	`block_dim`	✅	Returns `Dim3` representing CUDA block dimensions.
`suggested_block_dim`	N/A	✅	Automatically calculated and used if `block_dim="suggested"` is passed during `Matmul` initialization.
`max_threads_per_block`	`max_threads_per_block`	✅	Calculated as `x * y * z` threads.

Helper traits regarding hardware support and performance suggestions.

C++ Trait	Python `nvmath.device` Implementation	Status	Notes
`is_supported_smem_restrict`	N/A	❌	Not currently implemented or exposed to the user.
`is_supported_rmem_restrict`	N/A	❌	Not currently implemented or exposed to the user.
`suggested_leading_dimension_of`	N/A	✅	Automatically calculated and used if `leading_dimension="suggested"` is passed during `Matmul` initialization.
`suggested_alignment_of`	N/A	❌	Not explicitly implemented (although the backend imports `MAX_ALIGNMENT`, there is no trait method returning the suggested tuple for A, B, C).

`Matmul`(size, precision, data_type, *[, sm, ...])	A class that encapsulates a partial Matmul device function.
`matmul`(*[, compiler, code_type, ...])	Create an `Matmul` object that encapsulates a compiled and ready-to-use device function for matrix multiplication.
`make_tensor`(array, layout)	make_tensor is a helper function for creating `nvmath.device.OpaqueTensor` objects.
`axpby`(alpha, x_tensor, beta, y_tensor)	AXPBY operation: y = alpha * x + beta * y
`copy`(src, dst[, alignment])	Copies data from the source tensor to the destination tensor.
`copy_fragment`(src, dst[, alignment])	A bidirectional copying method to copy data between register fragments and global memory tensors.
`clear`(arr)	Clears the contents of the given tensor by setting all elements to zero.
`copy_wait`()	Creates synchronization point.
`OpaqueTensor`(*args)	Abstraction over the cuBLASDx tensor type (an alias of the CuTe tensor type).
`Layout`()	Layout for the `nvmath.device.OpaqueTensor`.
`Accumulator`(*args)	Accumulator is an abstraction that provides the link between the global memory and register layouts.
`DevicePipeline`(mm, pipeline_depth, a, b)	DevicePipeline allows users to optimally configure kernel calls for pipelined matrix multiplication.
`TilePipeline`(device_pipeline)	TilePipeline allows users to execute an pipelined matrix multiplication with partial tile results accumulated into an acuumulator.
`SharedStorageCalc`()	Helper class to calculate shared storage size.

`LeadingDimension`(a, b, c)	A namedtuple class that encapsulates the three leading dimensions in matrix multiplication \(C = \alpha Op(A) Op(B) + \beta C\).
`TransposeMode`(a, b)	A namedtuple class that encapsulates the transpose mode for input matrices `A` and `B` in matrix multiplication.
`Precision`(a, b, c)	A namedtuple class that encapsulates the three precisions in matrix multiplication \(C = \alpha Op(A) Op(B) + \beta C\).
`Arrangement`(a, b, c)	A namedtuple class that encapsulates the three arrangements in matrix allocation.
`Alignment`(a, b, c)	A type to encapsulate the memory alignment in bytes of the matrix operands A, B, and C.