matmul#

nvmath.device.matmul(

*,

compiler=None,

code_type=None,

execute_api=None,

tensor_types=None,

global_memory_alignment=None,

**kwargs,

)[source]#

Create an Matmul object that encapsulates a compiled and ready-to-use device function for matrix multiplication.

Deprecated since version 0.7.0.

Parameters:

size – A sequence of integers denoting the three dimensions (m, n, k) for the matrix multiplication problem.
precision – The computation precision specified as a numpy float dtype, currently supports numpy.float16, numpy.float32 and numpy.float64.
data_type – The data type of the input matrices, can be either 'real' or 'complex'.
compiler –
A string to specify the compiler for the device code, currently supports None (default) and 'numba'

Changed in version 0.7.0: compiler is no longer needed and does not take effect. Use nvmath.device.compile_blas_execute() to get device function code.
code_type (CodeType) –
The target GPU code and compute-capability.

Changed in version 0.7.0: code_type should be used by nvmath.device.compile_blas_execute() and no longer needed for numba-cuda usage.
block_size (int) – The total block size, optional. If not provided or set to 'suggested', will be set to a suggested value for 1D block dim.
block_dim (Dim3) – The block dimension for launching the CUDA kernel, optional. If not provided or set to 'suggested', will be set to a suggested value. Cannot be used when block_size is explicitly specified.
leading_dimension (LeadingDimension) – The leading dimensions for the input matrices, optional. If not provided, will be set to match the matrix row/column dimension. Alternatively, if provided as 'suggested', will be set to a suggested value for optimal performance.
transpose_mode (TransposeMode) – The transpose mode for all input matrices ; transpose_mode or arrangement must be provided.
arrangement (Arrangement) – The arrangement for all input matrices ; transpose_mode or arrangement must be provided.
alignment (Alignment) – The alignment for the input matrices in shared memory. Defines the alignments (in bytes) of the input matrices A, B, and C (either arrays or wrapped in opaque tensors) that are passed to the execute(…) method. Default alignment is equal to an element size of the matrix unless used suggested layout. In that case alignment is greater or equal than the element size.
function (str) – A string specifying the name of the function. Currently supports 'MM' (default) for matrix multiplication.
execution (str) – A string specifying the execution method, can be 'Block' or 'Thread'.
execute_api (str) –
A string specifying the signature of the function that handles problems with default or custom/dynamic leading dimensions. Could be 'static_leading_dimensions' or 'dynamic_leading_dimensions'.

Changed in version 0.7.0: execute_api should be used by nvmath.device.compile_blas_execute() and no longer needed for numba-cuda usage.
tensor_types (str) –
A list of strings specifying the tensors being used at execute signature.

Changed in version 0.7.0: tensor_types should be used by nvmath.device.compile_blas_execute() and no longer needed for numba-cuda usage.
global_memory_alignment (Alignment) –
Same as alignment, but for the global memory. Used to optimize copying between shared and global memory.

Changed in version 0.7.0: alignment should be set at nvmath.device.copy() global_memory_alignment should be used by nvmath.device.compile_blas_execute() for non numba-cuda usage. Alignment should be set

See also

The attributes of Matmul provide a 1:1 mapping with the CUDA C++ cuBLASDx APIs. For further details, please refer to cuBLASDx documentation.

Examples

>>> from numba import cuda
>>> from nvmath.device import matmul
>>> import numpy as np
>>> m, n, k = 32, 16, 64
>>> block_size = 256

Use nvmath.device.matmul() to create the compiled matrix multiplication object:

>>> MM = matmul(
...     size=(m, n, k),
...     precision=np.float32,
...     data_type="real",
...     transpose_mode=("non_transposed", "transposed"),
...     execution="Block",
...     block_size=block_size,
...     compiler="numba",
... )

Pass link=MM.files to the numba.cuda.jit() decorator when defining your kernel to link with the compiled code.

cuBLASDx works on shared memory arrays. It requires column-major (F order) arrays but cuda.shared.array creates row-major (C order) arrays only. You can emulate a column-major array by flipping dimensions. With your shared memory arrays ready and filled with actual data, you can run the matrix multiplication by calling MM

>>> a_dim, b_dim, c_dim = MM.a_dim, MM.b_dim, MM.c_dim
>>> @cuda.jit(link=MM.files)
... def f():
...     a = cuda.shared.array(shape=(a_dim[1], a_dim[0]), dtype=np.float32)
...     b = cuda.shared.array(shape=(b_dim[1], b_dim[0]), dtype=np.float32)
...     c = cuda.shared.array(shape=(c_dim[1], c_dim[0]), dtype=np.float32)
...     # TODO: Populate the arrays with actual data.
...     alpha, beta = 1.0, 0.0
...     MM(alpha, a, b, beta, c)
...     cuda.syncthreads()
...     # TODO: Copy the result (c) from the shared memory
>>> f[1, block_size]()

Further examples can be found in the nvmath/examples/device directory.