matmul#

nvmath.device.matmul(*, compiler=None, **kwargs)[source]#

Create an BlasOptions object that encapsulates a compiled and ready-to-use device function for matrix multiplication.

Parameters:

size – A sequence of integers denoting the three dimensions (m, n, k) for the matrix multiplication problem.
precision – The computation precision specified as a numpy float dtype, currently supports numpy.float16, numpy.float32 and numpy.float64.
data_type – The data type of the input matrices, can be either 'real' or 'complex'.
compiler – A string to specify the compiler for the device code, currently supports None (default) and 'numba'
code_type (CodeType) – The target GPU code and compute-capability.
block_size (int) – The total block size, optional. If not provided or set to 'suggested', will be set to a suggested value for 1D block dim.
block_dim (Dim3) – The block dimension for launching the CUDA kernel, optional. If not provided or set to 'suggested', will be set to a suggested value. Cannot be used when block_size is explicitly specified.
leading_dimension (LeadingDimension) – The leading dimensions for the input matrices, optional. If not provided, will be set to match the matrix row/column dimension. Alternatively, if provided as 'suggested', will be set to a suggested value for optimal performance.
transpose_mode (TransposeMode) – The transpose mode for all input matrices ; transpose_mode or arrangement must be provided.
arrangement (Arrangement) – The arrangement for all input matrices ; transpose_mode or arrangement must be provided.
alignment (Alignment) – The alignment for the input matrices in shared memory. Defines the alignments (in bytes) of the input matrices A, B, and C (either arrays or wrapped in opaque tensors) that are passed to the execute(…) method. Default alignment is equal to an element size of the matrix unless used suggested layout. In that case alignment is greater or equal than the element size.
function (str) – A string specifying the name of the function. Currently supports 'MM' (default) for matrix multiplication.
execution (str) – A string specifying the execution method, can be 'Block' or 'Thread'.
execute_api (str) – A string specifying the signature of the function that handles problems with default or custom/dynamic leading dimensions. Could be 'static_leading_dimensions' or 'dynamic_leading_dimensions'.
tensor_types (str) – A list of strings specifying the tensors being used at execute signature.
global_memory_alignment (Alignment) – Same as alignment, but for the global memory. Used to optimize copying between shared and global memory.

See also

The attributes of BlasOptions provide a 1:1 mapping with the CUDA C++ cuBLASDx APIs. For further details, please refer to cuBLASDx documentation.

Examples

>>> from numba import cuda
>>> from nvmath.device import matmul
>>> import numpy as np
>>> m, n, k = 32, 16, 64
>>> block_size = 256

Use nvmath.device.matmul() to create the compiled matrix multiplication object:

>>> MM = matmul(
...     size=(m, n, k),
...     precision=np.float32,
...     data_type="real",
...     transpose_mode=("non_transposed", "transposed"),
...     execution="Block",
...     block_size=block_size,
...     compiler="numba",
... )

Pass link=MM.files to the numba.cuda.jit() decorator when defining your kernel to link with the compiled code.

cuBLASDx works on shared memory arrays. It requires column-major (F order) arrays but cuda.shared.array creates row-major (C order) arrays only. You can emulate a column-major array by flipping dimensions. With your shared memory arrays ready and filled with actual data, you can run the matrix multiplication by calling MM

>>> a_dim, b_dim, c_dim = MM.a_dim, MM.b_dim, MM.c_dim
>>> @cuda.jit(link=MM.files)
... def f():
...     a = cuda.shared.array(shape=(a_dim[1], a_dim[0]), dtype=np.float32)
...     b = cuda.shared.array(shape=(b_dim[1], b_dim[0]), dtype=np.float32)
...     c = cuda.shared.array(shape=(c_dim[1], c_dim[0]), dtype=np.float32)
...     # TODO: Populate the arrays with actual data.
...     alpha, beta = 1.0, 0.0
...     MM(alpha, a, b, beta, c)
...     cuda.syncthreads()
...     # TODO: Copy the result (c) from the shared memory
>>> f[1, block_size]()

Further examples can be found in the nvmath/examples/device directory.