matmul#

nvmath.device.matmul(*, compiler=None, **kwargs)[source]#

Create an BlasOptions object that encapsulates a compiled and ready-to-use device function for matrix multiplication.

Parameters:

size – A sequence of integers denoting the three dimensions (m, n, k) for the matrix multiplication problem.
precision – The computation precision specified as a numpy float dtype, currently supports numpy.float16, numpy.float32 and numpy.float64.
data_type – The data type of the input matrices, can be either 'real' or 'complex'.
compiler – A string to specify the compiler for the device code, currently supports None (default) and 'Numba'
code_type (CodeType) – The target GPU code and compute-capability.
block_size (int) – The total block size, optional. If not provided or set to 'suggested', will be set to a suggested value for 1D block dim.
block_dim (Dim3) – The block dimension for launching the CUDA kernel, optional. If not provided or set to 'suggested', will be set to a suggested value. Can’t not be used when block_size is explicitly specified.
leading_dimension (LeadingDimension) – The leading dimensions for the input matrices, optional. If not provided, will be set to match the matrix row/column dimension. Alternatively, if provided as 'suggested', will be set to a suggested value for optimal performance.
transpose_mode (TransposeMode) – The transpose mode for all input matrices. If not provided, no transposition by default.
function (str) – A string specifying the name of the function. Currently supports 'MM' (default) for matrix multiplication.
execution (str) – A string specifying the execution method, can be 'Block' or 'Thread'.

See also

The attributes of BlasOptions provide a 1:1 mapping with the CUDA C++ cuBLASDx APIs. For further details, please refer to cuBLASDx documentation.

Examples

>>> from numba import cuda
>>> from nvmath.device import matmul
>>> import numpy as np
>>> m, n, k = 32, 16, 64
>>> block_size = 256

Use nvmath.device.matmul() to create the compiled matrix multiplication object:

>>> MM = matmul(
...     size=(m, n, k),
...     precision=np.float32,
...     data_type="real",
...     transpose_mode=("non_transposed", "transposed"),
...     execution="Block",
...     block_size=block_size,
...     compiler="numba",
... )

Pass link=MM.files to the numba.cuda.jit() decorator when defining your kernel to link with the compiled code.

cuBLASDx works on shared memory arrays. It requires column-major (F order) arrays but cuda.shared.array creates row-major (C order) arrays only. You can emulate a column-major array by flipping dimensions. With your shared memory arrays ready and filled with actual data, you can run the matrix multiplication by calling MM

>>> a_dim, b_dim, c_dim = MM.a_dim, MM.b_dim, MM.c_dim
>>> @cuda.jit(link=MM.files)
... def f():
...     a = cuda.shared.array(shape=(a_dim[1], a_dim[0]), dtype=np.float32)
...     b = cuda.shared.array(shape=(b_dim[1], b_dim[0]), dtype=np.float32)
...     c = cuda.shared.array(shape=(c_dim[1], c_dim[0]), dtype=np.float32)
...     # TODO: Populate the arrays with actual data.
...     alpha, beta = 1.0, 0.0
...     MM(alpha, a, b, beta, c)
...     cuda.syncthreads()
...     # TODO: Copy the result (c) from the shared memory
>>> f[1, block_size]()

Further examples can be found in the nvmath/examples/device directory.