matmul#
-
nvmath.
device. matmul(*, compiler=None, **kwargs)[source]# Create an
BlasOptionsobject that encapsulates a compiled and ready-to-use device function for matrix multiplication.- Parameters:
size – A sequence of integers denoting the three dimensions
(m, n, k)for the matrix multiplication problem.precision – The computation precision specified as a numpy float dtype, currently supports
numpy.float16,numpy.float32andnumpy.float64.data_type – The data type of the input matrices, can be either
'real'or'complex'.compiler – A string to specify the compiler for the device code, currently supports
None(default) and'Numba'code_type (CodeType) – The target GPU code and compute-capability.
block_size (int) – The total block size, optional. If not provided or set to
'suggested', will be set to a suggested value for 1D block dim.block_dim (Dim3) – The block dimension for launching the CUDA kernel, optional. If not provided or set to
'suggested', will be set to a suggested value. Can’t not be used whenblock_sizeis explicitly specified.leading_dimension (LeadingDimension) – The leading dimensions for the input matrices, optional. If not provided, will be set to match the matrix row/column dimension. Alternatively, if provided as
'suggested', will be set to a suggested value for optimal performance.transpose_mode (TransposeMode) – The transpose mode for all input matrices. If not provided, no transposition by default.
function (str) – A string specifying the name of the function. Currently supports
'MM'(default) for matrix multiplication.execution (str) – A string specifying the execution method, can be
'Block'or'Thread'.
See also
The attributes of
BlasOptionsprovide a 1:1 mapping with the CUDA C++ cuBLASDx APIs. For further details, please refer to cuBLASDx documentation.Examples
>>> from numba import cuda >>> from nvmath.device import matmul >>> import numpy as np >>> m, n, k = 32, 16, 64 >>> block_size = 256
Use
nvmath.to create the compiled matrix multiplication object:device. matmul() >>> MM = matmul( ... size=(m, n, k), ... precision=np.float32, ... data_type="real", ... transpose_mode=("non_transposed", "transposed"), ... execution="Block", ... block_size=block_size, ... compiler="numba", ... )
Pass
link=MM.filesto thenumba.cuda.jit()decorator when defining your kernel to link with the compiled code.cuBLASDx works on shared memory arrays. It requires column-major (F order) arrays but
cuda.shared.arraycreates row-major (C order) arrays only. You can emulate a column-major array by flipping dimensions. With your shared memory arrays ready and filled with actual data, you can run the matrix multiplication by callingMM>>> a_dim, b_dim, c_dim = MM.a_dim, MM.b_dim, MM.c_dim >>> @cuda.jit(link=MM.files) ... def f(): ... a = cuda.shared.array(shape=(a_dim[1], a_dim[0]), dtype=np.float32) ... b = cuda.shared.array(shape=(b_dim[1], b_dim[0]), dtype=np.float32) ... c = cuda.shared.array(shape=(c_dim[1], c_dim[0]), dtype=np.float32) ... # TODO: Populate the arrays with actual data. ... alpha, beta = 1.0, 0.0 ... MM(alpha, a, b, beta, c) ... cuda.syncthreads() ... # TODO: Copy the result (c) from the shared memory >>> f[1, block_size]()
Further examples can be found in the nvmath/examples/device directory.