matmul#
-
nvmath.
device. matmul(*, compiler=None, **kwargs)[source]# Create an
BlasOptions
object that encapsulates a compiled and ready-to-use device function for matrix multiplication.- Parameters:
size – A sequence of integers denoting the three dimensions
(m, n, k)
for the matrix multiplication problem.precision – The computation precision specified as a numpy float dtype, currently supports
numpy.float16
,numpy.float32
andnumpy.float64
.data_type – The data type of the input matrices, can be either
'real'
or'complex'
.compiler – A string to specify the compiler for the device code, currently supports
None
(default) and'numba'
code_type (CodeType) – The target GPU code and compute-capability.
block_size (int) – The total block size, optional. If not provided or set to
'suggested'
, will be set to a suggested value for 1D block dim.block_dim (Dim3) – The block dimension for launching the CUDA kernel, optional. If not provided or set to
'suggested'
, will be set to a suggested value. Cannot be used whenblock_size
is explicitly specified.leading_dimension (LeadingDimension) – The leading dimensions for the input matrices, optional. If not provided, will be set to match the matrix row/column dimension. Alternatively, if provided as
'suggested'
, will be set to a suggested value for optimal performance.transpose_mode (TransposeMode) – The transpose mode for all input matrices ; transpose_mode or arrangement must be provided.
arrangement (Arrangement) – The arrangement for all input matrices ; transpose_mode or arrangement must be provided.
alignment (Alignment) – The alignment for the input matrices in shared memory. Defines the alignments (in bytes) of the input matrices A, B, and C (either arrays or wrapped in opaque tensors) that are passed to the execute(…) method. Default alignment is equal to an element size of the matrix unless used suggested layout. In that case alignment is greater or equal than the element size.
function (str) – A string specifying the name of the function. Currently supports
'MM'
(default) for matrix multiplication.execution (str) – A string specifying the execution method, can be
'Block'
or'Thread'
.execute_api (str) – A string specifying the signature of the function that handles problems with default or custom/dynamic leading dimensions. Could be
'static_leading_dimensions'
or'dynamic_leading_dimensions'
.global_memory_alignment (Alignment) – Same as alignment, but for the global memory. Used to optimize copying between shared and global memory.
See also
The attributes of
BlasOptions
provide a 1:1 mapping with the CUDA C++ cuBLASDx APIs. For further details, please refer to cuBLASDx documentation.Examples
>>> from numba import cuda >>> from nvmath.device import matmul >>> import numpy as np >>> m, n, k = 32, 16, 64 >>> block_size = 256
Use
nvmath.
to create the compiled matrix multiplication object:device. matmul() >>> MM = matmul( ... size=(m, n, k), ... precision=np.float32, ... data_type="real", ... transpose_mode=("non_transposed", "transposed"), ... execution="Block", ... block_size=block_size, ... compiler="numba", ... )
Pass
link=MM.files
to thenumba.cuda.jit()
decorator when defining your kernel to link with the compiled code.cuBLASDx works on shared memory arrays. It requires column-major (F order) arrays but
cuda.shared.array
creates row-major (C order) arrays only. You can emulate a column-major array by flipping dimensions. With your shared memory arrays ready and filled with actual data, you can run the matrix multiplication by callingMM
>>> a_dim, b_dim, c_dim = MM.a_dim, MM.b_dim, MM.c_dim >>> @cuda.jit(link=MM.files) ... def f(): ... a = cuda.shared.array(shape=(a_dim[1], a_dim[0]), dtype=np.float32) ... b = cuda.shared.array(shape=(b_dim[1], b_dim[0]), dtype=np.float32) ... c = cuda.shared.array(shape=(c_dim[1], c_dim[0]), dtype=np.float32) ... # TODO: Populate the arrays with actual data. ... alpha, beta = 1.0, 0.0 ... MM(alpha, a, b, beta, c) ... cuda.syncthreads() ... # TODO: Copy the result (c) from the shared memory >>> f[1, block_size]()
Further examples can be found in the nvmath/examples/device directory.