matmul#
-
nvmath.
device. matmul( - *,
- compiler=None,
- code_type=None,
- execute_api=None,
- tensor_types=None,
- global_memory_alignment=None,
- **kwargs,
Create an
Matmulobject that encapsulates a compiled and ready-to-use device function for matrix multiplication.Deprecated since version 0.7.0.
- Parameters:
size – A sequence of integers denoting the three dimensions
(m, n, k)for the matrix multiplication problem.precision – The computation precision specified as a numpy float dtype, currently supports
numpy.float16,numpy.float32andnumpy.float64.data_type – The data type of the input matrices, can be either
'real'or'complex'.compiler –
A string to specify the compiler for the device code, currently supports
None(default) and'numba'Changed in version 0.7.0: compiler is no longer needed and does not take effect. Use
nvmath.to get device function code.device. compile_blas_execute() code_type (CodeType) –
The target GPU code and compute-capability.
Changed in version 0.7.0: code_type should be used by
nvmath.and no longer needed for numba-cuda usage.device. compile_blas_execute() block_size (int) – The total block size, optional. If not provided or set to
'suggested', will be set to a suggested value for 1D block dim.block_dim (Dim3) – The block dimension for launching the CUDA kernel, optional. If not provided or set to
'suggested', will be set to a suggested value. Cannot be used whenblock_sizeis explicitly specified.leading_dimension (LeadingDimension) – The leading dimensions for the input matrices, optional. If not provided, will be set to match the matrix row/column dimension. Alternatively, if provided as
'suggested', will be set to a suggested value for optimal performance.transpose_mode (TransposeMode) – The transpose mode for all input matrices ; transpose_mode or arrangement must be provided.
arrangement (Arrangement) – The arrangement for all input matrices ; transpose_mode or arrangement must be provided.
alignment (Alignment) – The alignment for the input matrices in shared memory. Defines the alignments (in bytes) of the input matrices A, B, and C (either arrays or wrapped in opaque tensors) that are passed to the execute(…) method. Default alignment is equal to an element size of the matrix unless used suggested layout. In that case alignment is greater or equal than the element size.
function (str) – A string specifying the name of the function. Currently supports
'MM'(default) for matrix multiplication.execution (str) – A string specifying the execution method, can be
'Block'or'Thread'.execute_api (str) –
A string specifying the signature of the function that handles problems with default or custom/dynamic leading dimensions. Could be
'static_leading_dimensions'or'dynamic_leading_dimensions'.Changed in version 0.7.0: execute_api should be used by
nvmath.and no longer needed for numba-cuda usage.device. compile_blas_execute() tensor_types (str) –
A list of strings specifying the tensors being used at execute signature.
Changed in version 0.7.0: tensor_types should be used by
nvmath.and no longer needed for numba-cuda usage.device. compile_blas_execute() global_memory_alignment (Alignment) –
Same as alignment, but for the global memory. Used to optimize copying between shared and global memory.
Changed in version 0.7.0: alignment should be set at
nvmath.global_memory_alignment should be used bydevice. copy() nvmath.for non numba-cuda usage. Alignment should be setdevice. compile_blas_execute()
See also
The attributes of
Matmulprovide a 1:1 mapping with the CUDA C++ cuBLASDx APIs. For further details, please refer to cuBLASDx documentation.Examples
>>> from numba import cuda >>> from nvmath.device import matmul >>> import numpy as np >>> m, n, k = 32, 16, 64 >>> block_size = 256
Use
nvmath.to create the compiled matrix multiplication object:device. matmul() >>> MM = matmul( ... size=(m, n, k), ... precision=np.float32, ... data_type="real", ... transpose_mode=("non_transposed", "transposed"), ... execution="Block", ... block_size=block_size, ... compiler="numba", ... )
Pass
link=MM.filesto thenumba.cuda.jit()decorator when defining your kernel to link with the compiled code.cuBLASDx works on shared memory arrays. It requires column-major (F order) arrays but
cuda.shared.arraycreates row-major (C order) arrays only. You can emulate a column-major array by flipping dimensions. With your shared memory arrays ready and filled with actual data, you can run the matrix multiplication by callingMM>>> a_dim, b_dim, c_dim = MM.a_dim, MM.b_dim, MM.c_dim >>> @cuda.jit(link=MM.files) ... def f(): ... a = cuda.shared.array(shape=(a_dim[1], a_dim[0]), dtype=np.float32) ... b = cuda.shared.array(shape=(b_dim[1], b_dim[0]), dtype=np.float32) ... c = cuda.shared.array(shape=(c_dim[1], c_dim[0]), dtype=np.float32) ... # TODO: Populate the arrays with actual data. ... alpha, beta = 1.0, 0.0 ... MM(alpha, a, b, beta, c) ... cuda.syncthreads() ... # TODO: Copy the result (c) from the shared memory >>> f[1, block_size]()
Further examples can be found in the nvmath/examples/device directory.