matmul#
-
nvmath.
distributed. linalg. advanced. matmul( - a,
- b,
- /,
- c=None,
- *,
- distributions: Sequence[Distribution],
- alpha=None,
- beta=None,
- epilog=None,
- epilog_inputs=None,
- qualifiers=None,
- options=None,
- preferences=None,
- stream: AnyStream | int | None = None,
Perform the specified distributed matrix multiplication computation \(F(\alpha a @ b + \beta c)\), where \(F\) is the epilog. This function-form is a wrapper around the stateful
Matmulobject APIs and is meant for single use (the user needs to perform just one matrix multiplication, for example), in which case there is no possibility of amortizing preparatory costs.Detailed information on what’s happening within this function can be obtained by passing in a
logging.Loggerobject toMatmulOptionsor by setting the appropriate options in the root logger object, which is used by default:>>> import logging >>> logging.basicConfig( ... level=logging.INFO, ... format="%(asctime)s %(levelname)-8s %(message)s", ... datefmt="%m-%d %H:%M:%S", ... )
A user can select the desired logging level and, in general, take advantage of all of the functionality offered by the Python
loggingmodule.- Parameters:
a – A distributed tensor representing the first operand to the matrix multiplication (see Semantics). The currently supported types are
numpy.ndarray,cupy.ndarray, andtorch.Tensor.b – A distributed tensor representing the second operand to the matrix multiplication (see Semantics). The currently supported types are
numpy.ndarray,cupy.ndarray, andtorch.Tensor.c – (Optional) A distributed tensor representing the operand to add to the matrix multiplication result (see Semantics). The currently supported types are
numpy.ndarray,cupy.ndarray, andtorch.Tensor.distributions – Sequence specifying the distribution across processes of matrices A, B and C/D. The distribution needs to be BlockCyclic or compatible.
alpha – The scale factor for the matrix multiplication term as a real or complex number. The default is \(1.0\).
beta – The scale factor for the matrix addition term as a real or complex number. A value for
betamust be provided if operandcis specified.epilog – Specify an epilog \(F\) as an object of type
MatmulEpilogto apply to the result of the matrix multiplication: \(F(\alpha A @ B + \beta C\)). The default is no epilog. See cuBLASMp documentation for the list of available epilogs.epilog_inputs – Specify the additional inputs needed for the selected epilog as a dictionary, where the key is the epilog input name and the value is the epilog input. The epilog input must be a tensor with the same package and in the same memory space as the operands (see the constructor for more information on the operands). If the required epilog inputs are not provided, an exception is raised that lists the required epilog inputs. Some epilog inputs are generated by other epilogs. For example, the epilog input for
MatmulEpilog.DRELUis generated by matrix multiplication with the same operands usingMatmulEpilog.RELU_AUX.qualifiers – Specify the matrix qualifiers as a
numpy.ndarrayofmatrix_qualifiers_dtypeobjects of length 3 corresponding to the operandsa,b, andc. See Matrix and Tensor Qualifiers for the motivation behind qualifiers.options – Specify options for the matrix multiplication as a
MatmulOptionsobject. Alternatively, adictcontaining the parameters for theMatmulOptionsconstructor can also be provided. If not specified, the value will be set to the default-constructedMatmulOptionsobject.preferences – This parameter specifies the preferences for planning as a
MatmulPlanPreferencesobject. Alternatively, a dictionary containing the parameters for theMatmulPlanPreferencesconstructor can also be provided. If not specified, the value will be set to the default-constructedMatmulPlanPreferencesobject.stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t(as Pythonint),cupy.cuda.Stream, andtorch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used.
- Returns:
The result of the specified matrix multiplication (epilog applied), which remains on the same device and belongs to the same package as the input operands.
- Semantics:
The semantics of the matrix multiplication follows
numpy.matmul()semantics, with some restrictions on broadcasting.
See also
Matmul,MatmulOptions,MatmulEpilog,MatmulPlanPreferencesExamples
>>> import cupy as cp >>> import nvmath.distributed >>> from nvmath.distributed.distribution import Slab
Get MPI communicator used to initialize nvmath.distributed (for information on initializing
nvmath., you can refer to the documentation or to the Matmul examples in nvmath/examples/distributed/linalg/advanced):distributed >>> comm = nvmath.distributed.get_context().communicator
Get my process rank:
>>> rank = comm.Get_rank()
Create three float32 ndarrays on the GPU:
>>> M, N, K = 128, 64, 256 >>> a_shape = Slab.X.shape(rank, (M, K)) >>> b_shape = Slab.Y.shape(rank, (K, N)) >>> c_shape = Slab.X.shape(rank, (M, N)) >>> device_id = nvmath.distributed.get_context().device_id >>> with cp.cuda.Device(device_id): ... a = cp.asfortranarray(cp.random.rand(*a_shape, dtype=cp.float32)) ... b = cp.asfortranarray(cp.random.rand(*b_shape, dtype=cp.float32)) ... c = cp.asfortranarray(cp.random.rand(*c_shape, dtype=cp.float32))
Perform the operation \(\alpha A @ B + \beta C\) using
matmul(). The resultris also a CuPy float32 ndarray:>>> distributions = [Slab.X, Slab.Y, Slab.X] >>> r = nvmath.distributed.linalg.advanced.matmul( ... a, b, c, alpha=1.23, beta=0.74, distributions=distributions ... )
Options can be provided to customize the operation:
>>> compute_type = ( ... nvmath.distributed.linalg.advanced.MatmulComputeType.COMPUTE_32F_FAST_TF32 ... ) >>> o = nvmath.distributed.linalg.advanced.MatmulOptions(compute_type=compute_type) >>> r = nvmath.distributed.linalg.advanced.matmul( ... a, b, distributions=distributions, options=o ... )
See
MatmulOptionsfor the complete list of available options.The package current stream is used by default, but a stream can be explicitly provided to the Matmul operation. This can be done if the operands are computed on a different stream, for example:
>>> with cp.cuda.Device(device_id): ... s = cp.cuda.Stream() ... with s: ... a = cp.asfortranarray(cp.random.rand(*a_shape)) ... b = cp.asfortranarray(cp.random.rand(*b_shape)) >>> r = nvmath.distributed.linalg.advanced.matmul( ... a, b, distributions=distributions, stream=s ... )
The operation above runs on stream
sand is ordered with respect to the input computation.Create NumPy ndarrays on the CPU.
>>> import numpy as np >>> a = np.asfortranarray(np.random.rand(*a_shape)) >>> b = np.asfortranarray(np.random.rand(*b_shape))
Provide the NumPy ndarrays to
matmul(), with the result also being a NumPy ndarray:>>> r = nvmath.distributed.linalg.advanced.matmul(a, b, distributions=distributions)
Notes
This function is a convenience wrapper around
Matmuland is specifically meant for single use.
Further examples can be found in the nvmath/distributed/examples/linalg/advanced/matmul directory.