matmul#

nvmath.distributed.linalg.advanced.matmul(
a,
b,
/,
c=None,
*,
distributions: Sequence[Distribution],
alpha=None,
beta=None,
epilog=None,
epilog_inputs=None,
qualifiers=None,
quantization_scales=None,
options=None,
preferences=None,
stream: AnyStream | int | None = None,
)[source]#

Perform the specified distributed matrix multiplication computation \(F(\alpha a @ b + \beta c)\), where \(F\) is the epilog. This function-form is a wrapper around the stateful Matmul object APIs and is meant for single use (the user needs to perform just one matrix multiplication, for example), in which case there is no possibility of amortizing preparatory costs.

Detailed information on what’s happening within this function can be obtained by passing in a logging.Logger object to MatmulOptions or by setting the appropriate options in the root logger object, which is used by default:

>>> import logging
>>> logging.basicConfig(
...     level=logging.INFO,
...     format="%(asctime)s %(levelname)-8s %(message)s",
...     datefmt="%m-%d %H:%M:%S",
... )

A user can select the desired logging level and, in general, take advantage of all of the functionality offered by the Python logging module.

Parameters:
  • a – A distributed tensor representing the first operand to the matrix multiplication (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

  • b – A distributed tensor representing the second operand to the matrix multiplication (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

  • c – (Optional) A distributed tensor representing the operand to add to the matrix multiplication result (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

  • distributions – Sequence specifying the distribution across processes of matrices A, B and C/D. The distribution needs to be BlockCyclic or compatible.

  • alpha – The scale factor for the matrix multiplication term as a real or complex number. The default is \(1.0\).

  • beta – The scale factor for the matrix addition term as a real or complex number. A value for beta must be provided if operand c is specified.

  • epilog – Specify an epilog \(F\) as an object of type MatmulEpilog to apply to the result of the matrix multiplication: \(F(\alpha A @ B + \beta C\)). The default is no epilog. See cuBLASMp documentation for the list of available epilogs.

  • epilog_inputs – Specify the additional inputs needed for the selected epilog as a dictionary, where the key is the epilog input name and the value is the epilog input. The epilog input must be a tensor with the same package and in the same memory space as the operands (see the constructor for more information on the operands). If the required epilog inputs are not provided, an exception is raised that lists the required epilog inputs. Some epilog inputs are generated by other epilogs. For example, the epilog input for MatmulEpilog.DRELU is generated by matrix multiplication with the same operands using MatmulEpilog.RELU_AUX.

  • qualifiers – Specify the matrix qualifiers as a numpy.ndarray of matrix_qualifiers_dtype objects of length 3 corresponding to the operands a, b, and c. See Matrix and Tensor Qualifiers for the motivation behind qualifiers.

  • quantization_scales – Specify scale factors for the matrix multiplication as a MatmulQuantizationScales object. Alternatively, a dict containing the parameters for the MatmulQuantizationScales constructor can also be provided. The scale factors can be provided as scalars or tensors. If a scale factor is provided as a tensor, it must be from the same package and on the same memory space (CPU or GPU device) as the operands of the matmul. If a scale factor is provided as a scalar, and the execution space is GPU, a CPU->GPU copy is inevitable. To avoid this copy, provide the quantization scale as one-element array on the GPU. Allowed and required only for narrow-precision (FP8 and lower) operations.

  • options – Specify options for the matrix multiplication as a MatmulOptions object. Alternatively, a dict containing the parameters for the MatmulOptions constructor can also be provided. If not specified, the value will be set to the default-constructed MatmulOptions object.

  • preferences – This parameter specifies the preferences for planning as a MatmulPlanPreferences object. Alternatively, a dictionary containing the parameters for the MatmulPlanPreferences constructor can also be provided. If not specified, the value will be set to the default-constructed MatmulPlanPreferences object.

  • stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.

Returns:

The result of the specified matrix multiplication (epilog applied), which remains on the same device and belongs to the same package as the input operands. If an epilog (like nvmath.distributed.linalg.advanced.MatmulEpilog.RELU_AUX) that results in extra output is used, or an extra output is requested (for example by setting result_amax option in options argument), a tuple is returned with the first element being the matrix multiplication result (epilog applied) and the second element being the auxiliary output provided as a dict.

Semantics:

The semantics of the matrix multiplication follows numpy.matmul semantics, with some restrictions on broadcasting. In addition, the semantics for the fused matrix addition are described below:

  • For in-place matrix multiplication (where the result is written into c) the result has the same shape as c.

Narrow-precision support:

Matrix multiplication with narrow-precision operands is supported, in both FP8 and MXFP8 formats.

Note

FP8 requires a device with compute capability 8.9 or higher (Ada, Hopper, Blackwell or newer architecture). MXFP8 requires a device with compute capability 10.0 or higher (Blackwell or newer architecture). Please refer to the compute capability table to check the compute capability of your device.

For FP8 operations:

  • For each operand a scaling factor needs to be specified via quantization_scales argument.

  • Maximum absolute value of the result (amax) can be requested via result_amax option in options argument.

  • Custom result type (both FP8 and non-FP8) can be requested via result_type option in options argument.

For MXFP8 operations:

  • To enable MXFP8 operations, block_scaling option must be set to True.

  • Block scaling factors need to be specified via quantization_scales argument.

  • Utilities in nvmath.distributed.linalg.advanced.helpers.matmul can be used to create and modify block scaling factors.

  • When MXFP8 is used and the result type is a narrow-precision data type, the auxiliary output "d_out_scale" will be returned in the auxiliary output tensor. It will contain the scales that were used for the result quantization.

Please refer to the examples and narrow-precision operations tutorial for more details. cuBLASMp follows cuBLAS specification and usage for FP8 and MXFP8 formats, scaling modes, scaling factor layouts, etc. For more details see the cublasLtMatmul documentation.

Examples

>>> import cupy as cp
>>> import nvmath.distributed
>>> from nvmath.distributed.distribution import Slab

Get process group used to initialize nvmath.distributed (for information on initializing nvmath.distributed, you can refer to the documentation or to the Matmul examples in nvmath/examples/distributed/linalg/advanced):

>>> process_group = nvmath.distributed.get_context().process_group

Get my process rank:

>>> rank = process_group.rank

Create three float32 ndarrays on the GPU:

>>> M, N, K = 128, 64, 256
>>> a_shape = Slab.X.shape(rank, (M, K))
>>> b_shape = Slab.Y.shape(rank, (K, N))
>>> c_shape = Slab.X.shape(rank, (M, N))
>>> device_id = nvmath.distributed.get_context().device_id
>>> with cp.cuda.Device(device_id):
...     a = cp.asfortranarray(cp.random.rand(*a_shape, dtype=cp.float32))
...     b = cp.asfortranarray(cp.random.rand(*b_shape, dtype=cp.float32))
...     c = cp.asfortranarray(cp.random.rand(*c_shape, dtype=cp.float32))

Perform the operation \(\alpha A @ B + \beta C\) using matmul(). The result r is also a CuPy float32 ndarray:

>>> distributions = [Slab.X, Slab.Y, Slab.X]
>>> r = nvmath.distributed.linalg.advanced.matmul(
...     a, b, c, alpha=1.23, beta=0.74, distributions=distributions
... )

Options can be provided to customize the operation:

>>> compute_type = (
...     nvmath.distributed.linalg.advanced.MatmulComputeType.COMPUTE_32F_FAST_TF32
... )
>>> o = nvmath.distributed.linalg.advanced.MatmulOptions(compute_type=compute_type)
>>> r = nvmath.distributed.linalg.advanced.matmul(
...     a, b, distributions=distributions, options=o
... )

See MatmulOptions for the complete list of available options.

The package current stream is used by default, but a stream can be explicitly provided to the Matmul operation. This can be done if the operands are computed on a different stream, for example:

>>> with cp.cuda.Device(device_id):
...     s = cp.cuda.Stream()
...     with s:
...         a = cp.asfortranarray(cp.random.rand(*a_shape))
...         b = cp.asfortranarray(cp.random.rand(*b_shape))
>>> r = nvmath.distributed.linalg.advanced.matmul(
...     a, b, distributions=distributions, stream=s
... )

The operation above runs on stream s and is ordered with respect to the input computation.

Create NumPy ndarrays on the CPU.

>>> import numpy as np
>>> a = np.asfortranarray(np.random.rand(*a_shape))
>>> b = np.asfortranarray(np.random.rand(*b_shape))

Provide the NumPy ndarrays to matmul(), with the result also being a NumPy ndarray:

>>> r = nvmath.distributed.linalg.advanced.matmul(a, b, distributions=distributions)

Notes

  • This function is a convenience wrapper around Matmul and is specifically meant for single use.

Further examples can be found in the nvmath/distributed/examples/linalg/advanced/matmul directory.