matmul#
-
nvmath.
linalg. advanced. matmul( - a,
- b,
- /,
- c=None,
- *,
- alpha=None,
- beta=None,
- epilog=None,
- epilog_inputs=None,
- qualifiers=None,
- quantization_scales=None,
- options=None,
- preferences=None,
- algorithm=None,
- stream=None,
Perform the specified matrix multiplication computation \(F(\alpha a @ b + \beta c)\), where \(F\) is the epilog. This function-form is a wrapper around the stateful
Matmul
object APIs and is meant for single use (the user needs to perform just one matrix multiplication, for example), in which case there is no possibility of amortizing preparatory costs.Detailed information on what’s happening within this function can be obtained by passing in a
logging.Logger
object toMatmulOptions
or by setting the appropriate options in the root logger object, which is used by default:>>> import logging >>> logging.basicConfig( ... level=logging.INFO, ... format="%(asctime)s %(levelname)-8s %(message)s", ... datefmt="%m-%d %H:%M:%S", ... )
A user can select the desired logging level and, in general, take advantage of all of the functionality offered by the Python
logging
module.- Parameters:
a – A tensor representing the first operand to the matrix multiplication (see
Semantics
). The currently supported types arenumpy.ndarray
,cupy.ndarray
, andtorch.Tensor
.b – A tensor representing the second operand to the matrix multiplication (see
Semantics
). The currently supported types arenumpy.ndarray
,cupy.ndarray
, andtorch.Tensor
.c –
(Optional) A tensor representing the operand to add to the matrix multiplication result (see
Semantics
). The currently supported types arenumpy.ndarray
,cupy.ndarray
, andtorch.Tensor
.Changed in version 0.3.0: In order to avoid broadcasting behavior ambiguity, nvmath-python no longer accepts a 1-D (vector)
c
. Use a singleton dimension to convert your input array to 2-D.alpha – The scale factor for the matrix multiplication term as a real or complex number. The default is \(1.0\).
beta – The scale factor for the matrix addition term as a real or complex number. A value for
beta
must be provided if operandc
is specified. from a previously planned and autotuned matrix multiplication.epilog – Specify an epilog \(F\) as an object of type
MatmulEpilog
to apply to the result of the matrix multiplication: \(F(\alpha A @ B + \beta C\)). The default is no epilog. See cuBLASLt documentation for the list of available epilogs.epilog_inputs – Specify the additional inputs needed for the selected epilog as a dictionary, where the key is the epilog input name and the value is the epilog input. The epilog input must be a tensor with the same package and in the same memory space as the operands (see the constructor for more information on the operands). If the required epilog inputs are not provided, an exception is raised that lists the required epilog inputs. Some epilog inputs are generated by other epilogs. For example, the epilog input for
MatmulEpilog.DRELU
is generated by matrix multiplication with the same operands usingMatmulEpilog.RELU_AUX
.qualifiers – If desired, specify the matrix qualifiers as a
numpy.ndarray
ofmatrix_qualifiers_dtype
objects of length 3 corresponding to the operandsa
,b
, andc
.options – Specify options for the matrix multiplication as a
MatmulOptions
object. Alternatively, adict
containing the parameters for theMatmulOptions
constructor can also be provided. If not specified, the value will be set to the default-constructedMatmulOptions
object.preferences – This parameter specifies the preferences for planning as a
MatmulPlanPreferences
object. Alternatively, a dictionary containing the parameters for theMatmulPlanPreferences
constructor can also be provided. If not specified, the value will be set to the default-constructedMatmulPlanPreferences
object.algorithm – An object of type
Algorithm
objects can be directly provided to bypass planning, if desired. The algorithm object must be compatible with the matrix multiplication. A typical use for this option is to provide an algorithm that has been serialized (pickled) from a previously planned and autotuned matrix multiplication.stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t
(as Pythonint
),cupy.cuda.Stream
, andtorch.cuda.Stream
. If a stream is not provided, the current stream from the operand package will be used.quantization_scales – Specify scale factors for the matrix multiplication as a
MatmulQuantizationScales
object. Alternatively, adict
containing the parameters for theMatmulQuantizationScales
constructor can also be provided. Allowed and required only for narrow-precision (FP8 and lower) operations.
- Returns:
The result of the specified matrix multiplication (epilog applied), which remains on the same device and belong to the same package as the input operands. If an epilog (like
nvmath.
) that results in extra output is used, or an extra output is requested (for example by settinglinalg. advanced. MatmulEpilog. RELU_AUX result_amax
option inoptions
argument), a tuple is returned with the first element being the matrix multiplication result (epilog applied) and the second element being the auxiliary output provided as adict
.
- Semantics:
The semantics of the matrix multiplication follows
numpy.matmul()
semantics, with some restrictions on broadcasting. In addition, the semantics for the fused matrix addition are described below:If arguments
a
andb
are matrices, they are multiplied according to the rules of matrix multiplication.If argument
a
is 1-D, it is promoted to a matrix by prefixing1
to its dimensions. After matrix multiplication, the prefixed1
is removed from the result’s dimensions.If argument
b
is 1-D, it is promoted to a matrix by appending1
to its dimensions. After matrix multiplication, the appended1
is removed from the result’s dimensions.If
a
orb
is N-D (N > 2), then the operand is treated as a batch of matrices. If botha
andb
are N-D, their batch dimensions must match. If exactly one ofa
orb
is N-D, the other operand is broadcast.The operand for the matrix addition
c
may be a matrix of shape (M, 1) or (M, N), or the batched versions (…, M, 1) or (…, M, N). Here M and N are the dimensions of the result of the matrix multiplication. If N = 1, the columns ofc
are broadcast for the addition; the rows ofc
are never broadcast. If batch dimensions are not present,c
is broadcast across batches as needed.Similarly, when operating on a batch, auxiliary outputs are 3-D for all epilogs. Therefore, epilogs that return 1-D vectors of length N in non-batched mode return 3-D matrices of size (batch, N, 1) in batched mode.
- Narrow-precision support:
Matrix multiplication with narrow-precision operands is supported, in both FP8 and MXFP8 formats.
Note
Narrow-precision matrix multiplication in nvmath-python requires CUDA Toolkit 12.8 or newer. FP8 requires a device with compute capability 8.9 or higher (Ada, Hopper, Blackwell or newer architecture). MXFP8 requires a device with compute capability 10.0 or higher (Blackwell or newer architecture). Please refer to the compute capability table to check the compute capability of your device.
For FP8 operations:
For each operand a scaling factor needs to be specified via
quantization_scales
argument.Maximum absolute value of the result (amax) can be requested via
result_amax
option inoptions
argument.Custom result type (both FP8 and non-FP8) can be requested via
result_type
option inoptions
argument.
For MXFP8 operations:
To enable MXFP8 operations,
block_scaling
option must be set toTrue
.Block scaling factors need to be specified via
quantization_scales
argument.Utilities in
nvmath.
can be used to create and modify block scaling factors.linalg. advanced. helpers. matmul When MXFP8 is used and the result type is a narrow-precision data type, the auxiliary output
"d_out_scale"
will be returned in the auxiliary output tensor. It will contain the scales that were used for the result quantization.
Please refer to the examples and narrow-precision operations tutorial for more details. For more details on the FP8 and MXFP8 formats in cuBLAS, see the cublasLtMatmul documentation.
See also
Examples
>>> import cupy as cp >>> import nvmath
Create three float32 ndarrays on the GPU:
>>> M, N, K = 128, 64, 256 >>> a = cp.random.rand(M, K, dtype=cp.float32) >>> b = cp.random.rand(K, N, dtype=cp.float32) >>> c = cp.random.rand(M, N, dtype=cp.float32)
Perform the operation \(\alpha A @ B + \beta C\) using
matmul()
. The resultr
is also a CuPy float64 ndarray:>>> r = nvmath.linalg.advanced.matmul(a, b, c, alpha=1.23, beta=0.74)
An epilog can be used as well. Here we perform \(RELU(\alpha A @ B + \beta C)\):
>>> epilog = nvmath.linalg.advanced.MatmulEpilog.RELU >>> r = nvmath.linalg.advanced.matmul(a, b, c, alpha=1.23, beta=0.74, epilog=epilog)
Options can be provided to customize the operation:
>>> compute_type = nvmath.linalg.advanced.MatmulComputeType.COMPUTE_32F_FAST_TF32 >>> o = nvmath.linalg.advanced.MatmulOptions(compute_type=compute_type) >>> r = nvmath.linalg.advanced.matmul(a, b, options=o)
See
MatmulOptions
for the complete list of available options.The package current stream is used by default, but a stream can be explicitly provided to the Matmul operation. This can be done if the operands are computed on a different stream, for example:
>>> s = cp.cuda.Stream() >>> with s: ... a = cp.random.rand(M, K) ... b = cp.random.rand(K, N) >>> r = nvmath.linalg.advanced.matmul(a, b, stream=s)
The operation above runs on stream
s
and is ordered with respect to the input computation.Create NumPy ndarrays on the CPU.
>>> import numpy as np >>> a = np.random.rand(M, K) >>> b = np.random.rand(K, N)
Provide the NumPy ndarrays to
matmul()
, with the result also being a NumPy ndarray:>>> r = nvmath.linalg.advanced.matmul(a, b)
Notes
This function is a convenience wrapper around
Matmul
and and is specifically meant for single use.
Further examples can be found in the nvmath/examples/linalg/advanced/matmul directory.