matmul#
-
nvmath.
linalg. advanced. matmul( - a,
- b,
- /,
- c=None,
- *,
- alpha=None,
- beta=None,
- epilog=None,
- epilog_inputs=None,
- qualifiers=None,
- quantization_scales=None,
- options=None,
- preferences=None,
- algorithm=None,
- stream: AnyStream | int | None = None,
Perform the specified matrix multiplication computation \(F(\alpha a @ b + \beta c)\), where \(F\) is the epilog. This function-form is a wrapper around the stateful
Matmulobject APIs and is meant for single use (the user needs to perform just one matrix multiplication, for example), in which case there is no possibility of amortizing preparatory costs.Detailed information on what’s happening within this function can be obtained by passing in a
logging.Loggerobject toMatmulOptionsor by setting the appropriate options in the root logger object, which is used by default:>>> import logging >>> logging.basicConfig( ... level=logging.INFO, ... format="%(asctime)s %(levelname)-8s %(message)s", ... datefmt="%m-%d %H:%M:%S", ... )
A user can select the desired logging level and, in general, take advantage of all of the functionality offered by the Python
loggingmodule.- Parameters:
a – A tensor representing the first operand to the matrix multiplication (see Semantics). The currently supported types are
numpy.ndarray,cupy.ndarray, andtorch.Tensor.b – A tensor representing the second operand to the matrix multiplication (see Semantics). The currently supported types are
numpy.ndarray,cupy.ndarray, andtorch.Tensor.c –
(Optional) A tensor representing the operand to add to the matrix multiplication result (see Semantics). The currently supported types are
numpy.ndarray,cupy.ndarray, andtorch.Tensor.Changed in version 0.3.0: In order to avoid broadcasting behavior ambiguity, nvmath-python no longer accepts a 1-D (vector)
c. Use a singleton dimension to convert your input array to 2-D.alpha – The scale factor for the matrix multiplication term as a real or complex number. The default is \(1.0\).
beta – The scale factor for the matrix addition term as a real or complex number. A value for
betamust be provided if operandcis specified.epilog – Specify an epilog \(F\) as an object of type
MatmulEpilogto apply to the result of the matrix multiplication: \(F(\alpha A @ B + \beta C\)). The default is no epilog. See cuBLASLt documentation for the list of available epilogs.epilog_inputs – Specify the additional inputs needed for the selected epilog as a dictionary, where the key is the epilog input name and the value is the epilog input. The epilog input must be a tensor with the same package and in the same memory space as the operands (see the constructor for more information on the operands). If the required epilog inputs are not provided, an exception is raised that lists the required epilog inputs. Some epilog inputs are generated by other epilogs. For example, the epilog input for
MatmulEpilog.DRELUis generated by matrix multiplication with the same operands usingMatmulEpilog.RELU_AUX.qualifiers – If desired, specify the matrix qualifiers as a
numpy.ndarrayofmatrix_qualifiers_dtypeobjects of length 3 corresponding to the operandsa,b, andc. See Matrix and Tensor Qualifiers for the motivation behind qualifiers.options – Specify options for the matrix multiplication as a
MatmulOptionsobject. Alternatively, adictcontaining the parameters for theMatmulOptionsconstructor can also be provided. If not specified, the value will be set to the default-constructedMatmulOptionsobject.preferences – This parameter specifies the preferences for planning as a
MatmulPlanPreferencesobject. Alternatively, a dictionary containing the parameters for theMatmulPlanPreferencesconstructor can also be provided. If not specified, the value will be set to the default-constructedMatmulPlanPreferencesobject.algorithm – An object of type
Algorithmobjects can be directly provided to bypass planning, if desired. The algorithm object must be compatible with the matrix multiplication. A typical use for this option is to provide an algorithm that has been serialized (pickled) from a previously planned and autotuned matrix multiplication.stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t(as Pythonint),cupy.cuda.Stream, andtorch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.quantization_scales – Specify scale factors for the matrix multiplication as a
MatmulQuantizationScalesobject. Alternatively, adictcontaining the parameters for theMatmulQuantizationScalesconstructor can also be provided. The scale factors can be provided as scalars or tensors. If a scale factor is provided as a tensor, it must be from the same package and on the same memory space (CPU or GPU device) as the operands of the matmul. If a scale factor is provided as a scalar, and the execution space is GPU, a CPU->GPU copy is inevitable. To avoid this copy, provide the quantization scale as one-element array on the GPU. Allowed and required only for narrow-precision (FP8 and lower) operations.
- Returns:
The result of the specified matrix multiplication (epilog applied), which remains on the same device and belongs to the same package as the input operands. If an epilog (like
nvmath.) that results in extra output is used, or an extra output is requested (for example by settinglinalg. advanced. MatmulEpilog. RELU_AUX result_amaxoption inoptionsargument), a tuple is returned with the first element being the matrix multiplication result (epilog applied) and the second element being the auxiliary output provided as adict.
- Semantics:
The semantics of the matrix multiplication follows
numpy.matmulsemantics, with some restrictions on broadcasting. In addition, the semantics for the fused matrix addition are described below.Note
For narrow-precision formats (FP8, MXFP8, NVFP4), some of the rules below are restricted — see the narrow-precision section for details.
For in-place matrix multiplication (where the result is written into
c) the result has the same shape asc.If arguments
aandbare matrices, they are multiplied according to the rules of matrix multiplication.If argument
ais 1-D, it is promoted to a matrix by prefixing1to its dimensions. After matrix multiplication, the prefixed1is removed from the result’s dimensions if the operation is not in-place.If argument
bis 1-D, it is promoted to a matrix by appending1to its dimensions. After matrix multiplication, the appended1is removed from the result’s dimensions if the operation is not in-place.If
aorbis N-D (N > 2), then the operand is treated as a batch of matrices. If bothaandbare N-D, their batch dimensions must match. If exactly one ofaorbis N-D, the other operand is broadcast.The operand for the matrix addition
cmay be a matrix of shape (M, 1) or (M, N), or the batched versions (…, M, 1) or (…, M, N). Here M and N are the dimensions of the result of the matrix multiplication. If N = 1, the columns ofcare broadcast for the addition; the rows ofcare never broadcast. If batch dimensions are not present,cis broadcast across batches as needed. If the operation is in-place,ccannot be broadcast since it must be large enough to hold the result.Similarly, when operating on a batch, auxiliary outputs are 3-D for all epilogs. Therefore, epilogs that return 1-D vectors of length N in non-batched mode return 3-D matrices of size (batch, N, 1) in batched mode.
For narrow-precision operations (FP8 and lower), further restrictions apply; see the narrow-precision support section below.
- Narrow-precision support:
Matrix multiplication with narrow-precision operands is supported, in FP8, MXFP8, and NVFP4 formats.
FP8 and MXFP8
FP8 and MXFP8 use
float8_e4m3fnorfloat8_e5m2data types. The difference is the scaling mode: FP8 (block_scaling=False) uses per-tensor scaling where a single scalar scale is applied to each operand; MXFP8 (block_scaling=True) uses microscaling with 32-element blocks arranged in 128x128 tiles.Note
FP8 and MXFP8 matrix multiplication requires CUDA Toolkit 12.8 or newer. FP8 requires a device with compute capability 8.9 or higher (Ada, Hopper, Blackwell or newer architecture). MXFP8 requires a device with compute capability 10.0 or higher (Blackwell or newer architecture). Please refer to the compute capability table to check the compute capability of your device.
For FP8 operations:
For each operand a scaling factor needs to be specified via
quantization_scalesargument.Maximum absolute value of the result (amax) can be requested via
result_amaxoption inoptionsargument.Custom result type (both FP8 and non-FP8) can be requested via
result_typeoption inoptionsargument.
For MXFP8 operations:
1-D (vector) operands are not supported. Both
aandbmust be at least 2-D matrices.Broadcasting of batch dimensions is not supported. The batch shapes of
aandbmust match exactly.All operand dimensions (M, N, K) must be multiples of 128.
block_scalingoption must be set toTrueand block scaling factors need to be specified viaquantization_scalesargument. Utilities innvmath.can be used to create and modify block scaling factors, see e.g.linalg. advanced. helpers. matmul create_mxfp8_scale().When the result type is a narrow-precision data type, the auxiliary output
"d_out_scale"will be returned containing the scales used for result quantization.
Layout Requirements
Due to the requirements of narrow-precision GEMM kernels, the contracting dimension K must be contiguous (stride-1) for both operands. The following layout constraints apply to both FP8 and MXFP8:
Operand
amust be(..., M, K)withstride[-1] == 1andstride[-2] >= K(row-major). The leading dimension (stride[-2]) can be larger thanKto support sliced or padded views.Operand
bmust be(..., K, N)withstride[-2] == 1andstride[-1] >= K(column-major). The leading dimension (stride[-1]) can be larger thanKto support sliced or padded views.
Attention
Epilog support for MXFP8 is still evolving in the underlying cuBLASLt library, so not every combination of epilog, data type, and layout is guaranteed to work. If running into unsupported combinations, a cuBLASLt error will be raised either at planning time or at execution time that will reveal the root cause. These gaps are expected to be filled in future cuBLASLt releases.
For more details on the FP8 and MXFP8 formats in cuBLAS, see the cublasLtMatmul documentation.
NVFP4
Added in version 1.0: NVFP4 support.
NVFP4 uses
float4_e2m1fn_x2data type with block scaling (16-element blocks arranged in 128x64 tiles).Note
NVFP4 matrix multiplication currently requires CUDA Toolkit 12.8 or newer, a device with compute capability 10.0 or higher (Blackwell or newer architecture), and PyTorch 2.9 or newer for
float4_e2m1fn_x2dtype support. Please refer to the compute capability table to check the compute capability of your device.For NVFP4 operations:
1-D (vector) operands are not supported. Both
aandbmust be at least 2-D matrices.Broadcasting of batch dimensions is not supported. The batch shapes of
aandbmust match exactly.The outer dimensions of
aandb(M and N) must be multiples of 128, and the contracting dimension K must be a multiple of 64.block_scalingoption must be set toTrueand block scaling factors need to be specified viaquantization_scalesargument.When the result type is a narrow-precision data type, the auxiliary output
"d_out_scale"will be returned containing the scales used for result quantization.
Layout and Packing Requirements
FP4 data is per-byte packed:
float4_e2m1fn_x2stores 2 FP4 values per byte. The block scaling (VEC16_UE4M3) assigns one scale factor per 16 consecutive elements along the innermost (stride-1) dimension of each operand. The layout requirements below ensure that this innermost dimension corresponds to the contracting dimension K for both operands.Operand
amust be(..., M, K//2)withstride[-1] == 1andstride[-2] >= K//2, i.e., row-wise packed along K. Note that the leading dimension (stride[-2]) can be larger thanK//2to support sliced views, as long as the stride remains 16-byte aligned.Operand
bmust be(..., K//2, N)withstride[-2] == 1andstride[-1] >= K//2, i.e., column-wise packed along K. Note that the leading dimension (stride[-1]) can be larger thanK//2to support sliced views, as long as the stride remains 16-byte aligned.
If your data has the stride-1 axis along a dimension other than K, you must repack it before calling
matmul().When the result type is also FP4, the output is packed along a dimension that depends on the result layout order:
Row-major result: packed along N — shape
(..., M, N//2), strides(..., N//2, 1).Column-major result: packed along M — shape
(..., M//2, N), strides(..., 1, M//2).
The result layout order is determined by the following priority:
If
cis provided, the result inheritsc’s layout order.Otherwise, if the epilog requests a specific layout, that layout is used.
Otherwise, the result inherits
a’s layout order as a fallback.
Epilog Support
NVFP4 matmul supports epilogs. The following have been verified:
RELU,GELU– with both row-major and column-major output.BIAS,RELU_BIAS,GELU_BIAS– with column-major output only (BIASwithfloat16C/D requires cuBLASLt >= 13.0).
Attention
Epilog support for NVFP4 is still evolving in the underlying cuBLASLt library, so not every combination of epilog, data type, and layout is guaranteed to work. If running into unsupported combinations, a cuBLASLt error will be raised either at planning time or at execution time that will reveal the root cause. These gaps are expected to be filled in future cuBLASLt releases.
Helper Functions
The
nvmath.module provides helpers for working with FP4 encoding/decoding and NVFP4 block scales, see e.g.linalg. advanced. helpers. matmul quantize_to_fp4(),unpack_fp4(),get_block_scale_offset(),to_block_scale(),expand_block_scale().For more details on the NVFP4 format in cuBLAS, see the cublasLtMatmul documentation. For usage examples, see the relevant files in the examples/linalg/advanced/matmul directory.
See also
Examples
>>> import cupy as cp >>> import nvmath
Create three float32 ndarrays on the GPU:
>>> M, N, K = 128, 64, 256 >>> a = cp.random.rand(M, K, dtype=cp.float32) >>> b = cp.random.rand(K, N, dtype=cp.float32) >>> c = cp.random.rand(M, N, dtype=cp.float32)
Perform the operation \(\alpha A @ B + \beta C\) using
matmul(). The resultris also a CuPy float32 ndarray:>>> r = nvmath.linalg.advanced.matmul(a, b, c, alpha=1.23, beta=0.74)
An epilog can be used as well. Here we perform \(RELU(\alpha A @ B + \beta C)\):
>>> epilog = nvmath.linalg.advanced.MatmulEpilog.RELU >>> r = nvmath.linalg.advanced.matmul(a, b, c, alpha=1.23, beta=0.74, epilog=epilog)
Options can be provided to customize the operation:
>>> compute_type = nvmath.linalg.advanced.MatmulComputeType.COMPUTE_32F_FAST_TF32 >>> o = nvmath.linalg.advanced.MatmulOptions(compute_type=compute_type) >>> r = nvmath.linalg.advanced.matmul(a, b, options=o)
See
MatmulOptionsfor the complete list of available options.The package current stream is used by default, but a stream can be explicitly provided to the Matmul operation. This can be done if the operands are computed on a different stream, for example:
>>> s = cp.cuda.Stream() >>> with s: ... a = cp.random.rand(M, K) ... b = cp.random.rand(K, N) >>> r = nvmath.linalg.advanced.matmul(a, b, stream=s)
The operation above runs on stream
sand is ordered with respect to the input computation.Create NumPy ndarrays on the CPU.
>>> import numpy as np >>> a = np.random.rand(M, K) >>> b = np.random.rand(K, N)
Provide the NumPy ndarrays to
matmul(), with the result also being a NumPy ndarray:>>> r = nvmath.linalg.advanced.matmul(a, b)
Notes
This function is a convenience wrapper around
Matmuland is specifically meant for single use.
Further examples can be found in the nvmath/examples/linalg/advanced/matmul directory.