Matmul#
-
class nvmath.
distributed. linalg. advanced. Matmul( - a,
- b,
- /,
- c=None,
- *,
- distributions: Sequence[Distribution],
- alpha=None,
- beta=None,
- qualifiers=None,
- quantization_scales=None,
- options=None,
- stream: AnyStream | int | None = None,
Create a stateful object encapsulating the specified distributed matrix multiplication computation \(\alpha a @ b + \beta c\) and the required resources to perform the operation. A stateful object can be used to amortize the cost of preparation (planning in the case of matrix multiplication) across multiple executions (also see the Stateful APIs section).
The function-form API
matmul()is a convenient alternative to using stateful objects for single use (the user needs to perform just one matrix multiplication, for example), in which case there is no possibility of amortizing preparatory costs. The function-form APIs are just convenience wrappers around the stateful object APIs.Using the stateful object typically involves the following steps:
Problem Specification: Initialize the object with a defined operation and options.
Preparation: Use
plan()to determine the best algorithmic implementation for this specific matrix multiplication operation.Execution: Perform the matrix multiplication computation with
execute().Resource Management: Ensure all resources are released either by explicitly calling
free()or by managing the stateful object within a context manager.
Detailed information on what’s happening in the various phases described above can be obtained by passing in a
logging.Loggerobject toMatmulOptionsor by setting the appropriate options in the root logger object, which is used by default:>>> import logging >>> logging.basicConfig( ... level=logging.INFO, ... format="%(asctime)s %(levelname)-8s %(message)s", ... datefmt="%m-%d %H:%M:%S", ... )
A user can select the desired logging level and, in general, take advantage of all of the functionality offered by the Python
loggingmodule.- Parameters:
a – A distributed tensor representing the first operand to the matrix multiplication (see Semantics). The currently supported types are
numpy.ndarray,cupy.ndarray, andtorch.Tensor.b – A distributed tensor representing the second operand to the matrix multiplication (see Semantics). The currently supported types are
numpy.ndarray,cupy.ndarray, andtorch.Tensor.c – (Optional) A distributed tensor representing the operand to add to the matrix multiplication result (see Semantics). The currently supported types are
numpy.ndarray,cupy.ndarray, andtorch.Tensor.distributions – Sequence specifying the distribution across processes of matrices A, B and C/D. The distribution needs to be BlockCyclic or compatible.
alpha – The scale factor for the matrix multiplication term as a real or complex number. The default is \(1.0\).
beta – The scale factor for the matrix addition term as a real or complex number. A value for
betamust be provided if operandcis specified.qualifiers – Specify the matrix qualifiers as a
numpy.ndarrayofmatrix_qualifiers_dtypeobjects of length 3 corresponding to the operandsa,b, andc. See Matrix and Tensor Qualifiers for the motivation behind qualifiers.quantization_scales – Specify scale factors for the matrix multiplication as a
MatmulQuantizationScalesobject. Alternatively, adictcontaining the parameters for theMatmulQuantizationScalesconstructor can also be provided. The scale factors can be provided as scalars or tensors. If a scale factor is provided as a tensor, it must be from the same package and on the same memory space (CPU or GPU device) as the operands of the matmul. If a scale factor is provided as a scalar, and the execution space is GPU, a CPU->GPU copy is inevitable. To avoid this copy, provide the quantization scale as one-element array on the GPU. Allowed and required only for narrow-precision (FP8 and lower) operations.options – Specify options for the matrix multiplication as a
MatmulOptionsobject. Alternatively, adictcontaining the parameters for theMatmulOptionsconstructor can also be provided. If not specified, the value will be set to the default-constructedMatmulOptionsobject.stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t(as Pythonint),cupy.cuda.Stream, andtorch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.
- Semantics:
The semantics of the matrix multiplication follows
numpy.matmulsemantics, with some restrictions on broadcasting. In addition, the semantics for the fused matrix addition are described below:For in-place matrix multiplication (where the result is written into
c) the result has the same shape asc.
- Narrow-precision support:
Matrix multiplication with narrow-precision operands is supported, in both FP8 and MXFP8 formats.
Note
FP8 requires a device with compute capability 8.9 or higher (Ada, Hopper, Blackwell or newer architecture). MXFP8 requires a device with compute capability 10.0 or higher (Blackwell or newer architecture). Please refer to the compute capability table to check the compute capability of your device.
For FP8 operations:
For each operand a scaling factor needs to be specified via
quantization_scalesargument.Maximum absolute value of the result (amax) can be requested via
result_amaxoption inoptionsargument.Custom result type (both FP8 and non-FP8) can be requested via
result_typeoption inoptionsargument.
For MXFP8 operations:
To enable MXFP8 operations,
block_scalingoption must be set toTrue.Block scaling factors need to be specified via
quantization_scalesargument.Utilities in
nvmath.can be used to create and modify block scaling factors.distributed. linalg. advanced. helpers. matmul When MXFP8 is used and the result type is a narrow-precision data type, the auxiliary output
"d_out_scale"will be returned in the auxiliary output tensor. It will contain the scales that were used for the result quantization.
Please refer to the examples and narrow-precision operations tutorial for more details. cuBLASMp follows cuBLAS specification and usage for FP8 and MXFP8 formats, scaling modes, scaling factor layouts, etc. For more details see the cublasLtMatmul documentation.
See also
Examples
>>> import numpy as np >>> import nvmath.distributed >>> from nvmath.distributed.distribution import Slab >>> from nvmath.distributed.linalg.advanced import matrix_qualifiers_dtype
Get process group used to initialize nvmath.distributed (for information on initializing
nvmath., you can refer to the documentation or to the Matmul examples in nvmath/examples/distributed/linalg/advanced):distributed >>> process_group = nvmath.distributed.get_context().process_group
Get my process rank:
>>> rank = process_group.rank
Create two 2-D float64 ndarrays on the CPU (using Slab distributions to distribute the matrices across processes):
>>> M, N, K = 1024, 1024, 1024 >>> a_shape = Slab.X.shape(rank, (K, M)) >>> b_shape = Slab.X.shape(rank, (K, N)) >>> a = np.asfortranarray(np.random.rand(*a_shape)) >>> b = np.asfortranarray(np.random.rand(*b_shape))
We will define a matrix multiplication operation followed by an AllReduce epilog using the specialized matrix multiplication interface.
Create a Matmul object encapsulating the problem specification above:
>>> qualifiers = np.zeros((3,), dtype=matrix_qualifiers_dtype) >>> qualifiers[0]["is_transpose"] = True # a is transposed >>> distributions = [Slab.X, Slab.X, Slab.Y] >>> mm = nvmath.distributed.linalg.advanced.Matmul( ... a, b, distributions=distributions, qualifiers=qualifiers ... )
Options can be provided above to control the behavior of the operation using the
optionsargument (seeMatmulOptions).Next, plan the operation. The epilog is specified, and optionally, preferences can be specified for planning:
>>> epilog = nvmath.distributed.linalg.advanced.MatmulEpilog.ALLREDUCE >>> mm.plan(epilog=epilog)
Now execute the matrix multiplication, and obtain the result
r1as a NumPy ndarray.>>> r1 = mm.execute()
Finally, free the object’s resources. To avoid having to explicitly make this call, it’s recommended to use the Matmul object as a context manager as shown below, if possible.
>>> mm.free()
Note that all
Matmulmethods execute on the current stream by default. Alternatively, thestreamargument can be used to run a method on a specified stream.Let’s now look at the same problem with CuPy ndarrays on the GPU.
>>> device_id = nvmath.distributed.get_context().device_id >>> import cupy as cp >>> with cp.cuda.Device(device_id): ... a = cp.asfortranarray(cp.random.rand(*a_shape)) ... b = cp.asfortranarray(cp.random.rand(*b_shape))
Create a Matmul object encapsulating the problem specification described earlier and use it as a context manager.
>>> with nvmath.distributed.linalg.advanced.Matmul( ... a, b, distributions=distributions, qualifiers=qualifiers ... ) as mm: ... mm.plan(epilog=epilog) ... ... # Execute the operation to get the first result. ... r1 = mm.execute() ... ... # Update operands A and B in-place (see reset_operands() for an ... # alternative). ... with cp.cuda.Device(device_id): ... a[:] = cp.random.rand(*a_shape) ... b[:] = cp.random.rand(*b_shape) ... ... # Execute the operation to get the new result. ... r2 = mm.execute()
All the resources used by the object are released at the end of the block.
Further examples can be found in the nvmath/examples/distributed/linalg/advanced/matmul directory.
Methods
- execute( )[source]#
Execute a planned distributed matrix multiplication.
- Parameters:
release_workspace – A value of
Truespecifies that the stateful object should release workspace memory back to the package memory pool on function return, while a value ofFalsespecifies that the object should retain the memory. This option may be set toTrueif the application performs other operations that consume a lot of memory between successive calls to the (same or different)execute()API, but incurs a small overhead due to obtaining and releasing workspace memory from and to the package memory pool on every call. The default isFalse.stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t(as Pythonint),cupy.cuda.Stream, andtorch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.
- Returns:
The result of the specified matrix multiplication (epilog applied), which remains on the same device and belongs to the same package as the input operands. If an epilog (like
nvmath.) that results in extra output is used, or an extra output is requested (for example by settingdistributed. linalg. advanced. MatmulEpilog. RELU_AUX result_amaxoption inoptionsargument), a tuple is returned with the first element being the matrix multiplication result (epilog applied) and the second element being the auxiliary output provided as adict.
- free()[source]#
Free Matmul resources.
It is recommended that the
Matmulobject be used within a context, but if it is not possible then this method must be called explicitly to ensure that the matrix multiplication resources (especially internal library objects) are properly cleaned up.
- plan( )[source]#
Plan the matrix multiplication operation, considering the epilog (if provided).
- Parameters:
preferences – This parameter specifies the preferences for planning as a
MatmulPlanPreferencesobject. Alternatively, a dictionary containing the parameters for theMatmulPlanPreferencesconstructor can also be provided. If not specified, the value will be set to the default-constructedMatmulPlanPreferencesobject.epilog – Specify an epilog \(F\) as an object of type
MatmulEpilogto apply to the result of the matrix multiplication: \(F(\alpha A @ B + \beta C\)). The default is no epilog. See cuBLASMp documentation for the list of available epilogs.epilog_inputs – Specify the additional inputs needed for the selected epilog as a dictionary, where the key is the epilog input name and the value is the epilog input. The epilog input must be a tensor with the same package and in the same memory space as the operands (see the constructor for more information on the operands). If the required epilog inputs are not provided, an exception is raised that lists the required epilog inputs. Some epilog inputs are generated by other epilogs. For example, the epilog input for
MatmulEpilog.DRELUis generated by matrix multiplication with the same operands usingMatmulEpilog.RELU_AUX.stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t(as Pythonint),cupy.cuda.Stream, andtorch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.
Notes
Epilogs that have
BIASin their name need an epilog input with the key'bias'. Epilogs that haveDRELUneed an epilog input with the key'relu_aux', which is produced in a “forward pass” epilog likeRELU_AUXorRELU_AUX_BIAS. Similarly, epilogs withDGELUin their name require an epilog input with the key'gelu_aux', produced in the corresponding forward pass operation.See
Matmulfor an example, and further examples can be found in the nvmath/examples/distributed/linalg/advanced/matmul directory.
- release_operands()[source]#
This method is experimental and potentially subject to future changes.
Added in version 0.9.0.
This method does two things:
Releases internal references to the user-provided operands, so that this instance no longer contributes to their reference counts.
Frees any internal copies (mirrors) that were created when the user-provided operands reside in a different memory space than the execution (i.e., copies made during construction or
reset_operands()/reset_operands_unchecked()if present).
This functionality can be useful in memory-constrained scenarios, e.g. where multiple stateful objects need to coexist. Leveraging this functionality, the caller can reduce memory usage while retaining the planned state.
- Parameters:
None
- Returns:
None
- Semantics:
Preserves the planned state of the stateful object.
After calling this method,
reset_operands()(orreset_operands_unchecked()if present) must be called to supply new operands before the nextexecute()call. Failure to do so will result in a runtime error. Device-side copies will be re-allocated as needed.For cross-space scenarios (e.g. CPU operands with GPU execution, or GPU operands with CPU execution): execution is guaranteed to be always blocking, so
execute()does not return until all computation is complete. It is therefore always safe to call this method after callingexecute()without additional synchronization.When the operands are in the same memory space as the execution (e.g. GPU operands with GPU execution): in such case, this method drops this instance’s internal reference to the user-provided operands. If the reference count of the operands reaches zero, their memory may be freed, so particular attention should be paid. The caller is responsible to ensure that if such deallocation happens, it is ordered after pending computation (e.g. by retaining a reference until the computation is complete, or by synchronizing the stream). Failure to do so is analogous to use-after-free.
See Overview, Stateful APIs: Design and Usage Patterns for operand lifecycle and usage patterns, and Stream Semantics for stream ordering rules.
- reset_operands(
- *,
- a=None,
- b=None,
- c=None,
- alpha=None,
- beta=None,
- quantization_scales=None,
- epilog_inputs=None,
- stream: AnyStream | int | None = None,
Reset one or more operands held by this
Matmulinstance. Only the operands explicitly passed are updated; omitted operands retain their current values.This method will perform various checks on the new operands to make sure:
The distributions, shapes, strides, datatypes match those of the old ones.
The packages that the operands belong to match those of the old ones.
If input tensors are on GPU, the device must match.
Changed in version 0.9: All parameters are now keyword-only.
- Parameters:
a – A distributed tensor representing the first operand to the matrix multiplication (see Semantics). The currently supported types are
numpy.ndarray,cupy.ndarray, andtorch.Tensor.b – A distributed tensor representing the second operand to the matrix multiplication (see Semantics). The currently supported types are
numpy.ndarray,cupy.ndarray, andtorch.Tensor.c – (Optional) A distributed tensor representing the operand to add to the matrix multiplication result (see Semantics). The currently supported types are
numpy.ndarray,cupy.ndarray, andtorch.Tensor.alpha – The scale factor for the matrix multiplication term as a real or complex number. The default is \(1.0\).
beta – The scale factor for the matrix addition term as a real or complex number. A value for
betamust be provided if operandcis specified.quantization_scales – Specify scale factors for the matrix multiplication as a
MatmulQuantizationScalesobject. Alternatively, adictcontaining the parameters for theMatmulQuantizationScalesconstructor can also be provided. The scale factors can be provided as scalars or tensors. If a scale factor is provided as a tensor, it must be from the same package and on the same memory space (CPU or GPU device) as the operands of the matmul. If a scale factor is provided as a scalar, and the execution space is GPU, a CPU->GPU copy is inevitable. To avoid this copy, provide the quantization scale as one-element array on the GPU. Allowed and required only for narrow-precision (FP8 and lower) operations.epilog_inputs – Specify the additional inputs needed for the selected epilog as a dictionary, where the key is the epilog input name and the value is the epilog input. The epilog input must be a tensor with the same package and in the same memory space as the operands (see the constructor for more information on the operands). If the required epilog inputs are not provided, an exception is raised that lists the required epilog inputs. Some epilog inputs are generated by other epilogs. For example, the epilog input for
MatmulEpilog.DRELUis generated by matrix multiplication with the same operands usingMatmulEpilog.RELU_AUX.stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t(as Pythonint),cupy.cuda.Stream, andtorch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.
Examples
>>> import cupy as cp >>> import nvmath.distributed >>> from nvmath.distributed.distribution import Slab
Get process group used to initialize nvmath.distributed (for information on initializing
nvmath., you can refer to the documentation or to the Matmul examples in nvmath/examples/distributed/linalg/advanced):distributed >>> process_group = nvmath.distributed.get_context().process_group
Get my process rank:
>>> rank = process_group.rank
Create two 3-D float64 ndarrays on the GPU (using Slab distributions to distribute the matrices across processes):
>>> M, N, K = 128, 128, 256 >>> a_shape = Slab.X.shape(rank, (M, K)) >>> b_shape = Slab.Y.shape(rank, (K, N)) >>> device_id = nvmath.distributed.get_context().device_id >>> with cp.cuda.Device(device_id): ... a = cp.asfortranarray(cp.random.rand(*a_shape)) ... b = cp.asfortranarray(cp.random.rand(*b_shape))
Create an matrix multiplication object as a context manager
>>> d = [Slab.X, Slab.Y, Slab.X] >>> with nvmath.distributed.linalg.advanced.Matmul(a, b, distributions=d) as mm: ... # Plan the operation. ... mm.plan() ... ... # Execute the MM to get the first result. ... r1 = mm.execute() ... ... # Reset the operands to new CuPy ndarrays. ... with cp.cuda.Device(device_id): ... a_new = cp.asfortranarray(cp.random.rand(*a_shape)) ... b_new = cp.asfortranarray(cp.random.rand(*b_shape)) ... mm.reset_operands(a=a_new, b=b_new) ... ... # Execute to get the new result corresponding to the updated operands. ... r2 = mm.execute()
Note that if only a subset of operands are reset, the operands that are not reset hold their original values.
With
reset_operands(), minimal overhead is achieved as problem specification and planning are only performed once.For the particular example above, explicitly calling
reset_operands()is functionally equivalent to updating the operands in-place, i.e, replacingmm.reset_operands(a=a_new, b=b_new)witha[:]=a_newandb[:]=b_new. Note that updating the operand in-place should be adopted with caution as it can only yield the expected result when the operand memory space is accessible from the execution space.For more details, please refer to inplace update example.
See also