Matmul#

class nvmath.linalg.advanced.Matmul( a, b, /, c=None, *, alpha=None, beta=None, qualifiers=None, quantization_scales=None, options=None, stream: AnyStream | int | None = None, )[source]#

Create a stateful object encapsulating the specified matrix multiplication computation \(\alpha a @ b + \beta c\) and the required resources to perform the operation. A stateful object can be used to amortize the cost of preparation (planning in the case of matrix multiplication) across multiple executions (also see the Stateful APIs section).

The function-form API matmul() is a convenient alternative to using stateful objects for single use (the user needs to perform just one matrix multiplication, for example), in which case there is no possibility of amortizing preparatory costs. The function-form APIs are just convenience wrappers around the stateful object APIs.

Using the stateful object typically involves the following steps:

Problem Specification: Initialize the object with a defined operation and options.
Preparation: Use plan() to determine the best algorithmic implementation for this specific matrix multiplication operation.
Execution: Perform the matrix multiplication computation with execute().
Resource Management: Ensure all resources are released either by explicitly calling free() or by managing the stateful object within a context manager.

Detailed information on what’s happening in the various phases described above can be obtained by passing in a logging.Logger object to MatmulOptions or by setting the appropriate options in the root logger object, which is used by default:

>>> import logging
>>> logging.basicConfig(
...     level=logging.INFO,
...     format="%(asctime)s %(levelname)-8s %(message)s",
...     datefmt="%m-%d %H:%M:%S",
... )

A user can select the desired logging level and, in general, take advantage of all of the functionality offered by the Python logging module.

Parameters:

a – A tensor representing the first operand to the matrix multiplication (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.
b – A tensor representing the second operand to the matrix multiplication (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.
c –
(Optional) A tensor representing the operand to add to the matrix multiplication result (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

Changed in version 0.3.0: In order to avoid broadcasting behavior ambiguity, nvmath-python no longer accepts a 1-D (vector) c. Use a singleton dimension to convert your input array to 2-D.
alpha – The scale factor for the matrix multiplication term as a real or complex number. The default is \(1.0\).
beta – The scale factor for the matrix addition term as a real or complex number. A value for beta must be provided if operand c is specified.
qualifiers – If desired, specify the matrix qualifiers as a numpy.ndarray of matrix_qualifiers_dtype objects of length 3 corresponding to the operands a, b, and c. See Matrix and Tensor Qualifiers for the motivation behind qualifiers.
options – Specify options for the matrix multiplication as a MatmulOptions object. Alternatively, a dict containing the parameters for the MatmulOptions constructor can also be provided. If not specified, the value will be set to the default-constructed MatmulOptions object.
stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.
quantization_scales – Specify scale factors for the matrix multiplication as a MatmulQuantizationScales object. Alternatively, a dict containing the parameters for the MatmulQuantizationScales constructor can also be provided. The scale factors can be provided as scalars or tensors. If a scale factor is provided as a tensor, it must be from the same package and on the same memory space (CPU or GPU device) as the operands of the matmul. If a scale factor is provided as a scalar, and the execution space is GPU, a CPU->GPU copy is inevitable. To avoid this copy, provide the quantization scale as one-element array on the GPU. Allowed and required only for narrow-precision (FP8 and lower) operations.

Semantics:

The semantics of the matrix multiplication follows numpy.matmul semantics, with some restrictions on broadcasting. In addition, the semantics for the fused matrix addition are described below.

Note

For narrow-precision formats (FP8, MXFP8, NVFP4), some of the rules below are restricted — see the narrow-precision section for details.

For in-place matrix multiplication (where the result is written into c) the result has the same shape as c.
If arguments a and b are matrices, they are multiplied according to the rules of matrix multiplication.
If argument a is 1-D, it is promoted to a matrix by prefixing 1 to its dimensions. After matrix multiplication, the prefixed 1 is removed from the result’s dimensions if the operation is not in-place.
If argument b is 1-D, it is promoted to a matrix by appending 1 to its dimensions. After matrix multiplication, the appended 1 is removed from the result’s dimensions if the operation is not in-place.
If a or b is N-D (N > 2), then the operand is treated as a batch of matrices. If both a and b are N-D, their batch dimensions must match. If exactly one of a or b is N-D, the other operand is broadcast.
The operand for the matrix addition c may be a matrix of shape (M, 1) or (M, N), or the batched versions (…, M, 1) or (…, M, N). Here M and N are the dimensions of the result of the matrix multiplication. If N = 1, the columns of c are broadcast for the addition; the rows of c are never broadcast. If batch dimensions are not present, c is broadcast across batches as needed. If the operation is in-place, c cannot be broadcast since it must be large enough to hold the result.
Similarly, when operating on a batch, auxiliary outputs are 3-D for all epilogs. Therefore, epilogs that return 1-D vectors of length N in non-batched mode return 3-D matrices of size (batch, N, 1) in batched mode.

For narrow-precision operations (FP8 and lower), further restrictions apply; see the narrow-precision support section below.

Narrow-precision support:

Matrix multiplication with narrow-precision operands is supported, in FP8, MXFP8, and NVFP4 formats.

FP8 and MXFP8

FP8 and MXFP8 use float8_e4m3fn or float8_e5m2 data types. The difference is the scaling mode: FP8 (block_scaling=False) uses per-tensor scaling where a single scalar scale is applied to each operand; MXFP8 (block_scaling=True) uses microscaling with 32-element blocks arranged in 128x128 tiles.

Note

FP8 and MXFP8 matrix multiplication requires CUDA Toolkit 12.8 or newer. FP8 requires a device with compute capability 8.9 or higher (Ada, Hopper, Blackwell or newer architecture). MXFP8 requires a device with compute capability 10.0 or higher (Blackwell or newer architecture). Please refer to the compute capability table to check the compute capability of your device.

For FP8 operations:

For each operand a scaling factor needs to be specified via quantization_scales argument.
Maximum absolute value of the result (amax) can be requested via result_amax option in options argument.
Custom result type (both FP8 and non-FP8) can be requested via result_type option in options argument.

For MXFP8 operations:

1-D (vector) operands are not supported. Both a and b must be at least 2-D matrices.
Broadcasting of batch dimensions is not supported. The batch shapes of a and b must match exactly.
All operand dimensions (M, N, K) must be multiples of 128.
block_scaling option must be set to True and block scaling factors need to be specified via quantization_scales argument. Utilities in nvmath.linalg.advanced.helpers.matmul can be used to create and modify block scaling factors, see e.g. create_mxfp8_scale().
When the result type is a narrow-precision data type, the auxiliary output "d_out_scale" will be returned containing the scales used for result quantization.

Layout Requirements

Due to the requirements of narrow-precision GEMM kernels, the contracting dimension K must be contiguous (stride-1) for both operands. The following layout constraints apply to both FP8 and MXFP8:

Operand a must be (..., M, K) with stride[-1] == 1 and stride[-2] >= K (row-major). The leading dimension (stride[-2]) can be larger than K to support sliced or padded views.
Operand b must be (..., K, N) with stride[-2] == 1 and stride[-1] >= K (column-major). The leading dimension (stride[-1]) can be larger than K to support sliced or padded views.

Attention

Epilog support for MXFP8 is still evolving in the underlying cuBLASLt library, so not every combination of epilog, data type, and layout is guaranteed to work. If running into unsupported combinations, a cuBLASLt error will be raised either at planning time or at execution time that will reveal the root cause. These gaps are expected to be filled in future cuBLASLt releases.

For more details on the FP8 and MXFP8 formats in cuBLAS, see the cublasLtMatmul documentation.

NVFP4

Added in version 1.0: NVFP4 support.

NVFP4 uses float4_e2m1fn_x2 data type with block scaling (16-element blocks arranged in 128x64 tiles).

Note

NVFP4 matrix multiplication currently requires CUDA Toolkit 12.8 or newer, a device with compute capability 10.0 or higher (Blackwell or newer architecture), and PyTorch 2.9 or newer for float4_e2m1fn_x2 dtype support. Please refer to the compute capability table to check the compute capability of your device.

For NVFP4 operations:

1-D (vector) operands are not supported. Both a and b must be at least 2-D matrices.
Broadcasting of batch dimensions is not supported. The batch shapes of a and b must match exactly.
The outer dimensions of a and b (M and N) must be multiples of 128, and the contracting dimension K must be a multiple of 64.
block_scaling option must be set to True and block scaling factors need to be specified via quantization_scales argument.
When the result type is a narrow-precision data type, the auxiliary output "d_out_scale" will be returned containing the scales used for result quantization.

Layout and Packing Requirements

FP4 data is per-byte packed: float4_e2m1fn_x2 stores 2 FP4 values per byte. The block scaling (VEC16_UE4M3) assigns one scale factor per 16 consecutive elements along the innermost (stride-1) dimension of each operand. The layout requirements below ensure that this innermost dimension corresponds to the contracting dimension K for both operands.

Operand a must be (..., M, K//2) with stride[-1] == 1 and stride[-2] >= K//2, i.e., row-wise packed along K. Note that the leading dimension (stride[-2]) can be larger than K//2 to support sliced views, as long as the stride remains 16-byte aligned.
Operand b must be (..., K//2, N) with stride[-2] == 1 and stride[-1] >= K//2, i.e., column-wise packed along K. Note that the leading dimension (stride[-1]) can be larger than K//2 to support sliced views, as long as the stride remains 16-byte aligned.

If your data has the stride-1 axis along a dimension other than K, you must repack it before calling matmul().

When the result type is also FP4, the output is packed along a dimension that depends on the result layout order:

Row-major result: packed along N — shape (..., M, N//2), strides (..., N//2, 1).
Column-major result: packed along M — shape (..., M//2, N), strides (..., 1, M//2).

The result layout order is determined by the following priority:

If c is provided, the result inherits c’s layout order.
Otherwise, if the epilog requests a specific layout, that layout is used.
Otherwise, the result inherits a’s layout order as a fallback.

Epilog Support

NVFP4 matmul supports epilogs. The following have been verified:

RELU, GELU – with both row-major and column-major output.
BIAS, RELU_BIAS, GELU_BIAS – with column-major output only (BIAS with float16 C/D requires cuBLASLt >= 13.0).

Attention

Epilog support for NVFP4 is still evolving in the underlying cuBLASLt library, so not every combination of epilog, data type, and layout is guaranteed to work. If running into unsupported combinations, a cuBLASLt error will be raised either at planning time or at execution time that will reveal the root cause. These gaps are expected to be filled in future cuBLASLt releases.

Helper Functions

The nvmath.linalg.advanced.helpers.matmul module provides helpers for working with FP4 encoding/decoding and NVFP4 block scales, see e.g. quantize_to_fp4(), unpack_fp4(), get_block_scale_offset(), to_block_scale(), expand_block_scale().

For more details on the NVFP4 format in cuBLAS, see the cublasLtMatmul documentation. For usage examples, see the relevant files in the examples/linalg/advanced/matmul directory.

Examples

>>> import numpy as np
>>> import nvmath

Create two 2-D float64 ndarrays on the CPU:

>>> M, N, K = 1024, 1024, 1024
>>> a = np.random.rand(M, K)
>>> b = np.random.rand(K, N)

We will define a matrix multiplication operation followed by a RELU epilog function using the specialized matrix multiplication interface.

Create a Matmul object encapsulating the problem specification above:

>>> mm = nvmath.linalg.advanced.Matmul(a, b)

Options can be provided above to control the behavior of the operation using the options argument (see MatmulOptions).

Next, plan the operation. The epilog is specified, and optionally, preferences can be specified for planning:

>>> epilog = nvmath.linalg.advanced.MatmulEpilog.RELU
>>> algorithms = mm.plan(epilog=epilog)

Certain epilog choices (like nvmath.linalg.advanced.MatmulEpilog.BIAS) require additional input provided using the epilog_inputs argument to plan().

Now execute the matrix multiplication, and obtain the result r1 as a NumPy ndarray.

>>> r1 = mm.execute()

Finally, free the object’s resources. To avoid having to explicitly make this call, it’s recommended to use the Matmul object as a context manager as shown below, if possible.

>>> mm.free()

Note that all Matmul methods execute on the current stream by default. Alternatively, the stream argument can be used to run a method on a specified stream.

Let’s now look at the same problem with CuPy ndarrays on the GPU.

>>> import cupy as cp
>>> a = cp.random.rand(M, K)
>>> b = cp.random.rand(K, N)

Create an Matmul object encapsulating the problem specification described earlier and use it as a context manager.

>>> with nvmath.linalg.advanced.Matmul(a, b) as mm:
...     algorithms = mm.plan(epilog=epilog)
...
...     # Execute the operation to get the first result.
...     r1 = mm.execute()
...
...     # Update operands A and B in-place (see reset_operands() for an
...     # alternative).
...     a[:] = cp.random.rand(M, K)
...     b[:] = cp.random.rand(K, N)
...
...     # Execute the operation to get the new result.
...     r2 = mm.execute()

All the resources used by the object are released at the end of the block.

Further examples can be found in the nvmath/examples/linalg/advanced/matmul directory.

Attributes

algorithms#

After planning using plan(), get the sequence of algorithm objects to inquire their capabilities, configure them, or serialize them for later use.

Returns:: A sequence of nvmath.linalg.advanced.Algorithm objects that are applicable to this matrix multiplication problem specification.

Methods

applicable_algorithm_ids(limit=8)[source]#

Obtain the algorithm IDs that are applicable to this matrix multiplication.

Parameters:: limit – The maximum number of applicable algorithm IDs that is desired
Returns:: A sequence of algorithm IDs that are applicable to this matrix multiplication problem specification, in random order.

autotune( iterations=10, prune=None, release_workspace=False, stream: AnyStream | int | None = None, )[source]#

Autotune the matrix multiplication to order the algorithms from the fastest measured execution time to the slowest. Once autotuned, the optimally-ordered algorithm sequence can be accessed using algorithms.

Note

This function will benchmark each of the algorithms and order the algorithms based on the benchmark results. The measurements can be impacted by factors such as GPU temperature, clock settings, or power consumption. Autotuning in an unstable environment can result in a suboptimal algorithm ordering. If you experience performance problems, consider omitting the autotuning.

Parameters:

iterations – The number of autotuning iterations to perform.
prune – An integer N, specifying the top N fastest algorithms to retain after autotuning. The default is to retain all algorithms.
release_workspace – A value of True specifies that the stateful object should release workspace memory back to the package memory pool on function return, while a value of False specifies that the object should retain the memory. This option may be set to True if the application performs other operations that consume a lot of memory between successive calls to the (same or different) execute() API, but incurs a small overhead due to obtaining and releasing workspace memory from and to the package memory pool on every call. The default is False.
stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.

execute( *, algorithm=None, release_workspace=False, stream: AnyStream | int | None = None, )[source]#

Execute a prepared (planned and possibly autotuned) matrix multiplication.

Parameters:

algorithm – (Experimental) An algorithm chosen from the sequence returned by plan() or algorithms. By default, the first algorithm in the sequence is used.
release_workspace – A value of True specifies that the stateful object should release workspace memory back to the package memory pool on function return, while a value of False specifies that the object should retain the memory. This option may be set to True if the application performs other operations that consume a lot of memory between successive calls to the (same or different) execute() API, but incurs a small overhead due to obtaining and releasing workspace memory from and to the package memory pool on every call. The default is False.
stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.

Returns:

The result of the specified matrix multiplication (epilog applied), which remains on the same device and belongs to the same package as the input operands. If an epilog (like nvmath.linalg.advanced.MatmulEpilog.RELU_AUX) that results in extra output is used, or an extra output is requested (for example by setting result_amax option in options argument), a tuple is returned with the first element being the matrix multiplication result (epilog applied) and the second element being the auxiliary output provided as a dict.

free()[source]#

Free Matmul resources.

It is recommended that the Matmul object be used within a context, but if it is not possible then this method must be called explicitly to ensure that the matrix multiplication resources (especially internal library objects) are properly cleaned up.

plan( *, preferences=None, algorithms=None, epilog=None, epilog_inputs=None, stream: AnyStream | int | None = None, )[source]#

Plan the matrix multiplication operation, considering the epilog (if provided).

Parameters:

preferences – This parameter specifies the preferences for planning as a MatmulPlanPreferences object. Alternatively, a dictionary containing the parameters for the MatmulPlanPreferences constructor can also be provided. If not specified, the value will be set to the default-constructed MatmulPlanPreferences object.
algorithms – A sequence of Algorithm objects that can be directly provided to bypass planning. The algorithm objects must be compatible with the matrix multiplication. A typical use for this option is to provide algorithms serialized (pickled) from a previously planned and autotuned matrix multiplication.
epilog – Specify an epilog \(F\) as an object of type MatmulEpilog to apply to the result of the matrix multiplication: \(F(\alpha A @ B + \beta C\)). The default is no epilog. See cuBLASLt documentation for the list of available epilogs.
epilog_inputs – Specify the additional inputs needed for the selected epilog as a dictionary, where the key is the epilog input name and the value is the epilog input. The epilog input must be a tensor with the same package and in the same memory space as the operands (see the constructor for more information on the operands). If the required epilog inputs are not provided, an exception is raised that lists the required epilog inputs. Some epilog inputs are generated by other epilogs. For example, the epilog input for MatmulEpilog.DRELU is generated by matrix multiplication with the same operands using MatmulEpilog.RELU_AUX.
stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.

Returns:

A sequence of nvmath.linalg.advanced.Algorithm objects that are applicable to this matrix multiplication problem specification, heuristically ordered from fastest to slowest.

Notes

Epilogs that have BIAS in their name need an epilog input with the key 'bias'. Epilogs that have DRELU need an epilog input with the key 'relu_aux', which is produced in a “forward pass” epilog like RELU_AUX or RELU_AUX_BIAS. Similarly, epilogs with DGELU in their name require an epilog input with the key 'gelu_aux', produced in the corresponding forward pass operation.

Examples

>>> import numpy as np
>>> import nvmath

Create two 3-D float64 ndarrays on the CPU representing batched matrices, along with a bias vector:

>>> batch = 32
>>> M, N, K = 1024, 1024, 1024
>>> a = np.random.rand(batch, M, K)
>>> b = np.random.rand(batch, K, N)
>>> # The bias vector will be broadcast along the columns, as well as along the
>>> # batch dimension.
>>> bias = np.random.rand(M)

We will define a matrix multiplication operation followed by a nvmath.linalg.advanced.MatmulEpilog.RELU_BIAS epilog function.

>>> with nvmath.linalg.advanced.Matmul(a, b) as mm:
...     # Plan the operation with RELU_BIAS epilog and corresponding epilog
...     # input.
...     p = nvmath.linalg.advanced.MatmulPlanPreferences(limit=8)
...     epilog = nvmath.linalg.advanced.MatmulEpilog.RELU_BIAS
...     epilog_inputs = {"bias": bias}
...     # The preferences can also be provided as a dict: {'limit': 8}
...     algorithms = mm.plan(
...         preferences=p,
...         epilog=epilog,
...         epilog_inputs=epilog_inputs,
...     )
...
...     # Execute the matrix multiplication, and obtain the result `r` as a
...     # NumPy ndarray.
...     r = mm.execute()

Some epilogs like nvmath.linalg.advanced.MatmulEpilog.RELU_AUX produce auxiliary output.

>>> with nvmath.linalg.advanced.Matmul(a, b) as mm:
...     # Plan the operation with RELU_AUX epilog>
...     epilog = nvmath.linalg.advanced.MatmulEpilog.RELU_AUX
...     algorithms = mm.plan(epilog=epilog)
...
...     # Execute the matrix multiplication, and obtain the result `r` along
...     # with the auxiliary output.
...     r, auxiliary = mm.execute()

The auxiliary output is a Python dict with the names of each auxiliary output as keys.

Further examples can be found in the nvmath/examples/linalg/advanced/matmul directory.

release_operands()[source]#

This method is experimental and potentially subject to future changes.

Added in version 0.9.0.

This method does two things:

Releases internal references to the user-provided operands, so that this instance no longer contributes to their reference counts.
Frees any internal copies (mirrors) that were created when the user-provided operands reside in a different memory space than the execution (i.e., copies made during construction or reset_operands() / reset_operands_unchecked() if present).

This functionality can be useful in memory-constrained scenarios, e.g. where multiple stateful objects need to coexist. Leveraging this functionality, the caller can reduce memory usage while retaining the planned state.

Parameters:: None
Returns:: None

Semantics:

Preserves the planned state of the stateful object.
After calling this method, reset_operands() (or reset_operands_unchecked() if present) must be called to supply new operands before the next execute() call. Failure to do so will result in a runtime error. Device-side copies will be re-allocated as needed.
For cross-space scenarios (e.g. CPU operands with GPU execution, or GPU operands with CPU execution): execution is guaranteed to be always blocking, so execute() does not return until all computation is complete. It is therefore always safe to call this method after calling execute() without additional synchronization.
When the operands are in the same memory space as the execution (e.g. GPU operands with GPU execution): in such case, this method drops this instance’s internal reference to the user-provided operands. If the reference count of the operands reaches zero, their memory may be freed, so particular attention should be paid. The caller is responsible to ensure that if such deallocation happens, it is ordered after pending computation (e.g. by retaining a reference until the computation is complete, or by synchronizing the stream). Failure to do so is analogous to use-after-free.

See Overview, Stateful APIs: Design and Usage Patterns for operand lifecycle and usage patterns, and Stream Semantics for stream ordering rules.

reset_operands( *, a=None, b=None, c=None, alpha=None, beta=None, quantization_scales=None, epilog_inputs=None, stream: AnyStream | int | None = None, )[source]#

Reset one or more operands held by this Matmul instance. Only the operands explicitly passed are updated; omitted operands retain their current values.

This method will perform various checks on the new operands to make sure:

The shapes, strides, datatypes match those of the old ones.
The packages that the operands belong to match those of the old ones.
If input tensors are on GPU, the device must match.

Changed in version 0.9: All parameters are now keyword-only.

Parameters:

a – A tensor representing the first operand to the matrix multiplication (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.
b – A tensor representing the second operand to the matrix multiplication (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.
c –
(Optional) A tensor representing the operand to add to the matrix multiplication result (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

Changed in version 0.3.0: In order to avoid broadcasting behavior ambiguity, nvmath-python no longer accepts a 1-D (vector) c. Use a singleton dimension to convert your input array to 2-D.
alpha – The scale factor for the matrix multiplication term as a real or complex number. The default is \(1.0\).
beta – The scale factor for the matrix addition term as a real or complex number. A value for beta must be provided if operand c is specified.
epilog_inputs – Specify the additional inputs needed for the selected epilog as a dictionary, where the key is the epilog input name and the value is the epilog input. The epilog input must be a tensor with the same package and in the same memory space as the operands (see the constructor for more information on the operands). If the required epilog inputs are not provided, an exception is raised that lists the required epilog inputs. Some epilog inputs are generated by other epilogs. For example, the epilog input for MatmulEpilog.DRELU is generated by matrix multiplication with the same operands using MatmulEpilog.RELU_AUX.
stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.
quantization_scales – Specify scale factors for the matrix multiplication as a MatmulQuantizationScales object. Alternatively, a dict containing the parameters for the MatmulQuantizationScales constructor can also be provided. The scale factors can be provided as scalars or tensors. If a scale factor is provided as a tensor, it must be from the same package and on the same memory space (CPU or GPU device) as the operands of the matmul. If a scale factor is provided as a scalar, and the execution space is GPU, a CPU->GPU copy is inevitable. To avoid this copy, provide the quantization scale as one-element array on the GPU. Allowed and required only for narrow-precision (FP8 and lower) operations.

Examples

>>> import cupy as cp
>>> import nvmath

Create two 3-D float64 ndarrays on the GPU:

>>> M, N, K = 128, 128, 256
>>> a = cp.random.rand(M, K)
>>> b = cp.random.rand(K, N)

Create an matrix multiplication object as a context manager

>>> with nvmath.linalg.advanced.Matmul(a, b) as mm:
...     # Plan the operation.
...     algorithms = mm.plan()
...
...     # Execute the MM to get the first result.
...     r1 = mm.execute()
...
...     # Reset the operands to new CuPy ndarrays.
...     a_new = cp.random.rand(M, K)
...     b_new = cp.random.rand(K, N)
...     mm.reset_operands(a=a_new, b=b_new)
...
...     # Execute to get the new result corresponding to the updated operands.
...     r2 = mm.execute()

With reset_operands(), minimal overhead is achieved as problem specification and planning are only performed once.

For the particular example above, the operands are on the GPU, so calling reset_operands() only updates internal references and is efficient. An alternative would be to modify the existing operands in-place (e.g. a[:]=a_new and b[:]=b_new), but that would copy data and have performance implications. When using in-place updates, the operand memory space must be accessible from the execution space.

For more details, please refer to inplace update example.