Matmul#

class nvmath.linalg.Matmul(
a: AnyTensor,
b: AnyTensor,
/,
c: AnyTensor | None = None,
*,
alpha: float | complex | None = None,
beta: float | complex | None = None,
qualifiers: ndarray[tuple[Any, ...], dtype[_ScalarT]] | None = None,
options: MatmulOptions | None = None,
execution: ExecutionCPU | ExecutionCUDA | None = None,
stream: AnyStream | int | None = None,
)[source]#

Create a stateful object encapsulating the specified matrix multiplication computation \(\alpha a @ b + \beta c\) and the required resources to perform the operation. A stateful object can be used to amortize the cost of preparation (planning in the case of matrix multiplication) across multiple executions (also see the Stateful APIs section).

The function-form API matmul() is a convenient alternative to using stateful objects for single use (the user needs to perform just one matrix multiplication, for example), in which case there is no possibility of amortizing preparatory costs. The function-form APIs are just convenience wrappers around the stateful object APIs.

Using the stateful object typically involves the following steps:

  1. Problem Specification: Initialize the object with a defined operation and options.

  2. Preparation: Use plan() to determine the best algorithmic implementation for this specific matrix multiplication operation.

  3. Execution: Perform the matrix multiplication computation with execute().

Detailed information on what’s happening in the various phases described above can be obtained by passing in a logging.Logger object to MatmulOptions or by setting the appropriate options in the root logger object, which is used by default:

>>> import logging
>>> logging.basicConfig(
...     level=logging.INFO,
...     format="%(asctime)s %(levelname)-8s %(message)s",
...     datefmt="%m-%d %H:%M:%S",
... )

A user can select the desired logging level and, in general, take advantage of all of the functionality offered by the Python logging module.

Parameters:
  • a – A tensor representing the first operand to the matrix multiplication (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

  • b – A tensor representing the second operand to the matrix multiplication (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

  • c – (Optional) A tensor representing the operand to add to the matrix multiplication result (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

  • alpha – The scale factor for the matrix multiplication term as a real or complex number. The default is \(1.0\).

  • beta – The scale factor for the matrix addition term as a real or complex number. A value for beta must be provided if operand c is specified.

  • qualifiers – If desired, specify the matrix qualifiers as a numpy.ndarray of matrix_qualifiers_dtype objects of length <= 3 corresponding to the operands a, b, and c. By default, GeneralMatrixQualifier is assumed for each tensor. See Matrix and Tensor Qualifiers for the motivation behind qualifiers.

  • options – Specify options for the matrix multiplication as a MatmulOptions object. If not specified, the value will be set to the default-constructed MatmulOptions object.

  • execution – Specify execution space options for the Matmul as a ExecutionCUDA or ExecutionCPU object. If not specified, the execution space will be selected to match operand’s storage (in GPU or host memory), and the corresponding ExecutionCUDA or ExecutionCPU object will be default-constructed.

  • stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used.

Semantics:

The semantics of the matrix multiplication follows numpy.matmul() semantics, with some restrictions.

  • Batching is not supported in this API, but is planned for a future release. See the advanced API (nvmath.linalg.advanced.matmul()) for an API that supports batching.

  • Broadcasting c is not supported in this API, but may be supported in the future. See the advanced API (nvmath.linalg.advanced.matmul()) for an API that supports broadcasting c.

In addition, the semantics for the fused matrix addition are described below:

  • If arguments a and b are matrices, they are multiplied according to the rules of matrix multiplication.

  • If argument a is 1-D, it is promoted to a matrix by prefixing 1 to its dimensions. After matrix multiplication, the prefixed 1 is removed from the result’s dimensions.

  • If argument b is 1-D, it is promoted to a matrix by appending 1 to its dimensions. After matrix multiplication, the appended 1 is removed from the result’s dimensions.

  • The operand for the matrix addition c must be the expected shape of the result of the matrix multiplication.

Examples

>>> import numpy as np
>>> import nvmath

Create two 2-D float64 ndarrays on the CPU:

>>> M, N, K = 1024, 1024, 1024
>>> a = np.random.rand(M, K)
>>> b = np.random.rand(K, N)

We will define a matrix multiplication operation using the generic matrix multiplication interface.

Create a Matmul object encapsulating the problem specification above:

>>> mm = nvmath.linalg.Matmul(a, b)

Options can be provided above to control the behavior of the operation using the options argument (see MatmulOptions).

Next, plan the operation. The operands’ layouts, qualifiers, and dtypes will be considered to select an appropriate matrix multiplication:

>>> mm.plan()

Now execute the matrix multiplication, and obtain the result r1 as a NumPy ndarray.

>>> r1 = mm.execute()

Note that all Matmul methods execute on the current stream by default. Alternatively, the stream argument can be used to run a method on a specified stream.

Let’s now look at the same problem with CuPy ndarrays on the GPU.

Create a 3-D complex128 CuPy ndarray on the GPU:

>>> import cupy as cp
>>> a = cp.random.rand(M, K)
>>> b = cp.random.rand(K, N)

Create an Matmul object encapsulating the problem specification described earlier and use it as a context manager.

>>> with nvmath.linalg.Matmul(a, b) as mm:
...     # Plan the operation.
...     mm.plan()
...
...     # Execute the operation to get the first result.
...     r1 = mm.execute()
...
...     # Update operands A and B in-place (see reset_operands() for an
...     # alternative).
...     a[:] = cp.random.rand(M, K)
...     b[:] = cp.random.rand(K, N)
...
...     # Execute the operation to get the new result.
...     r2 = mm.execute()

All the resources used by the object are released at the end of the block.

Further examples can be found in the nvmath/examples/linalg/generic/matmul directory.

Methods

__init__(
a: AnyTensor,
b: AnyTensor,
/,
c: AnyTensor | None = None,
*,
alpha: float | complex | None = None,
beta: float | complex | None = None,
qualifiers: ndarray[tuple[Any, ...], dtype[_ScalarT]] | None = None,
options: MatmulOptions | None = None,
execution: ExecutionCPU | ExecutionCUDA | None = None,
stream: AnyStream | int | None = None,
)[source]#

Copy operands to the execution space and setup options.

When inheriting from this class, you must create valid operands and options in the child class before calling StatefulAPI.__init__( … ).

execute(
*,
stream: AnyStream | int | None = None,
) AnyTensor[source]#

Execute a prepared (planned) matrix multiplication.

This method is a wrapper around _execute(), which takes the same arguments, but skips as many correctness and safety checks as possible.

Parameters:

stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used.

Returns:

The result of the specified matrix multiplication, which remains on the same device and belong to the same package as the input operands.

plan() None[source]#

Plan the matrix multiplication operation.

Unlike nvmath.linalg.advanced.Matmul.plan(), this method takes no tuning parameters. Its primary function is to find the correct matrix multiplication implementation based on the operands and options provided to the constructor.

Parameters:

stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used.

Returns:

Nothing.

reset_operands(
a=None,
b=None,
c=None,
*,
alpha=None,
beta=None,
stream: AnyStream | int | None = None,
)[source]#

Reset the operands held by this Matmul instance.

This method has two use cases:
  1. it can be used to provide new operands for execution when the original operands are on the CPU

  2. it can be used to release the internal reference to the previous operands and make their memory available for other use by passing None for all arguments. In this case, this method must be called again to provide the desired operands before another call to execution APIs like autotune() or execute().

This method is not needed when the operands reside on the GPU and in-place operations are used to update the operand values.

This method will perform various checks on the new operands to make sure:

  • The shapes, strides, datatypes match those of the old ones.

  • The packages that the operands belong to match those of the old ones.

  • If input tensors are on GPU, the device must match.

Parameters:
  • a – A tensor representing the first operand to the matrix multiplication (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

  • b – A tensor representing the second operand to the matrix multiplication (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

  • c – (Optional) A tensor representing the operand to add to the matrix multiplication result (see Semantics). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

  • alpha – The scale factor for the matrix multiplication term as a real or complex number. The default is \(1.0\).

  • beta – The scale factor for the matrix addition term as a real or complex number. A value for beta must be provided if operand c is specified.

  • stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used.

Examples

>>> import cupy as cp
>>> import nvmath

Create two 3-D float64 ndarrays on the GPU:

>>> M, N, K = 128, 128, 256
>>> a = cp.random.rand(M, K)
>>> b = cp.random.rand(K, N)

Create an matrix multiplication object as a context manager

>>> with nvmath.linalg.Matmul(a, b) as mm:
...     # Plan the operation.
...     mm.plan()
...
...     # Execute the MM to get the first result.
...     r1 = mm.execute()
...
...     # Reset the operands to new CuPy ndarrays.
...     c = cp.random.rand(M, K)
...     d = cp.random.rand(K, N)
...     mm.reset_operands(c, d)
...
...     # Execute to get the new result corresponding to the updated operands.
...     r2 = mm.execute()

Note that if only a subset of operands are reset, the operands that are not reset hold their original values.

With reset_operands(), minimal overhead is achieved as problem specification and planning are only performed once.

For the particular example above, explicitly calling reset_operands() is equivalent to updating the operands in-place, i.e, replacing mm.reset_operand(c, d) with a[:]=c and b[:]=d. Note that updating the operand in-place should be adopted with caution as it can only yield the expected result under the additional constraint below:

  • The operand is on the GPU (more precisely, the operand memory space should be accessible from the execution space).

For more details, please refer to inplace update example.

Attributes

options#

The options object used to construct this class.

execution: Final[ExecutionCPU | ExecutionCUDA]#

A class which describes the execution space parameters.