nvmath.linalg.advanced.Matmul¶
- class nvmath.linalg.advanced.Matmul(a, b, /, c=None, *, alpha=None, beta=None, qualifiers=None, options=None, stream=None)[source]¶
Create a stateful object encapsulating the specified matrix multiplication computation \(\alpha a @ b + \beta c\) and the required resources to perform the operation. A stateful object can be used to amortize the cost of preparation (planning in the case of matrix multiplication) across multiple executions (also see the Stateful APIs section).
The function-form API
matmul()
is a convenient alternative to using stateful objects for single use (the user needs to perform just one matrix multiplication, for example), in which case there is no possibility of amortizing preparatory costs. The function-form APIs are just convenience wrappers around the stateful object APIs.Using the stateful object typically involves the following steps:
Problem Specification: Initialize the object with a defined operation and options.
Preparation: Use
plan()
to determine the best algorithmic implementation for this specific matrix multiplication operation.Execution: Perform the matrix multiplication computation with
execute()
.Resource Management: Ensure all resources are released either by explicitly calling
free()
or by managing the stateful object within a context manager.
Detailed information on what’s happening in the various phases described above can be obtained by passing in a
logging.Logger
object toMatmulOptions
or by setting the appropriate options in the root logger object, which is used by default:>>> import logging >>> logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)-8s %(message)s', datefmt='%m-%d %H:%M:%S')
A user can select the desired logging level and, in general, take advantage of all of the functionality offered by the Python
logging
module.- Parameters:
a – A tensor representing the first operand to the matrix multiplication (see
Semantics
). The currently supported types arenumpy.ndarray
,cupy.ndarray
, andtorch.Tensor
.b – A tensor representing the second operand to the matrix multiplication (see
Semantics
). The currently supported types arenumpy.ndarray
,cupy.ndarray
, andtorch.Tensor
.c – (Optional) A tensor representing the operand to add to the matrix multiplication result (see
Semantics
). The currently supported types arenumpy.ndarray
,cupy.ndarray
, andtorch.Tensor
.alpha – The scale factor for the matrix multiplication term as a real or complex number. The default is \(1.0\).
beta – The scale factor for the matrix addition term as a real or complex number. A value for
beta
must be provided if operandc
is specified.qualifiers – If desired, specify the matrix qualifiers as a
numpy.ndarray
ofmatrix_qualifiers_dtype
objects of length 3 corresponding to the operandsa
,b
, andc
.options – Specify options for the matrix multiplication as a
MatmulOptions
object. Alternatively, adict
containing the parameters for theMatmulOptions
constructor can also be provided. If not specified, the value will be set to the default-constructedMatmulOptions
object.stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t
(as Pythonint
),cupy.cuda.Stream
, andtorch.cuda.Stream
. If a stream is not provided, the current stream from the operand package will be used.
- Semantics:
The semantics of the matrix multiplication follows
numpy.matmul()
semantics, with some restrictions on broadcasting. In addition, the semantics for the fused matrix addition are described below:If arguments
a
andb
are matrices, they are multiplied according to the rules of matrix multiplication.If argument
a
is 1-D, it is promoted to a matrix by prefixing1
to its dimensions. After matrix multiplication, the prefixed1
is removed from the result’s dimensions.If argument
b
is 1-D, it is promoted to a matrix by appending1
to its dimensions. After matrix multiplication, the appended1
is removed from the result’s dimensions.If
a
orb
is N-D (N > 2), then the operand is treated as a batch of matrices. If botha
andb
are N-D, their batch dimensions must match. If exactly one ofa
orb
is N-D, the other operand is broadcast.The operand for the matrix addition
c
may be a vector of length M, a matrix of shape (M, 1) or (M, N), or batched versions of the latter (…, M, 1) or (…, M, N). Here M and N are the dimensions of the result of the matrix multiplication. If a vector is provided or N = 1, the columns ofc
are broadcast for the addition. If batch dimensions are not present,c
is broadcast across batches as needed.
See also
Examples
>>> import numpy as np >>> import nvmath
Create two 2-D float64 ndarrays on the CPU:
>>> M, N, K = 1024, 1024, 1024 >>> a = np.random.rand(M, K) >>> b = np.random.rand(K, N)
We will define a matrix multiplication operation followed by a RELU epilog function using the specialized matrix multiplication inteface.
Create a Matmul object encapsulating the problem specification above:
>>> mm = nvmath.linalg.advanced.Matmul(a, b)
Options can be provided above to control the behavior of the operation using the
options
argument (seeMatmulOptions
).Next, plan the operation. The epilog is specified, and optionally, preferences can be specified for planning:
>>> epilog = nvmath.linalg.advanced.MatmulEpilog.RELU >>> mm.plan(epilog=epilog)
Certain epilog choices (like
nvmath.linalg.advanced.MatmulEpilog.BIAS
) require additional input provided using theepilog_inputs
argument toplan()
.Now execute the matrix multiplication, and obtain the result
r1
as a NumPy ndarray.>>> r1 = mm.execute()
Finally, free the object’s resources. To avoid having to explicitly making this call, it’s recommended to use the Matmul object as a context manager as shown below, if possible.
>>> mm.free()
Note that all
Matmul
methods execute on the current stream by default. Alternatively, thestream
argument can be used to run a method on a specified stream.Let’s now look at the same problem with CuPy ndarrays on the GPU.
Create a 3-D complex128 CuPy ndarray on the GPU:
>>> import cupy as cp >>> a = cp.random.rand(M, K) >>> b = cp.random.rand(K, N)
Create an Matmul object encapsulating the problem specification described earlier and use it as a context manager.
>>> with nvmath.linalg.advanced.Matmul(a, b) as mm: ... mm.plan(epilog=epilog) ... ... # Execute the operation to get the first result. ... r1 = mm.execute() ... ... # Update operands A and B in-place (see reset_operands() for an alternative). ... a[:] = cp.random.rand(M, K) ... b[:] = cp.random.rand(K, N) ... ... # Execute the operation to get the new result. ... r2 = mm.execute()
All the resources used by the object are released at the end of the block.
Further examples can be found in the nvmath/examples/linalg/advanced/matmul directory.
Methods
- __init__(a, b, /, c=None, *, alpha=None, beta=None, qualifiers=None, options=None, stream=None)[source]¶
- applicable_algorithm_ids(limit=8)[source]¶
Obtain the algorithm IDs that are applicable to this matrix multiplication.
- Parameters:
limit – The maximum number of applicable algorithm IDs that is desired
- Returns:
A sequence of algorithm IDs that are applicable to this matrix multiplication problem specification, in random order.
- autotune(iterations=3, prune=None, release_workspace=False, stream=None)[source]¶
Autotune the matrix multiplication to order the algorithms from the fastest measured execution time to the slowest. Once autotuned, the optimally-ordered algorithm sequence can be accessed using
algorithms
.- Parameters:
iterations – The number of autotuning iterations to perform.
prune – An integer N, specifying the top N fastest algorithms to retain after autotuning. The default is to retain all algorithms.
release_workspace – A value of
True
specifies that the stateful object should release workspace memory back to the package memory pool on function return, while a value ofFalse
specifies that the object should retain the memory. This option may be set toTrue
if the application performs other operations that consume a lot of memory between successive calls to the (same or different)execute()
API, but incurs a small overhead due to obtaining and releasing workspace memory from and to the package memory pool on every call. The default isFalse
.stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t
(as Pythonint
),cupy.cuda.Stream
, andtorch.cuda.Stream
. If a stream is not provided, the current stream from the operand package will be used.
- execute(*, algorithm=None, release_workspace=False, stream=None)[source]¶
Execute a prepared (planned and possibly autotuned) matrix multiplication.
- Parameters:
algorithm – (Experimental) An algorithm chosen from the sequence returned by
plan()
oralgorithms
. By default, the first algorithm in the sequence is used.release_workspace – A value of
True
specifies that the stateful object should release workspace memory back to the package memory pool on function return, while a value ofFalse
specifies that the object should retain the memory. This option may be set toTrue
if the application performs other operations that consume a lot of memory between successive calls to the (same or different)execute()
API, but incurs a small overhead due to obtaining and releasing workspace memory from and to the package memory pool on every call. The default isFalse
.stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t
(as Pythonint
),cupy.cuda.Stream
, andtorch.cuda.Stream
. If a stream is not provided, the current stream from the operand package will be used.
- Returns:
The result of the specified matrix multiplication (epilog applied), which remains on the same device and belong to the same package as the input operands. If an epilog (like
nvmath.linalg.advanced.MatmulEpilog.RELU_AUX
) that results in extra output is used, a tuple is returned with the first element being the matrix multiplication result (epilog applied) and the second element being the auxiliary output provided by the selected epilog as adict
.
- free()[source]¶
Free Matmul resources.
It is recommended that the
Matmul
object be used within a context, but if it is not possible then this method must be called explicitly to ensure that the matrix multiplication resources (especially internal library objects) are properly cleaned up.
- plan(*, preferences=None, algorithms=None, epilog=None, epilog_inputs=None, stream=None)[source]¶
Plan the matrix multiplication operation, considering the epilog (if provided).
- Parameters:
preferences – This parameter specifies the preferences for planning as a
MatmulPlanPreferences
object. Alternatively, a dictionary containing the parameters for theMatmulPlanPreferences
constructor can also be provided. If not specified, the value will be set to the default-constructedMatmulPlanPreferences
object.algorithms – A sequence of
Algorithm
objects that can be directly provided to bypass planning. The algorithm objects must be compatible with the matrix multiplication. A typical use for this option is to provide algorithms serialized (pickled) from a previously planned and autotuned matrix multiplication.epilog – Specify an epilog \(F\) as an object of type
MatmulEpilog
to apply to the result of the matrix multiplication: \(F(\alpha A @ B + \beta C\)). The default is no epilog.epilog_inputs – Specify the additional inputs needed for the selected epilog as a dictionary, where the key is the epilog input name and the value is the epilog input. The epilog input must be a tensor with the same package and in the same memory space as the operands (see the constructor for more information on the operands). If the required epilog inputs are not provided, an exception is raised that lists the required epilog inputs.
stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t
(as Pythonint
),cupy.cuda.Stream
, andtorch.cuda.Stream
. If a stream is not provided, the current stream from the operand package will be used.
- Returns:
A sequence of
nvmath.linalg.advanced.Algorithm
objects that are applicable to this matrix multiplication problem specification, heuristically ordered from fastest to slowest.
Notes
Epilogs that have
BIAS
in their name need an epilog input with the key'bias'
. Epilogs that haveDRELU
need an epilog input with the key'relu_aux'
, which is produced in a “forward pass” epilog likeRELU_AUX
orRELU_AUX_BIAS
. Similarly, epilogs withDGELU
in their name require an epilog input with the key'gelu_aux'
, produced in the corresponding forward pass operation.Examples
>>> import numpy as np >>> import nvmath
Create two 3-D float64 ndarrays on the CPU representing batched matrices, along with a bias vector:
>>> batch = 32 >>> M, N, K = 1024, 1024, 1024 >>> a = np.random.rand(batch, M, K) >>> b = np.random.rand(batch, K, N) >>> bias = np.random.rand(M) # The bias vector will be broadcast along the columns, as well as along the batch dimension.
We will define a matrix multiplication operation followed by a
nvmath.linalg.advanced.MatmulEpilog.RELU_BIAS
epilog function.>>> with nvmath.linalg.advanced.Matmul(a, b) as mm: ... ... # Plan the operation with RELU_BIAS epilog and corresonding epilog input. ... p = nvmath.linalg.advanced.MatmulPlanPreferences(limit=8) ... epilog = nvmath.linalg.advanced.MatmulEpilog.RELU_BIAS ... epilog_inputs = {'bias': bias} ... mm.plan(preferences=p, epilog=epilog, epilog_inputs=epilog_inputs) # The preferences can also be provided as a dict: {'limit': 8} ... ... # Execute the matrix multiplication, and obtain the result `r` as a NumPy ndarray. ... r = mm.execute()
Some epilogs like
nvmath.linalg.advanced.MatmulEpilog.RELU_AUX
produce auxiliary output.>>> with nvmath.linalg.advanced.Matmul(a, b) as mm: ... ... # Plan the operation with RELU_AUX epilog> ... epilog = nvmath.linalg.advanced.MatmulEpilog.RELU_AUX ... mm.plan(epilog=epilog) ... ... # Execute the matrix multiplication, and obtain the result `r` along with the auxiliary output. ... r, auxiliary = mm.execute()
The auxiliary output is a Python
dict
with the names of each auxiliary output as keys.Further examples can be found in the nvmath/examples/linalg/advanced/matmul directory.
- reset_operands(a=None, b=None, c=None, *, alpha=None, beta=None, epilog_inputs=None, stream=None)[source]¶
Reset the operands held by this
Matmul
instance.This method has two use cases: (1) it can be used to provide new operands for execution when the original operands are on the CPU, or (2) it can be used to release the internal reference to the previous operands and make their memory available for other use by passing
None
for all arguments. In this case, this method must be called again to provide the desired operands before another call to execution APIs likeautotune()
orexecute()
.This method is not needed when the operands reside on the GPU and in-place operations are used to update the operand values.
This method will perform various checks on the new operands to make sure:
The shapes, strides, datatypes match those of the old ones.
The packages that the operands belong to match those of the old ones.
If input tensors are on GPU, the device must match.
- Parameters:
a – A tensor representing the first operand to the matrix multiplication (see
Semantics
). The currently supported types arenumpy.ndarray
,cupy.ndarray
, andtorch.Tensor
.b – A tensor representing the second operand to the matrix multiplication (see
Semantics
). The currently supported types arenumpy.ndarray
,cupy.ndarray
, andtorch.Tensor
.c – (Optional) A tensor representing the operand to add to the matrix multiplication result (see
Semantics
). The currently supported types arenumpy.ndarray
,cupy.ndarray
, andtorch.Tensor
.alpha – The scale factor for the matrix multiplication term as a real or complex number. The default is \(1.0\).
beta – The scale factor for the matrix addition term as a real or complex number. A value for
beta
must be provided if operandc
is specified.epilog_inputs – Specify the additional inputs needed for the selected epilog as a dictionary, where the key is the epilog input name and the value is the epilog input. The epilog input must be a tensor with the same package and in the same memory space as the operands (see the constructor for more information on the operands). If the required epilog inputs are not provided, an exception is raised that lists the required epilog inputs.
stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t
(as Pythonint
),cupy.cuda.Stream
, andtorch.cuda.Stream
. If a stream is not provided, the current stream from the operand package will be used.
Examples
>>> import cupy as cp >>> import nvmath
Create two 3-D float64 ndarrays on the GPU:
>>> M, N, K = 128, 128, 256 >>> a = cp.random.rand(M, K) >>> b = cp.random.rand(K, N)
Create an matrix multiplication object as a context manager
>>> with nvmath.linalg.advanced.Matmul(a, b) as mm: ... # Plan the operation. ... mm.plan() ... ... # Execute the MM to get the first result. ... r1 = mm.execute() ... ... # Reset the operands to new CuPy ndarrays. ... c = cp.random.rand(M, K) ... d = cp.random.rand(K, N) ... mm.reset_operands(c, d) ... ... # Execute to get the new result corresponding to the updated operands. ... r2 = mm.execute()
Note that if only a subset of operands are reset, the operands that are not reset hold their original values.
With
reset_operands()
, minimal overhead is achieved as problem specification and planning are only performed once.For the particular example above, explicitly calling
reset_operands()
is equivalent to updating the operands in-place, i.e, replacingmm.reset_operand(c, d)
witha[:]=c
andb[:]=d
. Note that updating the operand in-place should be adopted with caution as it can only yield the expected result under the additional constraint below:The operand is on the GPU (more precisely, the operand memory space should be accessible from the execution space).
For more details, please refer to inplace update example.
Attributes
- algorithms¶
After planning using
plan()
, get the sequence of algorithm objects to inquire their capabilities, configure them, or serialize them for later use.- Returns:
A sequence of
nvmath.linalg.advanced.Algorithm
objects that are applicable to this matrix multiplication problem specification.