Matmul#

class nvmath.sparse.Matmul(
a,
b,
/,
c=None,
*,
alpha=None,
beta=None,
qualifiers=None,
options=None,
execution: ExecutionCUDA | None = None,
stream: AnyStream | int | None = None,
)[source]#

Create a stateful object encapsulating the specified matrix multiplication computation, which is one of \(epilog(\alpha \, op_h(a) \, @ \, op_h(b) + \beta \, c)\) or \(epilog(prolog_a(op_t(a)) \, @ \, prolog_b(op_t(b)) + prolog_c(c))\), along with the required resources to perform the operation. The \(op_h\) and \(op_t\) operators optionally specify transpose/hermitian or transpose operations respectively via the qualifiers argument. In addition, the scalar multiplication and addition operators (“semiring”) can be customized by the user, if desired.

Note

The complex conjugate operation is mutually exclusive with prolog since it can be absorbed into the prolog.

Note

Currently only in-place sparse matrix multiplication is supported, so operand c must be provided. This restriction will be removed in a future release.

A stateful object can be used to amortize the cost of preparation (planning in the case of matrix multiplication) across multiple executions (also see the Stateful APIs section). Prolog, epilog, and semiring operations can be specified in plan().

The function-form API matmul() is a convenient alternative to using stateful objects for single use (the user needs to perform just one matrix multiplication, for example), in which case there is no possibility of amortizing preparatory costs. The function-form APIs are just convenience wrappers around the stateful object APIs.

Using the stateful object typically involves the following steps:

  1. Problem Specification: Initialize the object with a defined operation and options.

  2. Preparation: Use plan() to determine the best algorithmic implementation for this specific matrix multiplication operation.

  3. Execution: Perform the matrix multiplication computation with execute().

  4. Resource Management: Ensure all resources are released either by explicitly calling free() or by managing the stateful object within a context manager.

Detailed information on what’s happening in the various phases described above can be obtained by passing in a logging.Logger object to MatmulOptions or by setting the appropriate options in the root logger object, which is used by default:

>>> import logging
>>> logging.basicConfig(
...     level=logging.INFO,
...     format="%(asctime)s %(levelname)-8s %(message)s",
...     datefmt="%m-%d %H:%M:%S",
... )

A user can select the desired logging level and, in general, take advantage of all of the functionality offered by the Python logging module.

Parameters:
  • a – A sparse tensor representing the first operand a in the sparse matrix multiplication (SpMM) from one of the supported sparse packages: SciPy, CuPy, PyTorch, or a universal sparse tensor (UST) object (see semantics). The sparse representation may be in any of the formats supported by the sparse package (CSR, BSC, COO, …), including novel formats defined using the UST DSL.

  • b – A dense tensor representing the second operand b in the SpMM (see semantics). The currently supported types are numpy.ndarray, cupy.ndarray, torch.Tensor, and nvmath.sparse.ust.Tensor.

  • c – A dense tensor representing the addend c in the SpMM (see semantics). The currently supported types are numpy.ndarray, cupy.ndarray, torch.Tensor, and nvmath.sparse.ust.Tensor.

  • alpha – The scale factor for the matrix multiplication term as a real or complex number. The default is \(1.0\).

  • beta – The scale factor for the addend term in the matrix multiplication as a real or complex number. The default is \(1.0\).

  • qualifiers – If desired, specify the matrix qualifiers as a numpy.ndarray of matmul_matrix_qualifiers_dtype objects of length 3 corresponding to the operands a, b, and c. See Matrix and Tensor Qualifiers for the motivation behind qualifiers.

  • options – Specify options for the sparse matrix multiplication as a MatmulOptions object. Alternatively, a dict containing the parameters for the MatmulOptions constructor can also be provided. If not specified, the value will be set to the default-constructed MatmulOptions object.

  • execution – Specify execution space options for the SpMM as a ExecutionCUDA object (the only execution space currently supported). If not specified, a ExecutionCUDA object will be default-constructed.

  • stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream for the operand device will be queried from the dense operand b (and c) package.

Semantics:

The semantics of the matrix multiplication follows numpy.matmul semantics, with some restrictions on broadcasting. In addition, the semantics for the fused matrix addition are described below.

  • For in-place matrix multiplication (where the result is written into c) the result has the same shape as c.

  • The operand a must be a sparse matrix or batched sparse matrix. Popular named formats like BSC, BSR, COO, CSR, … are supported in addition to custom formats defined using the UST DSL.

  • The operands b and c must be “dense” matrices (that is, their layout is strided).

  • If the operands a and b are matrices, they are multiplied according to the rules of matrix multiplication.

  • If argument b is 1-D, it is promoted to a matrix by appending 1 to its dimensions. After matrix multiplication, the appended 1 is removed from the result’s dimensions if the operation is not in-place.

  • If a or b is N-D (N > 2), then the operand is treated as a batch of matrices. If both a and b are N-D, their batch dimensions must match. If exactly one of a or b is N-D, the other operand is broadcast.

  • The operand for the matrix addition c must be a matrix of shape (M, N), or the batched equivalent (…, M, N). Here M and N are the dimensions of the result of the matrix multiplication. If batch dimensions are not present, c is broadcast across batches as needed. If the operation is in-place, c cannot be broadcast since it must be large enough to hold the result.

Examples

>>> import cupy as cp
>>> import cupyx.scipy.sparse as sp
>>> import nvmath

The problem parameters.

>>> m, n, k = 4, 2, 4
>>> index_type, dtype = "int32", "float64"

Create a sparse float64 ndarray in CSR format on the GPU.

>>> crow_indices = cp.array([0, 2, 4, 6, 8], dtype=index_type)
>>> col_indices = cp.array([0, 1, 0, 1, 2, 3, 2, 3], dtype=index_type)
>>> values = cp.array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], dtype=dtype)
>>> a = sp.csr_matrix((values, col_indices, crow_indices), shape=(m, k))

Create the dense operands b and c.

>>> b = cp.ones((k, n), dtype=dtype)
>>> c = cp.zeros((m, n), dtype=dtype)

We will define a sparse matrix multiplication (SpMM) operation \(c := \alpha \, a \, @ \, b + \beta \, c\) using the generic sparse matrix multiplication interface.

>>> mm = nvmath.sparse.Matmul(a, b, c=c, alpha=1.2, beta=1.0)

Options can be provided above to control the behavior of the operation using the options argument (see MatmulOptions).

Next, plan the operation. Optionally, user-defined prologs, epilog, and semiring can be provided.

>>> mm.plan()

Now execute the matrix multiplication. The result r is also a CuPy ndarray, and specifically in this example, it is the same as operand c since the operation is in-place.

>>> r = mm.execute()
>>> assert r is c, "Error: the operation is not in-place."

Finally, free the object’s resources. To avoid having to explicitly make this call, it’s recommended to use the Matmul object as a context manager as shown below, if possible.

>>> mm.free()

Note

All Matmul methods execute on the package current stream by default. Alternatively, the stream argument can be used to run a method on a specified stream.

Let’s now look at the SpMM \(c := a.T \, @ \, b.H + c\) with PyTorch sparse/dense tensors on the CPU.

>>> import numpy as np
>>> import torch
>>> m, n, k = 2, 2, 2
>>> index_type, dtype = torch.int32, torch.float32

Create and coalesce a sparse COO tensor on the CPU.

>>> indices = torch.tensor([[0, 1], [0, 1]], dtype=index_type)
>>> values = torch.tensor([2.0, 4.0], dtype=dtype) + 1.0j
>>> a = torch.sparse_coo_tensor(indices, values, (k, m))
>>> a = a.coalesce()

Create the dense operands b and c.

>>> b = torch.ones(n, k, dtype=dtype) + 1.0j
>>> c = torch.zeros(m, n, dtype=dtype) + 0.0j

The transpose/hermitian operations on a and b will be specified using qualifiers.

>>> qualifiers = np.zeros((3,), dtype=nvmath.sparse.matmul_matrix_qualifiers_dtype)
>>> qualifiers[0]["is_transpose"] = 1
>>> qualifiers[1]["is_transpose"] = qualifiers[1]["is_conjugate"] = 1

Create the SpMM operation and use it as a context manager.

>>> with nvmath.sparse.Matmul(a, b, c=c, qualifiers=qualifiers) as mm:
...     # Plan the operation.
...     mm.plan()
...
...     # Execute it.
...     r = mm.execute()

All the resources used by the object are released at the end of the block.

Finally, let’s see how to perform a matrix multiplication on a novel format using UST operands.

>>> device_id = 0
>>> dtype = torch.float64
>>> m, n, k = 3, 2, 8

Create a dense torch tensor and view it as UST.

>>> a = torch.tensor(
...     [[1, 0, 0, 0, 0, 0, 0, 2], [0, 0, 3, 4, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 5]],
...     dtype=dtype,
...     device=device_id,
... )
>>> a = nvmath.sparse.ust.Tensor.from_package(a)

Create a delta-compression format, using the predefined type or directly with the UST DSL.

>>> delta = nvmath.sparse.ust.NamedFormats.DELTA(2)

Convert the torch tensor to the delta-compressed format.

>>> a = a.convert(tensor_format=delta)

Create dense operands b and c for the SpMM.

>>> b = torch.ones(k, n, dtype=dtype, device=device_id)
>>> b = nvmath.sparse.ust.Tensor.from_package(b)
>>> c = torch.zeros(m, n, dtype=dtype, device=device_id)
>>> c = nvmath.sparse.ust.Tensor.from_package(c)

Create, plan, and execute the SpMM operation.

>>> with nvmath.sparse.Matmul(a, b, c=c) as mm:
...     # Plan the SpMM.
...     mm.plan()
...
...     # Execute it.
...     r = mm.execute()

View the UST result as a torch tensor.

>>> r = r.to_package()

Examples sampling the vast space of possibilities can be found in the nvmath/examples/sparse/generic/matmul directory.

Methods

execute(
*,
release_workspace=False,
stream: AnyStream | int | None = None,
)[source]#

Execute a prepared (planned) sparse matrix multiplication.

Parameters:
  • release_workspace – A value of True specifies that the stateful object should release workspace memory back to the package memory pool on function return, while a value of False specifies that the object should retain the memory. This option may be set to True if the application performs other operations that consume a lot of memory between successive calls to the (same or different) execute() API, but incurs a small overhead due to obtaining and releasing workspace memory from and to the package memory pool on every call. The default is False.

  • stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream for the operand device will be queried from the dense operand b (and c) package.

Returns:

The result of the sparse matrix multiplication (epilog applied). Currently only in-place SpMM is supported (the result of the computation is written into the addend c).

free()[source]#

Free Matmul resources.

It is recommended that the Matmul object be used within a context, but if it is not possible then this method must be called explicitly to ensure that the matrix multiplication resources (especially internal library objects) are properly cleaned up.

plan(
*,
prologs=None,
epilog=None,
semiring=None,
compute_capability=None,
stream: AnyStream | int | None = None,
)[source]#

Plan the sparse matrix multiplication operation, considering the prolog(s), epilog, and semiring operations.

Parameters:
  • prologs – A dict mapping an operand label ("a", "b", "c") to its prolog operation in LTO-IR format (as a bytes object). The prolog is a user-written unary function in Python that returns the transformed value, which has the data type of the operand to which it is applied. This function can be compiled to LTO-IR using the helper compile_matmul_prolog() or your own compiler of choice. If not specified, no prolog will be applied to the operands.

  • epilog – The epilog operation in LTO-IR format (as a bytes object). The epilog is a user-written unary function in Python that returns the transformed value, which has the data type of the SpMM result. This function can be compiled to LTO-IR using the helper compile_matmul_epilog() or your own compiler of choice. If not specified, no epilog will be applied to the SpMM result.

  • semiring – A dict mapping the semiring operations ("mul", "add", "atomic_add") to LTO-IR code (as a bytes object). Each semiring operation is a binary function in Python that returns a value. These function can be compiled to LTO-IR using the helpers compile_matmul_mul(), compile_matmul_add(), or compile_matmul_atomic_add() or your own compiler of choice. If not specified, the standard definitions of these operations from elementary algebra will be used.

  • compute_capability – The target compute capability, specified as a string ('80', '89', …). The default is the compute capability of the current device.

  • stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream for the operand device will be queried from the dense operand b (and c) package.

Examples

We’ll see how to use prologs specify the SpMM \(c := 3.14 \, sin(a) \, @ \, b + c\), where the elementwise transformations are fully-fused into the matrix multiplication.

>>> import math
>>> import cupy as cp
>>> import cupyx.scipy.sparse as sp
>>> import nvmath

Create a sparse operand in DIA format and view it as UST.

>>> n = 4
>>> values = cp.array(
...     [[0.0, 2.0, 3.0, 4.0], [10.0, 20.0, 30.0, 40.0], [-1.0, -2.0, -3.0, 0.0]]
... )
>>> offsets = cp.array([1, 0, -1], dtype=cp.int32)
>>> a = sp.dia_matrix((values, offsets), shape=(n, n))
>>> a = nvmath.sparse.ust.Tensor.from_package(a)

Dense b and c, also viewed as UST.

>>> b = cp.ones((n, n))
>>> b = nvmath.sparse.ust.Tensor.from_package(b)
>>> c = cp.zeros((n, n))
>>> c = nvmath.sparse.ust.Tensor.from_package(c)

Define the prolog for a, and compile to LTO-IR using the helper (or your own compiler).

>>> def transform_a(a):
...     return 3.14 * math.sin(a)
>>>
>>> prolog_a = nvmath.sparse.compile_matmul_prolog(
...     transform_a, operand_label="a", dtype="float64"
... )

Create, plan, and execute the SpMM.

>>> with nvmath.sparse.Matmul(a, b, c, beta=1.0) as mm:
...     # Plan the SpMM operation with the prologs.
...     mm.plan(prologs={"a": prolog_a})
...
...     # Execute the SpMM.
...     r = mm.execute()

Further examples can be found in the nvmath/examples/sparse/generic/matmul directory.

release_operands()[source]#

This method is experimental and potentially subject to future changes.

This method does two things:

  • Releases internal references to the user-provided operands, so that this instance no longer contributes to their reference counts.

  • Frees any internal copies (mirrors) that were created when the user-provided operands reside in a different memory space than the execution (i.e., copies made during construction or reset_operands() / reset_operands_unchecked() if present).

This functionality can be useful in memory-constrained scenarios, e.g. where multiple stateful objects need to coexist. Leveraging this functionality, the caller can reduce memory usage while retaining the planned state.

Parameters:

None

Returns:

None

Semantics:
  • Preserves the planned state of the stateful object.

  • After calling this method, reset_operands() (or reset_operands_unchecked() if present) must be called to supply new operands before the next execute() call. Failure to do so will result in a runtime error. Device-side copies will be re-allocated as needed.

  • For cross-space scenarios (e.g. CPU operands with GPU execution, or GPU operands with CPU execution): execution is guaranteed to be always blocking, so execute() does not return until all computation is complete. It is therefore always safe to call this method after calling execute() without additional synchronization.

  • When the operands are in the same memory space as the execution (e.g. GPU operands with GPU execution): in such case, this method drops this instance’s internal reference to the user-provided operands. If the reference count of the operands reaches zero, their memory may be freed, so particular attention should be paid. The caller is responsible to ensure that if such deallocation happens, it is ordered after pending computation (e.g. by retaining a reference until the computation is complete, or by synchronizing the stream). Failure to do so is analogous to use-after-free.

See Overview, Stateful APIs: Design and Usage Patterns for operand lifecycle and usage patterns, and Stream Semantics for stream ordering rules.

reset_operands(
*,
a=None,
b=None,
c=None,
alpha=None,
beta=None,
stream: AnyStream | int | None = None,
)[source]#

Reset one or more operands held by this Matmul instance. Only the operands explicitly passed are updated; omitted operands retain their current values.

This method will perform various checks on the new operands to make sure:

  • The shapes, index and data types match those of the old ones.

  • The packages that the operands belong to match those of the old ones.

  • If input tensors are on GPU, the device must match.

Parameters:
  • a – A sparse tensor representing the first operand a in the sparse matrix multiplication (SpMM) from one of the supported sparse packages: SciPy, CuPy, PyTorch, or a universal sparse tensor (UST) object (see semantics). The sparse representation may be in any of the formats supported by the sparse package (CSR, BSC, COO, …), including novel formats defined using the UST DSL.

  • b – A dense tensor representing the second operand b in the SpMM (see semantics). The currently supported types are numpy.ndarray, cupy.ndarray, torch.Tensor, and nvmath.sparse.ust.Tensor.

  • c – A dense tensor representing the addend c in the SpMM (see semantics). The currently supported types are numpy.ndarray, cupy.ndarray, torch.Tensor, and nvmath.sparse.ust.Tensor.

  • alpha – The scale factor for the matrix multiplication term as a real or complex number. The default is \(1.0\).

  • beta – The scale factor for the addend term in the matrix multiplication as a real or complex number. The default is \(1.0\).

  • stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream for the operand device will be queried from the dense operand b (and c) package.

Examples

>>> import torch
>>> import nvmath

Prepare sample input data.

>>> device_id = 0
>>> dtype = torch.float64
>>> m, n, k = 4, 2, 8

Create a torch CSR tensor.

>>> a = torch.ones(m, k, dtype=dtype, device=device_id)
>>> a = a.to_sparse_csr()

Dense b and c.

>>> b = torch.ones(k, n, dtype=dtype, device=device_id)
>>> c = torch.ones(m, n, dtype=dtype, device=device_id)

Create a stateful object that specifies the operation \(c := \alpha \, a \, @ \, b + \beta \, c\).

>>> alpha, beta = 1.2, 2.4
>>> with nvmath.sparse.Matmul(a, b, c, alpha=alpha, beta=beta) as mm:
...     # Plan the operation.
...     mm.plan()
...
...     # The first execution.
...     r = mm.execute()
...
...     # The operands can be reset in-place using the array library.
...     # Note that since `c` has been updated in-place, it also needs to
...     # be reset unless accumulating into it is desired.
...     b *= 3.14
...     c[:] = 1.0
...
...     # Execute the operation with updated `b`.
...     r = mm.execute()
...
...     # The operands can also be reset using the reset_operand[_unchecked]
...     # methods. This is needed when the memory space doesn't match the
...     # execution space or if the operands have not been updated in-place.
...     # Any operands that are not reset retain their prior values.
...
...     # The sparse matrix needs to be compatible (have the same sparse
...     # format, shape, dtype, NNZ etc.).
...     a = 2.718 * torch.ones(m, k, dtype=dtype, device=device_id)
...     a = a.to_sparse_csr()
...
...     c = 6.28 * torch.ones(m, n, dtype=dtype, device=device_id)
...
...     mm.reset_operands(a=a, c=c)
...
...     # Execute the operation with the reset `a` and `c`, `b` retains
...     # its previous value.
...     r = mm.execute()

For more details, please refer to the reset operand examples here.

reset_operands_unchecked(
*,
a=None,
b=None,
c=None,
alpha=None,
beta=None,
stream: AnyStream | int | None = None,
)[source]#

This method is experimental and potentially subject to future changes.

This method is a performance-optimized alternative to reset_operands() that eliminates validation and logging overhead, making it ideal for performance-critical loops where operands compatibility is guaranteed by the caller.

This method accepts the same parameters as reset_operands().

Semantics:

The semantics are the same as in reset_operands(), except that this method does not perform any validation (e.g. package match, data type match, etc.) or logging.

When to Use:
  • Performance-critical loops with repeated executions on different operands

  • After verifying correctness with reset_operands() during development

  • When operands compatibility is guaranteed by construction or invariant