Tensor Network Contraction

Introduction

The goal behind the high-level APIs is to provide an interface to the cuTensorNet library that feels natural for Python programmers. The APIs support ndarray-like objects from NumPy, CuPy, and PyTorch and support specification of the tensor network as an Einstein summation expression.

The high-level APIs can be further categorized into two levels:

  • The “coarse-grained” level, where the user deals with Python functions like contract(), contract_path(), einsum(), and einsum_path(). The coarse-grained level is an abstraction layer that is typically meant for single contraction operations.

  • The “fine-grained” level, where the interaction is through operations on a Network object. The fine-grained level allows the user to invest significant resources into finding an optimal contraction path and autotuning the network where repeated contractions on the same network object allow for amortization of the cost (see also Resource management in stateful objects).

The APIs also allow for interoperability between the cuTensorNet library and external packages. For example, the user can specify a contraction order obtained from a different package (perhaps a research project). Alternatively, the user can obtain the contraction order and the sliced modes from cuTensorNet for downstream use elsewhere.

Usage example

Contracting the same tensor network demonstrated in the cuTensorNet C example is as simple as:

from cuquantum import contract
from numpy.random import rand

a = rand(96,64,64,96)
b = rand(96,64,64)
c = rand(64,96,64)

r = contract("mhkn,ukh,xuy->mxny", a, b, c)

If desired, various options can be provided for the contraction. For PyTorch tensors, starting cuQuantum Python v23.10 the contract() function works like a native PyTorch operator that can be recorded in the autograd graph and generate backward-mode automatic differentiation. See contract() for more details and examples.

Starting cuQuantum v22.11 / cuTensorNet v2.0.0, cuTensorNet supports automatic MPI parallelism if users bind an MPI communicator to the library handle, among other requirements as outlined here. To illustrate, assuming all processes hold the same set of input tensors (on distinct GPUs) and the same network expression, this should work out of box:

from cupy.cuda.runtime import getDeviceCount
from mpi4py import MPI
from cuquantum import cutensornet as cutn

# bind comm to cuTensorNet handle
handle = cutn.create()
comm = MPI.COMM_WORLD
cutn.distributed_reset_configuration(
    handle, *cutn.get_mpi_comm_pointer(comm))

# make each process run on different GPU
rank = comm.Get_rank()
device_id = rank % getDeviceCount()
cp.cuda.Device(device_id).use()

# 1. assuming input tensors a, b, and c are created on the right GPU
# 2. passing handle explicitly allows reusing it to reduce the handle creation overhead
r = contract(
    "mhkn,ukh,xuy->mxny", a, b, c,
    options={'device_id' : device_id, 'handle': handle}))

An end-to-end Python example of such auto-MPI usage can be found at https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/cutensornet/coarse/example22_mpi_auto.py.

Note

As of cuQuantum v22.11 / cuTensorNet v2.0.0, the Python wheel does not have the required MPI wrapper library included. Users need to either build it from source (included in the wheel), or use the Conda package from conda-forge instead.

Finally, for users seeking full control over the tensor network operations and parallelization, we offer fine-grained APIs as illustrated by the examples in the documentation for Network. A complete example illustrating parallel implementation of tensor network contraction using the fine-grained API is shown below:

from cupy.cuda.runtime import getDeviceCount
from mpi4py import MPI
import numpy as np

from cuquantum import Network

root = 0
comm = MPI.COMM_WORLD

rank, size = comm.Get_rank(), comm.Get_size()

expr = 'ehl,gj,edhg,bif,d,c,k,iklj,cf,a->ba'
shapes = [(8, 2, 5), (5, 7), (8, 8, 2, 5), (8, 6, 3), (8,), (6,), (5,), (6, 5, 5, 7), (6, 3), (3,)]

# Set the operand data on root.
operands = [np.random.rand(*shape) for shape in shapes] if rank == root else None

# Broadcast the operand data.
operands = comm.bcast(operands, root)
    
# Assign the device for each process.
device_id = rank % getDeviceCount()

# Create network object.
network = Network(expr, *operands, options={'device_id' : device_id})

# Compute the path on all ranks with 8 samples for hyperoptimization. Force slicing to enable parallel contraction.
path, info = network.contract_path(optimize={'samples': 8, 'slicing': {'min_slices': max(16, size)}})

# Select the best path from all ranks.
opt_cost, sender = comm.allreduce(sendobj=(info.opt_cost, rank), op=MPI.MINLOC)
if rank == root:
    print(f"Process {sender} has the path with the lowest FLOP count {opt_cost}.")

# Broadcast info from the sender to all other ranks.
info = comm.bcast(info, sender)

# Set path and slices.
path, info = network.contract_path(optimize={'path': info.path, 'slicing': info.slices})

# Calculate this process's share of the slices.
num_slices = info.num_slices
chunk, extra = num_slices // size, num_slices % size
slice_begin = rank * chunk + min(rank, extra)
slice_end = num_slices if rank == size - 1 else (rank + 1) * chunk + min(rank + 1, extra)
slices = range(slice_begin, slice_end)

print(f"Process {rank} is processing slice range: {slices}.")

# Contract the group of slices the process is responsible for.
result = network.contract(slices=slices)

# Sum the partial contribution from each process on root.
result = comm.reduce(sendobj=result, op=MPI.SUM, root=root)

# Check correctness.
if rank == root:
   result_np = np.einsum(expr, *operands, optimize=True)
   print("Does the cuQuantum parallel contraction result match the numpy.einsum result?", np.allclose(result, result_np))

This “manual” MPI Python example can be found in the NVIDIA/cuQuantum repository (here).

Call blocking behavior

By default, calls to the execution APIs (Network.autotune() and Network.contract() on the Network object as well as the function contract()) block and do not return until the operation is completed. This behavior can be changed by setting NetworkOptions.blocking and passing in the options to Network. When NetworkOptions.blocking is set to 'auto', calls to the execution APIs will return immediately after the operation is launched on the GPU without waiting for it to complete if the input tensors are on the device. If the input tensors are on the host, the execution API calls will always block since the result of the contraction is a tensor that will also reside on the host.

APIs that execute on the host (such as Network.contract_path() on the Network object, and contract_path(), and einsum_path() functions) always block.

Stream semantics

The stream semantics depends on whether the behavior of the execution APIs is chosen to be blocking or non-blocking (see Call blocking behavior).

For blocking behavior, stream ordering is automatically handled by the cuQuantum Python high-level APIs for operations that are performed within the package. A stream can be provided for two reasons:

1. When the computation that prepares the input tensors is not already complete by the time the execution APIs are called. This is a correctness requirement for user-provided data. 2. To enable parallel computations across multiple streams if the device has sufficient resources and the current stream (which is the default) has concomitant operations. This can be done for performance reasons.

For non-blocking behavior, it is the user’s responsibility to ensure correct stream ordering between the execution API calls.

In any case, the execution APIs are launched on the provided stream.

Resource management in stateful objects

An important aspect of the fine-grained, stateful object APIs (e.g. Network) is resource management. We need to make sure that the internal resources, including library resources, memory resources, and user-provided input operands are properly managed throughout the object’s lifetime safely. As such, there are caveats that the users should be aware of, due to their impact on the memory watermark.

The stateful object APIs allow users to prepare an object and reuse it for multiple contractions or gradient calculations to amortize the preparation cost. Depending on the specific problem, investment in preparations that lead to shorter execution time may be the ideal solution. During the preparation step, an object would inevitably hold reference to device memory for later reuse. However, such problems oftentimes imply high memory usage, making it impossible to allow holding multiple objects at the same time. Interleaving contractions of multiple large tensor networks is an example of this.

To address this use case, starting cuQuantum Python v24.03 two new features are added:

  1. Every execution method now accepts a release_workspace option. When this option is set to True (default is False), the memory needed to perform an operation is freed before the method returns, making this memory available for other tasks. The next time the same (or different) method is called, memory is allocated on demand. Therefore, there is a small overhead associated with release_workspace=True, as allocating/deallocating memory could take time depending on the implementation of the underlying memory allocator (see next section); however, making multiple Network objects coexist becomes possible, see, e.g., the example6_resource_mgmt_contraction.py sample.

  2. The reset_operands() method now accepts setting operands=None to free the internal reference to the input operands after the execution. This reduces potential memory contention and thereby allows contracting multiple networks with large input tensors in an interleaved fashion. In such cases, before a subsequent execution on the same Network object is called, the reset_operands() method should be called again with the new operands, see, e.g., the example8_reset_operand_none.py sample.

These two features, used separately or jointly as the problem requires, make it possible to prepare and use a large number of Network objects when the device memory available is not enough to fit all problems at once.

External memory management

Starting cuQuantum Python v22.03, we support an EMM-like interface as proposed and supported by Numba for users to set their Python mempool. Users set the option NetworkOptions.allocator to a Python object complying with the cuquantum.BaseCUDAMemoryManager protocol, and pass the options to the high-level APIs like contract() or Network. Temporary memory allocations will then be done through this interface. (Internally, we use the same interface to use CuPy or PyTorch’s mempool depending on the input tensor operands.)

Note

cuQuantum’s BaseCUDAMemoryManager protocol is slightly different from Numba’s EMM interface (numba.cuda.BaseCUDAMemoryManager), but duck typing with an existing EMM instance (not type!) at runtime should be possible.