Overview

This section describes the basic working principles of the cuTensorNet library. For a general introduction to quantum circuits, please refer to Introduction to quantum computing.

Introduction to tensor networks

Tensor networks emerge in many mathematical and scientific domains, ranging from quantum circuit simulation, to quantum many-body physics, to quantum chemistry, and to machine learning. As the network size scales up exponentially, there is an increasing need for a high-performance tensor network library in order to perform tensor contractions efficiently, which cuTensorNet aims to serve.

A tensor network is a collection of tensors contracted together to form a tensor of arbitrary rank. The contractions between the constituent tensors fully determines the network topology. For example, the tensor \(T\) below is given by contracting the tensors \(A\), \(B\), \(C\), and \(D\):

\[T_{abcd} = A_{aij} B_{bjk} C_{klc} D_{lid},\]

where the repeated modes are implicitly summed over following the Einstein summation convention. In this example, the index \(i\) connects the tensors \(D\) and \(A\), the index \(j\) connects the tensors \(A\) and \(B\), the index \(k\) connects the tensors \(B\) and \(C\), the index \(l\) connects the tensors \(C\) and \(D\). The four uncontracted modes \(a\), \(b\), \(c\), and \(d\) are free modes (sometimes also referred to as external modes), indicating the resulting tensor \(T\) is of rank 4.

Description of tensor networks

In the cuTensorNet library, we follow cuTENSOR’s nomenclature:

  • A rank (or order) \(N\) tensor has \(N\) modes

  • Each mode has an extent (the size of the mode), so a \(3\times 3\) matrix has two modes of extent 3

  • Each mode has a stride accounting for the distance in physical memory between two logically consecutive elements along that mode, in unit of elements

Note

For NumPy/CuPy users, rank/order translates to the ndim attribute, extent translates to shape, and stride has the identical meaning as strides.

A tensor network in the cuTensorNet library is represented by the cutensornetNetworkDescriptor_t descriptor that effectively encodes the topology and data type of the network. To be precise, this descriptor specifies the number of input tensors in numInputs and the number of modes for each tensor in the array numModesIn, along with each tensor’s modes, extents, and strides in the arrays of pointers modesIn, extentsIn, and stridesIn, respectively.

Likewise, it holds similar information about the output tensor (e.g., numModesOut, modesOut, extentsOut, stridesOut). Note that there is only one output tensor per network, so there is no need to set numOutputs and the corresponding arguments are just plain arrays.

It is possible for all these network metadata to live on the host, since when constructing a tensor network only its topology and the data-access pattern matter, and we do not need to know the actual content of the input tensors.

Internally, cuTensorNet utilizes cuTENSOR to create tensor objects and perform pairwise tensor contractions. cuTensorNet’s APIs are designed such that users can just focus on creating the network description without having to manage such “low-level” details by themselves. The tensor contraction can be computed in a different precision from the data type, given by a cutensornetComputeType_t constant.

Once a valid tensor network is created, one can

  1. Find a low-cost contraction path, possibly with slicing and additional constraints

  2. Access information concerning the contraction path

  3. Get the needed workspace size to accommodate intermediate tensors

  4. Create a contraction plan according to the info collected above

  5. Perform the actual contraction to retrieve the output tensor

It is the users’ responsibility to manage device memory for the workspace (from Step 3) and input/output tensors (for Step 5). See API Reference for the cuTensorNet APIs.

Contraction pathfinder

A contraction path is a sequence of pairwise contractions represented in the numpy.einsum_path() format. The role of a path optimizer is to find a contraction path that minimizes the cost of contracting the tensor network. The cuTensorNet pathfinder is based on a graph-partitioning approach (called phase 1), followed by slicing and reconfiguration (called phase 2). Practically, experience indicates that finding an optimal contraction path can be sensitive to the choice of configuration parameters. Therefore, many of these are available to be configured via cutensornetContractionOptimizerConfigSetAttribute().

Slicing

In order to fit a tensor network contraction into available device memory, as specified by workspaceSizeConstraint, it may be necessary to use slicing (also known as variable projection or bond cutting). Each slice can be computed independently from the others. Thus, if we intend to run a parallel computation, slicing is also one of the best techniques as it create independent work for each device. We may wish used a sliced contraction in order to create work for all available nodes. Slicing means that we compute the contraction for only one particular position in some mode (or combination of modes), creating a number of slices equal to the product of the extents of the sliced modes. We can then sum over the individually computed values to reproduce the full tensor network contraction. Such a technique is useful for large tensor networks, in particular quantum circuits, where the memory footprint to perform the contraction could easily exceed any existing memory storage. Taking the above \(T\) tensor as an example, if we slice over the mode i we obtain the following:

\[T_{abcd} = A_{aij} B_{bjk} C_{klc} D_{lid} \longrightarrow \sum_{i_s} \left( A_{a {i_s} j} B_{bjk} C_{klc} D_{l {i_s} d} \right),\]

where the sliced mode \(i_s\) is no longer implicitly summed over as part of tensor contraction, but instead explicitly summed (potentially in parallel). Although slicing reduces the memory footprint, it usually worsens the flops count of the contraction, and there is no simple way to determine what set of sliced modes will yield the best performance.

The cuTensorNet library offers some handles to influence the slice-finding algorithm:

Reconfiguration

At the end of each slice-finding iteration, the quality of the contraction tree has been diminished by the slicing. We can improve the contraction tree at this stage by performing reconfiguration. Reconfiguration considers a number of small subtrees within the overall contraction tree and attempts to improve their quality. Although the process is computationally expensive, a non-reconfigured sliced contraction tree may be orders of magnitude more expensive to execute than expected. The cuTensorNet library offers some handles to influence the reconfiguration algorithm:

  • CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_RECONFIG_NUM_ITERATIONS: Specifies the number of subtrees to consider during each reconfiguration. The amount of time spent by reconfiguration is linearly proportional to this quantity. Default is 500. Setting this to 0 will disable reconfiguration.

  • CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_RECONFIG_NUM_LEAVES: Specifies the maximum number of leaf nodes in each subtree considered by reconfiguration. Since the time spent is exponential in this quantity for optimal subtree reconfiguration, selecting large values will invoke faster non-optimal algorithms. Nonetheless, the time spent by reconfiguration increases very rapidly as this quantity is increased. Default is 8. Must be at least 2.

Deferred rank simplification

Since the time taken by the path-finding algorithm increases quickly as the number of tensors increases, it is advantageous to minimize the number of tensors, if possible. Rank simplification removes trivial tensor contractions from the network in order to improve performance. These contractions are those where a tensor is only contracted with at most two neighbors, effectively making a matrix multiplication. The necessary contractions to perform the simplification are not immediately performed but rather are prepended to the contraction path returned. If, for some reason, such simplification is not desired, it can be disabled:

Hyper-optimizer

cuTensorNet provides a hyper-optimizer for the pathfinder that can automatically generate many instances of contraction path and return the best of them in terms of total flops. The number of instances is user controlled by CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES and is set to 0 by default. The idea here, is that the hyper-optimizer will create CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES instances, each with different parameters of the pathfinder algorithm. Each instance will run the full pathfinder algorithm including reconfiguration and slicing (if requested). At the end of the hyper-optimizer loop, the best path (in term of flops) is returned. The configuration parameters that are varied by the hyper-optimizer are:

The hyper-optimizer may be used to generate the values of all these parameters, or some of these parameters may be fixed to a given value (via cutensornetContractionOptimizerConfigSetAttribute()). When a parameter is fixed, the hyper-optimizer will not randomize it.

Supported data types

A valid combination of the data and compute types for tensor network contractions inherits straightforwardly from that of cuTENSOR. Please refer to cuTENSOR’s User Guide for detail.

Reference

For further information about general tensor networks, please refer to the following:

For the application of tensor networks to quantum circuit simulations, please see: