This section describes the basic working principles of the cuTensorNet library. For a general introduction to quantum circuits, please refer to Introduction to quantum computing.

Introduction to tensor networks

Tensor networks emerge in many mathematical and scientific domains, ranging from quantum circuit simulation, quantum many-body physics, quantum chemistry, to machine learning. As network sizes scale up exponentially, there is an increasing need for a high-performance tensor network library in order to perform tensor network contractions efficiently, which cuTensorNet aims to serve.

A tensor network is a collection of tensors contracted together to form a tensor of arbitrary rank. The contractions between the constituent tensors fully determines the network topology. For example, the tensor \(T\) below is given by contracting the tensors \(A\), \(B\), \(C\), and \(D\):

\[T_{abcd} = A_{aij} B_{bjk} C_{klc} D_{lid},\]

where the modes with the same label are implicitly summed over following the Einstein summation convention. In this example, the mode label (index) \(i\) connects the tensors \(D\) and \(A\), the mode label \(j\) connects the tensors \(A\) and \(B\), the mode label \(k\) connects the tensors \(B\) and \(C\), the mode label \(l\) connects the tensors \(C\) and \(D\). The four uncontracted modes with labels \(a\), \(b\), \(c\), and \(d\) refer to free modes (sometimes also referred to as external modes), indicating the resulting tensor \(T\) is of rank 4.

Description of tensor networks

In the cuTensorNet library, we follow cuTENSOR’s nomenclature:

  • A rank (or order) \(N\) tensor has \(N\) modes

  • Each mode has an extent (the size of the mode), so a \(3\times 3\) matrix has two modes, each of extent 3

  • Each mode has a stride accounting for the distance in physical memory between two logically consecutive elements along that mode, in unit of elements


For NumPy/CuPy users, rank/order translates to the ndim attribute, the sequence of extents translates to shape, and the sequence of strides has the identical meaning as strides.

A tensor network in the cuTensorNet library is represented by the cutensornetNetworkDescriptor_t descriptor that effectively encodes the topology and data type of the network. To be precise, this descriptor specifies the number of input tensors in numInputs and the number of modes for each tensor in the array numModesIn, along with each tensor’s modes, extents, and strides in the arrays of pointers modesIn, extentsIn, and stridesIn, respectively.

Likewise, it holds similar information about the output tensor (e.g., numModesOut, modesOut, extentsOut, stridesOut). Note that there is only one output tensor per network, so there is no need to set numOutputs and the corresponding arguments are just plain arrays.

It is possible for all these network metadata to live on the host, since when constructing a tensor network only its topology and the data-access pattern matter; we do not need to know the actual content of the input tensors at network descriptor creation.

Internally, cuTensorNet utilizes cuTENSOR to create tensor objects and perform pairwise tensor contractions. cuTensorNet’s APIs are designed such that users can just focus on creating the network description without having to manage such “low-level” details by themselves. The tensor contraction can be computed in a different precision from the data type, given by a cutensornetComputeType_t constant.

Once a valid tensor network is created, one can

  1. Find a low-cost contraction path, possibly with slicing and additional constraints

  2. Access information concerning the contraction path

  3. Get the needed workspace size to accommodate intermediate tensors

  4. Create a contraction plan according to the info collected above

  5. Auto tune the contraction plan to optimize the runtime of the network contraction

  6. Perform the actual contraction to retrieve the output tensor

It is the users’ responsibility to manage device memory for the workspace (from Step 3) and input/output tensors (for Step 5). See API Reference for the cuTensorNet APIs (section Workspace Management API). Alternatively, the user can provide a stream-ordered memory pool to the library to facilitate workspace memory allocations, see Memory Management API for details.

Contraction pathfinder

A contraction path is a sequence of pairwise contractions represented in the numpy.einsum_path() format. The role of a path optimizer is to find a contraction path that minimizes the cost of contracting the tensor network. The cuTensorNet pathfinder is based on a graph-partitioning approach (called phase 1), followed by slicing and reconfiguration (called phase 2). Practically, experience indicates that finding an optimal contraction path can be sensitive to the choice of configuration parameters. Therefore, many of these are available to be configured via cutensornetContractionOptimizerConfigSetAttribute().


In order to fit a tensor network contraction into available device memory, as specified by workspaceSizeConstraint, it may be necessary to use slicing (also known as variable projection or bond cutting). Each slice can be computed independently from the others. Thus, if we intend to run a parallel computation, slicing is also one of the best techniques as it creates independent work for each device. We may similarly use a sliced contraction in order to create work for all available nodes. Slicing means that we compute the contraction for only one particular position in a certain mode (or combination of modes), creating a number of slices equal to the product of the extents of the sliced modes. We can then sum over the individually computed values to reproduce the full tensor network contraction. Such a technique is useful for large tensor networks, in particular quantum circuits, where the memory footprint to perform the contraction could exceed any existing memory storage. Taking the above \(T\) tensor as an example, if we slice over the mode i we obtain the following:

\[T_{abcd} = A_{aij} B_{bjk} C_{klc} D_{lid} \longrightarrow \sum_{i_s} \left( A_{a {i_s} j} B_{bjk} C_{klc} D_{l {i_s} d} \right),\]

where the sliced mode \(i_s\) is no longer implicitly summed over as part of tensor contraction, but instead explicitly summed (potentially in parallel). Although slicing reduces the memory footprint, it usually worsens the flops count of the contraction, and there is no simple way to determine what set of sliced modes will yield the best performance.

The cuTensorNet library offers some controls to influence the slice-finding algorithm:


At the end of each slice-finding iteration, the quality of the contraction tree has been diminished by the slicing. We can improve the contraction tree at this stage by performing reconfiguration. Reconfiguration considers a number of small subtrees within the overall contraction tree and attempts to improve their quality. Although the process is computationally expensive, a non-reconfigured sliced contraction tree may be orders of magnitude more expensive to execute than expected. The cuTensorNet library offers some controls to influence the reconfiguration algorithm:

  • CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_RECONFIG_NUM_ITERATIONS: Specifies the number of subtrees to consider during each reconfiguration. The amount of time spent in reconfiguration, which usually dominates the pathfinder run time, is linearly proportional to this value. Based on our experiments, values between 500 and 1000 provide very good results. Default is 500. Setting this to 0 will disable reconfiguration.

  • CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_RECONFIG_NUM_LEAVES: Specifies the maximum number of leaf nodes in each subtree considered by reconfiguration. Since the time spent is exponential in this quantity for optimal subtree reconfiguration, selecting large values will invoke faster non-optimal algorithms. Nonetheless, the time spent by reconfiguration increases very rapidly as this quantity is increased. Default is 8. Must be at least 2. While using the default value usually produces the best flops, setting it to 6 will speed up the pathfinder execution without significant increase in the flops count for many problems.

Deferred rank simplification

Since the time taken by the path-finding algorithm increases quickly as the number of tensors increases, it is advantageous to minimize the number of tensors, if possible. Rank simplification removes trivial tensor contractions from the network in order to improve performance. These contractions are those where a tensor is only contracted with at most two neighbors, effectively making a matrix multiplication. The necessary contractions to perform the simplification are not immediately performed but rather are prepended to the contraction path returned. If, for some reason, such simplification is not desired, it can be disabled:

While simplification helps lower the FLOP count in most cases, it may sometimes (depending on the network topology and other factors) lead to a path with higher FLOP count. We recommend that users experiment with the impact of turning simplification off (using the option listed above) on the computed path.


cuTensorNet provides a hyper-optimizer for the pathfinder that can automatically generate many instances of contraction path and return the best of them in terms of total flops. The number of instances is user controlled by CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES and is set to 0 by default. The idea here is that the hyper-optimizer will create CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES instances, each with different parameters of the pathfinder algorithm. Each instance will run the full pathfinder algorithm including reconfiguration and slicing (if requested). At the end of the hyper-optimizer loop, the best path (in term of flops) is returned.

The hyper-optimizer runs its instances in parallel. The desired number of threads can be set using CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_THREADS and is chosen to be half of the available logical cores by default. The number of threads is limited to the number of the available logical cores to avoid the resource contention that is likely with an unnecessarily large number of threads.

Currently, hyper-optimizer multithreading is implemented via OpenMP and:

  • OpenMP environment variables (e.g., OMP_NUM_THREADS) are not used.

  • Internal OpenMP configuration/settings are not affected, i.e. no omp_set_*() functions are called.

The configuration parameters that are varied by the hyper-optimizer are:

Some of these parameters may be fixed to a given value (via cutensornetContractionOptimizerConfigSetAttribute()). When a parameter is fixed, the hyper-optimizer will not randomize it. The randomness can be fixed by setting the seed via the attribute CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SEED.

Supported data types

A valid combination of the data and compute types for tensor network contractions inherits straightforwardly from that of cuTENSOR. Please refer to cutensornetCreateNetworkDescriptor() and cuTENSOR’s User Guide for detail.


For a technical introduction to cuTensorNet, please refer to the NVIDIA blog:

For further information about general tensor networks, please refer to the following:

For the application of tensor networks to quantum circuit simulations, please see:

For citing cuQuantum, please see: