********
Overview
********

This section describes the basic working principles of the *cuTensorNet* library. For a general introduction to quantum circuits, please refer to :doc:`Introduction to quantum computing <../../overview>`.


Introduction to tensor networks
===============================

Tensor networks emerge in many mathematical and scientific domains, ranging from quantum circuit simulation, quantum
many-body physics, quantum chemistry, to machine learning. As network sizes scale up exponentially, there is
an increasing need for a high-performance tensor network library in order to perform tensor network contractions efficiently,
which cuTensorNet aims to serve.

A tensor network is a collection of tensors contracted together to form a tensor of arbitrary rank. The contractions
between the constituent tensors fully determines the network topology. For example, the tensor :math:`T` below is given by
contracting the tensors :math:`A`, :math:`B`, :math:`C`, and :math:`D`:

.. math::

    T_{abcd} = A_{aij} B_{bjk} C_{klc} D_{lid},

where the modes with the same label are implicitly summed over following the `Einstein summation convention`_. In this example,
the mode label (index) :math:`i` connects the tensors :math:`D` and :math:`A`, the mode label :math:`j` connects the tensors :math:`A` and :math:`B`,
the mode label :math:`k` connects the tensors :math:`B` and :math:`C`, the mode label :math:`l` connects the tensors :math:`C` and :math:`D`.
The four uncontracted modes with labels :math:`a`, :math:`b`, :math:`c`, and :math:`d` refer to free modes (sometimes also referred to as external modes), indicating the resulting tensor
:math:`T` is of rank 4.

.. _Einstein summation convention: https://en.wikipedia.org/wiki/Einstein_notation

..
    TODO: add a plot for the above TN?

Description of tensor networks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the *cuTensorNet* library, we follow `cuTENSOR's nomenclature`_:

    * A rank (or *order*) :math:`N` tensor has :math:`N` *modes*
    * Each mode has an *extent* (the size of the mode), so a :math:`3\times 3` matrix has two modes, each of extent 3
    * Each mode has a *stride* accounting for the distance in physical memory between two logically consecutive elements along that mode, in unit of elements

.. note::

    For NumPy/CuPy users, rank/order translates to the ``ndim`` attribute, the sequence of extents translates to ``shape``, and  the sequence of strides
    has the identical meaning as ``strides``.

.. _cuTENSOR's nomenclature: https://docs.nvidia.com/cuda/cutensor/user_guide.html#nomenclature

A tensor network in the *cuTensorNet* library is represented by the
`cutensornetNetworkDescriptor_t` descriptor that effectively encodes the topology and data
type of the network. To be precise, this descriptor specifies the number of input tensors in ``numInputs``
and the number of modes for each tensor in the array ``numModesIn``, along with each tensor's modes, extents, and
strides in the arrays of pointers ``modesIn``, ``extentsIn``, and ``stridesIn``, respectively.

Likewise, it holds similar information about the output tensor (e.g., ``numModesOut``, ``modesOut``, ``extentsOut``, ``stridesOut``).
Note that there is only one output tensor per network, so there is no need to set ``numOutputs`` and
the corresponding arguments are just plain arrays.

It is possible for all these network metadata to live on the host, since when constructing a tensor network only
its topology and the data-access pattern matter; we do not need to know the actual content of the input tensors at network descriptor creation.

Internally, *cuTensorNet* utilizes cuTENSOR to create tensor objects and perform pairwise tensor contractions.
*cuTensorNet*'s APIs are designed such that users can just focus on creating the network description without having to manage such
"low-level" details by themselves.
The tensor contraction can be computed in a different precision from the data type, given by a `cutensornetComputeType_t` constant.

Once a valid tensor network is created, one can

    #. Find a low-cost contraction path, possibly with slicing and additional constraints
    #. Access information concerning the contraction path
    #. Get the needed workspace size to accommodate intermediate tensors
    #. Create a contraction plan according to the info collected above
    #. Auto tune the contraction plan to optimize the runtime of the network contraction
    #. Perform the actual contraction to retrieve the output tensor

It is the users' responsibility to manage device memory for the workspace (from Step 3) and input/output tensors (for Step 5).
See :ref:`cutensornet-api-reference-label` for the *cuTensorNet* APIs (section :ref:`cuTensorNet workspace management API`).
Alternatively, the user can provide a *stream-ordered* memory pool to the library to facilitate workspace memory allocations, see :ref:`cuTensorNet memory management API` for details.


Contraction pathfinder
======================

A contraction path is a sequence of pairwise contractions represented in the :func:`numpy.einsum_path` format.
The role of a path optimizer is to find a contraction path that minimizes the cost of contracting the tensor network.
The *cuTensorNet* pathfinder is based on a graph-partitioning approach (called phase 1), followed by slicing and reconfiguration (called phase 2).
Practically, experience indicates that finding an optimal contraction path can be sensitive to the choice of configuration parameters.
Therefore, many of these are available to be configured via `cutensornetContractionOptimizerConfigSetAttribute`.

Slicing
^^^^^^^

In order to fit a tensor network contraction into available device memory, as specified by ``workspaceSizeConstraint``, it may be necessary to use *slicing* (also known as *variable projection* or *bond cutting*).
Each slice can be computed independently from the others.
Thus, if we intend to run a parallel computation,
slicing is also one of the best techniques as it creates independent work for each device.
We may similarly use a sliced contraction in order to create work for all available nodes.
Slicing means that we compute the contraction for only one particular position in a certain mode (or combination of modes), creating a number of slices equal to the product of the extents of the sliced modes.
We can then sum over the individually computed values to reproduce the full tensor network contraction.
Such a technique is useful for large tensor networks, in particular quantum circuits, where the memory footprint to perform the contraction could exceed any existing memory storage.
Taking the above :math:`T` tensor as an example, if we slice over the mode ``i`` we obtain the following:

.. math::

    T_{abcd} = A_{aij} B_{bjk} C_{klc} D_{lid} \longrightarrow \sum_{i_s} \left( A_{a {i_s} j} B_{bjk} C_{klc} D_{l {i_s} d} \right),

..
    TODO: add a plot for the above sliced TN?

where the sliced mode :math:`i_s` is no longer implicitly summed over as part of tensor contraction, but instead explicitly summed (potentially in parallel).
Although slicing reduces the memory footprint, it usually worsens the flops count of the contraction, and there is no simple way to determine what set of sliced modes will yield the best performance.

The *cuTensorNet* library offers some controls to influence the slice-finding algorithm:

    * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_DISABLE_SLICING`:
      If set to 1, slicing will not be considered, regardless of available memory.
      Default is 0.

    * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MEMORY_MODEL`:
      Specifies the memory model used to determine the workspace size, see `cutensornetMemoryModel_t`.
      The default is to use a memory model compatible with cuTENSOR (`CUTENSORNET_MEMORY_MODEL_HEURISTIC`).

    * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MEMORY_FACTOR`:
+     The memory limit for the first slice-finding iteration as a percentage of the workspace size. The default is 80 when using `CUTENSORNET_MEMORY_MODEL_CUTENSOR` for the memory model and 100 when using `CUTENSORNET_MEMORY_MODEL_HEURISTIC`.                                                                

    * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MIN_SLICES`:
      Specifies the minimum number of slices to produce (e.g. to create parallelizable work).
      Default is 1.

    * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_SLICE_FACTOR`:
      Specifies the factor by which the number of slices will increase per slice-finding iteration.
      For example, if the previous iteration had :math:`N` slices and the slice factor is :math:`s`, the next iteration
      will produce at least :math:`sN` slices.
      Since reconfiguration (see below) occurs after each slice-finding iteration, increasing this value will decrease the
      amount of reconfiguration,
      which, in turn, will decrease the amount of time taken but probably worsen the quality of the contraction tree.
      Default is 32. Must be at least 2.

Reconfiguration
^^^^^^^^^^^^^^^

At the end of each slice-finding iteration, the quality of the contraction tree has been diminished by the slicing.
We can improve the contraction tree at this stage by performing *reconfiguration*.
Reconfiguration considers a number of small subtrees within the overall contraction tree and attempts to improve their quality.
Although the process is computationally expensive, a non-reconfigured sliced contraction tree may be orders of magnitude more expensive to execute than expected.
The *cuTensorNet* library offers some controls to influence the reconfiguration algorithm:

    * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_RECONFIG_NUM_ITERATIONS`:
      Specifies the number of subtrees to consider during each reconfiguration.
      The amount of time spent in reconfiguration, which usually dominates the pathfinder run time, is linearly proportional to this value.
      Based on our experiments, values between 500 and 1000 provide very good results.
      Default is 500. Setting this to 0 will disable reconfiguration.

    * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_RECONFIG_NUM_LEAVES`:
      Specifies the maximum number of leaf nodes in each subtree considered by reconfiguration.
      Since the time spent is exponential in this quantity for optimal subtree reconfiguration, selecting large
      values will invoke faster non-optimal algorithms.
      Nonetheless, the time spent by reconfiguration increases very rapidly as this quantity is increased.
      Default is 8. Must be at least 2.
      While using the default value usually produces the best flops, setting it to 6 will speed up the pathfinder execution without significant increase in the flops count for many problems.

Deferred rank simplification
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Since the time taken by the path-finding algorithm increases quickly as the number of tensors increases, it is advantageous to minimize the number of tensors, if possible.
Rank simplification removes trivial tensor contractions from the network in order to improve performance.
These contractions are those where a tensor is only contracted with at most two neighbors, effectively making a matrix multiplication.
The necessary contractions to perform the simplification are not immediately performed but rather are prepended to the contraction path returned.
If, for some reason, such simplification is not desired, it can be disabled:

    * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SIMPLIFICATION_DISABLE_DR`:
      If set to 1, simplification will be skipped.
      This will run the path-finding algorithm on the full tensor network, increasing the time needed.
      Default is 0.

While simplification helps lower the FLOP count in most cases,  it may sometimes (depending on the 
network topology and other factors) lead to a path with higher FLOP count. 
We recommend that users experiment with the impact of turning simplification off 
(using the option listed above) on the computed path.

.. _hyperoptimizer:

Hyper-optimizer
===============

*cuTensorNet* provides a hyper-optimizer for the pathfinder that can automatically generate
many instances of contraction path and return the best of them in terms of total flops.
The number of instances is user controlled by `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES` and is set to 0 by default.
The idea here is that the hyper-optimizer will create
`CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES` instances, each with different parameters of the pathfinder algorithm.
Each instance will run the full pathfinder algorithm including reconfiguration and slicing (if requested).
At the end of the hyper-optimizer loop, the best path (in term of flops) is returned.

The hyper-optimizer runs its instances in parallel.
The desired number of threads can be set using `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_THREADS`
and is chosen to be half of the available logical cores by default. The number of threads
is limited to the number of the available logical cores to avoid the resource contention that is likely with an unnecessarily large number of threads.

Currently, hyper-optimizer multithreading is implemented via OpenMP and:

    * OpenMP environment variables (e.g., ``OMP_NUM_THREADS``) are not used.
    * Internal OpenMP configuration/settings are not affected, i.e. no ``omp_set_*()`` functions are called.

The configuration parameters that are varied by the hyper-optimizer are:

    * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_NUM_PARTITIONS`:
      A network of size larger than cutoff is split into ``num_partitions`` partitions. This process is repeated recursively until the size of each partition is less than or equal to the value of cutoff. The allowed range for ``num_partitions`` is [2, 30]. When the hyper-optimizer is disabled the default value is 8.

    * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_CUTOFF_SIZE`:
      The network is recursively split over ``num_partitions`` until the size of each partition is less than or equal to this cutoff. The allowed range for ``cutoff_size`` is [4, 50]. When the hyper-optimizer is disabled the default value is 8.

    * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_ALGORITHM`:
      The graph algorithm to be used in graph partitioning. The available choices are listed in the `cutensornetGraphAlgo_t` enum.

    * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_IMBALANCE_FACTOR`:
      Specifies the maximum allowed size imbalance among the partitions. Allowed range [30, 2000]. When the hyper-optimizer is disabled the default value is 200.

    * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_NUM_ITERATIONS`:
      Specifies the number of iterations for the refinement algorithms at each stage of the uncoarsening process of the graph partitioner. Allowed range [1, 500]. When the hyper-optimizer is disabled the default value is 60.

    * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_NUM_CUTS`:
      Specifies the number of different partitioning that the graph partitioner will compute. The final partitioning is the one that achieves the best edge-cut or communication volume. Allowed range [1, 40]. When the hyper-optimizer is disabled the default value is 10.

Some of these parameters may be fixed to a given value (via `cutensornetContractionOptimizerConfigSetAttribute`). When a parameter is fixed, the hyper-optimizer will not randomize it. The randomness can be fixed by setting the seed via the attribute `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SEED`.


Supported data types
====================

A valid combination of the data and compute types for tensor network contractions inherits straightforwardly from
that of cuTENSOR. Please refer to `cutensornetCreateNetworkDescriptor` and `cuTENSOR's User Guide`_ for detail.

.. _cuTENSOR's User Guide: https://docs.nvidia.com/cuda/cutensor/user_guide.html#user-guide

..
    TODO: Add sections for: CUDA Graph Support, *cuTensorNet* Logging, Environment Variables
    following cuTENSOR: https://docs.nvidia.com/cuda/cutensor/user_guide.html


Reference
=========

For further information about general tensor networks, please refer to the following:

  * `Tensor Network <https://tensornetwork.org>`_
  * `Hyper-optimized tensor network contraction <https://arxiv.org/abs/2002.01935>`_

For the application of tensor networks to quantum circuit simulations, please see:

  * `NVIDIA Blog: What Is Quantum Computing? <https://blogs.nvidia.com/blog/2021/04/12/what-is-quantum-computing/>`_
  * `Simulating Quantum Computation by Contracting Tensor Networks <https://arxiv.org/abs/quant-ph/0511069>`_