******** Overview ******** This section describes the basic working principles of the *cuTensorNet* library. For a general introduction to quantum circuits, please refer to :doc:`Introduction to quantum computing <../../overview>`. Introduction to tensor networks =============================== Tensor networks emerge in many mathematical and scientific domains, ranging from quantum circuit simulation, to quantum many-body physics, to quantum chemistry, and to machine learning. As the network size scales up exponentially, there is an increasing need for a high-performance tensor network library in order to perform tensor contractions efficiently, which cuTensorNet aims to serve. A tensor network is a collection of tensors contracted together to form a tensor of arbitrary rank. The contractions between the constituent tensors fully determines the network topology. For example, the tensor :math:`T` below is given by contracting the tensors :math:`A`, :math:`B`, :math:`C`, and :math:`D`: .. math:: T_{abcd} = A_{aij} B_{bjk} C_{klc} D_{lid}, where the repeated modes are implicitly summed over following the `Einstein summation convention`_. In this example, the index :math:`i` connects the tensors :math:`D` and :math:`A`, the index :math:`j` connects the tensors :math:`A` and :math:`B`, the index :math:`k` connects the tensors :math:`B` and :math:`C`, the index :math:`l` connects the tensors :math:`C` and :math:`D`. The four uncontracted modes :math:`a`, :math:`b`, :math:`c`, and :math:`d` are free modes (sometimes also referred to as external modes), indicating the resulting tensor :math:`T` is of rank 4. .. _Einstein summation convention: https://en.wikipedia.org/wiki/Einstein_notation .. TODO: add a plot for the above TN? Description of tensor networks ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In the cuTensorNet library, we follow `cuTENSOR's nomenclature`_: * A rank (or *order*) :math:`N` tensor has :math:`N` *modes* * Each mode has an *extent* (the size of the mode), so a :math:`3\times 3` matrix has two modes of extent 3 * Each mode has a *stride* accounting for the distance in physical memory between two logically consecutive elements along that mode, in unit of elements .. note:: For NumPy/CuPy users, rank/order translates to the ``ndim`` attribute, extent translates to ``shape``, and stride has the identical meaning as ``strides``. .. _cuTENSOR's nomenclature: https://docs.nvidia.com/cuda/cutensor/user_guide.html#nomenclature A tensor network in the cuTensorNet library is represented by the `cutensornetNetworkDescriptor_t` descriptor that effectively encodes the topology and data type of the network. To be precise, this descriptor specifies the number of input tensors in ``numInputs`` and the number of modes for each tensor in the array ``numModesIn``, along with each tensor's modes, extents, and strides in the arrays of pointers ``modesIn``, ``extentsIn``, and ``stridesIn``, respectively. Likewise, it holds similar information about the output tensor (e.g., ``numModesOut``, ``modesOut``, ``extentsOut``, ``stridesOut``). Note that there is only one output tensor per network, so there is no need to set ``numOutputs`` and the corresponding arguments are just plain arrays. It is possible for all these network metadata to live on the host, since when constructing a tensor network only its topology and the data-access pattern matter, and we do not need to know the actual content of the input tensors. Internally, cuTensorNet utilizes cuTENSOR to create tensor objects and perform pairwise tensor contractions. cuTensorNet's APIs are designed such that users can just focus on creating the network description without having to manage such "low-level" details by themselves. The tensor contraction can be computed in a different precision from the data type, given by a `cutensornetComputeType_t` constant. Once a valid tensor network is created, one can #. Find a low-cost contraction path, possibly with slicing and additional constraints #. Access information concerning the contraction path #. Get the needed workspace size to accommodate intermediate tensors #. Create a contraction plan according to the info collected above #. Perform the actual contraction to retrieve the output tensor It is the users' responsibility to manage device memory for the workspace (from Step 3) and input/output tensors (for Step 5). See :ref:`cutensornet-api-reference-label` for the cuTensorNet APIs. Contraction pathfinder ====================== A contraction path is a sequence of pairwise contractions represented in the :func:`numpy.einsum_path` format. The role of a path optimizer is to find a contraction path that minimizes the cost of contracting the tensor network. The cuTensorNet pathfinder is based on a graph-partitioning approach (called phase 1), followed by slicing and reconfiguration (called phase 2). Practically, experience indicates that finding an optimal contraction path can be sensitive to the choice of configuration parameters. Therefore, many of these are available to be configured via `cutensornetContractionOptimizerConfigSetAttribute`. Slicing ^^^^^^^ In order to fit a tensor network contraction into available device memory, as specified by ``workspaceSizeConstraint``, it may be necessary to use *slicing* (also known as *variable projection* or *bond cutting*). Each slice can be computed independently from the others. Thus, if we intend to run a parallel computation, slicing is also one of the best techniques as it create independent work for each device. We may wish used a sliced contraction in order to create work for all available nodes. Slicing means that we compute the contraction for only one particular position in some mode (or combination of modes), creating a number of slices equal to the product of the extents of the sliced modes. We can then sum over the individually computed values to reproduce the full tensor network contraction. Such a technique is useful for large tensor networks, in particular quantum circuits, where the memory footprint to perform the contraction could easily exceed any existing memory storage. Taking the above :math:`T` tensor as an example, if we slice over the mode ``i`` we obtain the following: .. math:: T_{abcd} = A_{aij} B_{bjk} C_{klc} D_{lid} \longrightarrow \sum_{i_s} \left( A_{a {i_s} j} B_{bjk} C_{klc} D_{l {i_s} d} \right), .. TODO: add a plot for the above sliced TN? where the sliced mode :math:`i_s` is no longer implicitly summed over as part of tensor contraction, but instead explicitly summed (potentially in parallel). Although slicing reduces the memory footprint, it usually worsens the flops count of the contraction, and there is no simple way to determine what set of sliced modes will yield the best performance. The cuTensorNet library offers some handles to influence the slice-finding algorithm: * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_DISABLE_SLICING`: If set to 1, slicing will not be considered, regardless of available memory. Default is 0. * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MIN_SLICES`: Specifies the minimum number of slices to produce (e.g. to create parallelizable work). Default is 1. * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_SLICE_FACTOR`: Specifies the factor by which the number of slices will increase per slice-finding iteration. For example, if the previous iteration had :math:`N` slices and the slice factor is :math:`s`, the next iteration will produce at least :math:`sN` slices. Since reconfiguration (see below) occurs after each slice-finding iteration, increasing this will decrease the amount of reconfiguration; this will decrease the amount of time taken but probably worsen the quality of the contraction tree. Default is 32. Must be at least 2. Reconfiguration ^^^^^^^^^^^^^^^ At the end of each slice-finding iteration, the quality of the contraction tree has been diminished by the slicing. We can improve the contraction tree at this stage by performing *reconfiguration*. Reconfiguration considers a number of small subtrees within the overall contraction tree and attempts to improve their quality. Although the process is computationally expensive, a non-reconfigured sliced contraction tree may be orders of magnitude more expensive to execute than expected. The cuTensorNet library offers some handles to influence the reconfiguration algorithm: * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_RECONFIG_NUM_ITERATIONS`: Specifies the number of subtrees to consider during each reconfiguration. The amount of time spent by reconfiguration is linearly proportional to this quantity. Default is 500. Setting this to 0 will disable reconfiguration. * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_RECONFIG_NUM_LEAVES`: Specifies the maximum number of leaf nodes in each subtree considered by reconfiguration. Since the time spent is exponential in this quantity for optimal subtree reconfiguration, selecting large values will invoke faster non-optimal algorithms. Nonetheless, the time spent by reconfiguration increases very rapidly as this quantity is increased. Default is 8. Must be at least 2. Deferred rank simplification ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Since the time taken by the path-finding algorithm increases quickly as the number of tensors increases, it is advantageous to minimize the number of tensors, if possible. Rank simplification removes trivial tensor contractions from the network in order to improve performance. These contractions are those where a tensor is only contracted with at most two neighbors, effectively making a matrix multiplication. The necessary contractions to perform the simplification are not immediately performed but rather are prepended to the contraction path returned. If, for some reason, such simplification is not desired, it can be disabled: * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SIMPLIFICATION_DISABLE_DR`: If set to 1, simplification will be skipped. This will run the path-finding algorithm on the full tensor network, increasing the time needed. Default is 0. Hyper-optimizer =============== cuTensorNet provides a hyper-optimizer for the pathfinder that can automatically generate many instances of contraction path and return the best of them in terms of total flops. The number of instances is user controlled by `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES` and is set to 0 by default. The idea here, is that the hyper-optimizer will create `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES` instances, each with different parameters of the pathfinder algorithm. Each instance will run the full pathfinder algorithm including reconfiguration and slicing (if requested). At the end of the hyper-optimizer loop, the best path (in term of flops) is returned. The configuration parameters that are varied by the hyper-optimizer are: * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_NUM_PARTITIONS`: The network is recursively split over ``num_partitions`` until the size of each partition is less than or equal to the cutoff. The allowed range for ``num_partitions`` is [2, 30]. When the hyper-optimizer is disabled the default value is 8. * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_CUTOFF_SIZE`: The network is recursively split over ``num_partitions`` until the size of each partition is less than or equal to this cutoff. The allowed range for ``cutoff_size`` is [4, 50]. When the hyper-optimizer is disabled the default value is 8. * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_ALGORITHM`: The graph algorithm to be used in graph partitioning. Choices are given in the `cutensornetGraphAlgo_t` enum. * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_IMBALANCE_FACTOR`: Specifies the maximum allowed size imbalance among the partitions. Allowed range [30, 2000]. When the hyper-optimizer is disabled the default value is 200. * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_NUM_ITERATIONS`: Specifies the number of iterations for the refinement algorithms at each stage of the uncoarsening process of the graph partitioner. Allowed range [1, 500]. When the hyper-optimizer is disabled the default value is 60. * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_NUM_CUTS`: Specifies the number of different partitioning that the graph partitioner will compute. The final partitioning is the one that achieves the best edge-cut or communication volume. Allowed range [1, 40]. When the hyper-optimizer is disabled the default value is 10. * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_RECONFIG_NUM_ITERATIONS`: Specifies the number of subtrees to be chosen for reconfiguration. A value of 0 disables reconfiguration. The default value is 500. The amount of time spent in reconfiguration, which usually dominates the pathfinder run time, is proportional to this. * `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_RECONFIG_NUM_LEAVES`: Specifies the maximum number of leaves in the subtree chosen for optimization in each reconfiguration iteration. The default value is 8. The default value is 500. The amount of time spent in reconfiguration, which usually dominates the pathfinder run time, is proportional to this. The hyper-optimizer may be used to generate the values of all these parameters, or some of these parameters may be fixed to a given value (via `cutensornetContractionOptimizerConfigSetAttribute`). When a parameter is fixed, the hyper-optimizer will not randomize it. Supported data types ==================== A valid combination of the data and compute types for tensor network contractions inherits straightforwardly from that of cuTENSOR. Please refer to `cuTENSOR's User Guide`_ for detail. .. _cuTENSOR's User Guide: https://docs.nvidia.com/cuda/cutensor/user_guide.html#user-guide .. TODO: Add sections for: CUDA Graph Support, cuTensorNet Logging, Environment Variables following cuTENSOR: https://docs.nvidia.com/cuda/cutensor/user_guide.html Reference ========= For further information about general tensor networks, please refer to the following: * `Tensor Network `_ * `Hyper-optimized tensor network contraction `_ For the application of tensor networks to quantum circuit simulations, please see: * `NVIDIA Blog: What Is Quantum Computing? `_ * `Simulating Quantum Computation by Contracting Tensor Networks `_