.. _python tensor network APIs:

.. currentmodule:: cuquantum.tensornet

*******************
Tensor Network APIs
*******************


The tensor network module, :mod:`cuquantum.tensornet`, provides a Python-friendly interface for users to leverage the cuTensorNet library. 
It supports NumPy, CuPy, and PyTorch ndarray-like objects, enabling functionalities such as tensor network contraction, 
tensor decomposition, circuit-to-Einsum expression conversion, and a Pythonic tensor network state simulator.

The following sections introduce these features in detail.

.. _python contraction APIs:

Contraction
===========

Introduction
------------

The contraction APIs support ndarray-like objects from NumPy, CuPy, and PyTorch and support
specification of the tensor network as an Einstein summation expression. 

These APIs can be further categorized into two levels:

  * The "coarse-grained" level, where the user deals with Python functions
    like :func:`contract`, :func:`contract_path`, :func:`einsum`, and :func:`einsum_path`.
    The coarse-grained level is an abstraction layer that is typically meant for single contraction operations.
  * The "fine-grained" level, where the interaction is through operations on a
    :class:`Network` object. The fine-grained level allows the user to invest significant resources into finding
    an optimal contraction path and autotuning the network where repeated contractions on the same network object
    allow for amortization of the cost (see also :ref:`python resource management`).

The APIs also allow for interoperability between the cuTensorNet library and external packages. For example, the user can
specify a contraction order obtained from a different package (perhaps a research project). Alternatively, the user
can obtain the contraction order and the sliced modes from cuTensorNet for downstream use elsewhere.

Usage example
-------------

Contracting the same tensor network demonstrated in the :ref:`cuTensorNet C example <cuTensorNet C example>` is as simple as:

.. testcode::

    from cuquantum.tensornet import contract
    from numpy.random import rand

    a = rand(96,64,64,96)
    b = rand(96,64,64)
    c = rand(64,96,64)

    r = contract("mhkn,ukh,xuy->mxny", a, b, c)

If desired, various options can be provided for the contraction.
For PyTorch tensors, starting cuQuantum Python v23.10 the :func:`contract` function works like a native PyTorch operator that can be
recorded in the autograd graph and generate backward-mode automatic differentiation.
See :func:`contract` for more details and examples.

Starting cuQuantum v22.11 / cuTensorNet v2.0.0, cuTensorNet supports automatic MPI parallelism if users bind an MPI communicator to
the library handle, among other requirements as outlined :ref:`here <automatic mpi sample>`. To illustrate, assuming all processes hold the
same set of input tensors (on distinct GPUs) and the same network expression, this should work out of box:

.. code-block:: python

    from cupy.cuda.runtime import getDeviceCount
    from mpi4py import MPI
    from cuquantum.bindings import cutensornet as cutn

    # bind comm to cuTensorNet handle
    handle = cutn.create()
    comm = MPI.COMM_WORLD
    cutn.distributed_reset_configuration(
        handle, *cutn.get_mpi_comm_pointer(comm))

    # make each process run on different GPU
    rank = comm.Get_rank()
    device_id = rank % getDeviceCount()
    cp.cuda.Device(device_id).use()

    # 1. assuming input tensors a, b, and c are created on the right GPU
    # 2. passing handle explicitly allows reusing it to reduce the handle creation overhead
    r = contract(
        "mhkn,ukh,xuy->mxny", a, b, c,
        options={'device_id' : device_id, 'handle': handle}))

An end-to-end Python example of such auto-MPI usage can be found at https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/tensornet/contraction/coarse/example22_mpi_auto.py.

.. note::

   As of cuQuantum v22.11 / cuTensorNet v2.0.0, the Python wheel does not have the required MPI wrapper library included. Users need to either build it from source (included in the wheel), or use the Conda package from conda-forge instead.

Finally, for users seeking full control over the tensor network operations and parallelization, we offer fine-grained APIs as illustrated by the examples in the documentation for :class:`Network`. A complete example
illustrating parallel implementation of tensor network contraction using the fine-grained API is shown below:

.. literalinclude:: ../../../python/samples/tensornet/contraction/fine/example2_mpi.py
   :language: python
   :start-after: Sphinx

This "manual" MPI Python example can be found in the `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository (`here <https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/tensornet/contraction/fine/example2_mpi.py>`_).

.. _call blocking:

Call blocking behavior
----------------------

By default, calls to the execution APIs (:meth:`Network.autotune` and :meth:`Network.contract` on the :class:`Network` object as well as the function :func:`contract`) block and do not return
until the operation is completed. This behavior can be changed by setting :attr:`NetworkOptions.blocking` and passing in the options to
:class:`Network`. When :attr:`NetworkOptions.blocking` is set to ``'auto'``, calls to the execution APIs will return immediately after
the operation is launched on the GPU without waiting for it to complete *if the input tensors are on the device*. If the input
tensors are on the host, the execution API calls will always block since the result of the contraction is a tensor that will also reside on the host.


APIs that execute on the host (such as :meth:`Network.contract_path` on the :class:`Network` object, and :func:`contract_path`, and :func:`einsum_path` functions) always block.

.. _stream semantics:

Stream semantics
----------------

The stream semantics depends on whether the behavior of the execution APIs is chosen to be blocking or non-blocking (see :ref:`call blocking`).

For blocking behavior, stream ordering is automatically handled by the cuQuantum Python high-level APIs for *operations that are performed
within the package*. A stream can be provided for two reasons:

1. When the computation that prepares the input tensors is not already complete by the time the execution APIs are called. This is a correctness
requirement for user-provided data.
2. To enable parallel computations across multiple streams if the device has sufficient resources and the current stream (which is the default)
has concomitant operations. This can be done for performance reasons.

For non-blocking behavior, it is the user's responsibility to ensure correct stream ordering between the execution API calls.

In any case, the execution APIs are launched on the provided stream.


.. _python resource management:

Resource management
-------------------

An important aspect of the fine-grained, stateful object APIs (e.g. :class:`~Network`) is resource management. We need to make sure that the internal resources, including library resources, memory resources, and user-provided input operands are properly managed throughout the object's lifetime safely. As such, there are caveats that the users should be aware of, due to their impact on the memory watermark.

The stateful object APIs allow users to prepare an object and reuse it for multiple contractions or gradient calculations to amortize the preparation cost. Depending on the specific problem, investment in preparations that lead to shorter execution time may be the ideal solution. During the preparation step, an object would inevitably hold reference to device memory for later reuse. However, such problems oftentimes imply high memory usage, making it impossible to allow holding multiple objects at the same time. Interleaving contractions of multiple large tensor networks is an example of this.

To address this use case, starting cuQuantum Python v24.03 two new features are added:

1. Every execution method now accepts a ``release_workspace`` option. When this option is set to `True` (default is `False`), the memory needed to perform an operation is freed before the method returns, making this memory available for other tasks. The next time the same (or different) method is called, memory is allocated on demand. Therefore, there is a small overhead associated with ``release_workspace=True``, as allocating/deallocating memory could take time depending on the implementation of the underlying memory allocator (see :ref:`next section <python memory management>`); however, making multiple :class:`Network` objects coexist becomes possible, see, e.g., the `example6_resource_mgmt_contraction.py <https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/tensornet/contraction/fine/example6_resource_mgmt_contraction.py>`_ sample.
2. The :meth:`~Network.reset_operands` method now accepts setting ``operands=None`` to free the internal reference to the input operands after the execution. This reduces potential memory contention and thereby allows contracting multiple networks with large input tensors in an *interleaved* fashion. In such cases, before a subsequent execution on the same :class:`Network` object is called, the :meth:`~Network.reset_operands` method should be called again with the new operands, see, e.g., the `example8_reset_operand_none.py <https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/tensornet/contraction/fine/example8_reset_operand_none.py>`_ sample.

These two features, used separately or jointly as the problem requires, make it possible to prepare and use a large number of :class:`Network` objects when the device memory available is not enough to fit all problems at once.


.. _python memory management:

External memory management
--------------------------

Starting cuQuantum Python v22.03, we support an `EMM`_-like interface as proposed and supported by Numba for users to set their Python
mempool. Users set the option :attr:`NetworkOptions.allocator` to a Python object complying with the :class:`cuquantum.BaseCUDAMemoryManager`
protocol, and pass the options to the pythonic APIs like :func:`contract` or :class:`Network`. Temporary memory allocations will then
be done through this interface. (Internally, we use the same interface to use CuPy or PyTorch's mempool depending on the input tensor
operands.)

.. note::

    cuQuantum's :class:`~cuquantum.BaseCUDAMemoryManager` protocol is slightly different from Numba's EMM interface
    (:class:`numba.cuda.BaseCUDAMemoryManager`), but duck typing with an existing EMM instance (not type!) at runtime
    should be possible.

.. _EMM: https://numba.readthedocs.io/en/stable/cuda/external-memory.html

.. _tensor decompose:
    
Decomposition
=============

Introduction
------------

Decomposition methods such as QR and SVD are prevalent in tensor network algorithms, as they allow one to exploit the sparsity of the network and thus reduce the computational cost.
Starting with cuQuantum Python v23.03, we provide these functionalities at *both* the tensor *and* tensor network levels.
The tensor level decomposition routines are implemented inside the module :mod:`cuquantum.tensornet.tensor` with the following features:

    * QR decomposition can be performed using :func:`cuquantum.tensornet.tensor.decompose` with :class:`cuquantum.tensornet.tensor.QRMethod`.
    * Both exact and truncated SVD can be performed using :func:`cuquantum.tensornet.tensor.decompose` with :class:`cuquantum.tensornet.tensor.SVDMethod`.
    * Decomposition options can be specified by :class:`cuquantum.tensornet.tensor.DecompositionOptions`.

As of cuQuantum Python v23.03, the tensor network level decomposition routines are implemented in the experimental subpackage :mod:`cuquantum.tensornet.experimental` with the main API :func:`cuquantum.tensornet.experimental.contract_decompose`. 
Given an input tensor network, this function can perform a full contraction followed by a QR or SVD decomposition. This can be specified via :class:`cuquantum.tensornet.experimental.ContractDecomposeAlgorithm`. 
If the contract and decompose problem amounts to a **ternary-operand gate split problem**, commonly seen in quantum circuit simulation (see :ref:`Gate Split Algorithm<gatesplitalgo>` for details), 
the user can potentially leverage QR decompositions to speed up the execution of contraction and SVD. This can be achieved by setting both :attr:`cuquantum.tensornet.experimental.ContractDecomposeAlgorithm.qr_method` and :attr:`cuquantum.tensornet.experimental.ContractDecomposeAlgorithm.svd_method`.

.. note::

    The APIs inside :mod:`cuquantum.tensornet.experimental` are subject to change and may be integrated into the main package :mod:`cuquantum.tensornet` in a future release. 
    Users are encouraged to leave feedback on `NVIDIA/cuQuantum GitHub Discussions <https://github.com/NVIDIA/cuQuantum/discussions>`_.

Usage example
-------------

.. testcode::

    import cupy

    from cuquantum.tensornet import contract
    from cuquantum.tensornet.tensor import decompose
    from cuquantum.tensornet.experimental import contract_decompose

    # create a random rank-4 tensor
    a = cupy.random.random((2,2,2,2)) + cupy.random.random((2,2,2,2)) * 1j

    # perform QR decomposition such that A[i,j,k,l] = \sum_{x} Q[i,x,k] R[x,j,l]
    q, r = decompose('ijkl->ixk,xjl', a) # QR by default

    # check the unitary property of q
    identity = contract('ixk,iyk->xy', q, q.conj())
    identity_reference = cupy.eye(identity.shape[0])
    assert cupy.allclose(identity, identity_reference)

    # check if the contraction of the decomposition outputs yields the input
    a_reference = contract('ixk,xjl->ijkl', q, r)
    assert cupy.allclose(a, a_reference)

More examples on tensor decompositions are available in our `sample directory <https://github.com/NVIDIA/cuQuantum/tree/main/python/samples/tensornet/tensor>`_ to demonstrate the use of QR and SVD in different settings.

For tensor network decompositions, please refer to `this directory <https://github.com/NVIDIA/cuQuantum/tree/main/python/samples/tensornet/experimental>`_ for more detailed examples. 
We have also provided a Jupyter notebook to demonstrate how to easily implement basic `MPS algorithms <https://github.com/NVIDIA/cuQuantum/tree/main/python/samples/tensornet/experimental/mps_algorithms.ipynb>`_ using these new APIs.

.. _CircuitToEinsum converter:

CircuitToEinsum converter
=========================

Introduction
------------

Starting cuQuantum Python v22.07, we provide a :class:`CircuitToEinsum` converter that takes either a :class:`qiskit.QuantumCircuit` or a 
:class:`cirq.Circuit` and generates the corresponding tensor network contraction for the target operation. The goal of the converter is to allow Qiskit and Cirq users to easily
explore the functionalities of the cuTensorNet library. As mentioned in the :ref:`tensor network introduction <tensor network introduction>`, 
quantum circuits can be viewed as tensor networks. For any quantum circuit, :class:`CircuitToEinsum` can construct the corresponding tensor network 
to compute various quantities of interest. The output tensor network is returned as an Einstein summation expression with tensor operands.  

We support the following operations:

    * :meth:`~CircuitToEinsum.state_vector`: 
      The contraction of this Einstein summation expression yields the final state coefficients as an N-dimensional tensor where N is the number of qubits in the circuit. 
      The mode labels of the tensor correspond to the :attr:`CircuitToEinsum.qubits`.
    * :meth:`~CircuitToEinsum.amplitude`: The contraction of this Einstein summation expression yields the amplitude coefficient for a given bitstring.
    * :meth:`~CircuitToEinsum.batched_amplitudes`: The contraction of this Einstein summation expression yields the amplitude coefficients for a subset of qubits while the others are fixed at certain states.
    * :meth:`~CircuitToEinsum.reduced_density_matrix`: The contraction of this Einstein summation expression yields the reduced density matrix for a subset of qubits, 
      optionally with another subset of qubits set to a fixed state.
    * :meth:`~CircuitToEinsum.expectation`: The contraction of this Einstein summation expression yields the expectation value for a given Pauli string.

The :class:`CircuitToEinsum` class also allows user to specify a desired tensor backend (`cupy`, `torch`, `numpy`) via the ``backend`` argument when constructing the converter object.
The returned Einstein summation expression and tensor operands can then directly serve as the input arguments for :func:`cuquantum.contract` or the corresponding backend's ``einsum`` function.

Usage example
-------------

.. testcode::

    import cirq
    import cupy

    from cuquantum.tensornet import contract, CircuitToEinsum

    # create a random cirq.Circuit
    circuit = cirq.testing.random_circuit(qubits=4, n_moments=4, op_density=0.9, random_state=1) 
    # same task can be achieved with qiskit.circuit.random.random_circuit

    # construct the CircuitToEinsum converter targeting double precision and cupy operands
    converter = CircuitToEinsum(circuit, dtype='complex128', backend='cupy')

    # generate the Einstein summation expression and tensor operands for computing the amplitude coefficient of bitstring 0000
    expression, operands = converter.amplitude(bitstring='0000')
    assert all([isinstance(op, cupy.ndarray) for op in operands])

    # contract the network to compute the amplitude
    amplitude = contract(expression, *operands)
    amplitude_cupy = cupy.einsum(expression, *operands)
    assert cupy.allclose(amplitude, amplitude_cupy)

Multiple Jupyter notebooks are `available <https://github.com/NVIDIA/cuQuantum/tree/main/python/samples/tensornet/CircuitToEinsum>`_ for 
Cirq and Qiskit users to easily build up their tensor network based simulations using cuTensorNet.

.. _TN simulator:

.. currentmodule:: cuquantum.tensornet.experimental

Tensor network simulator
========================

.. _TN simulator intro:

Introduction
------------

Starting from cuQuantum Python v24.08, we provide new APIs that enable Python users to easily leverage :ref:`cuTensorNet tensor network state APIs <TN state>` for tensor network simulation. 
These APIs are now available under the :mod:`cuquantum.tensornet.experimental` module and may **be subject to change** in future releases. 
Please share your feedback with us on `NVIDIA/cuQuantum GitHub Discussions`_!

The new set of APIs are centered around the :class:`~NetworkState` class and are designed to support the following groups of users:

- **Quantum Computing Framework Users:** These users can directly initialize a tensor network state from quantum circuit objects such as :class:`cirq.Circuit` or :class:`qiskit.QuantumCircuit` via the :meth:`NetworkState.from_circuit` method.

- **Tensor Network Framework Developers and Researchers:** These users can build any state of interest by applying tensor operators, matrix product operators (MPOs), and set initial state to a matrix product state (MPS). 
  The methods involved here include :meth:`NetworkState.apply_tensor_operator`, :meth:`NetworkState.update_tensor_operator`, :meth:`NetworkState.set_initial_mps`, :meth:`NetworkState.apply_mpo`, and :meth:`NetworkState.apply_network_operator`. 
  Ndarray-like objects from NumPy, CuPy, and PyTorch are all supported as input operands.

.. _NVIDIA/cuQuantum GitHub Discussions: https://github.com/NVIDIA/cuQuantum/discussions

.. note::

    - The :class:`~NetworkState` class supports arbitrary state dimensions beyond regular quantum circuit states with qubits (``d=2``). 
      An example of simulating a complex state with non-uniform state dimensions can be found in the `arbitrary state example`_. 

    - For both MPS and MPO, only open boundary condition is supported.

.. _arbitrary state example: https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/tensornet/experimental/network_state/generic_states/example02_arbitrary_dimension_numpy.py

Users can further specify the tensor network simulation method as one of the following:

- **Contraction-Based Simulations:** Specify ``config`` as a :class:`~TNConfig` object.
- **MPS-Based Simulations:** Specify ``config`` as a :class:`~MPSConfig` object, offering detailed control over truncation extents, canonical centers, SVD algorithms, and normalization options.

Once the problem is fully specified, users can take advantage of the following execution APIs to compute various properties:

- :meth:`NetworkState.compute_state_vector`: Computes the final state coefficients as an N-dimensional tensor with extents matching the specified state.
- :meth:`NetworkState.compute_amplitude`: Computes the amplitude coefficient for a given bitstring.
- :meth:`NetworkState.compute_batched_amplitudes`: Computes the batched amplitude coefficients for a subset of state dimensions while others are fixed at certain states.
- :meth:`NetworkState.compute_reduced_density_matrix`: Computes the reduced density matrix for a subset of state dimensions, optionally fixing another subset to specific states.
- :meth:`NetworkState.compute_expectation`: Computes the expectation value for a given tensor network operator, which can be specified as a sum of tensor product (such as Pauli operators) or MPOs with coefficients.
- :meth:`NetworkState.compute_sampling`: Draws samples from the underlying state, with options to sample just a subset of all state dimensions.
- :meth:`NetworkState.compute_norm`: Computes the norm of the tensor network state.

Additionally, the :class:`NetworkOperator` class allows users to create a network operator object as a sum of tensor products (via :meth:`NetworkOperator.append_product`) or MPOs (via :meth:`NetworkOperator.append_mpo`) with coefficients. 
This object can then interact with the :class:`NetworkState` class, enabling users to apply an MPO to the state or compute the expectation value of the operator on the state using methods like :meth:`NetworkState.apply_network_operator` and :meth:`NetworkState.compute_expectation`.

.. _simulator caching:

Caching feature
---------------

As of cuQuantum v24.08, the :class:`NetworkState` offers preliminary caching support for all execution methods with a ``compute_`` suffix **when contraction-based tensor network simulation or MPS simulation without value based truncation is used**. 
During the first call to these methods, the underlying cuTensorNet C object for these properties will be created, prepared, cached, and then executed to compute the final output. 
On subsequent calls to the **same method using compatible parameters** without updating the state with :meth:`NetworkState.apply_tensor_operator`, 
:meth:`NetworkState.apply_mpo`, :meth:`NetworkState.set_initial_mps`, or :meth:`NetworkState.apply_network_operator` (it's okay to call :meth:`NetworkState.update_tensor_operator`), the cached C object will be reused to compute the final output, 
thus reducing the overhead of C object creation and preparation. 

**Compatible parameters** have different contexts for different execution methods:

- For :meth:`NetworkState.compute_state_vector`, :meth:`NetworkState.compute_amplitude`, and :meth:`NetworkState.compute_norm`, any parameters will result in using the same cached object.
- For :meth:`NetworkState.compute_batched_amplitudes`, the set of state dimensions specified by ``fixed`` must be identical while the fixed state for each dimension may differ.
- For :meth:`NetworkState.compute_reduced_density_matrix`, the ``where`` parameter and the set of state dimensions specified by ``fixed`` must be identical while the fixed state for each dimension may differ.
- For :meth:`NetworkState.compute_expectation`, the same :class:`NetworkOperator` object with unchanged underlying components must be used. 
  Providing ``operators`` as a string of Pauli operators or as a dictionary mapping Pauli strings to coefficients will **not** activate the caching mechanism.
- For :meth:`NetworkState.compute_sampling`, the same ``modes`` parameter is required to activate the caching mechanism.

For more details, please refer to our `cirq caching example`_ and `qiskit caching example`_. 
Additionally, users can leverage the caching feature along with the :meth:`NetworkState.update_tensor_operator` method to reduce the overhead for variational workflows where the same computation needs to be performed on numerous states with identical topologies. 
For more details, please refer to our `variational workflow example`_.

.. _cirq caching example: https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/tensornet/experimental/network_state/circuits_cirq/example04_caching.py
.. _qiskit caching example: https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/tensornet/experimental/network_state/circuits_qiskit/example04_caching.py
.. _variational workflow example: https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/tensornet/experimental/network_state/generic_states/example04_variational_expectation.py

.. _simulator mpi:

MPI support
-----------

As of cuQuantum v24.08, the :class:`NetworkState` offers preliminary distributed parallel support for all execution methods with a ``compute_`` suffix **when contraction-based tensor network simulation is used**, i.e., :class:`TNConfig`. 
To activate distributed parallel execution, users must perform the following tasks:

1. Explicitly set the device ID to use in :attr:`cuquantum.tensornet.NetworkOptions.device_id` and provide it to :class:`NetworkState` via the ``options`` parameter.
2. Explicitly create the library handle on the corresponding device using :func:`cuquantum.bindings.cutensornet.create`, bind an MPI communicator to 
   the library handle using :func:`cuquantum.bindings.cutensornet.distributed_reset_configuration`, and provide it to :class:`NetworkState` via the ``options`` parameter.

For more details, please refer to our `cirq mpi sampling example`_ and `qiskit mpi sampling example`_.

.. _cirq mpi sampling example: https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/tensornet/experimental/network_state/circuits_cirq/example07_mpi_sampling.py
.. _qiskit mpi sampling example: https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/tensornet/experimental/network_state/circuits_qiskit/example07_mpi_sampling.py


API reference
=============

.. module:: cuquantum.tensornet

Objects
-------

.. autosummary::
   :toctree: generated/

   Network
   CircuitToEinsum

   :template: dataclass.rst

   NetworkOptions
   OptimizerInfo
   OptimizerOptions
   PathFinderOptions
   ReconfigOptions
   SlicerOptions


Python functions
----------------

.. autosummary::
   :toctree: generated/

   contract
   contract_path
   einsum
   einsum_path
   get_mpi_comm_pointer


.. currentmodule:: cuquantum.tensornet.tensor

Tensor submodule
----------------

.. autosummary::
   :toctree: generated/

   decompose
   :template: dataclass.rst

   DecompositionOptions
   QRMethod
   SVDInfo
   SVDMethod

.. currentmodule:: cuquantum.tensornet.experimental

Experimental submodule
----------------------

.. autosummary::
   :toctree: generated/

   contract_decompose
   NetworkState
   NetworkOperator
   :template: dataclass.rst

   ContractDecomposeAlgorithm
   ContractDecomposeInfo
   MPSConfig
   TNConfig