.. _cuTensorNet C example:

********
Examples
********

In this section, we show how to contract a tensor network using *cuTensorNet*.
First, we describe how to compile sample code. Then, we present an example code used 
to perform common steps in *cuTensorNet*. In the example, we perform the following tensor contraction:

.. math::

   R_{k,l} = A_{a,b,c,d,e,f} B_{b,g,h,e,i,j} C_{m,a,g,f,i,k} D_{l,c,h,d,j,m}

We construct the code step by step, each step adding code at the end. The steps are separated by succinct multi-line comment blocks.

It is recommended that the reader refers to :doc:`overview` and `cuTENSOR documentation`_ for familiarity with the nomenclature and cuTENSOR operations.

.. _cuTENSOR documentation: https://docs.nvidia.com/cuda/cutensor/index.html

==============
Compiling Code
==============

Assuming cuQuantum has been extracted in ``CUQUANTUM_ROOT`` and cuTENSOR in ``CUTENSOR_ROOT``, we update the library path as follows:

.. code-block:: bash

   export LD_LIBRARY_PATH=${CUQUANTUM_ROOT}/lib:${CUTENSOR_ROOT}/lib/11:${LD_LIBRARY_PATH}

Depending on your CUDA Toolkit, you might have to choose a different library version (e.g., ``${CUTENSOR_ROOT}/lib/11.0``).

The serial sample code discussed below (``tensornet_example.cu``) can be compiled via the following command:

.. code-block:: bash

   nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -o tensornet_example

For static linking against the *cuTensorNet* library, use the following command (note that ``libmetis_static.a`` needs to be explicitly linked against,
assuming it is installed through the NVIDIA CUDA Toolkit and accessible through ``$LIBRARY_PATH``):

.. code-block:: bash

   nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include ${CUQUANTUM_ROOT}/lib/libcutensornet_static.a -L${CUTENSOR_DIR}/lib/11 -lcutensor libmetis_static.a -o tensornet_example

In order to build parallel (MPI) versions of the examples (``tensornet_example_mpi_auto.cu`` and ``tensornet_example_mpi.cu``),
one will need to have an MPI library installed (e.g., recent Open MPI, MVAPICH, or MPICH).
In particular, the automatic parallel example requires *CUDA-aware* MPI, see :ref:`automatic mpi sample` below.
In this case, one will need to add ``-I${MPI_PATH}/include`` and ``-L${MPI_PATH}/lib -lmpi`` to the build command:

.. code-block:: bash

   nvcc tensornet_example_mpi_auto.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -I${MPI_PATH}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -L${MPI_PATH}/lib -lmpi -o tensornet_example_mpi_auto
   nvcc tensornet_example_mpi.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -I${MPI_PATH}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -L${MPI_PATH}/lib -lmpi -o tensornet_example_mpi

.. warning::

   When running ``tensornet_example_mpi_auto.cu`` without CUDA-aware MPI, the program will crash.

.. note::

   Depending on the source of the cuQuantum package, you may need to replace ``lib`` above by ``lib64``.

=====================
Code Example (Serial)
=====================

The following code example illustrates the common steps necessary to use *cuTensorNet* and also introduces typical tensor network operations.
The full sample code can be found in the `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository
(`here <https://github.com/NVIDIA/cuQuantum/blob/main/samples/cutensornet/tensornet_example.cu>`_).

----------------------
Headers and data types
----------------------

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #1
   :end-before: Sphinx: #2

--------------------------------------
Define tensor network and tensor sizes
--------------------------------------

Next, we define the topology of the tensor network (i.e., the modes of the tensors, their extents, and their connectivity).

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #2
   :end-before: Sphinx: #3

-----------------------------------
Allocate memory and initialize data
-----------------------------------

Next, we allocate memory for the tensor network operands and initialize them to random values.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #3
   :end-before: Sphinx: #4

-----------------------------------------
cuTensorNet handle and network descriptor
-----------------------------------------

Next, we initialize the *cuTensorNet* library via `cutensornetCreate()` and
create the network descriptor with the desired tensor modes, extents, and 
strides, as well as the data and compute types. Note that the created library
context will be associated with the currently active GPU.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #4
   :end-before: Sphinx: #5

-------------------------------------
Optimal contraction order and slicing
-------------------------------------

At this stage, we can deploy the *cuTensorNet* optimizer to find an optimized contraction path and slicing combination.
We choose a limit for the workspace needed to perform the contraction based on the available memory resources, and provide it
to the optimizer as a *constraint*. We then create an optimizer configuration object of type `cutensornetContractionOptimizerConfig_t`
to specify various optimizer options and provide it to the optimizer, which is invoked via `cutensornetContractionOptimize()`.
The results from the optimizer will be returned in an optimizer info object of type `cutensornetContractionOptimizerInfo_t`.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #5
   :end-before: Sphinx: #6

It is also possible to bypass the *cuTensorNet* optimizer and import a pre-determined contraction path, as well as slicing information,
directly to the optimizer info object via `cutensornetContractionOptimizerInfoSetAttribute`.

---------------------------------------------------------
Create workspace descriptor and allocate workspace memory
---------------------------------------------------------

Next, we create a workspace descriptor, compute the workspace sizes, and query the minimum workspace size needed
to contract the network. We then allocate device memory for the workspace and set this in the workspace descriptor.
The workspace descriptor will be provided to the contraction plan.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #6
   :end-before: Sphinx: #7

------------------------------
Contraction plan and auto-tune
------------------------------

We create a tensor network contraction plan holding all pairwise contraction plans for cuTENSOR. 
Optionally, we can auto-tune the plan such that cuTENSOR selects the best kernel for each pairwise contraction.
This contraction plan can be reused for many (possibly different) data inputs, avoiding
the cost of initializing this plan redundantly.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #7
   :end-before: Sphinx: #8

------------------------------------
Tensor network contraction execution
------------------------------------

Finally, we contract the tensor network as many times as needed, possibly with different input each time.
Tensor network slices, captured as a `cutensornetSliceGroup_t` object, are computed using the same contraction plan.
For convenience, `NULL` can be provided to the `cutensornetContractSlices()` function instead of a slice group when
the goal is to contract all the slices in the network. We also clean up and free allocated resources.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #8

Recall that the full sample code can be found in the `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository
(`here <https://github.com/NVIDIA/cuQuantum/blob/main/samples/cutensornet/tensornet_example.cu>`_).


.. _automatic mpi sample:

================================================================
Code Example (Automatic Slice-Based Distributed Parallelization)
================================================================

It is straightforward to adapt `Code Example (Serial)`_ and enable automatic parallel execution
across multiple/many GPU devices (across multiple/many nodes). We will illustrate this with
an example using the Message Passing Interface (MPI) as the communication layer. Below we show
the minor additions that need to be made in order to enable distributed parallel execution
without making any changes to the original serial source code.
The `full MPI-automatic sample code <https://github.com/NVIDIA/cuQuantum/blob/main/samples/cutensornet/tensornet_example_mpi_auto.cu>`_
can be found in the `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository.
To enable automatic parallelism, *cuTensorNet* requires that

  * the environment variable ``$CUTENSORNET_COMM_LIB`` is set to the path to the wrapper shared library ``libcutensornet_distributed_interface_mpi.so``, and
  * the executable is linked to a CUDA-aware MPI library

The detailed instruction for setting these up is given in the installation guide above.

First, in addition to the headers and definitions mentioned in `Headers and data types`_,
we include the MPI header and define a macro to handle MPI errors.
We also need to initialize the MPI service and assign a *unique* GPU device to each MPI process
that will later be associated with the *cuTensorNet* library handle created inside the MPI process.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi_auto.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #1 [begin]
   :end-before: Sphinx: MPI #1 [end]

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi_auto.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #2 [begin]
   :end-before: Sphinx: MPI #2 [end]

The MPI service initialization must precede the first `cutensornetCreate()` call
which creates a *cuTensorNet* library handle. An attempt to call `cutensornetCreate()`
before initializing the MPI service will result in an error.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi_auto.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #3 [begin]
   :end-before: Sphinx: MPI #3 [end]

If multiple GPU devices located on the same node are visible to an MPI process,
we need to pick an exclusive GPU device for each MPI process. If the ``mpirun`` (or ``mpiexec``)
command provided by your MPI library implementation sets up an environment variable
that shows the rank of the respective MPI process during its invocation, you can use
that environment variable to set ``CUDA_VISIBLE_DEVICES`` to point to a specific single
GPU device assigned to the MPI process exclusively (for example, Open MPI provides
``${OMPI_COMM_WORLD_LOCAL_RANK}`` for this purpose). Otherwise, the GPU device can be
set manually, as shown below.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi_auto.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #4 [begin]
   :end-before: Sphinx: MPI #4 [end]

Next we define the tensor network as described in `Define tensor network and tensor sizes`_.
In a one GPU device per process model, the tensor network, including operands and result data,
is replicated on each process. The root process initializes the input data and broadcasts it
to the other processes.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi_auto.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #5 [begin]
   :end-before: Sphinx: MPI #5 [end]

Once the MPI service has been initialized and the *cuTensorNet* library handle has been created
afterwards, one can activate the distributed parallel execution by calling `cutensornetDistributedResetConfiguration`.
Per standard practice, the user's code needs to create a duplicate MPI communicator via ``MPI_Comm_dup``.
Then, the duplicate MPI communicator is associated with the *cuTensorNet* library handle
by passing the pointer to the duplicate MPI communicator together with its size (in bytes)
to the `cutensornetDistributedResetConfiguration` call. The MPI communicator will be stored
inside the *cuTensorNet* library handle such that all subsequent calls to the
tensor network contraction path finder and tensor network contraction executor will be
parallelized across all participating MPI processes (each MPI process is associated with its own GPU).

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi_auto.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #6 [begin]
   :end-before: Sphinx: MPI #6 [end]

.. note::
   `cutensornetDistributedResetConfiguration` is a collective call that must be executed
   by all participating MPI processes.

The API of this distributed parallelization model makes it straightforward to run source codes
written for serial execution on multiple GPUs/nodes. Essentially, all MPI processes will
execute exactly the same (serial) source code while automatically performing distributed parallelization
inside the tensor network contraction path finder and tensor network contraction executor calls.
The parallelization of the tensor network contraction path finder will only occur when the number
of requested hyper-samples is larger than zero. However, regardless of that, activation of the
distributed parallelization must precede the invocation of the tensor network contraction path finder.
That is, the tensor network contraction path finder and tensor network contraction execution invocations
must be done strictly after activating the distributed parallelization via `cutensornetDistributedResetConfiguration`.
When the distributed configuration is set to a parallel mode, the user is normally expected to invoke
tensor network contraction execution by calling the `cutensornetContractSlices` function which is provided
with the full range of tensor network slices that will be automatically distributed across all MPI processes.

Since the size of the tensor network must be sufficiently large to get a benefit of acceleration from
distributed execution, smaller tensor networks (those which consist of only a single slice)
can still be processed without distributed parallelization, which can be achieved by calling
`cutensornetDistributedResetConfiguration` with a ``NULL`` argument in place of the MPI communicator
pointer (as before, this should be done prior to calling the tensor network contraction path finder).
That is, the switch between distributed parallelization and redundant serial execution can be done
on a per-tensor-network basis. Users can decide which (larger) tensor networks to process
in a parallel manner and which (smaller ones) to process in a serial manner redundantly,
by resetting the distributed configuration appropriately. In both cases, all MPI processes
will produce the same output tensor (result) at the end of the tensor network execution.

.. note::
   In the current version of the *cuTensorNet* library, the parallel tensor network contraction
   execution triggered by the `cutensornetContractSlices` call will block the provided CUDA stream as well
   as the calling CPU thread until the execution has completed on all MPI processes. This is a temporary
   limitation that will be lifted in future versions of the *cuTensorNet* library, where the call to
   `cutensornetContractSlices` will be fully asynchronous, similar to the serial execution case.
   Additionally, for an explicit synchronization of all MPI processes (barrier) one can make
   a collective call to `cutensornetDistributedSynchronize`.

Before termination, the MPI service needs to be finalized.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi_auto.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #7 [begin]
   :end-before: Sphinx: MPI #7 [end]

The `complete MPI-automatic sample <https://github.com/NVIDIA/cuQuantum/blob/main/samples/cutensornet/tensornet_example_mpi_auto.cu>`_
can be found in the `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository.


=============================================================
Code Example (Manual Slice-Based Distributed Parallelization)
=============================================================

For advanced users, it is also possible (but more involved) to adapt `Code Example (Serial)`_
to explicitly parallelize execution of the tensor network contraction operation on multiple GPU devices.
Here we will also use MPI as the communication layer. For brevity, we will show only
the changes that need to be made on top of the serial example.
The `full MPI-manual sample code <https://github.com/NVIDIA/cuQuantum/blob/main/samples/cutensornet/tensornet_example_mpi.cu>`_
can be found in the `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository.
Note that this sample does **NOT** require CUDA-aware MPI.

First, in addition to the headers and definitions mentioned in `Headers and data types`_,
we need to include the MPI header and define a macro to handle MPI errors. We also need to initialize
the MPI service and associate each MPI process with its own GPU device, as explained previously.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #1 [begin]
   :end-before: Sphinx: MPI #1 [end]

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #2 [begin]
   :end-before: Sphinx: MPI #2 [end]

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #3 [begin]
   :end-before: Sphinx: MPI #3 [end]

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #4 [begin]
   :end-before: Sphinx: MPI #4 [end]

Next, we define the tensor network as described in `Define tensor network and tensor sizes`_.
In a one GPU device per process model, the tensor network, including operands and result data,
is replicated on each process. The root process initializes the input data and broadcasts it
to the other processes.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #5 [begin]
   :end-before: Sphinx: MPI #5 [end]

Then we create the library handle and tensor network descriptor on each process, as described in `cuTensorNet handle and network descriptor`_.

Next, we find the optimal contraction path and slicing combination for our tensor network.
We will run the *cuTensorNet* optimizer on all processes and determine which process has the best path
in terms of the FLOP count. We will then pack the optimizer info object on this process, broadcast the packed buffer,
and unpack it on all other processes. Each process now has the same optimizer info object, which we use to calculate
the share of slices for each process.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #6 [begin]
   :end-before: Sphinx: MPI #6 [end]

We now create the workspace descriptor and allocate memory as described in `Create workspace descriptor and allocate workspace memory`_,
and create the `Contraction plan and auto-tune`_ the tensor network.

Next, on each process, we create a slice group (see `cutensornetSliceGroup_t`) that corresponds to its share of the tensor network slices.
We then provide this slice group object to the `cutensornetContractSlices()` function to get a partial contraction result on each process.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #7 [begin]
   :end-before: Sphinx: MPI #7 [end]

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #8 [begin]
   :end-before: Sphinx: MPI #8 [end]

Finally, we sum up the partial contributions to obtain the result of the tensor network contraction.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #9 [begin]
   :end-before: Sphinx: MPI #9 [end]

Before termination, the MPI service needs to be finalized.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #10 [begin]
   :end-before: Sphinx: MPI #10 [end]

The `complete MPI-manual sample <https://github.com/NVIDIA/cuQuantum/blob/main/samples/cutensornet/tensornet_example_mpi.cu>`_
can be found in the `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository.

.. _QRexample:

=======================
Code Example (tensorQR)
=======================

.. toctree::
   :maxdepth: 2

   examples/qr

.. _SVDexample:

========================
Code Example (tensorSVD)
========================

.. toctree::
   :maxdepth: 2

   examples/svd

.. _Gateexample:

========================
Code Example (GateSplit)
========================

.. toctree::
   :maxdepth: 2

   examples/gate

.. _MPSexample:

================================
Code Example (MPS Factorization)
================================

.. toctree::
   :maxdepth: 2

   examples/mps

.. _reuse-example:

========================================
Code Example (Intermediate Tensor Reuse)
========================================

.. toctree::
   :maxdepth: 2

   examples/reuse

.. _gradients-example:

====================================
Code Example (Gradients computation)
====================================

.. toctree::
   :maxdepth: 2

   examples/gradients

===============================
Code Example (Amplitudes Slice)
===============================

.. toctree::
   :maxdepth: 2

   examples/amplitudes

================================
Code Example (Expectation Value)
================================

.. toctree::
   :maxdepth: 2

   examples/expectation

====================================
Code Example (Marginal Distribution)
====================================

.. toctree::
   :maxdepth: 2

   examples/marginal

======================================
Code Example (Tensor Network Sampling)
======================================

.. toctree::
   :maxdepth: 2

   examples/sampling

===================================
Code Example (MPS Amplitudes Slice)
===================================

.. toctree::
   :maxdepth: 2

   examples/mps-amplitudes

====================================
Code Example (MPS Expectation Value)
====================================

.. toctree::
   :maxdepth: 2

   examples/mps-expectation

========================================
Code Example (MPS Marginal Distribution)
========================================

.. toctree::
   :maxdepth: 2

   examples/mps-marginal

===========================
Code Example (MPS Sampling)
===========================

.. toctree::
   :maxdepth: 2

   examples/mps-sampling

===============================
Code Example (MPS Sampling QFT)
===============================

.. toctree::
   :maxdepth: 2

   examples/mps-sampling-qft

===============================
Code Example (MPS Sampling MPO)
===============================

.. toctree::
   :maxdepth: 2

   examples/mps-sampling-mpo

===========
Useful Tips
===========

* For debugging, one can set the environment variable ``CUTENSORNET_LOG_LEVEL=n``.
  The level ``n`` = 0, 1, ..., 5 corresponds to the logger level as described and used in `cutensornetLoggerSetLevel`.
  The environment variable ``CUTENSORNET_LOG_FILE=<filepath>`` can be used to redirect the log output to a custom file
  at ``<filepath>`` instead of ``stdout``.