.. _cuTensorNet C example:

***************
Getting Started
***************

In this section, we show how to contract a tensor network using *cuTensorNet*. First, we describe how to install the library and how to compile a sample code.
Then, we present the example code used to perform common steps in *cuTensorNet*.
In this example, we perform the following tensor contraction:

.. math::

   D_{m,x,n,y} = A_{m,h,k,n} B_{u,k,h} C_{x,u,y}

We build the code up step by step, each step adding code at the end. The steps are separated by succinct multi-line comment blocks.

It is recommended that the reader refers to :doc:`overview` and `cuTENSOR documentation`_ for familiarity with the nomenclature and cuTENSOR operations.

.. _cuTENSOR documentation: https://docs.nvidia.com/cuda/cutensor/index.html

============================
Installation and Compilation
============================

Download the cuQuantum package (which *cuTensorNet* is part of) from https://developer.nvidia.com/cuQuantum-downloads,
and the cuTENSOR package from https://developer.nvidia.com/cutensor.

-----
Linux
-----

Assuming cuQuantum has been extracted in ``CUQUANTUM_ROOT`` and cuTENSOR in ``CUTENSOR_ROOT`` we update the library path as follows:

.. code-block:: bash

   export LD_LIBRARY_PATH=${CUQUANTUM_ROOT}/lib:${CUTENSOR_ROOT}/lib/11:${LD_LIBRARY_PATH}

Depending on your CUDA Toolkit, you might have to choose a different library version (e.g., ``${CUTENSOR_ROOT}/lib/11.0``).

The sample code discussed below (``tensornet_example.cu``) can be compiled via the following command:

.. code-block:: bash

   nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -o tensornet_example

For statically linking against the *cuTensorNet* library, use the following command (note that ``libmetis_static.a`` needs to be explicitly linked against, assuming it is installed through the NVIDIA CUDA Toolkit and accessible through ``$LIBRARY_PATH``):

.. code-block:: bash

   nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include ${CUQUANTUM_ROOT}/lib/libcutensornet_static.a -L${CUTENSOR_DIR}/lib/11 -lcutensor libmetis_static.a -o tensornet_example

.. note::

   Depending on the source of the cuQuantum package, you may need to replace ``lib`` above by ``lib64``.

..
   -------
   Windows
   -------
   
   Assuming *cuTensorNet* has been extracted in `CUTENSORNET_ROOT`, we update the library path accordingly:
   
   .. code-block:: bash
   
      setx LD_LIBRARY_PATH "%CUTENSORNET_ROOT%\lib:%CUTENSOR_ROOT%\lib:%LD_LIBRARY_PATH%"
   
   We can compile the sample code we will discuss below (`tensornet_example.cu`) via the following command:
   
   .. code-block:: bash
   
       nvcc.exe tensornet_example.cu /I "%CUTENSORNET_ROOT%\include" "%CUTENSOR_ROOT%\include" cuTensorNet.lib cuTensor.lib /out:tensornet_example.exe

=====================
Code Example (Serial)
=====================

The following code example illustrates the common steps necessary to use cuTensorNet and also introduces typical tensor network operations.
The full sample code can be found in the `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository (`here <https://github.com/NVIDIA/cuQuantum/blob/main/samples/cutensornet/tensornet_example.cu>`_).

----------------------
Headers and data types
----------------------

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #1
   :end-before: Sphinx: #2

--------------------------------------
Define tensor network and tensor sizes
--------------------------------------

Next, we define the topology of the tensor network (i.e., the modes of the tensors, their extents, and their connectivity).

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #2
   :end-before: Sphinx: #3
   
--------------------------------------
Allocate memory and initialize data
--------------------------------------

Next, we allocate memory for the tensor network operands and initialize it to random values.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #3
   :end-before: Sphinx: #4

-----------------------------------------
cuTensorNet handle and network descriptor
-----------------------------------------

Next, we initialize the *cuTensorNet* library via `cutensornetCreate()` and
create the network descriptor with the desired tensor modes, extents, and 
strides, as well as the data and compute types.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #4
   :end-before: Sphinx: #5

--------------------------------------
Optimal contraction order and slicing
--------------------------------------

At this stage, we can deploy the *cuTensorNet* optimizer to find an optimized contraction path and slicing combination.
We choose a limit for the workspace needed to perform the contraction based on the  available resources, and provide it
to the optimizer as a *constraint*.  We then create an optimizer config object of type `cutensornetContractionOptimizerConfig_t`
to specify various optimizer options and provide it to the optimizer, which is called using `cutensornetContractionOptimize()`. The
results from the optimizer will be returned in an optimizer info object of type `cutensornetContractionOptimizerInfo_t`.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #5
   :end-before: Sphinx: #6

It is also possible to bypass the *cuTensorNet* optimizer and set a pre-determined path, as well as slicing information, directly into the optimizer
info object via `cutensornetContractionOptimizerInfoSetAttribute`.

---------------------------------------------------------
Create workspace descriptor and allocate workspace memory
---------------------------------------------------------

Next, we create a workspace descriptor, compute the workspace sizes, and query the minimum workspace size needed
to contract the network. We then allocate device memory for the workspace and set this in the workspace descriptor.
The workspace descriptor will be provided to the contraction plan.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #6
   :end-before: Sphinx: #7

--------------------------------------
Contraction plan and auto-tune
--------------------------------------

We create a contraction plan holding pair-wise contraction plans for cuTENSOR. 
Optionally, we can auto-tune the plan such that cuTENSOR selects the best kernel for each contraction.
This contraction plan can be reused for many (possibly different) data inputs, avoiding
the cost of initializing this plan redundantly.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #7
   :end-before: Sphinx: #8

--------------------------------------
Network contraction execution
--------------------------------------

Finally, we contract the network as many times as needed, possibly with different data.
Network slices, captured as a `cutensornetSliceGroup_t` object, are executed using the same contraction plan. For convenience, `NULL` can be provided to
the `cutensornetContractSlices()` function instead of a slice group when the goal is to contract all the slices in the network.
We also clean up and free allocated resources.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: #8

Recall that the full sample code can be found in the `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository (`here <https://github.com/NVIDIA/cuQuantum/blob/main/samples/cutensornet/tensornet_example.cu>`_).

======================================
Code Example (Slice-based Parallelism)
======================================

It is straightforward to adapt `Code Example (Serial)`_ to enable parallel execution of the contraction operation on multiple devices. We will illustrate this with an example using MPI as the communication layer.
In the interests of brevity, we will show only the changes that need to be made; the `full sample code <https://github.com/NVIDIA/cuQuantum/blob/main/samples/cutensornet/tensornet_example_mpi.cu>`_ can be found
in the `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository.

First, in addition to the headers and definitions mentioned in `Headers and data types`_, we include the MPI header and define a macro to handle MPI errors. We also initialize MPI and map each process to
a device in this section.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #1 [begin]
   :end-before: Sphinx: MPI #1 [end]

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #2 [begin]
   :end-before: Sphinx: MPI #2 [end]

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #3 [begin]
   :end-before: Sphinx: MPI #3 [end]

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #4 [begin]
   :end-before: Sphinx: MPI #4 [end]

Next we define the tensor network as described in `Define tensor network and tensor sizes`_. In a one process per device model, the tensor network, including operand and result data, is replicated on each process.
The root process initalizes the operand data and broadcasts it to the other processes.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #5 [begin]
   :end-before: Sphinx: MPI #5 [end]

Then we create the library handle and tensor network descriptor on each process, as decribed in `cuTensorNet handle and network descriptor`_. 

Next, we find the optimal contraction path and slicing combination for our network. We will run the *cuTensorNet* optimizer on all processes and determine which process has the best path in terms of
FLOP count. We will then pack the optimizer info object on this process, broadcast the packed buffer, and unpack it on all processes. Each process now has the same optimizer info object, which we use to calculate
the share of slices for each process.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #6 [begin]
   :end-before: Sphinx: MPI #6 [end]

We now create the workspace descriptor and allocate memory as described in `Create workspace descriptor and allocate workspace memory`_, and create the `Contraction plan and auto-tune`_ the network.

Next, on each process, we create a slice group (see `cutensornetSliceGroup_t`) that corresponds to its share of the network slices. We then provide this slice group object to the `cutensornetContractSlices()` function
to get a partial contraction result on each process.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #7 [begin]
   :end-before: Sphinx: MPI #7 [end]

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #8 [begin]
   :end-before: Sphinx: MPI #8 [end]

Finally, we sum up the partial contributions to obtain the result of the contraction.

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #9 [begin]
   :end-before: Sphinx: MPI #9 [end]

.. literalinclude:: ../../../tensor_network/samples/tensornet_example_mpi.cu
   :language: c++
   :linenos:
   :lineno-match:
   :start-after: Sphinx: MPI #10 [begin]
   :end-before: Sphinx: MPI #10 [end]

The `complete MPI sample <https://github.com/NVIDIA/cuQuantum/blob/main/samples/cutensornet/tensornet_example_mpi.cu>`_ can be found in the `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository.

===========
Useful Tips
===========

* For debugging, the environment variable ``CUTENSORNET_LOG_LEVEL=n`` can be set. The level ``n`` = 0, 1, ..., 5 corresponds to the logger level as described and used in `cutensornetLoggerSetLevel`. The environment variable ``CUTENSORNET_LOG_FILE=<filepath>`` can be used to direct the log output to a custom file at ``<filepath>`` instead of ``stdout``.