********
Overview
********

cuQuantum Python aims to bring the full functionalities of NVIDIA cuQuantum SDK to Python.
To do so, we adopt a two-layer approach:

1. Provide 1:1 Python wrappers of the corresponding C APIs in cuQuantum, including both cuStateVec and cuTensorNet.
2. Provide high-level, pythonic APIs for easier integration with Python applications.

Below we introduce each layer and show examples for the intended usage.
Python sample codes (such as those shown below) can be found in the `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository.

Low-level Python bindings
=========================

Naming & calling convention
---------------------------

.. currentmodule:: cuquantum

All cuQuantum C APIs are exposed under the :mod:`cuquantum.custatevec` and :mod:`cuquantum.cutensornet` modules.
In doing so, we follow the `PEP 8`_ style guide and adopt the following changes:

* All library name prefixes are stripped
* The function names are broken by words and follow the camel case
* The first letter in each word in the enum names are capitalized
* Each enum's name prefix is stripped from its values' names
* Common enums that can be used in all submodules are placed in the parent module :mod:`cuquantum`
* Whenever applicable, the outputs are stripped away from the function arguments and returned directly as Python objects
* Pointers are passed as Python :class:`int`

Below is a non-exhaustive list of examples of such C-to-Python mappings:

- Function: `custatevecGetDefaultWorkspaceSize` -> :func:`custatevec.get_default_workspace_size`.
- Function: `cutensornetCreateNetworkDescriptor` -> :func:`cutensornet.create_network_descriptor`.
- Enum type: `custatevecMatrixLayout_t` -> :class:`custatevec.MatrixLayout`.
- Enum type: `cutensornetContractionOptimizerConfigAttributes_t` -> :class:`cutensornet.ContractionOptimizerConfigAttribute`.
- Enum value name: `CUSTATEVEC_MATRIX_LAYOUT_COL` -> :data:`custatevec.MatrixLayout.COL`.
- Enum value name: `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES` -> :data:`cutensornet.ContractionOptimizerConfigAttribute.HYPER_NUM_SAMPLES`.
- Return: The outputs of `custatevecSamplerCreate` are the sampler descriptor and the required workspace size, which are wrapped as a 2-tuple in the corresponding :func:`custatevec.sampler_create` Python API.
- Global enum: `custatevecComputeType_t` and `cutensornetComputeType_t` -> :class:`cuquantum.ComputeType`.

There may be exceptions for the above rules, but they would be self-evident and properly documented. In the next section we discuss
pointer passing in Python.

.. _PEP 8: https://www.python.org/dev/peps/pep-0008/

Memory management
-----------------

Pointer and data lifetime
.........................

Unlike in C/C++, Python does not provide low-level primitives to allocate/deallocate host memory, not to mention device memory.
In order to make the C APIs work with Python, it is important that memory management is properly done through Python proxy objects.
In cuQuantum Python, we ask users to address such needs using NumPy (for host memory) and CuPy (for device memory).

.. note::

    It is also possible to use :class:`array.array` (plus :class:`memoryview` as needed) to manage host memory, however it is more
    tedious compared to using :class:`numpy.ndarray`, especially when it comes to array manipulation and computation.

.. note::

    It is also possible to use `CUDA Python`_ to manage device memory, but as of CUDA 11 there is no simple, pythonic way to
    modify the contents stored on GPU, which requires custom kernels. CuPy is a lightweight, NumPy-compatible array library to
    address this need.

To pass data from Python to C, using pointer addresses (as Python :class:`int`) of various objects is required. With NumPy/CuPy arrays
as the proxy, it is as simple as follows:

.. code-block:: python

    # create a host buffer to hold 5 int
    buf = numpy.empty((5,), dtype=numpy.int32)
    # pass buf's pointer to the wrapper
    # buf could get modified in-place if the function writes to it
    my_func(..., buf.ctypes.data, ...)
    # examine/use buf's data
    print(buf)

    # create a device buffer to hold 10 double
    buf = cupy.empty((10,), dtype=cupy.float64)
    # pass buf's pointer to the wrapper
    # buf could get modified in-place if the function writes to it
    my_func(..., buf.data.ptr, ...)
    # examine/use buf's data
    print(buf)

    # create an untyped device buffer of 128 bytes
    buf = cupy.cuda.alloc(128)
    # pass buf's pointer to the wrapper
    # buf could get modified in-place if the function writes to it
    my_func(..., buf.ptr, ...)
    # buf is automatically destroyed when going out of scope

Please be aware that the underlying assumption is that the arrays must be contiguous in memory (unless the C interface allows
for specifying the array strides).

As a consequence, for example, as of cuQuantum Python v0.1.0 all C structs (including handles and descriptors) are *not exposed*
as Python classes; that is, they do not have their own types and are simply cast to plain Python :class:`int` for passing around. Any
downstream consumer should create a wrapper class to hold the pointer address if so desired. In other words, users have full
control (and responsibility) for managing the *pointer lifetime*.

However, in certain cases we are able to convert Python objects for users (if *readonly, host* arrays are needed) so as to alleviate
users' burden. For example, in functions that require a sequence or a nested sequence, the following operations are equivalent:

.. code-block:: python

    # passing a host buffer of int type can be done like this
    buf = numpy.array([0, 1, 3, 5, 6], dtype=numpy.int32)
    my_func(..., buf.ctypes.data, ...)

    # or just this
    buf = [0, 1, 3, 5, 6]
    my_func(..., buf, ...)  # the underlying data type is determined by the C API

which is particularly useful when users need to pass a large number of tensor metadata to C (ex: :func:`cutensornet.create_network_descriptor`).

.. _CUDA Python: https://nvidia.github.io/cuda-python/index.html

User-provided memory pools
..........................

Starting cuQuantum v22.03, we offer an interface for users to bring in their memory pool for the cuStateVec/cuTensorNet libraries to use.
Once set, users are no longer required to manage any temporary workspace before calling an API; the library will draw memory from the user's
pool (and return it back once done). The only requirement for the memory pool is it must be *stream-ordered*. See :ref:`Memory Management API <cuStateVec memory management API>` for an introduction. Currently we only support *device* mempools.

In cuQuantum Python, this interface is exposed with low-level APIs :func:`custatevec.set_device_mem_handler` and :func:`custatevec.get_device_mem_handler` (likewise for :mod:`~cuquantum.cutensornet`). Currently we offer three different ways to set the ``handler`` argument:

  - if an :class:`int` is given, it is assumed to be a pointer address to a fully initialized `custatevecDeviceMemHandler_t` struct
  - if a Python sequence of length 4, it is assumed to be ``(ctx, device_alloc, device_free, name)``
  - if a Python sequence of length 3, it is assumed to be ``(malloc, free, name)``

see the API reference for further detail. Once set, using the calling convention

  - setting the workspace (or workspace descriptor) pointer address to ``0``
  - setting the workspace size to ``0``

wherever an API needs a workspace will notify the library that it should use the user mempool. `This example <https://github.com/NVIDIA/cuQuantum/tree/main/python/samples/custatevec/memory_handler.py>`_ demonstrates the usage of this API.

Usage example
-------------

The code below is a Python translation of the :ref:`corresponding cuStateVec example written in C <cuStateVec C example>`.

.. testcode::

    import numpy as np
    import cupy as cp
    from cuquantum import custatevec as cusv
    from cuquantum import cudaDataType as cudtype
    from cuquantum import ComputeType as ctype
    
    
    nIndexBits = 3
    nSvSize = (1 << nIndexBits)
    nTargets = 1
    nControls = 2
    adjoint = 0
    
    targets = (2,)
    controls = (0, 1)
    
    d_sv = cp.asarray([[0.0, 0.0], [0.0, 0.1], [0.1, 0.1], [0.1, 0.2],
                       [0.2, 0.2], [0.3, 0.3], [0.3, 0.4], [0.4, 0.5]], dtype=np.float64)
    d_sv = d_sv.view(np.complex128).reshape(-1)
    
    d_sv_result = cp.asarray([[0.0, 0.0], [0.0, 0.1], [0.1, 0.1], [0.4, 0.5],
                              [0.2, 0.2], [0.3, 0.3], [0.3, 0.4], [0.1, 0.2]], dtype=np.float64)
    d_sv_result = d_sv_result.view(np.complex128).reshape(-1)
    
    d_matrix = cp.asarray([[0.0, 0.0], [1.0, 0.0], [1.0, 0.0], [0.0, 0.0]], dtype=np.float64)
    d_matrix = d_matrix.view(np.complex128).reshape(-1)
    
    # cuStateVec handle initialization
    handle = cusv.create()
    
    # check the size of external workspace
    extraWorkspaceSizeInBytes = cusv.apply_matrix_get_workspace_size(
        handle, cudtype.CUDA_C_64F, nIndexBits, d_matrix.data.ptr, cudtype.CUDA_C_64F,
        cusv.MatrixLayout.ROW, adjoint, nTargets, nControls, ctype.COMPUTE_64F)
    
    # allocate external workspace if necessary
    if extraWorkspaceSizeInBytes > 0:
        workspace = cp.cuda.alloc(extraWorkspaceSizeInBytes)
        workspace_ptr = workspace.ptr
    else:
        workspace_ptr = 0
    
    # apply gate
    cusv.apply_matrix(
        handle, d_sv.data.ptr, cudtype.CUDA_C_64F, nIndexBits,
        d_matrix.data.ptr, cudtype.CUDA_C_64F, cusv.MatrixLayout.ROW, adjoint,
        targets, len(targets), controls, 0, len(controls), ctype.COMPUTE_64F,
        workspace_ptr, extraWorkspaceSizeInBytes)
    
    # destroy handle
    cusv.destroy(handle)
    
    # --------------------------------------------------------------------------
    
    # check if d_sv holds the updated statevector
    correct = cp.allclose(d_sv, d_sv_result)
    if not correct:
        raise RuntimeError("example FAILED: wrong result")
    
    # if this is a standalone script, everything is cleaned up properly at exit

High-level pythonic APIs
========================

Introduction
------------

The goal behind the high-level APIs is to provide an interface to the cuTensorNet library that feels natural
for Python programmers. The APIs support ndarray-like objects from NumPy, CuPy, and PyTorch and support
specification of the tensor network as an Einstein summation expression. 

The high-level APIs can be further categorized into two levels:

  * The "coarse-grained" level, where the user deals with Python functions
    like :func:`contract`, :func:`contract_path`, :func:`einsum`, and :func:`einsum_path`.
    The coarse-grained level is an abstraction layer that is typically meant for single contraction operations.
  * The "fine-grained" level, where the interaction is through operations on a
    :class:`Network` object. The fine-grained level allows the user to invest significant resources into finding
    an optimal contraction path and autotuning the network where repeated contractions on the same network object
    allow for amortization of the cost.

The APIs also allow for interoperability between the cuTensorNet library and external packages. For example, the user can
specify a contraction order obtained from the a different package (perhaps a research project). Alternatively, the user
can obtain the contraction order and the sliced modes from cuTensorNet for downstream use elsewhere.

Usage example
-------------

Contracting the same tensor network demonstrated in the :ref:`cuTensorNet C example <cuTensorNet C example>` is as simple as:

.. testcode::

    from cuquantum import contract
    from numpy.random import rand

    a = rand(96,64,64,96)
    b = rand(96,64,64)
    c = rand(64,96,64)

    r = contract("mhkn,ukh,xuy->mxny", a, b, c)

If desired, various options can be provided for the contraction. See :func:`contract` for more details and examples.

The fine-grained API allows for more control as the examples in the documentation for :class:`Network` illustrate. A complete example
illustrating parallel implementation of tensor network contraction using the fine-grained API is shown below:

.. literalinclude:: ../../../python/samples/cutensornet/fine/example2_mpi.py
   :language: python
   :start-after: Sphinx

The complete MPI Python example can be found in the `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository (`here <https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/cutensornet/fine/example2_mpi.py>`_).

.. _high-level memory management:

Memory management
-----------------

Starting cuQuantum Python v22.03, we support an `EMM`_-like interface as proposed and supported by Numba for users to set their Python
mempool. Users set the option :attr:`NetworkOptions.allocator` to a Python object complying with the :class:`cuquantum.BaseCUDAMemoryManager`
protocol, and pass the options to the high-level APIs like :func:`contract` or :class:`Network`. Temporary memory allocations will then
be done through this interface. (Internally, we use the same interface to use CuPy or PyTorch's mempool depending on the input tensor
operands.)

.. note::

    cuQuantum's :class:`~cuquantum.BaseCUDAMemoryManager` protocol is slightly different from Numba's EMM interface
    (:class:`numba.cuda.BaseCUDAMemoryManager`), but duck typing with an existing EMM instance (not type!) at runtime
    should be possible.

.. _EMM: https://numba.readthedocs.io/en/stable/cuda/external-memory.html

Circuit to tensor network converter
===================================

Introduction
------------

Starting cuQuantum Python v22.07, we provide a :class:`CircuitToEinsum` converter that takes either a :class:`qiskit.QuantumCircuit` or a 
:class:`cirq.Circuit` and generates the corresponding tensor network contraction for the target operation. The goal of the converter is to allow Qiskit and Cirq users to easily
explore the functionalities of the cuTensorNet library. As mentioned in the :ref:`tensor network introduction <tensor network introduction>`, 
quantum circuits can be viewed as tensor networks. For any quantum circuit, :class:`CircuitToEinsum` can construct the corresponding tensor network 
to compute various quantities of interest. The output tensor network is returned as an Einstein summation expression with tensor operands.  

We support the following operations:

    * :meth:`~CircuitToEinsum.state_vector`: 
      The contraction of this Einstein summation expression yields the final state coefficients as an N-dimensional tensor where N is the number of qubits in the circuit. 
      The mode labels of the tensor correspond to the :attr:`CircuitToEinsum.qubits`.
    * :meth:`~CircuitToEinsum.amplitude`: The contraction of this Einstein summation expression yields the amplitude coefficient for a given bitstring.
    * :meth:`~CircuitToEinsum.reduced_density_matrix`: The contraction of this Einstein summation expression yields the reduced density matrix for a subset of qubits, 
      optionally with another subset of qubits set to a fixed state.

The :class:`CircuitToEinsum` class also allows user to specify a desired tensor backend (`cupy`, `torch`, `numpy`) via the ``backend`` argument when constructing the converter object.
The returned Einstein summation expression and tensor operands can then directly serve as the input arguments for :func:`cuquantum.contract` or the corresponding backend's ``einsum`` function.

Usage example
-------------

.. testcode::

    import cirq
    import cupy

    from cuquantum import contract, CircuitToEinsum

    # create a random cirq.Circuit
    circuit = cirq.testing.random_circuit(qubits=4, n_moments=4, op_density=0.9, random_state=1) 
    # same task can be achieved with qiskit.circuit.random.random_circuit

    # construct the CircuitToEinsum converter targeting double precision and cupy operands
    converter = CircuitToEinsum(circuit, dtype='complex128', backend='cupy')

    # generate the Einstein summation expression and tensor operands for computing the amplitude coefficient of bitstring 0000
    expression, operands = converter.amplitude(bitstring='0000')
    assert all([isinstance(op, cupy.ndarray) for op in operands])

    # contract the network to compute the amplitude
    amplitude = contract(expression, *operands)
    amplitude_cupy = cupy.einsum(expression, *operands)
    assert cupy.allclose(amplitude, amplitude_cupy)

Multiple Jupyter notebooks are `available <https://github.com/NVIDIA/cuQuantum/tree/main/python/samples/cutensornet/circuit_converter>`_ for Cirq and Qiskit users to easily build up their tensor network based simulations using cuTensorNet.