********
Overview
********

This section describes the basic working principle of the *cuStateVec* library. For a general introduction to quantum circuits, please refer to `Introduction to quantum computing <../../overview.html>`_.

Description of state vectors
============================

..
    TODO: a lot of text here were copied from custatevec.h, we must find a way to keep them in sync

In the cuStateVec library, the state vector is always given as a device array and its data type is specified by a `cudaDataType_t` constant.  It is user's responsibility to manage memory for the state vector.
This version of cuStateVec library supports 128-bit complex (complex128) and 64-bit complex (complex64) as types of the state vector.
The size of state vector is represented by the `nIndexBits` argument which corresponds to the number of qubits in a circuit.  Therefore, the size of the state vector is expressed as :math:`2^{nIndexBits}`.
The type of `custatevecIndex_t` is provided to express the state vector index, which is a typedef of the 64-bit signed integer.

Bit ordering
^^^^^^^^^^^^

In the cuStateVec library, the bit ordering of the state vector index is defined in the little endian order. The 0-th index bit is the least significant bit (LSB).
Most functions accept arguments to specify bit positions as integer arrays. Those bit positions are specified in the little endian order. Values in bit positions are in the range ``[0, nIndexBits)``.
In order to represent bit strings, a pair of `bitString` and `bitOrdering` arguments is used. The `bitString` argument specifies bit string values as an array of 0 and 1.
The `bitOrdering` argument specifies the bit positions of the `bitString` array elements in the little endian order.
In the following example, ``"10"`` is specified as a bit string. Bit string values are mapped to the 1st and 2nd index bits and can be used to specify a bit mask, ``"*...*10*"``.
 
.. code-block:: cpp

   int32_t bitString[]   = {0, 1};
   int32_t bitOrdering[] = {1, 2};

Supported data types
^^^^^^^^^^^^^^^^^^^^

By default, computation is executed by the corresponding precision of the state vector, double float (FP64) for complex128 and single float (FP32) for complex64.
The cuStateVec library also provides the compute type, allowing computation with reduced precision.  
Some cuStateVec functions accept the compute type specified by using `custatevecComputeType_t`.
Below is the table of combinations of state vector and compute types available in the current version of the *cuStateVec* library.

.. csv-table::
    :header: State vector / `cudaDataType_t` , Matrix / `cudaDataType_t`  , Compute / `custatevecComputeType_t`
    :widths: 10, 10, 10
  
    Complex128 / CUDA_C_F64      , Complex128 / CUDA_C_F64 , FP64 / CUSTATEVEC_COMPUTE_64F
    Complex64  / CUDA_C_F32      , Complex128 / CUDA_C_F64 , FP32 / CUSTATEVEC_COMPUTE_32F
    Complex64  / CUDA_C_F32      , Complex64  / CUDA_C_F32 , FP32 / CUSTATEVEC_COMPUTE_32F

.. Note::
    CUSTATEVEC_COMPUTE_TF32 is not available at this version.

Gate fusion
===========

Gate applications account for large proportion of the computation cost in quantum simulators.
We can reduce the overall memory footprint required in gate applications by fusing multiple gates into one larger gate.

.. figure:: ./figures/fusion.png    
    :width: 480px
    :align: center


cuStateVec API *custatevecApplyMatrix* supports these general gate applications with multiple qubits.
For the detailed availability, please refer to `custatevecApplyMatrix <_custatevecApplyMatrix-label>`.

..
  Qubit reordering
  ================
  The memory usage in quantum simulations increases exponentially with the number of qubits.
  To use many qubits, multiple GPUs are required.
  A typical approach is to divide all the qubits into global qubits and local qubits.
  Suppose we use :math:`M` qubits and each GPU can store :math:`2^N` state vector elements or :math:`N` qubits.
  Then :math:`2^{M-N}` GPUs are required to store the entire vector.
  The :math:`k`-th GPU handles the following elements: 
  .. math:: 
    \alpha_{i_{M-1}i_{M-2}\cdots i_{N} i_{N-1}\cdots i_{0}}
    \ s.t. \ k = (i_{M-1}i_{M-2}\cdots i_{N})_2, i_p \in \{0, 1\}, 0 \leq p \leq N-1.
  | For instance, 
  | GPU #0 handles from :math:`\alpha_{0_{M-1} \cdots 0_{N+1} 0_{N} 0_{N-1} \cdots 0_{0}}` to :math:`\alpha_{0_{M-1} \cdots 0_{N+1} 0_{N} 1_{N-1} \cdots 1_{0}}`,
  | GPU #1 handles from :math:`\alpha_{0_{M-1} \cdots 0_{N+1} 1_{N} 0_{N-1} \cdots 0_{0}}` to :math:`\alpha_{0_{M-1} \cdots 0_{N+1} 1_{N} 1_{N-1} \cdots 1_{0}}`, and so on.
  | Here, :math:`i_{M-1}, i_{M-2}, \cdots, i_{N}` belong to global qubits, and others belong to local qubits.
  Gate applications with global qubits requires data transfer of state vector elements between GPUs.
  However, it is known that this transfer can become the bottleneck of the performance. 
  To reduce the data transfer cost, cuStateVec provides an API to reorder the qubits.
  With qubit reordering, we can target only local qubits in the gate applications, which does not require any data transfer between GPUs.
  .. figure:: ./figures/reordering.png
    :width: 600px
    :align: center