********
Overview
********

This section describes the basic working principle of the *cuStateVec* library. For a general introduction to quantum circuits, please refer to :doc:`Introduction to quantum computing <../../overview>`.

API synchronization behavior
============================

The cuStateVec APIs are designed for asynchronous execution.  Their API synchronization behavior follows the description in `API synchronization behavior <https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior>`_ in `CUDA Runtime API <https://docs.nvidia.com/cuda/cuda-runtime-api/index.html>`_.  Developers are required to appropriately call CUDA APIs to synchronize API calls.

Using CUDA Stream
=================

The execution of most cuStateVec APIs are serialized on the stream attached to the cuStateVec handle created by `custatevecCreate`.  The initial stream is the default stream.  Users are able to set a user-created stream to the cuStateVec handle by calling `custatevecSetStream`.  All types of streams (default, blocking and non-blocking) are acceptable.  API calls are synchronized by appropriate CUDA API calls such as `cudaDeviceSynchronize`, `cudaStreamSynchronize` or `cudaStreamWaitEvent`.

There is one exception in :doc:`Distributed index bit swap API <distributed-index-bit-swap>`.  The `custatevecSVSwapWorkerCreate` API requires a user-created stream that is specifically utilized only for data transfers.  Therefore, other cuStateVec API calls are concurrently executed on the stream attached to the handle.  
Also, `custatevecSVSwapWorkerExecute` blocks on the stream specified on the call to `custatevecSVSwapWorkerCreate` API to synchronize with the peer of data transfers.


Description of state vectors
============================
.. doxygenpage:: state_vector
   :content-only:

Bit ordering
============
.. doxygenpage:: bit_ordering
   :content-only:

Supported data types
====================
.. doxygenpage:: data_types
   :content-only:

.. _workspace-label:

Workspace
=========
.. doxygenpage:: workspace
   :content-only:

Gate fusion
===========

Gate applications account for large proportion of the computation cost in quantum simulators.
We can reduce the overall memory footprint required in gate applications by fusing multiple gates into one larger gate.

.. figure:: ./figures/fusion.png    
    :width: 480px
    :align: center

cuStateVec API supports these general gate applications with multiple qubits.
For detailed information, please refer to `custatevecApplyMatrix`.

.. _multiGPUComputation-label:

Multi-GPU Computation
=====================

The memory usage in quantum circuit simulations increases exponentially with the number of qubits.
To simulate more qubits, multiple GPUs are required.
A typical approach is to divide the qubits into global and local ones.

For instance, a 3-qubit system with 8 state vector elements can be equally distributed to 4 GPUs as described in the following figure.

.. figure:: ./figures/globalQubits.png
    :width: 200px
    :align: center

When an index is assigned for each sub state vector, it can represent the higher-order qubits.
We refer to these qubits as global qubits and other qubits as local qubits.
In the example above, we have 2 global qubits and 1 local qubit.

In general, for an :math:`M`-qubit system, suppose each GPU can store :math:`2^N` state vector elements (for :math:`N` local qubits),
then :math:`2^{M-N}` GPUs (that is, :math:`M-N` global qubits) are required to store the entire state vector.
The :math:`k`-th GPU (:math:`k = (i_{M-1} i_{M-2} \cdots i_{N})_2`) stores the state vector elements
:math:`\alpha_{i_{M-1} i_{M-2} \cdots i_{N} i_{N-1} \cdots i_{0}}` with :math:`i_p \in \{0, 1\}, 0 \leq p \leq N-1`.

For instance,

  - GPU #0 handles elements from :math:`\alpha_{0_{M-1} \cdots 0_{N+1} 0_{N} 0_{N-1} \cdots 0_{0}}` to :math:`\alpha_{0_{M-1} \cdots 0_{N+1} 0_{N} 1_{N-1} \cdots 1_{0}}`,
  - GPU #1 handles elements from :math:`\alpha_{0_{M-1} \cdots 0_{N+1} 1_{N} 0_{N-1} \cdots 0_{0}}` to :math:`\alpha_{0_{M-1} \cdots 0_{N+1} 1_{N} 1_{N-1} \cdots 1_{0}}`
  - GPU #2 handles elements from :math:`\alpha_{0_{M-1} \cdots 1_{N+1} 0_{N} 0_{N-1} \cdots 0_{0}}` to :math:`\alpha_{0_{M-1} \cdots 1_{N+1} 0_{N} 1_{N-1} \cdots 1_{0}}`
  - GPU #3 handles elements from :math:`\alpha_{0_{M-1} \cdots 1_{N+1} 1_{N} 0_{N-1} \cdots 0_{0}}` to :math:`\alpha_{0_{M-1} \cdots 1_{N+1} 1_{N} 1_{N-1} \cdots 1_{0}}`, and so on.

Here, the indices :math:`i_{M-1}, i_{M-2}, \cdots, i_{N}` belong to the global qubits, and others belong to the local qubits.

cuStateVec provides APIs for multi-GPU qubit measurement, sampling, and qubit reordering.
Measurement and sampling APIs work on single GPU, and users are required to gather/scatter the results of each GPU.
As for details of each API, please refer to :ref:`batchMeasureSection-label`, :ref:`samplerSection-label`, and :ref:`qubitReorderingSection-label`, respectively.

Also for those who are interested in multi-GPU quantum simulations, :doc:`NVIDIA cuQuantum Appliance<../appliance/index>` is available.

.. note::

   Each GPU requires its own cuStateVec handle. Also, the users are responsible for switching the CUDA device context.

.. _batchedStateVectors-label:

Batched state vectors simulation
================================

cuStateVec provides gate application and qubit measurement APIs for a group of state vectors.
When computing with many small state vectors, replacing API calls for each state vector with a single batched API call is expected to lead to improved performance.

These batched-version APIs assume that state vectors are allocated as one contiguous block of device memory and require three parameters to specify their locations:

  - ``nSVs``, the number of state vectors in the batch.
  - ``nIndexBits``, the number of qubits in each state vector. All the state vectors in a batch need to have the same number of qubits.
  - ``svStride``, the offset in number of elements between two state vectors.
    It should be equal to or larger than each state vector size, ``1 << nIndexBits``.

For instance, the following figure describes a group of state vectors with ``nSVs`` = 3, ``nIndexBits`` = 2, and ``svStride`` = 8.
Here, each element of the 2-D array in the figure are ordered in column-major format.

.. figure:: ./figures/batchedStateVectors.png
    :width: 400px
    :align: center

For the details of each API, please refer to `custatevecApplyMatrixBatched`, `custatevecComputeExpectationBatched`, 
`custatevecAbs2SumArrayBatched`, `custatevecCollapseByBitStringBatched`, and `custatevecMeasureBatched`.

References
==========

For a technical introduction to cuStateVec, please refer to the NVIDIA blog:

  * `Accelerating Quantum Circuit Simulation with NVIDIA cuStateVec`_

.. _Accelerating Quantum Circuit Simulation with NVIDIA cuStateVec: https://developer.nvidia.com/blog/accelerating-quantum-circuit-simulation-with-nvidia-custatevec

Citing cuQuantum
================

* H. Bayraktar et al., "cuQuantum SDK: A High-Performance Library for Accelerating Quantum Science," 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), Bellevue, WA, USA, 2023, pp. 1050-1061, doi: `10.1109/QCE57702.2023.00119 <https://doi.org/10.1109/QCE57702.2023.00119>`_.