cuStateVec Functions
####################

.. role::  raw-html(raw)
    :format: html


Library Management
******************

Handle Management API
=====================

.. _custatevecCreate-label:

:code:`custatevecCreate`
------------------------

.. doxygenfunction:: custatevecCreate

----

.. _custatevecDestroy-label:

:code:`custatevecDestroy`
-------------------------

.. doxygenfunction:: custatevecDestroy

----

.. _custatevecGetDefaultWorkspaceSize-label:

:code:`custatevecGetDefaultWorkspaceSize`
-----------------------------------------

.. doxygenfunction:: custatevecGetDefaultWorkspaceSize

----

.. _custatevecSetWorkspace-label:

:code:`custatevecSetWorkspace`
-------------------------------------

.. doxygenfunction:: custatevecSetWorkspace


CUDA Stream Management API
==========================

.. _custatevecSetStream-label:

:code:`custatevecSetStream`
---------------------------

.. doxygenfunction:: custatevecSetStream

----

.. _custatevecGetStream-label:

:code:`custatevecGetStream`
---------------------------

.. doxygenfunction:: custatevecGetStream

Error Management API
====================

.. _custatevecGetErrorName-label:

:code:`custatevecGetErrorName`
------------------------------

.. doxygenfunction:: custatevecGetErrorName

----

.. _custatevecGetErrorString-label:

:code:`custatevecGetErrorString`
--------------------------------

.. doxygenfunction:: custatevecGetErrorString

Logger API
==========

.. _custatevecLoggerSetCallback-label:

:code:`custatevecLoggerSetCallback`
-----------------------------------

.. doxygenfunction:: custatevecLoggerSetCallback

----

.. _custatevecLoggerSetCallbackData-label:

:code:`custatevecLoggerSetCallbackData`
---------------------------------------

.. doxygenfunction:: custatevecLoggerSetCallbackData

----

.. _custatevecLoggerSetFile-label:

:code:`custatevecLoggerSetFile`
-------------------------------

.. doxygenfunction:: custatevecLoggerSetFile

----

.. _custatevecLoggerOpenFile-label:

:code:`custatevecLoggerOpenFile`
--------------------------------

.. doxygenfunction:: custatevecLoggerOpenFile

----

.. _custatevecLoggerSetLevel-label:

:code:`custatevecLoggerSetLevel`
--------------------------------

.. doxygenfunction:: custatevecLoggerSetLevel

----

.. _custatevecLoggerSetMask-label:

:code:`custatevecLoggerSetMask`
-------------------------------

.. doxygenfunction:: custatevecLoggerSetMask

----

.. _custatevecLoggerForceDisable-label:

:code:`custatevecLoggerForceDisable`
------------------------------------

.. doxygenfunction:: custatevecLoggerForceDisable


Versioning API
==============

.. _custatevecGetProperty-label:

:code:`custatevecGetProperty`
-----------------------------

.. doxygenfunction:: custatevecGetProperty

----

.. _custatevecGetVersion-label:

:code:`custatevecGetVersion`
-----------------------------

.. doxygenfunction:: custatevecGetVersion


.. _cuStateVec memory management API:

Memory Management API
=====================

A *stream-ordered* memory allocator (or mempool for short) allocates/deallocates memory *asynchronously* from/to a mempool in a
stream-ordered fashion, meaning memory operations and computations enqueued on the streams have a well-defined inter- and intra-
stream dependency. There are several well-implemented stream-ordered mempools available, such as ``cudaMemPool_t`` that is built-in
at the CUDA driver level since CUDA 11.2 (so that all CUDA applications in the same process can easily share the same pool,
see `here <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-ordered-memory-allocator>`_) and the RAPIDS
Memory Manager (`RMM`_). For a detailed introduction, see the `NVIDIA Developer Blog`_.

.. _RMM: https://github.com/rapidsai/rmm
.. _NVIDIA Developer Blog: https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/

The new device memory handler APIs allow users to bind a stream-ordered mempool to the library handle, such that cuStateVec
can take care of most of the memory management for users. Below is an illustration of what can be done:

.. code-block:: c++

    MyMemPool pool = MyMemPool();  // kept alive for the entire process in real apps

    int my_alloc(void* ctx, void** ptr, size_t size, cudaStream_t stream) {
      return reinterpret_cast<MyMemPool*>(ctx)->alloc(ptr, size, stream);
    }
    
    int my_dealloc(void* ctx, void* ptr, size_t size, cudaStream_t stream) {
      return reinterpret_cast<MyMemPool*>(ctx)->dealloc(ptr, size, stream);
    }

    // create a mem handler and fill in the required members for the library to use
    custatevecDeviceMemHandler_t handler;
    handler.ctx = reinterpret_cast<void*>(&pool);
    handler.device_alloc = my_alloc;
    handler.device_free = my_dealloc;
    memcpy(handler.name, std::string("my pool").c_str(), CUSTATEVEC_ALLOCATOR_NAME_LEN);

    // bind the handler to the library handle
    custatevecSetDeviceMemHandler(handle, &handler);

    /* ... use gate application as usual ... */

    // User doesn’t compute the required sizes
     
    // User doesn’t query the workspace size (but one can if desired)
     
    // User doesn’t allocate memory!
     
    // User sets null pointer to indicate the library should draw memory from the user's pool;
    void* extraWorkspace = nullptr;
    size_t extraWorkspaceInBytes = 0;
    custatevecApplyMatrix(
        handle, sv, svDataType, nIndexBits, matrix, matrixDataType, layout,
        adjoint, targets, nTargets, controls, nControls, controlBitValues,
        computeType, extraWorkspace, extraWorkspaceSizeInBytes);

    // User doesn’t deallocate memory!
 
As shown above, several calls to the workspace-related APIs can be skipped. Moreover, allowing the library to share
your memory pool not only can alleviate potential memory conflicts, but also enable possible optimizations.

In the current release, only a device mempool can be bound.

.. _custatevecSetDeviceMemHandler-label:

:code:`custatevecSetDeviceMemHandler`
-------------------------------------

.. doxygenfunction:: custatevecSetDeviceMemHandler

----

.. _custatevecGetDeviceMemHandler-label:

:code:`custatevecGetDeviceMemHandler`
-------------------------------------

.. doxygenfunction:: custatevecGetDeviceMemHandler


Gate Application
****************

General Matrices
================

cuStateVec API `custatevecApplyMatrix` can apply a matrix representing a gate to a state vector.
The API may require external workspace for large matrices, 
and `custatevecApplyMatrixGetWorkspaceSize` provides the size of workspace.
If a device memory handler is set, `custatevecApplyMatrixGetWorkspaceSize` can be skipped.

Use case
--------

.. code-block:: cpp

  // check the size of external workspace
  custatevecApplyMatrixGetWorkspaceSize(
      handle, svDataType, nIndexBits, matrix, matrixDataType, layout, adjoint, nTargets, 
      nControls, computeType, &extraWorkspaceSizeInBytes);

  // allocate external workspace if necessary
  void* extraWorkspace = nullptr;
  if (extraWorkspaceSizeInBytes > 0)
      cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

  // apply gate
  custatevecApplyMatrix(
      handle, sv, svDataType, nIndexBits, matrix, matrixDataType, layout,
      adjoint, targets, nTargets, controls, controlBitValues, nControls,
      computeType, extraWorkspace, extraWorkspaceSizeInBytes);

API reference
-------------

.. _custatevecApplyMatrixGetWorkspaceSize-label:

:code:`custatevecApplyMatrixGetWorkspaceSize`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenfunction:: custatevecApplyMatrixGetWorkspaceSize

----

.. _custatevecApplyMatrix-label:

:code:`custatevecApplyMatrix`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenfunction:: custatevecApplyMatrix


Pauli Matrices
==============

Exponential of a tensor product of Pauli matrices can be expressed as follows:

.. math::
  e^{i \theta \left( P_{target[0]} \otimes P_{target[1]} \otimes \cdots \otimes P_{target[nTargets-1]} \right)}.

Matrix :math:`P_{target[i]}` can be either of Pauli matrices :math:`I`, :math:`X`, :math:`Y`, and :math:`Z`, which are corresponding to the `custatevecPauli_t` enums `CUSTATEVEC_PAULI_I`, `CUSTATEVEC_PAULI_X`, `CUSTATEVEC_PAULI_Y`, and `CUSTATEVEC_PAULI_Z`, respectively. 
Also refer to :ref:`custatevecPauli_t<custatevecPauli_t-label>` for details.

Use case
--------

.. code-block:: cpp

  // apply exponential
  custatevecApplyPauliRotation(
      handle, sv, svDataType, nIndexBits, theta, paulis, targets, nTargets,
      controls, controlBitValues, nControls);

API reference
-------------

.. _custatevecApplyPauliRotation-label:

:code:`custatevecApplyPauliRotation`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. doxygenfunction:: custatevecApplyPauliRotation


Generalized Permutation Matrices
================================

A generalized permutation matrix can be expressed as the multiplication of a permutation matrix :math:`P` and a diagonal matrix :math:`D`.
For instance, we can decompose a 4 :math:`\times` 4 generalized permutation matrix :math:`A` as follows:

.. math::
  A = \left[ \begin{array}{cccc}  0 & 0 & a_0 & 0 \\ a_1 & 0 & 0 & 0 \\ 0 & 0 & 0 & a_2 \\ 0 & a_3 & 0 & 0 \end{array}\right] = DP

, where

.. math::
  D = \left[ \begin{array}{cccc}  a_0 & 0 & 0 & 0 \\ 0 & a_1 & 0 & 0 \\ 0 & 0 & a_2 & 0 \\ 0 & 0 & 0 & a_3 \end{array}\right],
  P = \left[ \begin{array}{cccc}  0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 \end{array}\right].

When :math:`P` is diagonal, the generalized permutation matrix is also diagonal.
Similarly, when :math:`D` is the identity matrix, the generalized permutation matrix becomes a permutation matrix.

The cuStateVec API `custatevecApplyGeneralizedPermutationMatrix` applies a generalized permutation matrix like :math:`A` to a state vector.
The API may require extra workspace for large matrices,
whose size can be queried using `custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize`.
If a device memory handler is set, `custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize` can be skipped.

Use case
--------

.. code-block:: cpp

    // check the size of external workspace
    custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize(
        handle, CUDA_C_64F, nIndexBits, permutation, diagonals, CUDA_C_64F, targets,
        nTargets, nControls, &extraWorkspaceSizeInBytes);

    // allocate external workspace if necessary
    void* extraWorkspace = nullptr;
    if (extraWorkspaceSizeInBytes > 0)
        cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

    // apply a generalized permutation matrix
    custatevecApplyGeneralizedPermutationMatrix(
        handle, d_sv, CUDA_C_64F, nIndexBits, permutation, diagonals, CUDA_C_64F,
        adjoint, targets, nTargets, controls, controlBitValues, nControls,
        extraWorkspace, extraWorkspaceSizeInBytes);

The operation is equivalent to the following:

.. code-block:: cpp

  // sv, sv_temp: the state vector and temporary buffer.

  int64_t sv_size = int64_t{1} << nIndexBits;
  for (int64_t sv_idx = 0; sv_idx < sv_size; sv_idx++) {
      // The basis of sv_idx is converted to permutation basis to obtain perm_idx
      auto perm_idx = convertToPermutationBasis(sv_idx);
      // apply generalized permutation matrix
      if (adjoint == 0)
          sv_temp[sv_idx] = sv[permutation[perm_idx]] * diagonals[perm_idx];
      else
          sv_temp[permutation[perm_idx]] = sv[sv_idx] * conj(diagonals[perm_idx]);
  }

  for (int64_t sv_idx = 0; sv_idx < sv_size; sv_idx++)
      sv[sv_idx] = sv_temp[sv_idx];


API reference
-------------

.. _custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize-label:

:code:`custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenfunction:: custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize

----

.. _custatevecApplyGeneralizedPermutationMatrix-label:

:code:`custatevecApplyGeneralizedPermutationMatrix`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenfunction:: custatevecApplyGeneralizedPermutationMatrix


Measurement
***********

Measurement on Z-bases
======================

Let us consider the measurement of an :math:`nIndexBits`-qubit state vector :math:`sv` on an :math:`nBasisBits`-bit Z product basis :math:`basisBits`.

The sums of squared absolute values of state vector elements on the Z product basis, :math:`abs2sum0` and :math:`abs2sum1`, are obtained by the followings:

.. math::

  abs2sum0 &= \Bra{sv} \left\{ \dfrac{1}{2} \left( 1 + Z_{basisBits[0]} \otimes Z_{basisBits[1]} \otimes \cdots \otimes Z_{basisBits[nBasisBits-1]} \right) \right\} \Ket{sv}, \\
  abs2sum1 &= \Bra{sv} \left\{ \dfrac{1}{2} \left( 1 - Z_{basisBits[0]} \otimes Z_{basisBits[1]} \otimes \cdots \otimes Z_{basisBits[nBasisBits-1]} \right) \right\} \Ket{sv}. 

Therefore, probabilities to obtain parity 0 and 1 are expressed in the following expression:

.. math::

  Pr(parity = 0) &= \dfrac{abs2sum0}{abs2sum0 + abs2sum1}, \\
  Pr(parity = 1) &= \dfrac{abs2sum1}{abs2sum0 + abs2sum1}.

Depending on the measurement result, the state vector is collapsed.
If parity is equal to 0, we obtain the following vector:

.. math::

  \ket{sv} = \dfrac{1}{\sqrt{norm}} \left\{ \dfrac{1}{2} \left( 1 + Z_{basisBits[0]} \otimes Z_{basisBits[1]} \otimes \cdots \otimes Z_{basisBits[nBasisBits-1]} \right) \right\} \Ket{sv},

and if parity is equal to 1, we obtain the following vector:

.. math::
  \ket{sv} = \dfrac{1}{\sqrt{norm}} \left\{ \dfrac{1}{2} \left( 1 - Z_{basisBits[0]} \otimes Z_{basisBits[1]} \otimes \cdots \otimes Z_{basisBits[nBasisBits-1]} \right) \right\} \Ket{sv},

where :math:`norm` is the normalization factor.

Use case
--------

We can measure by `custatevecMeasureOnZBasis` as follows:

.. code-block:: cpp
 
  // measure on a Z basis
  custatevecMeasureOnZBasis(
      handle, sv, svDataType, nIndexBits, &parity, basisBits, nBasisBits, 
      randnum, collapse);

The operation is equivalent to the following:

.. code-block:: cpp

  // compute the sums of squared absolute values of state vector elements 
  // on a Z product basis
  double abs2sum0, abs2sum1;
  custatevecAbs2SumOnZBasis(
      handle, sv, svDataType, nIndexBits, &abs2sum0, &abs2sum1, basisBits,
      nBasisBits);

  // [User] compute parity and norm
  double abs2sum = abs2sum0 + abs2sum1;
  int parity = (randnum * abs2sum < abs2sum0) ? 0 : 1;
  double norm = (parity == 0) ? abs2sum0 : abs2sum1;

  // collapse if necessary
  switch (collapse) {
  case CUSTATEVEC_COLLAPSE_NORMALIZE_AND_ZERO:
      custatevecCollapseOnZBasis(
          handle, sv, svDataType, nIndexBits, parity, basisBits, nBasisBits, 
          norm);
      break;  /* collapse */
  case CUSTATEVEC_COLLAPSE_NONE:
      break;  /* Do nothing */

API reference
-------------

.. _custatevecAbs2SumOnZBasis-label:

:code:`custatevecAbs2SumOnZBasis`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenfunction:: custatevecAbs2SumOnZBasis

----

.. _custatevecCollapseOnZBasis-label:

:code:`custatevecCollapseOnZBasis`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenfunction:: custatevecCollapseOnZBasis

----

.. _custatevecMeasureOnZBasis-label:

:code:`custatevecMeasureOnZBasis`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenfunction:: custatevecMeasureOnZBasis

.. _batchMeasureSection-label:

Batched Single Qubit Measurement
================================

Assume that we measure an :math:`nIndexBits`-qubits state vector :math:`sv` with a :math:`bitOrderingLen`-bits bit string :math:`bitOrdering`.

The sums of squared absolute values of state vector elements are obtained by the following:

.. math::

  abs2sum[idx] = \braket{sv|i}\braket{i|sv},

where :math:`idx =  b_{BitOrderingLen-1}\cdots b_1 b_0`, :math:`i = b_{bitOrdering[BitOrderingLen-1]} \cdots b_{bitOrdering[1]} b_{bitOrdering[0]}`, :math:`b_p \in \{0, 1\}`.

Therefore, probability to obtain the :math:`idx`-th pattern of bits are expressed in the following expression:

.. math::
  Pr(idx) = \dfrac{abs2sum[idx]}{\sum_{k}abs2sum[k]}.

Depending on the measurement result, the state vector is collapsed.

If :math:`idx` satisfies :math:`(idx \ \& \ bitString) = idx`, we obtain :math:`sv[idx] = \dfrac{1}{\sqrt{norm}} sv[idx]`.
Otherwise, :math:`sv[idx] = 0`, where :math:`norm` is the normalization factor.

Use case
--------

We can measure by `custatevecBatchMeasure` as follows:

.. code-block:: cpp
 
  // measure with a bit string
  custatevecBatchMeasure(
      handle, sv, svDataType, nIndexBits, bitString, bitOrdering, bitStringLen,
      randnum, collapse);

The operation is equivalent to the following:

.. code-block:: cpp

  // compute the sums of squared absolute values of state vector elements 
  int maskLen = 0;
  int* maskBitString = nullptr;
  int* maskOrdering = nullptr;

  custatevecAbs2SumArray(
      handle, sv, svDataType, nIndexBits, abs2Sum, bitOrdering, bitOrderingLen, 
      maskBitString, maskOrdering, maskLen);

  // [User] compute a cumulative sum and choose bitString by a random number

  // collapse if necessary
  switch (collapse) {
  case CUSTATEVEC_COLLAPSE_NORMALIZE_AND_ZERO:
      custatevecCollapseByBitString(
          handle, sv, svDataType, nIndexBits, bitString, bitOrdering, 
          bitStringLen, norm);
      break;  /* collapse */
  case CUSTATEVEC_COLLAPSE_NONE:
      break;  /* Do nothing */

For multi-GPU computations, `custatevecBatchMeasureWithOffset` is available.
This function works on one device, and users are required to compute 
the cumulative array of squared absolute values of state vector elements beforehand.

.. code-block:: cpp

  // The state vector is divided to nSubSvs sub state vectors.
  // each state vector has its own ordinal and nLocalBits index bits. 
  // The ordinals of sub state vectors correspond to the extended index bits.
  // In this example, all the local qubits are measured and collapsed.

  // get abs2sum for each sub state vector
  double abs2SumArray[nSubSvs];
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cudaSetDevice(devices[iSv]);
      custatevecAbs2SumArray(
          handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, &abs2SumArray[iSv], nullptr,
          0, nullptr, nullptr, 0);
  }

  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cudaSetDevice(devices[iSv]);
      cudaDeviceSynchronize();
  }

  // get cumulative array
  double cumulativeArray[nSubSvs + 1];
  cumulativeArray[0] = 0.0;
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cumulativeArray[iSv + 1] = cumulativeArray[iSv] + abs2SumArray[iSv];
  }

  // measurement
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      // detect which sub state vector will be used for measurement.
      if (cumulativeArray[iSv] <= randnum && randnum < cumulativeArray[iSv + 1]) {
          double norm = cumulativeArray[nSubSvs];
          double offset = cumulativeArray[iSv];
          cudaSetDevice(devices[iSv]);
          // measure local qubits. Here the state vector will not be collapsed.
          // Only local qubits can be included in bitOrdering and bitString arguments.
          // That is, bitOrdering = {0, 1, 2, ..., nLocalBits - 1} and
          // bitString will store values of local qubits as an array of integers.
          custatevecBatchMeasureWithOffset(
              handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, bitString, bitOrdering,
              bitStringLen, randnum, CUSTATEVEC_COLLAPSE_NONE, offset, norm);
      }
  }

  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cudaSetDevice(devices[iSv]);
      cudaDeviceSynchronize();
  }

  // get abs2Sum after collapse
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cudaSetDevice(devices[iSv]);
      custatevecAbs2SumArray(
          handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, &abs2SumArray[iSv], nullptr,
          0, bitString, bitOrdering, bitStringLen);
  }

  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cudaSetDevice(devices[iSv]);
      cudaDeviceSynchronize();
  }

  // get norm after collapse
  double norm = 0.0;
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      norm += abs2SumArray[iSv];
  }

  // collapse sub state vectors
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cudaSetDevice(devices[iSv]);
      custatevecCollapseByBitString(
          handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, bitString, bitOrdering,
          bitStringLen, norm);
  }

  // destroy handle
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cudaSetDevice(devices[iSv]);
      custatevecDestroy(handle[iSv]);
  }

Please refer to `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository for further detail.

API reference
-------------

.. _custatevecAbs2SumArray-label:

:code:`custatevecAbs2SumArray`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenfunction:: custatevecAbs2SumArray

----

.. _custatevecCollapseByBitString-label:

:code:`custatevecCollapseByBitString`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenfunction:: custatevecCollapseByBitString

----

.. _custatevecBatchMeasure-label:

:code:`custatevecBatchMeasure`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenfunction:: custatevecBatchMeasure

----

.. _custatevecBatchMeasureWithOffset-label:

:code:`custatevecBatchMeasureWithOffset`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenfunction:: custatevecBatchMeasureWithOffset


Expectation
***********

Expectation via a Matrix
========================

Expectation performs the following operation:

.. math::
  \langle A \rangle = \bra{\phi}A\ket{\phi},

where :math:`\ket{\phi}` is a state vector and :math:`A` is a matrix or an observer.
The API for expectation `custatevecComputeExpectation` may require external workspace for large matrices, 
and `custatevecComputeExpectationGetWorkspaceSize` provides the size of workspace.
If a device memory handler is set, `custatevecComputeExpectationGetWorkspaceSize` can be skipped.

Use case
--------

.. code-block:: cpp

  // check the size of external workspace
  custatevecComputeExpectationGetWorkspaceSize(
      handle, svDataType, nIndexBits, matrix, matrixDataType, layout, nBasisBits, computeType,    
      &extraWorkspaceSizeInBytes);

  // allocate external workspace if necessary
  void* extraWorkspace = nullptr;
  if (extraWorkspaceSizeInBytes > 0)
      cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

  // perform expectation
  custatevecComputeExpectation(
      handle, sv, svDataType, nIndexBits, expect, expectDataType, residualNorm, 
      matrix, matrixDataType, layout, basisBits, nBasisBits, computeType,
      extraWorkspace, extraWorkspaceSizeInBytes);

API reference
-------------

.. _custatevecComputeExpectationGetWorkspaceSize-label:

:code:`custatevecComputeExpectationGetWorkspaceSize`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenfunction:: custatevecComputeExpectationGetWorkspaceSize

----

.. _custatevecComputeExpectation-label:

:code:`custatevecComputeExpectation`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenfunction:: custatevecComputeExpectation


Expectation on Pauli Basis
==========================

cuStateVec API `custatevecComputeExpectationsOnPauliBasis` computes expectation values for a batch of Pauli strings.
Each observable can be expressed as follows:

.. math::
  P_{\text{basisBits}[0]} \otimes P_{\text{basisBits}[1]} \otimes \cdots \otimes P_{\text{basisBits}[\text{nBasisBits}-1]}.

Each matrix :math:`P_{\text{basisBits}[i]}` can be one of the Pauli matrices :math:`I`, :math:`X`, :math:`Y`, and :math:`Z`, corresponding to the `custatevecPauli_t` enums `CUSTATEVEC_PAULI_I`, `CUSTATEVEC_PAULI_X`, `CUSTATEVEC_PAULI_Y`, and `CUSTATEVEC_PAULI_Z`, respectively. 
Also refer to :ref:`custatevecPauli_t<custatevecPauli_t-label>` for details.

Use case
--------

.. code-block:: cpp

  // calculate the norm and the expectations for Z(q1) and X(q0)Y(q2)

  uint32_t nPauliOperatorArrays = 3;
  custatevecPauli_t pauliOperators0[] = {};                                       // III
  int32_t           basisBits0[]      = {};
  custatevecPauli_t pauliOperators1[] = {CUSTATEVEC_PAULI_Z};                     // IZI
  int32_t           basisBits1[]      = {1};
  custatevecPauli_t pauliOperators2[] = {CUSTATEVEC_PAULI_X, CUSTATEVEC_PAULI_Y}; // XIY
  int32_t           basisBits2[]      = {0, 2};
  const uint32_t nBasisBitsArray[] = {0, 1, 2};

  const custatevecPauli_t*
    pauliOperatorsArray[] = {pauliOperators0, pauliOperators1, pauliOperators2};
  const int32_t *basisBitsArray[] = { basisBits0, basisBits1, basisBits2};

  uint32_t nIndexBits = 3;
  double expectationValues[nPauliOperatorArrays];

  custatevecComputeExpectationsOnPauliBasis(
      handle, sv, svDataType, nIndexBits, expectationValues,
      pauliOperatorsArray, nPauliOperatorArrays,
      basisBitsArray, nBasisBitsArray);

API reference
-------------

.. _custatevecComputeExpectationsOnPauliBasis-label:

:code:`custatevecComputeExpectationsOnPauliBasis`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenfunction:: custatevecComputeExpectationsOnPauliBasis

.. _samplerSection-label:

Sampling
********

Sampling enables to obtain measurement results many times by using probability
calculated from quantum states.

Use case
========

.. code-block:: cpp

  // create sampler and check the size of external workspace
  custatevecSamplerCreate(
      handle, sv, svDataType, nIndexBits, &sampler, nMaxShots,
      &extraWorkspaceSizeInBytes);

  // allocate external workspace if necessary
  void extraWorkspace = nullptr;
  if (extraWorkspaceSizeInBytes > 0)
      cudaMalloc(extraWorkspace, extraWorkspaceSizeInBytes);
  
  // calculate cumulative abs2sum
  custatevecSamplerPreprocess(
      handle, sampler, extraWorkspace, extraWorkspaceSizeInBytes);

  // [User] generate randnums, array of random numbers [0, 1) for sampling
  ...

  // sample bit strings
  custatevecSamplerSample(
      handle, sampler, bitStrings, bitOrdering, bitStringLen, randnums, nShots,
      output);

  // deallocate the sampler
  custatevecSamplerDestroy(sampler);

For multi-GPU computations, cuStateVec provides `custatevecSamplerGetSquaredNorm` and 
`custatevecSamplerApplySubSVOffset`.
Users are required to calculate cumulative abs2sum array with the squared norm of 
each sub state vector via `custatevecSamplerGetSquaredNorm` and provide its values 
to the sampler descriptor via `custatevecSamplerApplySubSVOffset`.

.. code-block:: cpp

  // The state vector is divided to nSubSvs sub state vectors.
  // each state vector has its own ordinal and nLocalBits index bits. 
  // The ordinals of sub state vectors correspond to the extended index bits.

  // create sampler and check the size of external workspace
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cudaSetDevice(devices[iSv]);
      custatevecSamplerCreate(
          handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, &sampler[iSv], nMaxShots,
          &extraWorkspaceSizeInBytes[iSv]);
  }

  // allocate external workspace if necessary
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      if (extraWorkspaceSizeInBytes[iSv] > 0) {
          cudaSetDevice(devices[iSv]);
          cudaMalloc(&extraWorkspace[iSv], extraWorkspaceSizeInBytes[iSv]);
      }
  }

  // sample preprocess
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cudaSetDevice(devices[iSv]) );
      custatevecSampler_preprocess(
          handle[iSv], sampler[iSv], extraWorkspace[iSv],
          extraWorkspaceSizeInBytes[iSv]);
  }

  // get norm of the sub state vectors
  double subNorms[nSubSvs];
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cudaSetDevice(devices[iSv]);
      custatevecSamplerGetSquaredNorm(
          handle[iSv], sampler[iSv], &subNorms[iSv]);
  }

  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cudaSetDevice(devices[iSv]);
      cudaDeviceSynchronize();
  }

  // get cumulative array
  double cumulativeArray[nSubSvs + 1];
  cumulativeArray[0] = 0.0;
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cumulativeArray[iSv + 1] = cumulativeArray[iSv] + subNorms[iSv];
  }
  double norm = cumulativeArray[nSubSvs];

  // apply offset and norm
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cudaSetDevice(devices[iSv]) );
      custatevecSamplerApplySubSVOffset(
          handle[iSv], sampler[iSv], iSv, nSubSvs, cumulativeArray[iSv], norm);
  }

  // divide randnum array. randnums must be sorted in the ascending order.
  int shotOffsets[nSubSvs + 1];
  shotOffsets[0] = 0;
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      double* pos = std::lower_bound(randnums, randnums + nShots, 
                                      cumulativeArray[iSv + 1] / norm);
      if (iSv == nSubSvs - 1) {
          pos = randnums + nShots;
      }
      shotOffsets[iSv + 1] = pos - randnums;
  }

  // sample bit strings
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      int shotOffset = shotOffsets[iSv];
      int nSubShots = shotOffsets[iSv + 1] - shotOffsets[iSv];
      if (nSubShots > 0) {
          cudaSetDevice(devices[iSv]);
          custatevecSamplerSample(
              handle[iSv], sampler[iSv], &bitStrings[shotOffset], bitOrdering,
              bitStringLen, &randnums[shotOffset], nSubShots,
              CUSTATEVEC_SAMPLER_OUTPUT_RANDNUM_ORDER);
      }
  }
  
  for (int iSv = 0; iSv < nSubSvs; iSv++) {
      cudaSetDevice(devices[iSv]);
      custatevecSamplerDestroy(sampler[iSv]);
  }

Please refer to `NVIDIA/cuQuantum <https://github.com/NVIDIA/cuQuantum>`_ repository for further detail.

API reference
=============

.. _custatevecSamplerCreate-label:

:code:`custatevecSamplerCreate`
-------------------------------

.. doxygenfunction:: custatevecSamplerCreate

----

.. _custatevecSamplerPreprocess-label:

:code:`custatevecSamplerPreprocess`
-----------------------------------

.. doxygenfunction:: custatevecSamplerPreprocess

----

.. _custatevecSamplerGetSquaredNorm-label:

:code:`custatevecSamplerGetSquaredNorm`
---------------------------------------

.. doxygenfunction:: custatevecSamplerGetSquaredNorm

----

.. _custatevecSamplerApplySubSVOffset-label:

:code:`custatevecSamplerApplySubSVOffset`
-----------------------------------------

.. doxygenfunction:: custatevecSamplerApplySubSVOffset

----

.. _custatevecSamplerSample-label:

:code:`custatevecSamplerSample`
-------------------------------

.. doxygenfunction:: custatevecSamplerSample

----

.. _custatevecSamplerDestroy-label:

:code:`custatevecSamplerDestroy`
--------------------------------

.. doxygenfunction:: custatevecSamplerDestroy


Accessor
********

An accessor extracts or updates state vector segments.

The APIs `custatevecAccessorCreate` and `custatevecAccessorCreateView` initialize an accessor and also return the size of an extra workspace (if needed by the APIs `custatevecAccessorGet` and `custatevecAccessorSet` to perform the copy).
The workspace must be bound to an accessor by `custatevecAccessorSetExtraWorkspace`, and the lifetime of the workspace must be as long as the accessor's to cover the entire duration of the copy operation.
If a device memory handler is set, it is not necessary to provide explicit workspace by users.

The ``begin`` and ``end`` arguments in the Get/Set APIs correspond to the state vector elements' indices such that elements within the specified range are copied.

Use case
========

Extraction
----------

.. code-block:: cpp

    // create accessor and check the size of external workspace
    custatevecAccessorCreateView(
        handle, d_sv, CUDA_C_64F, nIndexBits, &accessor, bitOrdering, bitOrderingLen,
        maskBitString, maskOrdering, maskLen, &extraWorkspaceSizeInBytes);

    // allocate external workspace if necessary
    if (extraWorkspaceSizeInBytes > 0)
        cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

    // set external workspace
    custatevecAccessorSetExtraWorkspace(
        handle, &accessor, extraWorkspace, extraWorkspaceSizeInBytes);

    // get state vector elements
    custatevecAccessorGet(
        handle, &accessor, buffer, accessBegin, accessEnd);

    // deallocate the accessor
    custatevecAccessorDestroy(accessor);

Update
------

.. code-block:: cpp

    // create accessor and check the size of external workspace
    custatevecAccessorCreate(
        handle, d_sv, CUDA_C_64F, nIndexBits, &accessor, bitOrdering, bitOrderingLen,
        maskBitString, maskOrdering, maskLen, &extraWorkspaceSizeInBytes);

    // allocate external workspace if necessary
    if (extraWorkspaceSizeInBytes > 0)
        cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

    // set external workspace
    custatevecAccessorSetExtraWorkspace(
        handle, &accessor, extraWorkspace, extraWorkspaceSizeInBytes);

    // set state vector elements
    custatevecAccessorSet(
        handle, &accessor, buffer, 0, nSvSize);

    // deallocate the accessor
    custatevecAccessorDestroy(accessor);

API reference
=============

.. _custatevecAccessorCreate-label:

:code:`custatevecAccessorCreate`
--------------------------------

.. doxygenfunction:: custatevecAccessorCreate

----

.. _custatevecAccessorCreateView-label:

:code:`custatevecAccessorCreateView`
------------------------------------

.. doxygenfunction:: custatevecAccessorCreateView

----

.. _custatevecAccessorDestroy-label:

:code:`custatevecAccessorDestroy`
---------------------------------

.. doxygenfunction:: custatevecAccessorDestroy

----

.. _custatevecAccessorSetExtraWorkspace-label:

:code:`custatevecAccessorSetExtraWorkspace`
-------------------------------------------

.. doxygenfunction:: custatevecAccessorSetExtraWorkspace

----

.. _custatevecAccessorGet-label:

:code:`custatevecAccessorGet`
-----------------------------

.. doxygenfunction:: custatevecAccessorGet

----

.. _custatevecAccessorSet-label:

:code:`custatevecAccessorSet`
-----------------------------

.. doxygenfunction:: custatevecAccessorSet


.. _qubitReorderingSection-label:

Qubit reordering
****************

cuStateVec provides `custatevecSwapIndexBits` API for single device and 
`custatevecMultiDeviceSwapIndexBits` for multiple devices to reorder state vector elements.

Use case
========

single device
-------------

.. code-block:: cpp

    // This example uses 3 qubits.
    const int nIndexBits = 3;

    // swap 0th and 2nd qubits
    const int nBitSwaps  = 1;
    const int2 bitSwaps[] = {{0, 2}}; // specify the qubit pairs

    // swap the state vector elements only if 1st qubit is 1
    const int maskLen = 1;
    int maskBitString[] = {1}; // specify the values of mask qubits
    int maskOrdering[] = {1};  // specify the mask qubits

    // Swap index bit pairs.
    // {|000>, |001>, |010>, |011>, |100>, |101>, |110>, |111>} will be permuted to
    // {|000>, |001>, |010>, |110>, |100>, |101>, |011>, |111>}.
    custatevecSwapIndexBits(handle, sv, svDataType, nIndexBits, bitSwaps, nBitSwaps,
        maskBitString, maskOrdering, maskLen);

multiple devices
----------------

.. code-block:: cpp

    // This example uses 2 GPUs and each GPU stores 2-qubit sub state vector.
    const int nGlobalIndexBits = 1;
    const int nLocalIndexBits = 2;
    const int nHandles = 1 << nGlobalIndexBits;

    // Users are required to enable direct access on a peer device prior to the swap API call.
    for (int i0 = 0; i0 < nHandles; i0++) {
      cudaSetDevice(i0);
      for (int i1 = 0; i1 < nHandles; i1++) {
        if (i0 == i1)
          continue;
        cudaDeviceEnablePeerAccess(i1, 0);
      }
    }
    cudaSetDevice(0);

    // specify the type of device network topology to optimize the data transfer sequence.
    // Here, devices are assumed to be connected via NVLink with an NVSwitch or
    // PCIe device network with a single PCIe switch.
    const custatevecDeviceNetworkType_t deviceNetworkType = CUSTATEVEC_DEVICE_NETWORK_TYPE_SWITCH;

    // swap 0th and 2nd qubits
    const int nIndexBitSwaps  = 1;
    const int2 indexBitSwaps[] = {{0, 2}}; // specify the qubit pairs

    // swap the state vector elements only if 1st qubit is 1
    const int maskLen = 1;
    int maskBitString[] = {1}; // specify the values of mask qubits
    int maskOrdering[] = {1};  // specify the mask qubits

    // Swap index bit pairs.
    // {|000>, |001>, |010>, |011>, |100>, |101>, |110>, |111>} will be permuted to
    // {|000>, |001>, |010>, |110>, |100>, |101>, |011>, |111>}.
    custatevecMultiDeviceSwapIndexBits(handles, nHandles, subSVs, svDataType,
        nGlobalIndexBits, nLocalIndexBits, indexBitSwaps, nIndexBitSwaps,
        maskBitString, maskOrdering, maskLen, deviceNetworkType);

API reference
=============

.. _custatevecSwapIndexBits-label:

:code:`custatevecSwapIndexBits`
-------------------------------

.. doxygenfunction:: custatevecSwapIndexBits

----

.. _custatevecMultiDeviceSwapIndexBits-label:

:code:`custatevecMultiDeviceSwapIndexBits`
------------------------------------------

.. doxygenfunction:: custatevecMultiDeviceSwapIndexBits


Matrix property testing
***********************

The API `custatevecTestMatrixType` is available to check the properties of matrices.

If a matrix :math:`A` is unitary, :math:`AA^{\dagger} = A^{\dagger}A = I`,
where :math:`A^{\dagger}` is the conjugate transpose of :math:`A` and :math:`I` is the identity matrix, respectively.

When `CUSTATEVEC_MATRIX_TYPE_UNITARY` is given for its argument, this API computes the 1-norm :math:`||R||_1 = \sum{|r_{ij}|}`, where 
:math:`R = AA^{\dagger} - I`.
This value will be approximately zero if :math:`A` is unitary.

If a matrix :math:`A` is Hermitian, :math:`A^{\dagger} = A`.

When `CUSTATEVEC_MATRIX_TYPE_HERMITIAN` is given for its argument, this API computes the 2-norm :math:`||R||_2 = \sum{|r_{ij}|^2}`, where
:math:`R = (A - A^{\dagger}) / 2`.
This value will be approximately zero if :math:`A` is Hermitian.

The API may require external workspace for large matrices, 
and `custatevecTestMatrixTypeGetWorkspaceSize` provides the size of workspace.
If a device memory handler is set, it is not necessary to provide explicit workspace by users.

Use case
========

.. code-block:: cpp

    double residualNorm;

    void* extraWorkspace = nullptr;
    size_t extraWorkspaceSizeInBytes = 0;

    // check the size of external workspace
    custatevecTestMatrixTypeGetWorkspaceSize(
        handle, matrixType, matrix, matrixDataType, layout,
        nTargets, adjoint, computeType, &extraWorkspaceSizeInBytes);

    // allocate external workspace if necessary
    if (extraWorkspaceSizeInBytes > 0)
        cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

    // execute testing
    custatevecTestMatrixType(
        handle, &residualNorm, matrixType, matrix, matrixDataType, layout,
        nTargets, adjoint, computeType, extraWorkspace, extraWorkspaceSizeInBytes);
 
API reference
=============

.. _custatevecTestMatrixTypeGetWorkspaceSize-label:

:code:`custatevecTestMatrixTypeGetWorkspaceSize`
-----------------------------------------------------

.. doxygenfunction:: custatevecTestMatrixTypeGetWorkspaceSize

----

.. _custatevecTestMatrixType-label:

:code:`custatevecTestMatrixType`
-----------------------------------------------------

.. doxygenfunction:: custatevecTestMatrixType