cusvaer
*******

Starting with cuQuantum Appliance 22.11, we have included cusvaer.
cusvaer is designed as a Qiskit backend solver and is optimized for distributed state vector simulation.

Features
========

* Distributed simulation

  cusvaer distributes state vector simulations to multiple processes and nodes.

* Power-of-2 configuration for performance

  The number of GPUs and the number of processes and nodes should be always power-of-2.  This design is chosen for the optimal performance.

* Shipped with a validated set of libraries

  cusvaer is shipped with the latest cuStateVec and a validated set of MPI libraries.  The performance has been validated on `NVIDIA SuperPOD <https://www.nvidia.com/en-us/data-center/dgx-superpod/>`_.


Distributed state vector simulation
===================================

Popular method to distribute state vector simulation is to equally-slice the state vector into multiple sub state vectors and distribute them to processes.
Some examples for distributed state vector simulation are also described in :ref:`Multi-GPU Computation <multiGPUComputation-label>`.
In cusvaer, this equally-sliced state vector is called as sub state vector.

The first release of cusvaer uses the "single device per single process" model.  All sub state vectors are placed in GPUs.  Each process owns a single GPU, and one sub state vector is allocated in the GPU assigned to the process.  Thus, the number of processes is always identical to the number of GPUs.  The size of sub state vector is the same in all processes.

When slicing the state vector, the size of sub state vector is calculated by the number of qubits in a circuit and the number of GPUs used for distribution.

Ex. 40-qubit c128 simulation can be computed using 32 DGX A100 with the following configurations.

  * The state vector size       : 16 TiB (= 16 [bytes/elm] * (1 << 40 [qubits]))
  * The number of GPUs per node : 8
  * The number of nodes         : 32
  * The size of sub state vector in each GPU: 64 [GiB/GPU] = 16 TiB / (32 [node] x 8 [GPU/node])


cusvaer calculates the size of sub state vector by using the number of qubits of a given quantum circuit and the number of processes.  Thus, by specifying an appropriate number of processes, users are able to distribute state vector simulations to processes and nodes.  The number of processes should be power-of-2.

For nodes with multiple GPUs, GPUs are assigned to processes by using the following formula that is used to configure affinity.

.. code-block:: python

  device_id = mpi_rank % number_of_processes_in_node


Using cusvaer
=============

cusvaer provides one class, ``cusvaer.backends.StatevectorSimulator``. This class has the same interface of `Qiskit Aer StatevectorSimulator <https://qiskit.org/documentation/stubs/qiskit_aer.StatevectorSimulator.html>`_.

Below is an example code for using cusvaer.

.. code-block:: python

  from qiskit import QuantumCircuit, transpile
  from cusvaer.backends import StatevectorSimulator

  def create_ghz_circuit(n_qubits):
      ghz = QuantumCircuit(n_qubits)
      ghz.h(0)
      for qubit in range(n_qubits - 1):
          ghz.cx(qubit, qubit + 1)
      ghz.measure_all()
      return ghz

  circuit = create_ghz_circuit(20)

  simulator = StatevectorSimulator()
  simulator.set_options(precision='double')
  circuit = transpile(circuit, simulator)
  job = simulator.run(circuit, shots=1024)
  result = job.result()

  if result.mpi_rank == 0:
      print(result.get_counts())


The following console output will be obtained:

.. code-block:: bash

  $ python ghz_cusvaer.py
  {'00000000000000000000': 480, '11111111111111111111': 544}


To run distributed simulation, users need to use MPI or appropriate libraries to specify the number of processes.  In the example below, 20-qubit state vector simulation is distributed to two processes, and each process has 19-qubit sub state vector.  The number of processes should be power-of-2.

.. code-block:: shell

  $ mpirun -n 2 ghz_cusvaer.py


.. note::

  GHZ circuit is used here as an example, which does not mean for demonstrating performance.  It is known that the GHZ circuit is too shallow to show good scaling when simulation is distributed to multiple-GPUs and/or multiple-nodes.


MPI libraries
=============

cusvaer provides built-in support for `Open MPI <https://www.open-mpi.org/>`_ and `MPICH <https://www.mpich.org/>`_.

The versions shown below are validated or expected to work.

* Open MPI

  - Validated: v4.1.4 / UCX v1.13.1
  - Expected to work: v3.0.x, v3.1.x, v4.0.x, v4.1.x

* MPICH

  - Validated: v4.0.2 


cuQuantum Appliance is shipped with binaries Open MPI v4.1.4 and UCX v1.13.1 binaries compiled with PMI enabled.  The build configuration is intended to work with Slurm.

Two options, ``cusvaer_comm_plugin_type`` and ``cusvaer_comm_plugin_soname`` are used to select MPI library.  Please refer to :ref:`CommPlugin <commPlugin-label>` for details.

For the use of other MPI libraries, users need to compile external plugin.  Please refer to `cuQuantum GitHub <https://github.com/NVIDIA/cuQuantum>`_.

.. _cusvaerOptions-label:

cusvaer options
===============

cusvaer provides a subset of common Qiskit Aer options and cusvaer-specific options.

.. list-table:: **Common options provided by Qiskit Aer**
  :widths: 20 30 50

  * - Option
    - Value
    - Description
  * - ``shots``
    - non-negative integer
    - the number of shots
  * - ``precision``
    - ``"single"`` or ``"double"``
    - the precision of the state vector elements
  * - ``fusion_enable``
    - ``True`` or ``False``
    - Enable/disable gate fusion
  * - ``fusion_max_qubit``
    - positive integer
    - The max number of qubits used for gate fusion
  * - ``fusion_threshold``
    - positive integer
    - Threshold to enable gate fusion
  * - ``memory``
    - ``True`` or ``False``
    - If ``True`` is specified, classical register values are saved to the result
  * - ``seed_simulator``
    - integer
    - the seed for random number generator

.. list-table:: **cusvaer options**
  :widths: 20 30 50

  * - Option
    - Value
    - Description
  * - ``cusvaer_comm_plugin_type``
    - ``cusvaer.CommPluginType``
    - Selecting CommPlugin for inter-process communication
  * - ``cusvaer_comm_plugin_soname``
    - string
    - The shared library name for inter-process communication
  * - ``cusvaer_global_index_bits``
    - list of positive integers
    - Network structure
  * - ``cusvaer_p2p_device_bits``
    - non-negative integer
    -
  * - ``cusvaer_data_transfer_buffer_bits``
    - positive integer
    - The size used for a temporary buffer during transferring data
  * - ``cusvaer_host_threads``
    - positive integer
    - The number of host threads used to process circuits on host
  * - ``cusvaer_diagonal_fusion_max_qubit``
    - integer greater than or equal to -1
    - The max number of qubits used for diagonal gate fusion

.. _commPlugin-label:

CommPlugin
----------

cusvaer dynamically links to a library that handles inter-process communication by using CommPlugin.  CommPlugin is a small module that typically wraps MPI libraries.  The CommPlugin is selected by ``cusvaer_comm_plugin_type`` and ``cusvaer_comm_plugin_soname`` options.

cusvaer_comm_plugin_type
++++++++++++++++++++++++

``cusvaer_comm_plugin_type`` option to select CommPlugin.  The value type is ``cusvaer.CommPluginType`` and the default value is ``cusvaer.CommPluginType.MPI_AUTO``.

.. code-block:: python

  # declared in cusvaer
  from enum import IntEnum

  class CommPluginType(IntEnum):
      SELF        = 0       # Single process
      EXTERNAL    = 1       # Use external plugin
      MPI_AUTO    = 0x100   # Automatically select Open MPI or MPICH
      MPI_OPENMPI = 0x101   # Use Open MPI
      MPI_MPICH   = 0x102   # Use MPICH

``SELF``

  CommPlugin used for single process simulation.  ``SELF`` CommPlugin does not have any external dependency.

``MPI_AUTO``, ``MPI_OPENMPI``, ``MPI_MPICH``

  By specifying ``MPI_OPENMPI`` or ``MPI_MPICH``, the specified library will be dynamically loaded and used for inter-process data transfers.  ``MPI_AUTO`` automatically selects one of them.
  With these options, cusvaer internally would use ``MPI_COMM_WORLD`` as the communicator. 

``EXTERNAL``

  Option to use an external custom plugin that wraps an inter-process communication library.  This option is to use libraries of users' preferences.  Please refer to `cuQuantum GitHub <https://github.com/NVIDIA/cuQuantum>`_ for details.

The value of ``CommPluginType`` should be the same during the application lifetime if a value other than ``CommPluginType.SELF`` is specified.  It reflects that a process is able only to use one MPI library for the application lifetime.

comm_plugin_soname
++++++++++++++++++

``comm_plugin_soname`` option specifies the name of a shared library.  When ``MPI_AUTO``, ``MPI_OPENMPI``, or ``MPI_MPICH`` is specified for ``cusvaer_comm_plugin_type``, the corresponding MPI library name in the search path should be specified.  The default value is ``"libmpi.so"``.

If ``EXTERNAL`` is specified, ``comm_plugin_soname`` will contain the name of a custom CommPlugin.

.. _appliance-cusvaer-network-structure-label:

Specifying device network structure
-----------------------------------

Two options, ``cusvaer_global_index_bits`` and ``cusvaer_p2p_device_bits``, are available for specifying inter-node and intra-node network structures.  By specifying these options, cusvaer schedules data transfers to utilize faster communication paths to accelerate simulations.

cusvaer_global_index_bits
+++++++++++++++++++++++++

``cusvaer_global_index_bits`` is a list of positive integers that represents the inter-node network structure.

Assuming 8 nodes has faster communication network in a cluster, and running 32 node simulation, the value of ``cusvaer_global_index_bits`` is ``[3, 2]``.  The first ``3`` is log2(8) representing **8** nodes with fast communication which corresponding to 3 qubits in the state vector.  The second ``2`` means **4** 8-node groups in 32 nodes.  The sum of the global_index_bits elements is 5, which means the number of nodes is 32 = 2^5.

The last element in the ``cusvaer_global_index_bits`` can be omitted.

cusvaer_p2p_device_bits
+++++++++++++++++++++++

``cusvaer_p2p_device_bits`` option is to specify the number of GPUs that can communicate by using GPUDirect P2P.

For 8 GPU node such as DGX A100, the number is log2(8) = 3.

The value of ``cusvaer_p2p_device_bits`` is typically the same as the first element of ``cusvaer_global_index_bits`` as the GPUDirect P2P network is typically the fastest in a cluster.

cusvaer_data_transfer_buffer_bits
+++++++++++++++++++++++++++++++++

``cusvaer_data_transfer_buffer_bits`` specifies the size of the buffer utilized for inter-node data transfers.  Each rank of process allocates (``1 << cusvaer_data_transfer_buffer_bits``) bytes of device memory.  The default is set to 26 (64 MiB).  The minimum allowed value is 24 (16 MiB).

Depending on systems, setting a larger value to ``cusvaer_data_transfer_buffer_bits`` can accelerate inter-node data transfers.

Other options
-------------

cusvaer_host_threads
++++++++++++++++++++

``cusvaer_host_threads`` specifies the number of CPU threads used for circuit processing.  The default value is set to ``8``, a good value.  The performance is not sensitive to this option.

cusvaer_diagonal_fusion_max_qubit
+++++++++++++++++++++++++++++++++

``cusvaer_diagonal_fusion_max_qubit`` specifies the maximum number of qubits allowed per fused diagonal gate. The default value is set to ``-1`` and the fusion size will be automatically adjusted for the better performance. If ``0``, the gate fusion for diagonal gates is disabled.

Custom instruction
==================

``set_state_simple(state)``
---------------------------

This instruction sets the state vector.  The state argument is numpy ndarray.  If the dtype of the state argument is different from that of the state vector precision, the given ndarray is converted to the type of the state vector prior to being set to the state vector.

When running a single process simulation, the state argument will set its array elements to the state vector.  The state argument value has the same size as that of the state vector.

If a simulation is distributed to multiple processes, this instruction sets the sub state vector allocated in the device assigned to a process.  The size of the state argument should be the same as that of the sub state vector.

Different from ``set_statevector()`` instruction provided by Qiskit Aer, ``set_state_simple()`` does not check the absolute value of the state argument.


``save_state_simple()``
-----------------------

This instruction saves the state vector allocated in GPU assigned to a process.  The saved state vector is returned as a numpy ndarray in a result object.

When running a single process simulation, the saved ndarray is a state vector.  If a simulation is distributed to multiple processes, the saved ndarray is a sub state vector owned by a process.

.. note::

  The above custom instructions are preliminary and are subject to change.


Example of cusvaer option configurations
========================================

.. code-block:: python

  import cusvaer

  options = {
    'cusvaer_comm_plugin_type': cusvaer_comm_plugin_type=cusvaer.CommPluginType.MPI_AUTO,  # automatically select Open MPI or MPICH
    'cusvaer_comm_plugin_soname': 'libmpi.so',  # MPI library name is libmpi.so
    'cusvaer_global_index_bits': [3, 2],  # 8 devices per node, 4 nodes
    'cusvaer_p2p_device_bits': 3,         # 8 GPUs in one node
    'precision': 'double'         # use complex128
  }
  simulator = cusvaer.backends.StatevectorSimulator()
  simulator.set_options(**options)


Exception
=========

cusvaer uses `qiskit.providers.basicaer.BasicAerError class <https://qiskit.org/documentation/stubs/qiskit.providers.basicaer.BasicAerError.html>`_ to raise exceptions.


.. _mpi4py-label:

Interoperability with mpi4py
============================

`mpi4py <https://mpi4py.readthedocs.io/en/stable/>`_ is a python package that wraps MPI.  By importing mpi4py before running simulations, cusvaer interoperates with mpi4py. 

.. code-block:: python

  from qiskit import QuantumCircuit, transpile
  from cusvaer.backends import StatevectorSimulator
  # import mpi4py here to call MPI_Init()
  from mpi4py import MPI

  def create_ghz_circuit(n_qubits):
      ghz = QuantumCircuit(n_qubits)
      ghz.h(0)
      for qubit in range(n_qubits - 1):
          ghz.cx(qubit, qubit + 1)
      ghz.measure_all()
      return ghz

  print(f"mpi4py: rank: {MPI.COMM_WORLD.Get_rank()}, size: {MPI.COMM_WORLD.Get_size()}")

  circuit = create_ghz_circuit(n_qubits)

  # Create StatevectorSimulator instead of using Aer.get_backend()
  simulator = StatevectorSimulator()
  circuit = transpile(circuit, simulator)
  job = simulator.run(circuit)
  result = job.result()

  print(f"Result: rank: {result.mpi_rank}, size: {result.num_mpi_processes}")


cusvaer calls ``MPI_Init()`` right before executing the first simulation, and calls ``MPI_Finalize()`` at the timing that Python calls functions registered in ``atexit``.

cusvaer checks if MPI is initialized before calling ``MPI_Init()``.  If ``MPI_Init()`` is already called before the first simulation, ``MPI_Init()`` and ``MPI_Finalize()`` calls are skipped.

For mpi4py, ``MPI_Init()`` is called at the timing of loading mpi4py. Thus, mpi4py and cusvaer coexist by importing mpi4py before calling ``simulator.run()``.