Starting with cuQuantum Appliance 22.11, we have included cusvaer. cusvaer is designed as a Qiskit backend solver and is optimized for distributed state vector simulation.


  • Distributed simulation

    cusvaer distributes state vector simulations to multiple processes and nodes.

  • Power-of-2 configuration for performance

    The number of GPUs and the number of processes and nodes should be always power-of-2. This design is chosen for the optimal performance.

  • Shipped with a validated set of libraries

    cusvaer is shipped with the latest cuStateVec and a validated set of MPI libraries. The performance has been validated on NVIDIA SuperPOD.

Distributed state vector simulation

Popular method to distribute state vector simulation is to equally-slice the state vector into multiple sub state vectors and distribute them to processes. Some examples for distributed state vector simulation are also described in Multi-GPU Computation. In cusvaer, this equally-sliced state vector is called as sub state vector.

The first release of cusvaer uses the “single device per single process” model. All sub state vectors are placed in GPUs. Each process owns a single GPU, and one sub state vector is allocated in the GPU assigned to the process. Thus, the number of processes is always identical to the number of GPUs. The size of sub state vector is the same in all processes.

When slicing the state vector, the size of sub state vector is calculated by the number of qubits in a circuit and the number of GPUs used for distribution.

Ex. 40-qubit c128 simulation can be computed using 32 DGX A100 with the following configurations.

  • The state vector size : 16 TiB (= 16 [bytes/elm] * (1 << 40 [qubits]))

  • The number of GPUs per node : 8

  • The number of nodes : 32

  • The size of sub state vector in each GPU: 64 [GiB/GPU] = 16 TiB / (32 [node] x 8 [GPU/node])

cusvaer calculates the size of sub state vector by using the number of qubits of a given quantum circuit and the number of processes. Thus, by specifying an appropriate number of processes, users are able to distribute state vector simulations to processes and nodes. The number of processes should be power-of-2.

For nodes with multiple GPUs, GPUs are assigned to processes by using the following formula that is used to configure affinity.

device_id = mpi_rank % number_of_processes_in_node

Using cusvaer

cusvaer provides one class, cusvaer.backends.StatevectorSimulator. This class has the same interface of Qiskit Aer StatevectorSimulator.

Below is an example code for using cusvaer.

from qiskit import QuantumCircuit, transpile
from cusvaer.backends import StatevectorSimulator

def create_ghz_circuit(n_qubits):
    ghz = QuantumCircuit(n_qubits)
    for qubit in range(n_qubits - 1):, qubit + 1)
    return ghz

circuit = create_ghz_circuit(20)

simulator = StatevectorSimulator()
circuit = transpile(circuit, simulator)
job =, shots=1024)
result = job.result()

if result.mpi_rank == 0:

The following console output will be obtained:

$ python
{'00000000000000000000': 480, '11111111111111111111': 544}

To run distributed simulation, users need to use MPI or appropriate libraries to specify the number of processes. In the example below, 20-qubit state vector simulation is distributed to two processes, and each process has 19-qubit sub state vector. The number of processes should be power-of-2.

$ mpirun -n 2


GHZ circuit is used here as an example, which does not mean for demonstrating performance. It is known that the GHZ circuit is too shallow to show good scaling when simulation is distributed to multiple-GPUs and/or multiple-nodes.

MPI libraries

cusvaer provides built-in support for Open MPI and MPICH.

The versions shown below are validated or expected to work.

  • Open MPI

    • Validated: v4.1.4 / UCX v1.13.1

    • Expected to work: v3.0.x, v3.1.x, v4.0.x, v4.1.x


    • Validated: v4.0.2

cuQuantum Appliance is shipped with binaries Open MPI v4.1.4 and UCX v1.13.1 binaries compiled with PMI enabled. The build configuration is intended to work with Slurm.

Two options, cusvaer_comm_plugin_type and cusvaer_comm_plugin_soname are used to select MPI library. Please refer to CommPlugin for details.

For the use of other MPI libraries, users need to compile external plugin. Please refer to cuQuantum GitHub.

cusvaer options

cusvaer provides a subset of common Qiskit Aer options and cusvaer-specific options.

Common options provided by Qiskit Aer





non-negative integer

the number of shots


"single" or "double"

the precision of the state vector elements


True or False

Enable/disable gate fusion


positive integer

The max number of qubits used for gate fusion


positive integer

Threshold to enable gate fusion


True or False

If True is specified, classical register values are saved to the result



the seed for random number generator

cusvaer options






Selecting CommPlugin for inter-process communication



The shared library name for inter-process communication


list of positive integers

Network structure


non-negative integer


positive integer

The size used for a temporary buffer during transferring data


positive integer

The number of host threads used to process circuits on host


integer greater than or equal to -1

The max number of qubits used for diagonal gate fusion


cusvaer dynamically links to a library that handles inter-process communication by using CommPlugin. CommPlugin is a small module that typically wraps MPI libraries. The CommPlugin is selected by cusvaer_comm_plugin_type and cusvaer_comm_plugin_soname options.


cusvaer_comm_plugin_type option to select CommPlugin. The value type is cusvaer.CommPluginType and the default value is cusvaer.CommPluginType.MPI_AUTO.

# declared in cusvaer
from enum import IntEnum

class CommPluginType(IntEnum):
    SELF        = 0       # Single process
    EXTERNAL    = 1       # Use external plugin
    MPI_AUTO    = 0x100   # Automatically select Open MPI or MPICH
    MPI_OPENMPI = 0x101   # Use Open MPI
    MPI_MPICH   = 0x102   # Use MPICH


CommPlugin used for single process simulation. SELF CommPlugin does not have any external dependency.


By specifying MPI_OPENMPI or MPI_MPICH, the specified library will be dynamically loaded and used for inter-process data transfers. MPI_AUTO automatically selects one of them. With these options, cusvaer internally would use MPI_COMM_WORLD as the communicator.


Option to use an external custom plugin that wraps an inter-process communication library. This option is to use libraries of users’ preferences. Please refer to cuQuantum GitHub for details.

The value of CommPluginType should be the same during the application lifetime if a value other than CommPluginType.SELF is specified. It reflects that a process is able only to use one MPI library for the application lifetime.


comm_plugin_soname option specifies the name of a shared library. When MPI_AUTO, MPI_OPENMPI, or MPI_MPICH is specified for cusvaer_comm_plugin_type, the corresponding MPI library name in the search path should be specified. The default value is "".

If EXTERNAL is specified, comm_plugin_soname will contain the name of a custom CommPlugin.

Specifying device network structure

Two options, cusvaer_global_index_bits and cusvaer_p2p_device_bits, are available for specifying inter-node and intra-node network structures. By specifying these options, cusvaer schedules data transfers to utilize faster communication paths to accelerate simulations.


cusvaer_global_index_bits is a list of positive integers that represents the inter-node network structure.

Assuming 8 nodes has faster communication network in a cluster, and running 32 node simulation, the value of cusvaer_global_index_bits is [3, 2]. The first 3 is log2(8) representing 8 nodes with fast communication which corresponding to 3 qubits in the state vector. The second 2 means 4 8-node groups in 32 nodes. The sum of the global_index_bits elements is 5, which means the number of nodes is 32 = 2^5.

The last element in the cusvaer_global_index_bits can be omitted.


cusvaer_p2p_device_bits option is to specify the number of GPUs that can communicate by using GPUDirect P2P.

For 8 GPU node such as DGX A100, the number is log2(8) = 3.

The value of cusvaer_p2p_device_bits is typically the same as the first element of cusvaer_global_index_bits as the GPUDirect P2P network is typically the fastest in a cluster.


cusvaer_data_transfer_buffer_bits specifies the size of the buffer utilized for inter-node data transfers. Each rank of process allocates (1 << cusvaer_data_transfer_buffer_bits) bytes of device memory. The default is set to 26 (64 MiB). The minimum allowed value is 24 (16 MiB).

Depending on systems, setting a larger value to cusvaer_data_transfer_buffer_bits can accelerate inter-node data transfers.

Other options


cusvaer_host_threads specifies the number of CPU threads used for circuit processing. The default value is set to 8, a good value. The performance is not sensitive to this option.


cusvaer_diagonal_fusion_max_qubit specifies the maximum number of qubits allowed per fused diagonal gate. The default value is set to -1 and the fusion size will be automatically adjusted for the better performance. If 0, the gate fusion for diagonal gates is disabled.

Custom instruction


This instruction sets the state vector. The state argument is numpy ndarray. If the dtype of the state argument is different from that of the state vector precision, the given ndarray is converted to the type of the state vector prior to being set to the state vector.

When running a single process simulation, the state argument will set its array elements to the state vector. The state argument value has the same size as that of the state vector.

If a simulation is distributed to multiple processes, this instruction sets the sub state vector allocated in the device assigned to a process. The size of the state argument should be the same as that of the sub state vector.

Different from set_statevector() instruction provided by Qiskit Aer, set_state_simple() does not check the absolute value of the state argument.


This instruction saves the state vector allocated in GPU assigned to a process. The saved state vector is returned as a numpy ndarray in a result object.

When running a single process simulation, the saved ndarray is a state vector. If a simulation is distributed to multiple processes, the saved ndarray is a sub state vector owned by a process.


The above custom instructions are preliminary and are subject to change.

Example of cusvaer option configurations

import cusvaer

options = {
  'cusvaer_comm_plugin_type': cusvaer_comm_plugin_type=cusvaer.CommPluginType.MPI_AUTO,  # automatically select Open MPI or MPICH
  'cusvaer_comm_plugin_soname': '',  # MPI library name is
  'cusvaer_global_index_bits': [3, 2],  # 8 devices per node, 4 nodes
  'cusvaer_p2p_device_bits': 3,         # 8 GPUs in one node
  'precision': 'double'         # use complex128
simulator = cusvaer.backends.StatevectorSimulator()


cusvaer uses qiskit.providers.basicaer.BasicAerError class to raise exceptions.

Interoperability with mpi4py

mpi4py is a python package that wraps MPI. By importing mpi4py before running simulations, cusvaer interoperates with mpi4py.

from qiskit import QuantumCircuit, transpile
from cusvaer.backends import StatevectorSimulator
# import mpi4py here to call MPI_Init()
from mpi4py import MPI

def create_ghz_circuit(n_qubits):
    ghz = QuantumCircuit(n_qubits)
    for qubit in range(n_qubits - 1):, qubit + 1)
    return ghz

print(f"mpi4py: rank: {MPI.COMM_WORLD.Get_rank()}, size: {MPI.COMM_WORLD.Get_size()}")

circuit = create_ghz_circuit(n_qubits)

# Create StatevectorSimulator instead of using Aer.get_backend()
simulator = StatevectorSimulator()
circuit = transpile(circuit, simulator)
job =
result = job.result()

print(f"Result: rank: {result.mpi_rank}, size: {result.num_mpi_processes}")

cusvaer calls MPI_Init() right before executing the first simulation, and calls MPI_Finalize() at the timing that Python calls functions registered in atexit.

cusvaer checks if MPI is initialized before calling MPI_Init(). If MPI_Init() is already called before the first simulation, MPI_Init() and MPI_Finalize() calls are skipped.

For mpi4py, MPI_Init() is called at the timing of loading mpi4py. Thus, mpi4py and cusvaer coexist by importing mpi4py before calling