cusvaer¶
cusvaer is a Qiskit backend of an optimized state vector simulator built on the cuStateVec library.
New features¶
- B200 GPU support - Floating-point emulation has been introduced for accelerated gate application. - To enable or disable this feature, please refer to - cusvaer_enable_floating_point_emulationoption.
- GB200 system support - Multi-node NVLink is supported. Please refer to - Running simulations in GH200 and GB200 clusters.
Qiskit backend for distributed simulations¶
- Distributed simulation - cusvaer distributes state vector simulations to multiple devices of GPUs and CPUs. 
- CPU memory as an external storage to accommodate state vector. - cusvaer is able to allocate state vector not only on GPUs, but also with CPU memory to accommodate larger size of state vectors. 
- Power-of-2 configuration for performance - The number of GPUs and the number of processes and nodes should be always power-of-2. This design is chosen for the optimal performance. 
- Shipped with a validated set of libraries - cusvaer is shipped with the latest cuStateVec and a validated set of MPI libraries. The performance has been validated on NVIDIA SuperPOD. 
Distributed state vector simulation¶
Popular method to distribute state vector simulation is to equally-slice the state vector into multiple sub state vectors and distribute them to processes. Some examples for distributed state vector simulation are also described in Multi-GPU Computation. In cusvaer, this equally-sliced state vector is called as sub state vector.
The current version supports two distributed simulation configurations, (1) multi process simulation and (2) multi-GPU single process simulation.
- Multi process simulation 
For this configuration, cusvaer uses the “single device per single process” model. Sub state vectors are placed in GPUs[*]_. Each process owns a single GPU, and one sub state vector is allocated in the GPU assigned to the process. Thus, the number of processes is always identical to the number of GPUs. The size of sub state vector is the same in all processes.
When slicing the state vector, the size of sub state vector is calculated by the number of qubits in a circuit and the number of GPUs used for distribution.
Ex. 40-qubit c128 simulation can be computed using 32 DGX A100 with the following configurations.
The state vector size : 16 TiB (= 16 [bytes/elm] * (1 << 40 [qubits]))
The number of GPUs per node : 8
The number of nodes : 32
The size of sub state vector in each GPU: 64 [GiB/GPU] = 16 TiB / (32 [node] x 8 [GPU/node])
cusvaer calculates the size of sub state vector by using the number of qubits of a given quantum circuit and the number of processes. Thus, by specifying an appropriate number of processes, users are able to distribute state vector simulations to processes and nodes. The number of processes should be power-of-2.
For nodes with multiple GPUs, GPUs are assigned to processes by using the following formula that is used to configure affinity.
device_id = mpi_rank % number_of_processes_in_node
- Multi-GPU single process simulation 
For this configuration, cusvaer will use the “single process with multiple GPUs” model. All GPUs are expected to be installed in a single server or workstation. GPUDirect P2P is a requirement, thus, all GPUs are connected by NVLink/NVSwitch and/or PCIe bus with a single PCIe switch.
By specifying GPU device IDs to cusvaer_device_ids, simulator will use the specified GPUs to distribute state vector allocation.
Using CPU and GPU memory to allocate state vector¶
From 24.11, cusvaer supports the usage of CPU memory to allocate a part of state vector. If the capacity of GPU memory is not enough to accommodate large state vector, CPU memory will be additionally utilized to hold a part of state vector.
By default, CPU memory will be automatically utilized if a state vector does not fit GPU memory.  And, users are also able to explicitly specify the memory capacities that can be used for state vector allocation with options, cusvaer_max_cpu_memory_mb and cusvaer_max_gpu_memory_mb.  Those options give the upper limits of the available memory capacity for state vector allocation.
By default, those options are None, which means cusvaer will compute the max memory capacity by using physical memory installed in a system as shown below:
- cusvaer_max_cpu_memory_mbis set to [Physical CPU memory amount] - 4 GiB.
- cusvaer_max_gpu_memory_mbis set to [Physical GPU memory amount] - 2 MiB.
For the max GPU memory consumption, the state vector size is a power-of-2 number. If a GPU has 80 GiB of memory, the max size of state vector will be 64 GiB. When one qubit is added, the state vector size will be doubled. Thus, 64 GiB of CPU memory will be utilized. The CPU memory size is [state vector size] - [GPU memory usage]. When adding two qubits to 64 GiB state vector on GPU, the state vector size will be 256 GiB. Thus, the CPU memory usage will be 192 GiB. For three qubits addition, the CPU memory usage will be 448 GiB out of 512 GiB state vector.
In the current implementation, CPU memory can add up-to three qubits.
Note
It’s a users’ responsibility to keep free CPU and GPU memory utilized for state vector allocation. If the required amount of CPU and/or GPU memory is not available, cusvaer will raise an out-of-memory error.
This feature is provided for “single process with single GPU” simulation and “multi process simulation”. The “single process with multiple GPUs” configuration does not support CPU memory utilization.
Running simulations in GB200 and GH200 clusters¶
Multi-node NVLink (MNNVL) is supported by cusvaer. In GB200 NVL36/NVL72 systems, data transfers between compute nodes are executed via NVLink as peer-to-peer transfers between GPUs. The option, cusvaer_p2p_device_bits works in the same way for NVLink in compute nodes. In order to use MNNVL for 32 GPUs in the GB200 NVL36 system, set cusvaer_p2p_device_bits to 5 (log2(32)).
cusvaer automatically detects and enables MNNVL support on the target system. Users can explicitly control MNNVL support using the UBACKEND_USE_FABRIC_HANDLE environment variable. Setting it to “0” disables MNNVL support, while setting it to “1” enables it.
Using cusvaer¶
cusvaer provides one class, cusvaer.backends.StatevectorSimulator. This class has the same interface of Qiskit Aer StatevectorSimulator.
Below is an example code for using cusvaer.
from qiskit import QuantumCircuit, transpile
from cusvaer.backends import StatevectorSimulator
def create_ghz_circuit(n_qubits):
    ghz = QuantumCircuit(n_qubits)
    ghz.h(0)
    for qubit in range(n_qubits - 1):
        ghz.cx(qubit, qubit + 1)
    ghz.measure_all()
    return ghz
circuit = create_ghz_circuit(20)
simulator = StatevectorSimulator()
simulator.set_options(precision='double')
circuit = transpile(circuit, simulator)
job = simulator.run(circuit, shots=1024)
result = job.result()
if result.mpi_rank == 0:
    print(result.get_counts())
The following console output will be obtained:
$ python ghz_cusvaer.py
{'00000000000000000000': 480, '11111111111111111111': 544}
To run distributed simulation, users need to use MPI or appropriate libraries to specify the number of processes. In the example below, 20-qubit state vector simulation is distributed to two processes, and each process has 19-qubit sub state vector. The number of processes should be power-of-2.
$ mpirun -n 2 ghz_cusvaer.py
Note
GHZ circuit is used here as an example, which does not mean for demonstrating performance. It is known that the GHZ circuit is too shallow to show good scaling when simulation is distributed to multiple-GPUs and/or multiple-nodes.
Note
Starting with 24.11, cusvaer will use CUDA virtual memory management functions for state vector allocations in multi-process simulation when available.
To allow this feature in cuQuantum Appliance, extra capability for inter-process communications may need to be provided to the container by the option --cap-add=SYS_PTRACE as follows:
docker run --cap-add=SYS_PTRACE --gpus all -it --rm "$image_name"
MPI libraries¶
cusvaer provides built-in support for Open MPI and MPICH.
The versions shown below are validated or expected to work.
- Open MPI - Validated: v4.1.4 / UCX v1.13.1 
- Expected to work: v3.0.x, v3.1.x, v4.0.x, v4.1.x 
 
- MPICH - Validated: v4.0.2 
 
cuQuantum Appliance is shipped with binaries Open MPI v4.1.4 and UCX v1.13.1 binaries compiled with PMI enabled. The build configuration is intended to work with Slurm.
Two options, cusvaer_comm_plugin_type and cusvaer_comm_plugin_soname are used to select MPI library.  Please refer to CommPlugin for details.
For the use of other MPI libraries, users need to compile external plugin. Please refer to cuQuantum GitHub.
cusvaer options¶
cusvaer provides a subset of common Qiskit Aer options and cusvaer-specific options.
| Option | Value | Description | 
| 
 | non-negative integer | the number of shots | 
| 
 | 
 | the precision of the state vector elements | 
| 
 | 
 | Enable/disable gate fusion | 
| 
 | positive integer | The max number of qubits used for gate fusion | 
| 
 | positive integer | Threshold to enable gate fusion | 
| 
 | 
 | If  | 
| 
 | integer | the seed for random number generator | 
| Option | Value | Description | 
| 
 | 
 | Selecting CommPlugin for inter-process communication | 
| 
 | string | The shared library name for inter-process communication | 
| 
 | list of positive integers | Network structure | 
| 
 | non-negative integer | |
| 
 | positive integer | The size used for a temporary buffer during transferring data | 
| 
 | positive integer | The number of host threads used to process circuits on host | 
| 
 | Bool | If  | 
| 
 | integer greater than or equal to -1 | The max number of qubits used for diagonal gate fusion | 
| 
 | List of device ids as integers | |
| 
 | Zero or positive integer | The max capacity of CPU memory in MiB for state vector allocation | 
| 
 | Positive integer | The max GPU memory capacity in MiB for state vector allocation | 
| 
 | Zero or positive integer | The position for the number of migration index bits inserted to  | 
Device selection¶
cusvaer_device_ids¶
cusvaer_device_ids specifies the device ids of GPUs that will be using for simulations.
For simulations with single process and single GPU, this option should contain only one integer that represents a GPU. If None is specified, the current GPU will be used.
For simulations with single process with multiple GPUs, this option should contain the list of integers that represents a series of device IDs. The number of device IDs should be a power-of-two number.
For simulations with multiple processes with single GPU for each process, this option should contain a single integer to represent the device ID. If None is specified, GPU will be automatically selected according to the rank of a process. None is the default value and is recommended for multi-process simulations.
Multi-process simulation¶
CommPlugin options¶
cusvaer dynamically links to a library that handles inter-process communication at runtime by using CommPlugin.  CommPlugin is a small module that typically wraps MPI libraries.  The CommPlugin is selected by cusvaer_comm_plugin_type and cusvaer_comm_plugin_soname options.
cusvaer_comm_plugin_type¶
cusvaer_comm_plugin_type option to select CommPlugin.  The value type is cusvaer.CommPluginType and the default value is cusvaer.CommPluginType.MPI_AUTO.
# declared in cusvaer
from enum import IntEnum
class CommPluginType(IntEnum):
    SELF        = 0       # Single process
    EXTERNAL    = 1       # Use external plugin
    MPI_AUTO    = 0x100   # Automatically select Open MPI or MPICH
    MPI_OPENMPI = 0x101   # Use Open MPI
    MPI_MPICH   = 0x102   # Use MPICH
SELF
CommPlugin used for single process simulation.
SELFCommPlugin does not have any external dependency.
MPI_AUTO, MPI_OPENMPI, MPI_MPICH
By specifying
MPI_OPENMPIorMPI_MPICH, the specified library will be dynamically loaded and used for inter-process data transfers.MPI_AUTOautomatically selects one of them. With these options, cusvaer internally would useMPI_COMM_WORLDas the communicator.
EXTERNAL
Option to use an external custom plugin that wraps an inter-process communication library. This option is to use libraries of users’ preferences. Please refer to cuQuantum GitHub for details.
The value of CommPluginType should be the same during the application lifetime if a value other than CommPluginType.SELF is specified.  It reflects that a process is able only to use one MPI library for the application lifetime.
comm_plugin_soname¶
comm_plugin_soname option specifies the name of a shared library.  When MPI_AUTO, MPI_OPENMPI, or MPI_MPICH is specified for cusvaer_comm_plugin_type, the corresponding MPI library name in the search path should be specified.  The default value is "libmpi.so".
If EXTERNAL is specified, comm_plugin_soname will contain the name of a custom CommPlugin.
Specifying device network structure¶
Two options, cusvaer_global_index_bits and cusvaer_p2p_device_bits, are available for specifying inter-node and intra-node network structures.  By specifying these options, cusvaer schedules data transfers to utilize faster communication paths to accelerate simulations.
cusvaer_global_index_bits¶
cusvaer_global_index_bits is a list of positive integers that represents the inter-node network structure.
Assuming 8 nodes has faster communication network in a cluster, and running 32 node simulation, the value of cusvaer_global_index_bits is [3, 2].  The first 3 is log2(8) representing 8 nodes with fast communication which corresponding to 3 qubits in the state vector.  The second 2 means 4 8-node groups in 32 nodes.  The sum of the global_index_bits elements is 5, which means the number of nodes is 32 = 2^5.
The last element in the cusvaer_global_index_bits can be omitted.
cusvaer_p2p_device_bits¶
cusvaer_p2p_device_bits option is to specify the number of GPUs that can communicate by using GPUDirect P2P.
For 8 GPU node such as DGX A100, the number is log2(8) = 3.
The value of cusvaer_p2p_device_bits is typically the same as the first element of cusvaer_global_index_bits as the GPUDirect P2P network is typically the fastest in a cluster.
cusvaer_data_transfer_buffer_bits¶
cusvaer_data_transfer_buffer_bits specifies the size of the buffer utilized for inter-node data transfers.  Each rank of process allocates (1 << cusvaer_data_transfer_buffer_bits) bytes of device memory.  The default is set to 26 (64 MiB).  The minimum allowed value is 24 (16 MiB).
Depending on systems, setting a larger value to cusvaer_data_transfer_buffer_bits can accelerate inter-node data transfers.
CPU memory utilization¶
cusvaer_max_cpu_memory_mb¶
cusvaer_max_cpu_memory_mb option specifies the max capacity of CPU memory utilized for state vector allocation.  The unit is MiB.  The option value can be None (default), 0, or positive integer.
If None is specified, the max CPU memory capacity is calculated by using the physical memory size of CPU.
If 0 is specified, CPU memory utilization is disabled. State vector will be allocated only on GPUs.
If a positive integer is specified, the value is used as the max size of CPU memory to allocate state vector. The value should not exceed the amount of CPU physical memory.
cusvaer_max_gpu_memory_mb¶
cusvaer_max_gpu_memory_mb option specifies the max capacity of GPU memory utilized for state vector allocation.  The unit is MiB.  The option value can be None or positive integer.  The default value is None.
If None is specified, the max GPU memory capacity is calculated by using the physical memory size of GPU.
If a positive integer is specified, the value is used as the max size of GPU memory to allocate state vector. The value should not exceed the amount of GPU physical memory.
cusvaer_migration_level¶
Extending state vector by using CPU memory is to insert migration index bits that represent the qubits allocated on both CPU and GPU. If one migration index bit is inserted, the state vector size is doubled.
The cusvaer_migration_level option specifies the position to insert the number of migration index bits to cusvaer_global_index_bits.
The numbers of index bits in cusvaer_global_index_bits are sorted from faster to slower network connections for the best performance.  Also, the migration of state vector between CPU and GPU transfers state vector elements between them, which affects the performance of simulations depending on the bandwidth of CPU-GPU data transfers.  By choosing an appropriate position to insert the number of migration index bits to cusvaer_global_index_bits, the simulation performance is optimized.
The default value is None, which means that the number of migration index bits are inserted in the end of cusvaer_global_index_bits.  The default setting assumes the data transfer bandwidth between CPU and GPU is slower than inter-GPU connections such as NVLink, IB and so on.
Other options¶
cusvaer_host_threads¶
cusvaer_host_threads specifies the number of CPU threads used for circuit processing.  The default value is set to 8, a good value.  The performance is not sensitive to this option.
cusvaer_enable_floating_point_emulation¶
cusvaer_enable_floating_point_emulation specifies if floating point emulation is enabled for B200 GPU.  Floating point emulation is enabled when True is specified to this option.  Otherwise, the emulation is disabled.  The default value is True.
cusvaer_diagonal_fusion_max_qubit¶
cusvaer_diagonal_fusion_max_qubit specifies the maximum number of qubits allowed per fused diagonal gate. The default value is set to -1 and the fusion size will be automatically adjusted for the better performance. If 0, the gate fusion for diagonal gates is disabled.
Custom instruction¶
set_state_simple(state)¶
This instruction sets the state vector. The state argument is numpy ndarray. If the dtype of the state argument is different from that of the state vector precision, the given ndarray is converted to the type of the state vector prior to being set to the state vector.
When running a single process simulation, the state argument will set its array elements to the state vector. The state argument value has the same size as that of the state vector.
If a simulation is distributed to multiple processes, this instruction sets the sub state vector allocated in the device assigned to a process. The size of the state argument should be the same as that of the sub state vector.
Different from set_statevector() instruction provided by Qiskit Aer, set_state_simple() does not check the absolute value of the state argument.
save_state_simple()¶
This instruction saves the state vector allocated in GPU assigned to a process. The saved state vector is returned as a numpy ndarray in a result object.
When running a single process simulation, the saved ndarray is a state vector. If a simulation is distributed to multiple processes, the saved ndarray is a sub state vector owned by a process.
Note
The above custom instructions are preliminary and are subject to change.
Example of cusvaer option configurations¶
import cusvaer
options = {
  'cusvaer_comm_plugin_type': cusvaer_comm_plugin_type=cusvaer.CommPluginType.MPI_AUTO,  # automatically select Open MPI or MPICH
  'cusvaer_comm_plugin_soname': 'libmpi.so',  # MPI library name is libmpi.so
  'cusvaer_global_index_bits': [3, 2],  # 8 devices per node, 4 nodes
  'cusvaer_p2p_device_bits': 3,         # 8 GPUs in one node
  'precision': 'double'         # use complex128
}
simulator = cusvaer.backends.StatevectorSimulator()
simulator.set_options(**options)
Exception¶
cusvaer uses qiskit.providers.basicaer.BasicAerError class to raise exceptions.
cusvaer environmental variable¶
UBACKEND_USE_FABRIC_HANDLE¶
The UBACKEND_USE_FABRIC_HANDLE environment variable controls Multi-node NVLink (MNNVL) support in cusvaer.
- Setting it to “0” disables MNNVL support. 
- Setting it to “1” enables MNNVL support. 
- When not set, cusvaer automatically detects and enables MNNVL support where applicable. 
$ export UBACKEND_USE_FABRIC_HANDLE="0"  # Disable MNNVL support
$ export UBACKEND_USE_FABRIC_HANDLE="1"  # Enable MNNVL support
Interoperability with mpi4py¶
mpi4py is a python package that wraps MPI. By importing mpi4py before running simulations, cusvaer interoperates with mpi4py.
from qiskit import QuantumCircuit, transpile
from cusvaer.backends import StatevectorSimulator
# import mpi4py here to call MPI_Init()
from mpi4py import MPI
def create_ghz_circuit(n_qubits):
    ghz = QuantumCircuit(n_qubits)
    ghz.h(0)
    for qubit in range(n_qubits - 1):
        ghz.cx(qubit, qubit + 1)
    ghz.measure_all()
    return ghz
print(f"mpi4py: rank: {MPI.COMM_WORLD.Get_rank()}, size: {MPI.COMM_WORLD.Get_size()}")
circuit = create_ghz_circuit(n_qubits)
# Create StatevectorSimulator instead of using Aer.get_backend()
simulator = StatevectorSimulator()
circuit = transpile(circuit, simulator)
job = simulator.run(circuit)
result = job.result()
print(f"Result: rank: {result.mpi_rank}, size: {result.num_mpi_processes}")
cusvaer calls MPI_Init() right before executing the first simulation, and calls MPI_Finalize() at the timing that Python calls functions registered in atexit.
cusvaer checks if MPI is initialized before calling MPI_Init().  If MPI_Init() is already called before the first simulation, MPI_Init() and MPI_Finalize() calls are skipped.
For mpi4py, MPI_Init() is called at the timing of loading mpi4py. Thus, mpi4py and cusvaer coexist by importing mpi4py before calling simulator.run().