cusvaer¶
Starting with cuQuantum Appliance 22.11, we have included cusvaer. cusvaer is designed as a Qiskit backend solver and is optimized for distributed state vector simulation.
Features¶
Distributed simulation
cusvaer distributes state vector simulations to multiple processes and nodes.
Power-of-2 configuration for performance
The number of GPUs and the number of processes and nodes should be always power-of-2. This design is chosen for the optimal performance.
Shipped with a validated set of libraries
cusvaer is shipped with the latest cuStateVec and a validated set of MPI libraries. The performance has been validated on NVIDIA SuperPOD.
Distributed state vector simulation¶
Popular method to distribute state vector simulation is to equally-slice the state vector into multiple sub state vectors and distribute them to processes. Some examples for distributed state vector simulation are also described in Multi-GPU Computation. In cusvaer, this equally-sliced state vector is called as sub state vector.
The first release of cusvaer uses the “single device per single process” model. All sub state vectors are placed in GPUs. Each process owns a single GPU, and one sub state vector is allocated in the GPU assigned to the process. Thus, the number of processes is always identical to the number of GPUs. The size of sub state vector is the same in all processes.
When slicing the state vector, the size of sub state vector is calculated by the number of qubits in a circuit and the number of GPUs used for distribution.
Ex. 40-qubit c128 simulation can be computed using 32 DGX A100 with the following configurations.
The state vector size : 16 TiB (= 16 [bytes/elm] * (1 << 40 [qubits]))
The number of GPUs per node : 8
The number of nodes : 32
The size of sub state vector in each GPU: 64 [GiB/GPU] = 16 TiB / (32 [node] x 8 [GPU/node])
cusvaer calculates the size of sub state vector by using the number of qubits of a given quantum circuit and the number of processes. Thus, by specifying an appropriate number of processes, users are able to distribute state vector simulations to processes and nodes. The number of processes should be power-of-2.
For nodes with multiple GPUs, GPUs are assigned to processes by using the following formula that is used to configure affinity.
device_id = mpi_rank % number_of_processes_in_node
Using cusvaer¶
cusvaer provides one class, cusvaer.backends.StatevectorSimulator
. This class has the same interface of Qiskit Aer StatevectorSimulator.
Below is an example code for using cusvaer.
from qiskit import QuantumCircuit, transpile
from cusvaer.backends import StatevectorSimulator
def create_ghz_circuit(n_qubits):
ghz = QuantumCircuit(n_qubits)
ghz.h(0)
for qubit in range(n_qubits - 1):
ghz.cx(qubit, qubit + 1)
ghz.measure_all()
return ghz
circuit = create_ghz_circuit(20)
simulator = StatevectorSimulator()
simulator.set_options(precision='double')
circuit = transpile(circuit, simulator)
job = simulator.run(circuit, shots=1024)
result = job.result()
if result.mpi_rank == 0:
print(result.get_counts())
The following console output will be obtained:
$ python ghz_cusvaer.py
{'00000000000000000000': 480, '11111111111111111111': 544}
To run distributed simulation, users need to use MPI or appropriate libraries to specify the number of processes. In the example below, 20-qubit state vector simulation is distributed to two processes, and each process has 19-qubit sub state vector. The number of processes should be power-of-2.
$ mpirun -n 2 ghz_cusvaer.py
Note
GHZ circuit is used here as an example, which does not mean for demonstrating performance. It is known that the GHZ circuit is too shallow to show good scaling when simulation is distributed to multiple-GPUs and/or multiple-nodes.
MPI libraries¶
cusvaer provides built-in support for Open MPI and MPICH.
The versions shown below are validated or expected to work.
Open MPI
Validated: v4.1.4 / UCX v1.13.1
Expected to work: v3.0.x, v3.1.x, v4.0.x, v4.1.x
MPICH
Validated: v4.0.2
cuQuantum Appliance is shipped with binaries Open MPI v4.1.4 and UCX v1.13.1 binaries compiled with PMI enabled. The build configuration is intended to work with Slurm.
Two options, cusvaer_comm_plugin_type
and cusvaer_comm_plugin_soname
are used to select MPI library. Please refer to CommPlugin for details.
For the use of other MPI libraries, users need to compile external plugin. Please refer to cuQuantum GitHub.
cusvaer options¶
cusvaer provides a subset of common Qiskit Aer options and cusvaer-specific options.
Option |
Value |
Description |
|
non-negative integer |
the number of shots |
|
|
the precision of the state vector elements |
|
|
Enable/disable gate fusion |
|
positive integer |
The max number of qubits used for gate fusion |
|
positive integer |
Threshold to enable gate fusion |
|
|
If |
|
integer |
the seed for random number generator |
Option |
Value |
Description |
|
|
Selecting CommPlugin for inter-process communication |
|
string |
The shared library name for inter-process communication |
|
list of positive integers |
Network structure |
|
non-negative integer |
|
|
positive integer |
The size used for a temporary buffer during transferring data |
|
positive integer |
The number of host threads used to process circuits on host |
|
integer greater than or equal to -1 |
The max number of qubits used for diagonal gate fusion |
CommPlugin¶
cusvaer dynamically links to a library that handles inter-process communication by using CommPlugin. CommPlugin is a small module that typically wraps MPI libraries. The CommPlugin is selected by cusvaer_comm_plugin_type
and cusvaer_comm_plugin_soname
options.
cusvaer_comm_plugin_type¶
cusvaer_comm_plugin_type
option to select CommPlugin. The value type is cusvaer.CommPluginType
and the default value is cusvaer.CommPluginType.MPI_AUTO
.
# declared in cusvaer
from enum import IntEnum
class CommPluginType(IntEnum):
SELF = 0 # Single process
EXTERNAL = 1 # Use external plugin
MPI_AUTO = 0x100 # Automatically select Open MPI or MPICH
MPI_OPENMPI = 0x101 # Use Open MPI
MPI_MPICH = 0x102 # Use MPICH
SELF
CommPlugin used for single process simulation.
SELF
CommPlugin does not have any external dependency.
MPI_AUTO
, MPI_OPENMPI
, MPI_MPICH
By specifying
MPI_OPENMPI
orMPI_MPICH
, the specified library will be dynamically loaded and used for inter-process data transfers.MPI_AUTO
automatically selects one of them. With these options, cusvaer internally would useMPI_COMM_WORLD
as the communicator.
EXTERNAL
Option to use an external custom plugin that wraps an inter-process communication library. This option is to use libraries of users’ preferences. Please refer to cuQuantum GitHub for details.
The value of CommPluginType
should be the same during the application lifetime if a value other than CommPluginType.SELF
is specified. It reflects that a process is able only to use one MPI library for the application lifetime.
comm_plugin_soname¶
comm_plugin_soname
option specifies the name of a shared library. When MPI_AUTO
, MPI_OPENMPI
, or MPI_MPICH
is specified for cusvaer_comm_plugin_type
, the corresponding MPI library name in the search path should be specified. The default value is "libmpi.so"
.
If EXTERNAL
is specified, comm_plugin_soname
will contain the name of a custom CommPlugin.
Specifying device network structure¶
Two options, cusvaer_global_index_bits
and cusvaer_p2p_device_bits
, are available for specifying inter-node and intra-node network structures. By specifying these options, cusvaer schedules data transfers to utilize faster communication paths to accelerate simulations.
cusvaer_global_index_bits¶
cusvaer_global_index_bits
is a list of positive integers that represents the inter-node network structure.
Assuming 8 nodes has faster communication network in a cluster, and running 32 node simulation, the value of cusvaer_global_index_bits
is [3, 2]
. The first 3
is log2(8) representing 8 nodes with fast communication which corresponding to 3 qubits in the state vector. The second 2
means 4 8-node groups in 32 nodes. The sum of the global_index_bits elements is 5, which means the number of nodes is 32 = 2^5.
The last element in the cusvaer_global_index_bits
can be omitted.
cusvaer_p2p_device_bits¶
cusvaer_p2p_device_bits
option is to specify the number of GPUs that can communicate by using GPUDirect P2P.
For 8 GPU node such as DGX A100, the number is log2(8) = 3.
The value of cusvaer_p2p_device_bits
is typically the same as the first element of cusvaer_global_index_bits
as the GPUDirect P2P network is typically the fastest in a cluster.
cusvaer_data_transfer_buffer_bits¶
cusvaer_data_transfer_buffer_bits
specifies the size of the buffer utilized for inter-node data transfers. Each rank of process allocates (1 << cusvaer_data_transfer_buffer_bits
) bytes of device memory. The default is set to 26 (64 MiB). The minimum allowed value is 24 (16 MiB).
Depending on systems, setting a larger value to cusvaer_data_transfer_buffer_bits
can accelerate inter-node data transfers.
Other options¶
cusvaer_host_threads¶
cusvaer_host_threads
specifies the number of CPU threads used for circuit processing. The default value is set to 8
, a good value. The performance is not sensitive to this option.
cusvaer_diagonal_fusion_max_qubit¶
cusvaer_diagonal_fusion_max_qubit
specifies the maximum number of qubits allowed per fused diagonal gate. The default value is set to -1
and the fusion size will be automatically adjusted for the better performance. If 0
, the gate fusion for diagonal gates is disabled.
Custom instruction¶
set_state_simple(state)
¶
This instruction sets the state vector. The state argument is numpy ndarray. If the dtype of the state argument is different from that of the state vector precision, the given ndarray is converted to the type of the state vector prior to being set to the state vector.
When running a single process simulation, the state argument will set its array elements to the state vector. The state argument value has the same size as that of the state vector.
If a simulation is distributed to multiple processes, this instruction sets the sub state vector allocated in the device assigned to a process. The size of the state argument should be the same as that of the sub state vector.
Different from set_statevector()
instruction provided by Qiskit Aer, set_state_simple()
does not check the absolute value of the state argument.
save_state_simple()
¶
This instruction saves the state vector allocated in GPU assigned to a process. The saved state vector is returned as a numpy ndarray in a result object.
When running a single process simulation, the saved ndarray is a state vector. If a simulation is distributed to multiple processes, the saved ndarray is a sub state vector owned by a process.
Note
The above custom instructions are preliminary and are subject to change.
Example of cusvaer option configurations¶
import cusvaer
options = {
'cusvaer_comm_plugin_type': cusvaer_comm_plugin_type=cusvaer.CommPluginType.MPI_AUTO, # automatically select Open MPI or MPICH
'cusvaer_comm_plugin_soname': 'libmpi.so', # MPI library name is libmpi.so
'cusvaer_global_index_bits': [3, 2], # 8 devices per node, 4 nodes
'cusvaer_p2p_device_bits': 3, # 8 GPUs in one node
'precision': 'double' # use complex128
}
simulator = cusvaer.backends.StatevectorSimulator()
simulator.set_options(**options)
Exception¶
cusvaer uses qiskit.providers.basicaer.BasicAerError class to raise exceptions.
Interoperability with mpi4py¶
mpi4py is a python package that wraps MPI. By importing mpi4py before running simulations, cusvaer interoperates with mpi4py.
from qiskit import QuantumCircuit, transpile
from cusvaer.backends import StatevectorSimulator
# import mpi4py here to call MPI_Init()
from mpi4py import MPI
def create_ghz_circuit(n_qubits):
ghz = QuantumCircuit(n_qubits)
ghz.h(0)
for qubit in range(n_qubits - 1):
ghz.cx(qubit, qubit + 1)
ghz.measure_all()
return ghz
print(f"mpi4py: rank: {MPI.COMM_WORLD.Get_rank()}, size: {MPI.COMM_WORLD.Get_size()}")
circuit = create_ghz_circuit(n_qubits)
# Create StatevectorSimulator instead of using Aer.get_backend()
simulator = StatevectorSimulator()
circuit = transpile(circuit, simulator)
job = simulator.run(circuit)
result = job.result()
print(f"Result: rank: {result.mpi_rank}, size: {result.num_mpi_processes}")
cusvaer calls MPI_Init()
right before executing the first simulation, and calls MPI_Finalize()
at the timing that Python calls functions registered in atexit
.
cusvaer checks if MPI is initialized before calling MPI_Init()
. If MPI_Init()
is already called before the first simulation, MPI_Init()
and MPI_Finalize()
calls are skipped.
For mpi4py, MPI_Init()
is called at the timing of loading mpi4py. Thus, mpi4py and cusvaer coexist by importing mpi4py before calling simulator.run()
.