cusvaer

Starting with cuQuantum Appliance 22.11, we have included cusvaer. cusvaer is designed as a Qiskit backend solver and is optimized for distributed state vector simulation.

Features

  • Distributed simulation

    cusvaer distributes state vector simulations to multiple devices of GPUs and CPUs.

  • CPU memory as an external storage to accommodate state vector.

    cusvaer is able to allocate state vector not only on GPUs, but also with CPU memory to accommodate larger size of state vectors.

  • Power-of-2 configuration for performance

    The number of GPUs and the number of processes and nodes should be always power-of-2. This design is chosen for the optimal performance.

  • Shipped with a validated set of libraries

    cusvaer is shipped with the latest cuStateVec and a validated set of MPI libraries. The performance has been validated on NVIDIA SuperPOD.

Distributed state vector simulation

Popular method to distribute state vector simulation is to equally-slice the state vector into multiple sub state vectors and distribute them to processes. Some examples for distributed state vector simulation are also described in Multi-GPU Computation. In cusvaer, this equally-sliced state vector is called as sub state vector.

The current version supports two distributed simulation configurations, (1) multi process simulation and (2) multi-GPU single process simulation.

  1. Multi process simulation

For this configuration, cusvaer uses the “single device per single process” model. Sub state vectors are placed in GPUs[*]_. Each process owns a single GPU, and one sub state vector is allocated in the GPU assigned to the process. Thus, the number of processes is always identical to the number of GPUs. The size of sub state vector is the same in all processes.

When slicing the state vector, the size of sub state vector is calculated by the number of qubits in a circuit and the number of GPUs used for distribution.

Ex. 40-qubit c128 simulation can be computed using 32 DGX A100 with the following configurations.

  • The state vector size : 16 TiB (= 16 [bytes/elm] * (1 << 40 [qubits]))

  • The number of GPUs per node : 8

  • The number of nodes : 32

  • The size of sub state vector in each GPU: 64 [GiB/GPU] = 16 TiB / (32 [node] x 8 [GPU/node])

cusvaer calculates the size of sub state vector by using the number of qubits of a given quantum circuit and the number of processes. Thus, by specifying an appropriate number of processes, users are able to distribute state vector simulations to processes and nodes. The number of processes should be power-of-2.

For nodes with multiple GPUs, GPUs are assigned to processes by using the following formula that is used to configure affinity.

device_id = mpi_rank % number_of_processes_in_node
  1. Multi-GPU single process simulation

For this configuration, cusvaer will use the “single process with multiple GPUs” model. All GPUs are expected to be installed in a single server or workstation. GPUDirect P2P is a requirement, thus, all GPUs are connected by NVLink/NVSwitch and/or PCIe bus with a single PCIe switch.

By specifying GPU device IDs to cusvaer_device_ids, simulator will use the specified GPUs to distribute state vector allocation.

Using CPU and GPU memory to allocate state vector

From 24.11, cusvaer supports the usage of CPU memory to allocate a part of state vector. If the capacity of GPU memory is not enough to accommodate large state vector, CPU memory will be additionally utilized to hold a part of state vector.

By default, CPU memory will be automatically utilized if a state vector does not fit GPU memory. And, users are also able to explicitly specify the memory capacities that can be used for state vector allocation with options, cusvaer_max_cpu_memory_mb and cusvaer_max_gpu_memory_mb. Those options give the upper limits of the available memory capacity for state vector allocation.

By default, those options are None, which means cusvaer will compute the max memory capacity by using physical memory installed in a system as shown below:

  • cusvaer_max_cpu_memory_mb is set to [Physical CPU memory amount] - 4 GiB.

  • cusvaer_max_gpu_memory_mb is set to [Physical GPU memory amount] - 2 MiB.

For the max GPU memory consumption, the state vector size is a power-of-2 number. If a GPU has 80 GiB of memory, the max size of state vector will be 64 GiB. When one qubit is added, the state vector size will be doubled. Thus, 64 GiB of CPU memory will be utilized. The CPU memory size is [state vector size] - [GPU memory usage]. When adding two qubits to 64 GiB state vector on GPU, the state vector size will be 256 GiB. Thus, the CPU memory usage will be 192 GiB. For three qubits addition, the CPU memory usage will be 448 GiB out of 512 GiB state vector.

In the current implementation, CPU memory can add up-to three qubits.

Note

It’s a users’ responsibility to keep free CPU and GPU memory utilized for state vector allocation. If the required amount of CPU and/or GPU memory is not available, cusvaer will raise an out-of-memory error.

This feature is provided for “single process with single GPU” simulation and “multi process simulation”. The “single process with multiple GPUs” configuration does not support CPU memory utilization.

Using cusvaer

cusvaer provides one class, cusvaer.backends.StatevectorSimulator. This class has the same interface of Qiskit Aer StatevectorSimulator.

Below is an example code for using cusvaer.

from qiskit import QuantumCircuit, transpile
from cusvaer.backends import StatevectorSimulator

def create_ghz_circuit(n_qubits):
    ghz = QuantumCircuit(n_qubits)
    ghz.h(0)
    for qubit in range(n_qubits - 1):
        ghz.cx(qubit, qubit + 1)
    ghz.measure_all()
    return ghz

circuit = create_ghz_circuit(20)

simulator = StatevectorSimulator()
simulator.set_options(precision='double')
circuit = transpile(circuit, simulator)
job = simulator.run(circuit, shots=1024)
result = job.result()

if result.mpi_rank == 0:
    print(result.get_counts())

The following console output will be obtained:

$ python ghz_cusvaer.py
{'00000000000000000000': 480, '11111111111111111111': 544}

To run distributed simulation, users need to use MPI or appropriate libraries to specify the number of processes. In the example below, 20-qubit state vector simulation is distributed to two processes, and each process has 19-qubit sub state vector. The number of processes should be power-of-2.

$ mpirun -n 2 ghz_cusvaer.py

Note

GHZ circuit is used here as an example, which does not mean for demonstrating performance. It is known that the GHZ circuit is too shallow to show good scaling when simulation is distributed to multiple-GPUs and/or multiple-nodes.

Note

Starting with 24.11, cusvaer will use CUDA virtual memory management functions for state vector allocations in multi-process simulation when available. To allow this feature in cuQuantum Appliance, extra capability for inter-process communications may need to be provided to the container by the option --cap-add=SYS_PTRACE as follows:

docker run --cap-add=SYS_PTRACE --gpus all -it --rm "$image_name"

MPI libraries

cusvaer provides built-in support for Open MPI and MPICH.

The versions shown below are validated or expected to work.

  • Open MPI

    • Validated: v4.1.4 / UCX v1.13.1

    • Expected to work: v3.0.x, v3.1.x, v4.0.x, v4.1.x

  • MPICH

    • Validated: v4.0.2

cuQuantum Appliance is shipped with binaries Open MPI v4.1.4 and UCX v1.13.1 binaries compiled with PMI enabled. The build configuration is intended to work with Slurm.

Two options, cusvaer_comm_plugin_type and cusvaer_comm_plugin_soname are used to select MPI library. Please refer to CommPlugin for details.

For the use of other MPI libraries, users need to compile external plugin. Please refer to cuQuantum GitHub.

cusvaer options

cusvaer provides a subset of common Qiskit Aer options and cusvaer-specific options.

Common options provided by Qiskit Aer

Option

Value

Description

shots

non-negative integer

the number of shots

precision

"single" or "double"

the precision of the state vector elements

fusion_enable

True or False

Enable/disable gate fusion

fusion_max_qubit

positive integer

The max number of qubits used for gate fusion

fusion_threshold

positive integer

Threshold to enable gate fusion

memory

True or False

If True is specified, classical register values are saved to the result

seed_simulator

integer

the seed for random number generator

cusvaer options

Option

Value

Description

cusvaer_comm_plugin_type

cusvaer.CommPluginType

Selecting CommPlugin for inter-process communication

cusvaer_comm_plugin_soname

string

The shared library name for inter-process communication

cusvaer_global_index_bits

list of positive integers

Network structure

cusvaer_p2p_device_bits

non-negative integer

cusvaer_data_transfer_buffer_bits

positive integer

The size used for a temporary buffer during transferring data

cusvaer_host_threads

positive integer

The number of host threads used to process circuits on host

cusvaer_diagonal_fusion_max_qubit

integer greater than or equal to -1

The max number of qubits used for diagonal gate fusion

cusvaer_device_ids

List of device ids as integers

cusvaer_max_cpu_memory_mb

Zero or positive integer

The max capacity of CPU memory in MiB for state vector allocation

cusvaer_max_gpu_memory_mb

Positive integer

The max GPU memory capacity in MiB for state vector allocation

cusvaer_migration_level

Zero or positive integer

The position for the number of migration index bits inserted to cusvaer_global_index_bits

Device selection

cusvaer_device_ids

cusvaer_device_ids specifies the device ids of GPUs that will be using for simulations.

For simulations with single process and single GPU, this option should contain only one integer that represents a GPU. If None is specified, the current GPU will be used.

For simulations with single process with multiple GPUs, this option should contain the list of integers that represents a series of device IDs. The number of device IDs should be a power-of-two number.

For simulations with multiple processes with single GPU for each process, this option should contain a single integer to represent the device ID. If None is specified, GPU will be automatically selected according to the rank of a process. None is the default value and is recommended for multi-process simulations.

Multi-process simulation

CommPlugin options

cusvaer dynamically links to a library that handles inter-process communication at runtime by using CommPlugin. CommPlugin is a small module that typically wraps MPI libraries. The CommPlugin is selected by cusvaer_comm_plugin_type and cusvaer_comm_plugin_soname options.

cusvaer_comm_plugin_type

cusvaer_comm_plugin_type option to select CommPlugin. The value type is cusvaer.CommPluginType and the default value is cusvaer.CommPluginType.MPI_AUTO.

# declared in cusvaer
from enum import IntEnum

class CommPluginType(IntEnum):
    SELF        = 0       # Single process
    EXTERNAL    = 1       # Use external plugin
    MPI_AUTO    = 0x100   # Automatically select Open MPI or MPICH
    MPI_OPENMPI = 0x101   # Use Open MPI
    MPI_MPICH   = 0x102   # Use MPICH

SELF

CommPlugin used for single process simulation. SELF CommPlugin does not have any external dependency.

MPI_AUTO, MPI_OPENMPI, MPI_MPICH

By specifying MPI_OPENMPI or MPI_MPICH, the specified library will be dynamically loaded and used for inter-process data transfers. MPI_AUTO automatically selects one of them. With these options, cusvaer internally would use MPI_COMM_WORLD as the communicator.

EXTERNAL

Option to use an external custom plugin that wraps an inter-process communication library. This option is to use libraries of users’ preferences. Please refer to cuQuantum GitHub for details.

The value of CommPluginType should be the same during the application lifetime if a value other than CommPluginType.SELF is specified. It reflects that a process is able only to use one MPI library for the application lifetime.

comm_plugin_soname

comm_plugin_soname option specifies the name of a shared library. When MPI_AUTO, MPI_OPENMPI, or MPI_MPICH is specified for cusvaer_comm_plugin_type, the corresponding MPI library name in the search path should be specified. The default value is "libmpi.so".

If EXTERNAL is specified, comm_plugin_soname will contain the name of a custom CommPlugin.

Specifying device network structure

Two options, cusvaer_global_index_bits and cusvaer_p2p_device_bits, are available for specifying inter-node and intra-node network structures. By specifying these options, cusvaer schedules data transfers to utilize faster communication paths to accelerate simulations.

cusvaer_global_index_bits

cusvaer_global_index_bits is a list of positive integers that represents the inter-node network structure.

Assuming 8 nodes has faster communication network in a cluster, and running 32 node simulation, the value of cusvaer_global_index_bits is [3, 2]. The first 3 is log2(8) representing 8 nodes with fast communication which corresponding to 3 qubits in the state vector. The second 2 means 4 8-node groups in 32 nodes. The sum of the global_index_bits elements is 5, which means the number of nodes is 32 = 2^5.

The last element in the cusvaer_global_index_bits can be omitted.

cusvaer_p2p_device_bits

cusvaer_p2p_device_bits option is to specify the number of GPUs that can communicate by using GPUDirect P2P.

For 8 GPU node such as DGX A100, the number is log2(8) = 3.

The value of cusvaer_p2p_device_bits is typically the same as the first element of cusvaer_global_index_bits as the GPUDirect P2P network is typically the fastest in a cluster.

cusvaer_data_transfer_buffer_bits

cusvaer_data_transfer_buffer_bits specifies the size of the buffer utilized for inter-node data transfers. Each rank of process allocates (1 << cusvaer_data_transfer_buffer_bits) bytes of device memory. The default is set to 26 (64 MiB). The minimum allowed value is 24 (16 MiB).

Depending on systems, setting a larger value to cusvaer_data_transfer_buffer_bits can accelerate inter-node data transfers.

CPU memory utilization

cusvaer_max_cpu_memory_mb

cusvaer_max_cpu_memory_mb option specifies the max capacity of CPU memory utilized for state vector allocation. The unit is MiB. The option value can be None (default), 0, or positive integer.

If None is specified, the max CPU memory capacity is calculated by using the physical memory size of CPU.

If 0 is specified, CPU memory utilization is disabled. State vector will be allocated only on GPUs.

If a positive integer is specified, the value is used as the max size of CPU memory to allocate state vector. The value should not exceed the amount of CPU physical memory.

cusvaer_max_gpu_memory_mb

cusvaer_max_gpu_memory_mb option specifies the max capacity of GPU memory utilized for state vector allocation. The unit is MiB. The option value can be None or positive integer. The default value is None.

If None is specified, the max GPU memory capacity is calculated by using the physical memory size of GPU.

If a positive integer is specified, the value is used as the max size of GPU memory to allocate state vector. The value should not exceed the amount of GPU physical memory.

cusvaer_migration_level

Extending state vector by using CPU memory is to insert migration index bits that represent the qubits allocated on both CPU and GPU. If one migration index bit is inserted, the state vector size is doubled.

The cusvaer_migration_level option specifies the position to insert the number of migration index bits to cusvaer_global_index_bits.

The numbers of index bits in cusvaer_global_index_bits are sorted from faster to slower network connections for the best performance. Also, the migration of state vector between CPU and GPU transfers state vector elements between them, which affects the performance of simulations depending on the bandwidth of CPU-GPU data transfers. By choosing an appropriate position to insert the number of migration index bits to cusvaer_global_index_bits, the simulation performance is optimized.

The default value is None, which means that the number of migration index bits are inserted in the end of cusvaer_global_index_bits. The default setting assumes the data transfer bandwidth between CPU and GPU is slower than inter-GPU connections such as NVLink, IB and so on.

Other options

cusvaer_host_threads

cusvaer_host_threads specifies the number of CPU threads used for circuit processing. The default value is set to 8, a good value. The performance is not sensitive to this option.

cusvaer_diagonal_fusion_max_qubit

cusvaer_diagonal_fusion_max_qubit specifies the maximum number of qubits allowed per fused diagonal gate. The default value is set to -1 and the fusion size will be automatically adjusted for the better performance. If 0, the gate fusion for diagonal gates is disabled.

Custom instruction

set_state_simple(state)

This instruction sets the state vector. The state argument is numpy ndarray. If the dtype of the state argument is different from that of the state vector precision, the given ndarray is converted to the type of the state vector prior to being set to the state vector.

When running a single process simulation, the state argument will set its array elements to the state vector. The state argument value has the same size as that of the state vector.

If a simulation is distributed to multiple processes, this instruction sets the sub state vector allocated in the device assigned to a process. The size of the state argument should be the same as that of the sub state vector.

Different from set_statevector() instruction provided by Qiskit Aer, set_state_simple() does not check the absolute value of the state argument.

save_state_simple()

This instruction saves the state vector allocated in GPU assigned to a process. The saved state vector is returned as a numpy ndarray in a result object.

When running a single process simulation, the saved ndarray is a state vector. If a simulation is distributed to multiple processes, the saved ndarray is a sub state vector owned by a process.

Note

The above custom instructions are preliminary and are subject to change.

Example of cusvaer option configurations

import cusvaer

options = {
  'cusvaer_comm_plugin_type': cusvaer_comm_plugin_type=cusvaer.CommPluginType.MPI_AUTO,  # automatically select Open MPI or MPICH
  'cusvaer_comm_plugin_soname': 'libmpi.so',  # MPI library name is libmpi.so
  'cusvaer_global_index_bits': [3, 2],  # 8 devices per node, 4 nodes
  'cusvaer_p2p_device_bits': 3,         # 8 GPUs in one node
  'precision': 'double'         # use complex128
}
simulator = cusvaer.backends.StatevectorSimulator()
simulator.set_options(**options)

Exception

cusvaer uses qiskit.providers.basicaer.BasicAerError class to raise exceptions.

Interoperability with mpi4py

mpi4py is a python package that wraps MPI. By importing mpi4py before running simulations, cusvaer interoperates with mpi4py.

from qiskit import QuantumCircuit, transpile
from cusvaer.backends import StatevectorSimulator
# import mpi4py here to call MPI_Init()
from mpi4py import MPI

def create_ghz_circuit(n_qubits):
    ghz = QuantumCircuit(n_qubits)
    ghz.h(0)
    for qubit in range(n_qubits - 1):
        ghz.cx(qubit, qubit + 1)
    ghz.measure_all()
    return ghz

print(f"mpi4py: rank: {MPI.COMM_WORLD.Get_rank()}, size: {MPI.COMM_WORLD.Get_size()}")

circuit = create_ghz_circuit(n_qubits)

# Create StatevectorSimulator instead of using Aer.get_backend()
simulator = StatevectorSimulator()
circuit = transpile(circuit, simulator)
job = simulator.run(circuit)
result = job.result()

print(f"Result: rank: {result.mpi_rank}, size: {result.num_mpi_processes}")

cusvaer calls MPI_Init() right before executing the first simulation, and calls MPI_Finalize() at the timing that Python calls functions registered in atexit.

cusvaer checks if MPI is initialized before calling MPI_Init(). If MPI_Init() is already called before the first simulation, MPI_Init() and MPI_Finalize() calls are skipped.

For mpi4py, MPI_Init() is called at the timing of loading mpi4py. Thus, mpi4py and cusvaer coexist by importing mpi4py before calling simulator.run().