The cuStateVec Ex API Technical Brief#

Requirements#

To use the cuStateVec Ex API, the following system and hardware requirements should be met:

Multi-device#

Same GPU generation: All specified GPUs must be of the same major computing capability.
GPUDirect P2P: GPUDirect P2P communication must be available among all devices. GPUDirect P2P is enabled by inter-device connections via NVSwitch or NVLink. For PCIe-only systems, all GPUs need to be connected to the same PCIe root complex.

Multi-process#

For multi-process state vector simulations, the following requirements should be satisfied:

CUDA-aware MPI library: A CUDA-aware MPI or inter-process communication library is required. The cuStateVec library provides built-in communicator implementations for the following MPI libraries:
- Open MPI version 5.0.8 and ABI compatible versions
- MPICH version 4.3.2 and ABI compatible versions
For other MPI libraries or custom IPC implementations, users can build external communicators. See Communicator for details on custom communicator implementation.
Linux kernel: Linux kernel 5.6 or later is required to use GPUDirect P2P between processes.

cuStateVec Ex API Features#

Distributed state vector simulation#

The API set introduces its own state vector object that is seamlessly distributed to multiple devices and multiple processes. The object manages the memory allocation and its layout. It also provides the fundamental capability of data transfers between devices and processes that allows arbitrary transposition of tensor modes of the state vector as a tensor.

The following three configurations are supported in the current release:

Single device / single process
- One process owns a single device.
- Up to 35-qubit state vector simulations depending on the device memory capacity.
Multi device / single process
- One process owns multiple devices.
- Utilizes devices in a single hardware box connected via NVLink and/or PCIe.
- Up to 38-qubit state vector simulations depending on the number of devices and the device memory capacity.
Multi process
- Scales by running with multiple processes. One process owns a single device.
- Supports multi-node NVLink and MPI (or other IPC library).
- The maximum size of state vector is defined by the number of nodes and GPUs.

Distribution models for state vectors — Distribution models: single device, multi device, and multi process#

While supporting these three distribution configurations, the state vector object abstracts the distribution model. All the differences in the distribution are absorbed during the creation of state vector instances. Once a state vector instance is created, the state vector object works with other cuStateVec Ex APIs.

For detailed information about configuration workflows, see State Vector. For detailed information about system and hardware requirements, see State Vector Distribution.

State vector operations#

cuStateVec Ex API provides standard operations for state vector simulations:

Gate application supporting dense, diagonal, and anti-diagonal matrices and Pauli rotations
Multiple single-qubit measurements in a single call with flexible collapse operations
Sampling with configurable output ordering
Probability calculations with masking support
Expectation value computation for Pauli strings

The APIs for those operations accept the state vector object as the first argument, regardless of the state vector distribution. Once the core simulation logic is developed using gate operations, measurements, and sampling functions, it works identically across all distribution models. Circuit simulations can start from a small-sized state vector on a single device and seamlessly scale to multi-device and multi-process configurations for large-scale simulations without modifying the core logic.

Note

Operation APIs may internally change the wire ordering for better performance. The resulting wire ordering is implementation-defined. See State Vector Operations for wire ordering concepts and Wire ordering management for APIs to manipulate wire ordering.

Simulator workflow#

Simulator workflow using cuStateVec Ex consists of two main stages:

Stage 1: State Vector Configuration and Creation

Configure the state vector distribution model (single-device, multi-device, or multi-process) and create the state vector instance. All the differences in state vector distribution are absorbed in this stage.

Stage 2: Run simulation

Once created, the state vector instance can be used with various APIs to perform quantum operations:

Apply gates using custatevecExApplyMatrix() or State Vector Updater
Perform measurements with custatevecExMeasure()
Calculate expectation values using custatevecExComputeExpectationOnPauliBasis()
Generate samples with custatevecExSample()

The state vector object abstracts the distribution model, so the same code works identically across all configurations.

../../_images/simulatorWorkflow.png — Figure. Simulator workflow: (1) Configure and create state vector instance - set distribution model and allocate resources; (2) Apply operations - use the state vector with gate application, measurement, sampling, and other APIs.#

State Vector Updater#

The State Vector Updater (SVUpdater) encapsulates a simulation pipeline for accelerated state vector updates. This component implements a queue-and-execute framework: operators in a circuit are queued into the SVUpdater, then the queued operators are applied to the state vector with the following optimizations.

Gate Fusion: Fuses gate matrices of queued gates, then applies the fused result to the specified state vector in a single scan. This reduces the number of state vector updates and improves performance.
Noise Channels: Supports both mixed unitary channels and general quantum channels using Kraus operators. Noise channels are stochastically applied using user-provided random numbers. When a channel needs to apply a matrix, it is appropriately fused in the same path as gate fusion.
Operator scheduling: For distributed state vectors, the SVUpdater automatically schedules operator applications to minimize data transfer overhead between devices and processes, maintaining high simulation throughput.
Reusable Queue: The queued operators can be applied to multiple state vector instances or with different random number sequences, enabling efficient noise simulation studies.

../../_images/SVUpdaterWorkflow.png — Figure. SVUpdater workflow showing the two-step process: (1) Enqueue operators - gate matrices, mixed unitary channels, and general channels are queued into the SVUpdater; (2) Apply to state vector - the queued operations are applied to update the state vector with optimized fusion and execution.#

For detailed information about the State Vector Updater, including API details, noise channel integration, and configuration options, see State Vector Updater.

Compute resource management#

cuStateVec Ex automatically manages the additional compute resources required to execute APIs, simplifying the simulator development by handling workspace allocation and other resource management internally.

For multi-process state vectors, the transfer workspace is allocated using the specified memory sharing method to facilitate efficient inter-process data transfers during global operations.