Overview#

This section describes the basic working principles of the cuStateVec library and cuStateVec Ex API:

cuStateVec API provides key operations for state vector quantum simulators accelerated on NVIDIA GPUs.
cuStateVec Ex API is a new set of APIs built on top of the cuStateVec API that provides scalable capabilities from single-GPU to multiple-node systems.

For a general introduction to quantum circuits, please refer to Introduction to quantum computing.

State Vector Representation and Core Concepts#

Description of state vectors#

In the cuStateVec library, the state vector is always given as a device array and its data type is specified by a cudaDataType_t constant. It’s the user’s responsibility to manage memory for the state vector.

This version of cuStateVec library supports 128-bit complex (complex128) and 64-bit complex (complex64) as datatypes of the state vector. The size of a state vector is represented by the nIndexBits argument which corresponds to the number of qubits in a circuit. Therefore, the state vector size is expressed as \(2^{\text{nIndexBits}}\).

The type custatevecIndex_t is provided to express the state vector index, which is a typedef of the 64-bit signed integer. It is also used to express the number of state vector elements.

Bit ordering#

In the cuStateVec library, the bit ordering of the state vector index is defined in little endian order. The 0-th index bit is the least significant bit (LSB). Most functions accept arguments to specify bit positions as integer arrays. Those bit positions are specified in little endian order. Values in bit positions are in the range \([0, \text{nIndexBits})\).

The cuStateVec library represents bit strings in either of the following two ways:

One 32-bit signed integer array for one bit string:
Some APIs use a pair of 32-bit signed integer arrays bitString and bitOrdering arguments to specify one bit string. The bitString argument specifies bit string values as an array of 0s and 1s. The bitOrdering argument specifies the bit positions of the bitString array elements in little endian order. Both arrays are allocated on host memory.
In the following example, “10” is specified as a bit string. Bit string values are mapped to the 2nd and 3rd index bits and can be used to specify a bit mask, \(***\cdots *10*\).
```
int32_t bitString[]   = {0, 1}
int32_t bitOrdering[] = {1, 2}
```
One 64-bit signed integer array for multiple bit strings:
Some APIs introduce a pair of bitStrings and bitOrdering arguments to represent each bit string using custatevecIndex_t, which is a 64-bit signed integer, to handle multiple bit strings with small memory footprint. The bitOrdering argument is a 32-bit signed integer array and it specifies the bit positions of each bit string in the bitStrings argument in little endian order.
The following example describes the same bit string, as was used in the previous example:
```
custatevecIndex_t bitStrings[] = {0b10}
int32_t bitOrdering[] = {1, 2}
```
bitStrings are allocated on host memory but some APIs allow bitStrings to be allocated on device memory as well. For the detailed requirements, please refer to each API description.

Supported data types#

By default, computation is executed using the corresponding precision of the state vector, double float (FP64) for complex128 and single float (FP32) for complex64.

The cuStateVec library also provides the compute type, allowing computation with reduced precision. Some cuStateVec functions accept the compute type specified by using custatevecComputeType_t.

Below is the table of combinations of state vector and compute types available in the current version of the cuStateVec library.

State vector / cudaDataType_t	Matrix / cudaDataType_t	Compute / custatevecComputeType_t
Complex 128 / CUDA_C_64F	Complex 128 / CUDA_C_64F	FP64 / CUSTATEVEC_COMPUTE_64F
Complex 64 / CUDA_C_32F	Complex 128 / CUDA_C_64F	FP32 / CUSTATEVEC_COMPUTE_32F
Complex 64 / CUDA_C_32F	Complex 64 / CUDA_C_32F	FP32 / CUSTATEVEC_COMPUTE_32F

Note

CUSTATEVEC_COMPUTE_TF32 is not available in this version.

Math mode#

For B200 GPU (compute capability 10.0), floating point math emulation has been introduced to the cuStateVec library. For cuStateVec, this feature is disabled by default, and enabled by calling the custatevecSetMathMode() API with CUSTATEVEC_MATH_MODE_ALLOW_FP32_EMULATED_BF16X9. For cuStateVec Ex, the default math mode is CUSTATEVEC_MATH_MODE_ALLOW_FP32_EMULATED_BF16X9. To disable the use of BF16x9 floating point emulation, call custatevecExStateVectorSetMathMode() with CUSTATEVEC_MATH_MODE_DISALLOW_FP32_EMULATED_BF16X9.

Note

In the 25.03 release, performance improvement with CUSTATEVEC_MATH_MODE_ALLOW_FP32_EMULATED_BF16X9 is expected only on devices of compute capability 10.0.

The math mode CUSTATEVEC_MATH_MODE_ALLOW_FP32_EMULATED_BF16X9 allows FP32 emulation kernels using BFloat16 (BF16) whenever possible. For the detailed usage of each math mode API, please refer to custatevecSetMathMode() and custatevecGetMathMode() for cuStateVec, or custatevecExStateVectorSetMathMode() for cuStateVec Ex.

Features Common to cuStateVec and cuStateVec Ex API#

API synchronization behavior#

The cuStateVec APIs are designed for asynchronous execution. Their API synchronization behavior follows the description in API synchronization behavior in CUDA Runtime API. Developers are required to appropriately call CUDA APIs to synchronize API calls.

Using CUDA stream#

The execution of most cuStateVec APIs are serialized on the stream attached to the cuStateVec handle created by custatevecCreate(). The initial stream is the default stream. Users are able to set a user-created stream to the cuStateVec handle by calling custatevecSetStream(). All types of streams (default, blocking and non-blocking) are acceptable. API calls are synchronized by appropriate CUDA API calls such as cudaDeviceSynchronize, cudaStreamSynchronize or cudaStreamWaitEvent.

cuStateVec API Features#

Using CUDA stream with distributed index bit swap API#

There is one exception in CUDA stream usage for Distributed index bit swap API. The custatevecSVSwapWorkerCreate() API requires a user-created stream that is specifically utilized only for data transfers. Therefore, other cuStateVec API calls are concurrently executed on the stream attached to the handle. Also, custatevecSVSwapWorkerExecute() blocks on the stream specified on the call to custatevecSVSwapWorkerCreate() API to synchronize with the peer of data transfers.

Workspace#

cuStateVec APIs require users to explicitly manage workspace memory for operations. Users need to query the workspace size using APIs such as custatevecApplyMatrixGetWorkspaceSize() and allocate the required memory before calling the corresponding operation APIs.

The cuStateVec library internally manages temporary device memory for executing functions, which is referred to as context workspace.

The context workspace is attached to the cuStateVec context and allocated when a cuStateVec context is created by calling custatevecCreate(). The default size of the context workspace is chosen to cover most typical use cases, obtained by calling custatevecGetDefaultWorkspaceSize().

When the context workspace cannot provide enough amount of temporary memory or when a device memory chunk is shared by two or more functions, there are two options for users:

Users can provide user-managed device memory for the extra workspace. Functions that need the extra workspace have their sibling functions suffixed by GetWorkspaceSize(). If these functions return a nonzero value via the extraBufferSizeInBytes argument, users are requested to allocate a device memory and supply the pointer to the allocated memory to the corresponding function. The extra workspace should be 256-byte aligned, which is automatically satisfied by using cudaMalloc() to allocate device memory. If the size of the extra workspace is not enough, CUSTATEVEC_STATUS_INSUFFICIENT_WORKSPACE is returned.
Users also can set a device memory handler. When it is set to the cuStateVec library context, the library can directly draw memory from the pool on the user’s behalf. In this case, users are not required to allocate device memory explicitly and a null pointer (zero size) can be specified as the extra workspace (size) in the function. Please refer to custatevecDeviceMemHandler_t and custatevecSetDeviceMemHandler() for details.

Batched state vectors simulation#

cuStateVec provides gate application and qubit measurement APIs for a group of state vectors. When computing with many small state vectors, replacing API calls for each state vector with a single batched API call is expected to lead to improved performance.

These batched-version APIs assume that state vectors are allocated as one contiguous block of device memory and require three parameters to specify their locations:

nSVs, the number of state vectors in the batch.

nIndexBits, the number of qubits in each state vector. All the state vectors in a batch need to have the same number of qubits.

svStride, the offset in number of elements between two state vectors. It should be equal to or larger than each state vector size, 1 << nIndexBits.

For instance, the following figure describes a group of state vectors with nSVs = 3, nIndexBits = 2, and svStride = 8. Here, each element of the 2-D array in the figure are ordered in column-major format.

For the details of each API, please refer to custatevecApplyMatrixBatched(), custatevecComputeExpectationBatched(), custatevecAbs2SumArrayBatched(), custatevecCollapseByBitStringBatched(), and custatevecMeasureBatched().

Multi-GPU APIs#

cuStateVec provides APIs for multi-GPU qubit measurement, sampling, and qubit reordering. Measurement and sampling APIs work on single GPU, and users are required to gather/scatter the results of each GPU. As for details of each API, please refer to Qubit measurement, Sampling, and Single-process qubit reordering, respectively.

Note

In cuStateVec, each GPU requires its own library handle. Also, the users are responsible for switching the CUDA device context.

cuStateVec Ex API Features#

cuStateVec Ex API is the API set as a natural extension of the existing cuStateVec API. Building upon the foundational cuStateVec library, cuStateVec Ex APIs provide enhanced capabilities for quantum circuit simulations with advanced performance and flexibility.

The cuStateVec Ex API targets distributed state vector simulations and acceleration by introducing a simulator pipeline. It adds two key features: (1) Distributed state vectors to distribute simulations across multiple devices and multiple processes, enabling seamless scaling from a small state vector on a single device to cluster-scale large-scale runs. (2) State vector updater that implements a simulator pipeline to apply gates and noise channels efficiently.

For detailed information about cuStateVec Ex features and capabilities, please refer to The cuStateVec Ex API Technical Brief.

In-Depth Guides#

The following pages provide detailed information on specific APIs and features:

State Vector Algorithms

cuStateVec API#

cuStateVec Ex API#

References#

For a technical introduction to cuStateVec, please refer to the NVIDIA blog:

Accelerating Quantum Circuit Simulation with NVIDIA cuStateVec

Citing cuQuantum#

1. Bayraktar et al., “cuQuantum SDK: A High-Performance Library for Accelerating Quantum Science,” 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), Bellevue, WA, USA, 2023, pp. 1050-1061, doi: 10.1109/QCE57702.2023.00119.