State Vector Migration API#

About this document#

The cuStateVec library provides the custatevecSubSVMigrator API to enable users the ability to leverage Host CPU memory in conjunction with device GPU memory to increase the scale of their simulations. This document outlines the possible scenarios for leveraging this API.

custatevecSubSVMigrator API#

The custatevecSubSVMigrator API is a utility to migrate state vectors allocated on CPU (host) and additionally on GPU (device). This API allows utilizing CPU memory to accommodate state vector. One can also utilize both CPU and GPU memory to allocate a single state vector to maximize the number of qubits to be simulated.

Memory model of custatevecSubSVMigrator API#

The custatevecSubSVMigrator API assumes the memory model shown in Figure 1. When utilizing host memory in addition to device memory, a state vector is divided into sub state vectors, in the same manner as distribution across multiple devices. The number of sub state vectors is always a power of two. One sub state vector is placed on device (staged), and the remaining sub state vectors are placed on host (unstaged).

Each host (unstaged) sub state vector is further divided into sub state vector slices. A sub state vector slice is the unit of data migration between host and device. The number of slices per sub state vector is always a power of two. The staged sub state vector on device is stored as a single contiguous memory block, but can be viewed as consisting of slices.

A sub state vector slice represents a partial state. Each slice has the following index bits:

Global index bits: Index bits that identify which sub state vector the slice belongs to. These correspond to the sub state vector index.
Migration index bits: A subset of global index bits that correspond to sub state vectors migrating between host and device. For single-device state vectors with host memory, migration index bits are identical to global index bits.
Slice index bits: Index bits that identify the position (ordinal) of a slice within a sub state vector. The number of slice index bits equals log2(numSlicesPerSubSV).
Slice local index bits: Index bits within the slice that address individual elements. The number of slice local index bits determines the size of each slice, which is \(2^{nSliceLocalIndexBits}\).

Each element in the state vector is addressed by the combination of global (migration) index bits, slice index bits, and slice local index bits.

There is one requirement, that host slices should be directly accessible from the device. This means cudaHostAlloc() would be used to allocate CUDA pinned memory on x86 platform without HMM. For other systems such as GH200, memory chunks allocated by using malloc() are accessible from the device, thus, allocating CUDA pinned memory is not mandatory. Each host slice can have its own memory chunk or can be allocated as a single contiguous memory chunk. It’s a developer’s choice.

The device sub state vector is allocated as a single contiguous device memory block. Although it is stored contiguously, it can be viewed as consisting of device slices. Operations such as gate applications are applied to the device sub state vector directly. The number of device slices is always a power of two.

In the cuStateVec library, this model is expressed by custatevecSubSVMigratorDescriptor_t which is created by custatevecSubSVMigratorCreate() and destroyed by custatevecSubSVMigratorDestroy().

Index bit swap is the algorithm to localize index bits in cuStateVec as described in Qubit reordering and distributed index bit swap for distributed index bit swap API documentation. When migrating sub state vector slices, index bits are swapped by using custatevecSubSVMigrator API and custatevecSwapIndexBits() as described in later sections in this document.

../../_images/migrator_memory_model.png — Figure 1. Memory model of SubStateVectorMigrator API#

Possible scenarios#

There are two scenarios wherein one might allocate a state vector by using host memory.

Allocate state vector on host

If the amount of host memory is large enough to hold the entire state vector, one is able to allocate the state vector on host slices and use device slices to apply operations. During simulations, sub state vector slices are copied to device slices (checkout), and after operations applied, sub state vector slices are copied back to the host slices (check-in). This migration step is iterated for operations to be applied for all sub state vector slices.

Allocate state vector on both host and device

In order to utilize host and device memory as much as possible (in order to allocate the largest state vector), one is able to use both host and device memory to allocate the state vector. Operations are applied on device, and sub state vector slices are swapped between host and device to apply operations for all sub state vector slices.

1. Allocate state vector on host#

Figure 2 shows a simplified example of a sub state vector allocated on host. Four sub state vector slices are placed on host, and one device sub state vector is allocated and kept empty. There are a migration bit denoted as p and a slice index bit denoted as q.

../../_images/host_state_vector.png — Figure 2. State vector allocated on host#

When using NVIDIA H100 (80G) and allocating memory to device as shown in Figure 2, the size of the device sub state vector is 64 GB. The device sub state vector is viewed as having two slices, and the size of each slice is 32 GB. The host state vector size is twice as large, thus, the size of host state vector is 128 GB. The max state vector size with NVIDIA H100 (80G) is 33 qubits (c64) and 32 qubits (c128), respectively. By using 128 GB of host memory, the maximum state vector size increases by 1, reaching to 34 (complex64) and 33 (complex128) qubits, respectively.

With the SubStateVectorMigrator API, sub state vector slices migrate by using the following primitives. By combining these primitives, migration bits and local index bits are appropriately reordered.

Checkout a sub state vector slice in a host to a slice in device sub state vector.

Copy host sub state vector slice on host to a slice in device sub state vector. This operation is executed by passing a host slice pointer to the srcSubSVSlice argument of custatevecSubSVMigratorMigrate().

Check-in a slice in device sub state vector to a host sub state vector slice.

Copy back a device slice to a host slice. This operation is executed by passing a host slice pointer to the dstSubSVSlice argument of custatevecSubSVMigratorMigrate().

Swap index bits in device sub state vector

Move slice local index bits to slice index bit positions. This operation is executed by using the custatevecSwapIndexBits() API.

Slice index bits are moved as shown in Figure 3. Figure 3 (a-1) shows the first migration to localize a slice index bit, q. The 0th and 1st sub state vector slices are copied from host to device slices (checkout), and q moves to a device slice index bit. Then, gate applications and other operations are applied to the device sub state vector. At this point, q is localized as a slice index bit, and operations can act on q along with the slice local index bits. After operations complete, device slices are copied back to update the host slices (Figure 3 (a-2)), which updates the first half of the state vector (check-in). The same sequence of steps is executed for the second half of the state vector (Figure 3 (a-3, 4)).

In order to move a slice index bit, p, to the device sub state vector, the 0th and 2nd sub state vector slices are copied from host to device, which moves p to a device slice index bit (Figure 3 (b-1)). Similar steps shown in Figure 3 (b-2) - (b-4) are applied to complete operations.

../../_images/host_state_vector_migration.png — Figure 3. Host state vector migration to localize migration or slice index bit#

There is an optimization during state vector migration to overlap checkout and check-in sub state vector slices to utilize bidirectional transfer (on PCIe on x86 systems and on NVLink-C2C on GH200) between host and device. The left part of Figure 4 is a cut-out of Figure 3 (a-2) and (a-3). These two steps are fused to a single step as shown in the right part of Figure 4.

../../_images/host_state_vector_overlapped_checkout_and_check_in.png — Figure 4. Overlapped check-in and checkout#

In order to swap slice local index bits and slice index bits, index bit swap is applied during migration steps as shown in Figure 5. The left part of the figure shows the initial state vector allocation where p and q are slice index bits, and r denotes the LSB of slice local index bits being swapped with a slice index bit, q.

The first checkout step is identical to the migration step shown in Figure 3. Then, q and r are swapped to move q to be a slice local index bit and r to be a slice index bit. This swap is operated by a cuStateVec API, custatevecSwapIndexBits(). After swapping q and r, device slices are checked in to the host slices. By applying the same for the rest of host sub state vector slices, a slice index bit, q and a slice local index bit, r, are swapped.

../../_images/host_state_vector_swap_index_bits.png — Figure 5. Index bit swap of slice index bit and slice local index bit#

2. Allocate state vector on host and device#

The second scenario is to utilize host and device memory and allocate the state vector on them to maximize the size of the state vector. An example allocation is shown in Figure 6, which is the simplest case: one sub state vector on device and one on host as sub state vector slices. This example is used to describe the migration algorithm for simplicity. By increasing the number of sub state vectors on host, one is able to allocate larger state vectors.

Ex. When NVIDIA H100 (80G) is employed to allocate the device sub state vector according to Figure 6, the max size of the device sub state vector is 64 GB. The allocation size of state vector slices on host is identical, thus, the size of state vector is 128 GB (64 GB on host + 64 GB on device). By using 448 GB host memory, the state vector size grows to 512 GB.

../../_images/host_device_state_vector.png — Figure 6. State vector allocated on host and device#

For the state vector allocated on host and device, the primitive of state vector migration is swap, which is considered as an overlapped check-in and checkout to the same host state vector slice. custatevecSubSVMigrator swaps sub state vector slices by passing the same host sub state vector slice pointer to srcSubSVSlice and dstSubSVSlice arguments of custatevecSubSVMigratorMigrate().

State vector migration is executed as illustrated in Figure 7. The first step shown in Figure 7 (a) is the initial state where the index bit, q is localized on device. In Figure 7 (b), operations are applied for the sub state vector on device that contains the index bit q and slice local index bits. Then, sub state vector slices are swapped between host and device, and operations are applied for the second half of the state vector (Figure 7 (c), (d)).

The next migration sequence is aiming at localizing the index bit, p, and applying operations. The first migration is to swap the 0th device sub state vector slice and the 1st host sub state vector slice (Figure 7 (e)). Then, the operations are applied (Figure 7 (f)) for the first half of the state vector. The next migration is to swap sub state vector slices between host and device, and operations are applied for the second half of the state vector.

../../_images/host_device_state_vector_swaps.png — Figure 7. State vector migration of host-device state vector#

In order to swap slice index bits and slice local index bits, the cuStateVec API, custatevecSwapIndexBits() is applied in the same way as used for host state vector. Figure 8 (a) shows the same placement of host and device sub state vector slices where p and q are slice index bits and r is the LSB of slice local index bits. After applying operations (Figure 8 (b)), q and r are swapped (Figure 8 (c)). Then, sub state vector slices are swapped between host and device (Figure 8 (d)). Executing the same steps for the remaining sub state vector slices, q and r are swapped (Figure 8 (e)).

../../_images/host_device_state_vector_swap_index_bits.png — Figure 8. Swapping slice index bits and slice local index bits in host-device state vector#

For the cuStateVec Ex API’s host memory support, see Migration.