cuStateVec Functions

Library Management

Handle Management API

custatevecCreate

custatevecStatus_t custatevecCreate(custatevecHandle_t *handle)

This function initializes the cuStateVec library and creates a handle on the cuStateVec context. It must be called prior to any other cuStateVec API functions.

Parameters

handle[in] the pointer to the handle to the cuStateVec context


custatevecDestroy

custatevecStatus_t custatevecDestroy(custatevecHandle_t handle)

This function releases resources used by the cuStateVec library.

Parameters

handle[in] the handle to the cuStateVec context


custatevecGetDefaultWorkspaceSize

custatevecStatus_t custatevecGetDefaultWorkspaceSize(custatevecHandle_t handle, size_t *workspaceSizeInBytes)

This function returns the default workspace size defined by the cuStateVec library.

This function returns the default size used for the workspace.

Parameters
  • handle[in] the handle to the cuStateVec context

  • workspaceSizeInBytes[out] default workspace size


custatevecSetWorkspace

custatevecStatus_t custatevecSetWorkspace(custatevecHandle_t handle, void *workspace, size_t workspaceSizeInBytes)

This function sets the workspace used by the cuStateVec library.

This function sets the workspace attached to the handle. The required size of the workspace is obtained by custatevecGetDefaultWorkspaceSize().

By setting a larger workspace, users are able to execute functions without allocating the extra workspace in some functions.

If a device memory handler is set, the workspace can be set to null and the workspace is allocated using the user-defined memory pool.

Parameters
  • handle[in] the handle to the cuStateVec context

  • workspace[in] device pointer to workspace

  • workspaceSizeInBytes[in] workspace size

CUDA Stream Management API

custatevecSetStream

custatevecStatus_t custatevecSetStream(custatevecHandle_t handle, cudaStream_t streamId)

This function sets the stream to be used by the cuStateVec library to execute its routine.

Parameters
  • handle[in] the handle to the cuStateVec context

  • streamId[in] the stream to be used by the library


custatevecGetStream

custatevecStatus_t custatevecGetStream(custatevecHandle_t handle, cudaStream_t *streamId)

This function gets the cuStateVec library stream used to execute all calls from the cuStateVec library functions.

Parameters
  • handle[in] the handle to the cuStateVec context

  • streamId[out] the stream to be used by the library

Error Management API

custatevecGetErrorName

const char *custatevecGetErrorName(custatevecStatus_t status)

This function returns the name string for the input error code. If the error code is not recognized, “unrecognized error code” is returned.

Parameters

status[in] Error code to convert to string


custatevecGetErrorString

const char *custatevecGetErrorString(custatevecStatus_t status)

This function returns the description string for an error code. If the error code is not recognized, “unrecognized error code” is returned.

Parameters

status[in] Error code to convert to string

Logger API

custatevecLoggerSetCallback

custatevecStatus_t custatevecLoggerSetCallback(custatevecLoggerCallback_t callback)

Experimental: This function sets the logging callback function.

Parameters

callback[in] Pointer to a callback function. See custatevecLoggerCallback_t.


custatevecLoggerSetCallbackData

custatevecStatus_t custatevecLoggerSetCallbackData(custatevecLoggerCallbackData_t callback, void *userData)

Experimental: This function sets the logging callback function with user data.

Parameters

custatevecLoggerSetFile

custatevecStatus_t custatevecLoggerSetFile(FILE *file)

Experimental: This function sets the logging output file.

Note

Once registered using this function call, the provided file handle must not be closed unless the function is called again to switch to a different file handle.

Parameters

file[in] Pointer to an open file. File should have write permission.


custatevecLoggerOpenFile

custatevecStatus_t custatevecLoggerOpenFile(const char *logFile)

Experimental: This function opens a logging output file in the given path.

Parameters

logFile[in] Path of the logging output file.


custatevecLoggerSetLevel

custatevecStatus_t custatevecLoggerSetLevel(int32_t level)

Experimental: This function sets the value of the logging level.

Levels are defined as follows:

Level

Summary

Long Description

“0”

Off

logging is disabled (default)

“1”

Errors

only errors will be logged

“2”

Performance Trace

API calls that launch CUDA kernels will log their parameters and important information

“3”

Performance Hints

hints that can potentially improve the application’s performance

“4”

Heuristics Trace

provides general information about the library execution, may contain details about heuristic status

“5”

API Trace

API Trace - API calls will log their parameter and important information

Parameters

level[in] Value of the logging level.


custatevecLoggerSetMask

custatevecStatus_t custatevecLoggerSetMask(int32_t mask)

Experimental: This function sets the value of the logging mask. Masks are defined as a combination of the following masks:

Level

Description

“0”

Off

“1”

Errors

“2”

Performance Trace

“4”

Performance Hints

“8”

Heuristics Trace

“16”

API Trace

Refer to custatevecLoggerCallback_t for the details.

Parameters

mask[in] Value of the logging mask.


custatevecLoggerForceDisable

custatevecStatus_t custatevecLoggerForceDisable()

Experimental: This function disables logging for the entire run.

Versioning API

custatevecGetProperty

custatevecStatus_t custatevecGetProperty(libraryPropertyType type, int32_t *value)

This function returns the version information of the cuStateVec library.

Parameters
  • type[in] requested property (MAJOR_VERSION, MINOR_VERSION, or PATCH_LEVEL).

  • value[out] value of the requested property.


custatevecGetVersion

size_t custatevecGetVersion()

This function returns the version information of the cuStateVec library.

Memory Management API

A stream-ordered memory allocator (or mempool for short) allocates/deallocates memory asynchronously from/to a mempool in a stream-ordered fashion, meaning memory operations and computations enqueued on the streams have a well-defined inter- and intra- stream dependency. There are several well-implemented stream-ordered mempools available, such as cudaMemPool_t that is built-in at the CUDA driver level since CUDA 11.2 (so that all CUDA applications in the same process can easily share the same pool, see here) and the RAPIDS Memory Manager (RMM). For a detailed introduction, see the NVIDIA Developer Blog.

The new device memory handler APIs allow users to bind a stream-ordered mempool to the library handle, such that cuStateVec can take care of most of the memory management for users. Below is an illustration of what can be done:

MyMemPool pool = MyMemPool();  // kept alive for the entire process in real apps

int my_alloc(void* ctx, void** ptr, size_t size, cudaStream_t stream) {
  return reinterpret_cast<MyMemPool*>(ctx)->alloc(ptr, size, stream);
}

int my_dealloc(void* ctx, void* ptr, size_t size, cudaStream_t stream) {
  return reinterpret_cast<MyMemPool*>(ctx)->dealloc(ptr, size, stream);
}

// create a mem handler and fill in the required members for the library to use
custatevecDeviceMemHandler_t handler;
handler.ctx = reinterpret_cast<void*>(&pool);
handler.device_alloc = my_alloc;
handler.device_free = my_dealloc;
memcpy(handler.name, std::string("my pool").c_str(), CUSTATEVEC_ALLOCATOR_NAME_LEN);

// bind the handler to the library handle
custatevecSetDeviceMemHandler(handle, &handler);

/* ... use gate application as usual ... */

// User doesn’t compute the required sizes

// User doesn’t query the workspace size (but one can if desired)

// User doesn’t allocate memory!

// User sets null pointer to indicate the library should draw memory from the user's pool;
void* extraWorkspace = nullptr;
size_t extraWorkspaceInBytes = 0;
custatevecApplyMatrix(
    handle, sv, svDataType, nIndexBits, matrix, matrixDataType, layout,
    adjoint, targets, nTargets, controls, nControls, controlBitValues,
    computeType, extraWorkspace, extraWorkspaceSizeInBytes);

// User doesn’t deallocate memory!

As shown above, several calls to the workspace-related APIs can be skipped. Moreover, allowing the library to share your memory pool not only can alleviate potential memory conflicts, but also enable possible optimizations.

In the current release, only a device mempool can be bound.

custatevecSetDeviceMemHandler

custatevecStatus_t custatevecSetDeviceMemHandler(custatevecHandle_t handle, const custatevecDeviceMemHandler_t *handler)

Set the current device memory handler.

Once set, when cuStateVec needs device memory in various API calls it will allocate from the user-provided memory pool and deallocate at completion. See custatevecDeviceMemHandler_t and APIs that require extra workspace for further detail.

The internal stream order is established using the user-provided stream set via custatevecSetStream().

If handler argument is set to nullptr, the library handle will detach its existing memory handler.

Warning

It is undefined behavior for the following scenarios:

  • the library handle is bound to a memory handler and subsequently to another handler

  • the library handle outlives the attached memory pool

  • the memory pool is not stream-ordered

Parameters
  • handle[in] Opaque handle holding cuStateVec’s library context.

  • handler[in] the device memory handler that encapsulates the user’s mempool. The struct content is copied internally.


custatevecGetDeviceMemHandler

custatevecStatus_t custatevecGetDeviceMemHandler(custatevecHandle_t handle, custatevecDeviceMemHandler_t *handler)

Get the current device memory handler.

Parameters
  • handle[in] Opaque handle holding cuStateVec’s library context.

  • handler[out] If previously set, the struct pointed to by handler is filled in, otherwise CUSTATEVEC_STATUS_NO_DEVICE_ALLOCATOR is returned.

Gate Application

General Matrices

cuStateVec API custatevecApplyMatrix() can apply a matrix representing a gate to a state vector. The API may require external workspace for large matrices, and custatevecApplyMatrixGetWorkspaceSize() provides the size of workspace. If a device memory handler is set, custatevecApplyMatrixGetWorkspaceSize() can be skipped.

Use case

// check the size of external workspace
custatevecApplyMatrixGetWorkspaceSize(
    handle, svDataType, nIndexBits, matrix, matrixDataType, layout, adjoint, nTargets,
    nControls, computeType, &extraWorkspaceSizeInBytes);

// allocate external workspace if necessary
void* extraWorkspace = nullptr;
if (extraWorkspaceSizeInBytes > 0)
    cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

// apply gate
custatevecApplyMatrix(
    handle, sv, svDataType, nIndexBits, matrix, matrixDataType, layout,
    adjoint, targets, nTargets, controls, controlBitValues, nControls,
    computeType, extraWorkspace, extraWorkspaceSizeInBytes);

API reference

custatevecApplyMatrixGetWorkspaceSize
custatevecStatus_t custatevecApplyMatrixGetWorkspaceSize(custatevecHandle_t handle, cudaDataType_t svDataType, const uint32_t nIndexBits, const void *matrix, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const int32_t adjoint, const uint32_t nTargets, const uint32_t nControls, custatevecComputeType_t computeType, size_t *extraWorkspaceSizeInBytes)

This function gets the required workspace size for custatevecApplyMatrix().

This function returns the required extra workspace size to execute custatevecApplyMatrix(). extraWorkspaceSizeInBytes will be set to 0 if no extra buffer is required for a given set of arguments.

Parameters
  • handle[in] the handle to the cuStateVec context

  • svDataType[in] Data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • matrix[in] host or device pointer to a matrix

  • matrixDataType[in] data type of matrix

  • layout[in] enumerator specifying the memory layout of matrix

  • adjoint[in] apply adjoint of matrix

  • nTargets[in] the number of target bits

  • nControls[in] the number of control bits

  • computeType[in] computeType of matrix multiplication

  • extraWorkspaceSizeInBytes[out] workspace size


custatevecApplyMatrix
custatevecStatus_t custatevecApplyMatrix(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, const void *matrix, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const int32_t adjoint, const int32_t *targets, const uint32_t nTargets, const int32_t *controls, const int32_t *controlBitValues, const uint32_t nControls, custatevecComputeType_t computeType, void *extraWorkspace, size_t extraWorkspaceSizeInBytes)

Apply gate matrix.

Apply gate matrix to a state vector. The state vector size is \(2^\text{nIndexBits}\).

The matrix argument is a host or device pointer of a 2-dimensional array for a square matrix. The size of matrix is ( \(2^\text{nTargets} \times 2^\text{nTargets}\) ) and the value type is specified by the matrixDataType argument. The layout argument specifies the matrix layout which can be in either the row-major or column-major order. The targets and controls arguments specify target and control bit positions in the state vector index.

The controlBitValues argument specifies bit values of control bits. The ordering of controlBitValues is specified by the controls argument. If a null pointer is specified to this argument, all control bit values are set to 1.

By definition, bit positions in targets and controls arguments should not overlap.

This function may return CUSTATEVEC_STATUS_INSUFFICIENT_WORKSPACE for large nTargets. In such cases, the extraWorkspace and extraWorkspaceSizeInBytes arguments should be specified to provide extra workspace. The size of required extra workspace is obtained by calling custatevecApplyMatrixGetWorkspaceSize(). A null pointer can be passed to the extraWorkspace argument if no extra workspace is required. Also, if a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • matrix[in] host or device pointer to a square matrix

  • matrixDataType[in] data type of matrix

  • layout[in] enumerator specifying the memory layout of matrix

  • adjoint[in] apply adjoint of matrix

  • targets[in] pointer to a host array of target bits

  • nTargets[in] the number of target bits

  • controls[in] pointer to a host array of control bits

  • controlBitValues[in] pointer to a host array of control bit values

  • nControls[in] the number of control bits

  • computeType[in] computeType of matrix multiplication

  • extraWorkspace[in] extra workspace

  • extraWorkspaceSizeInBytes[in] extra workspace size

Pauli Matrices

Exponential of a tensor product of Pauli matrices can be expressed as follows:

\[e^{i \theta \left( P_{target[0]} \otimes P_{target[1]} \otimes \cdots \otimes P_{target[nTargets-1]} \right)}.\]

Matrix \(P_{target[i]}\) can be either of Pauli matrices \(I\), \(X\), \(Y\), and \(Z\), which are corresponding to the custatevecPauli_t enums CUSTATEVEC_PAULI_I, CUSTATEVEC_PAULI_X, CUSTATEVEC_PAULI_Y, and CUSTATEVEC_PAULI_Z, respectively. Also refer to custatevecPauli_t for details.

Use case

// apply exponential
custatevecApplyPauliRotation(
    handle, sv, svDataType, nIndexBits, theta, paulis, targets, nTargets,
    controls, controlBitValues, nControls);

API reference

custatevecApplyPauliRotation
custatevecStatus_t custatevecApplyPauliRotation(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, double theta, const custatevecPauli_t *paulis, const int32_t *targets, const uint32_t nTargets, const int32_t *controls, const int32_t *controlBitValues, const uint32_t nControls)

Apply the exponential of a multi-qubit Pauli operator.

Apply exponential of a tensor product of Pauli bases specified by bases, \( e^{i \theta P} \), where \(P\) is the product of Pauli bases. The paulis, targets, and nTargets arguments specify Pauli bases and their bit positions in the state vector index.

At least one target and a corresponding Pauli basis should be specified.

The controls and nControls arguments specifies the control bit positions in the state vector index.

The controlBitValues argument specifies bit values of control bits. The ordering of controlBitValues is specified by the controls argument. If a null pointer is specified to this argument, all control bit values are set to 1.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of bits in the state vector index

  • theta[in] theta

  • paulis[in] host pointer to custatevecPauli_t array

  • targets[in] pointer to a host array of target bits

  • nTargets[in] the number of target bits

  • controls[in] pointer to a host array of control bits

  • controlBitValues[in] pointer to a host array of control bit values

  • nControls[in] the number of control bits

Generalized Permutation Matrices

A generalized permutation matrix can be expressed as the multiplication of a permutation matrix \(P\) and a diagonal matrix \(D\). For instance, we can decompose a 4 \(\times\) 4 generalized permutation matrix \(A\) as follows:

\[\begin{split}A = \left[ \begin{array}{cccc} 0 & 0 & a_0 & 0 \\ a_1 & 0 & 0 & 0 \\ 0 & 0 & 0 & a_2 \\ 0 & a_3 & 0 & 0 \end{array}\right] = DP\end{split}\]

, where

\[\begin{split}D = \left[ \begin{array}{cccc} a_0 & 0 & 0 & 0 \\ 0 & a_1 & 0 & 0 \\ 0 & 0 & a_2 & 0 \\ 0 & 0 & 0 & a_3 \end{array}\right], P = \left[ \begin{array}{cccc} 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 \end{array}\right].\end{split}\]

When \(P\) is diagonal, the generalized permutation matrix is also diagonal. Similarly, when \(D\) is the identity matrix, the generalized permutation matrix becomes a permutation matrix.

The cuStateVec API custatevecApplyGeneralizedPermutationMatrix() applies a generalized permutation matrix like \(A\) to a state vector. The API may require extra workspace for large matrices, whose size can be queried using custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize(). If a device memory handler is set, custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize() can be skipped.

Use case

// check the size of external workspace
custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize(
    handle, CUDA_C_64F, nIndexBits, permutation, diagonals, CUDA_C_64F, targets,
    nTargets, nControls, &extraWorkspaceSizeInBytes);

// allocate external workspace if necessary
void* extraWorkspace = nullptr;
if (extraWorkspaceSizeInBytes > 0)
    cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

// apply a generalized permutation matrix
custatevecApplyGeneralizedPermutationMatrix(
    handle, d_sv, CUDA_C_64F, nIndexBits, permutation, diagonals, CUDA_C_64F,
    adjoint, targets, nTargets, controls, controlBitValues, nControls,
    extraWorkspace, extraWorkspaceSizeInBytes);

The operation is equivalent to the following:

// sv, sv_temp: the state vector and temporary buffer.

int64_t sv_size = int64_t{1} << nIndexBits;
for (int64_t sv_idx = 0; sv_idx < sv_size; sv_idx++) {
    // The basis of sv_idx is converted to permutation basis to obtain perm_idx
    auto perm_idx = convertToPermutationBasis(sv_idx);
    // apply generalized permutation matrix
    if (adjoint == 0)
        sv_temp[sv_idx] = sv[permutation[perm_idx]] * diagonals[perm_idx];
    else
        sv_temp[permutation[perm_idx]] = sv[sv_idx] * conj(diagonals[perm_idx]);
}

for (int64_t sv_idx = 0; sv_idx < sv_size; sv_idx++)
    sv[sv_idx] = sv_temp[sv_idx];

API reference

custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize
custatevecStatus_t custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize(custatevecHandle_t handle, cudaDataType_t svDataType, const uint32_t nIndexBits, const custatevecIndex_t *permutation, const void *diagonals, cudaDataType_t diagonalsDataType, const int32_t *targets, const uint32_t nTargets, const uint32_t nControls, size_t *extraWorkspaceSizeInBytes)

Get the extra workspace size required by custatevecApplyGeneralizedPermutationMatrix().

This function gets the size of extra workspace size required to execute custatevecApplyGeneralizedPermutationMatrix(). extraWorkspaceSizeInBytes will be set to 0 if no extra buffer is required for a given set of arguments.

Parameters
  • handle[in] the handle to the cuStateVec library

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • permutation[in] host or device pointer to a permutation table

  • diagonals[in] host or device pointer to diagonal elements

  • diagonalsDataType[in] data type of diagonals

  • targets[in] pointer to a host array of target bits

  • nTargets[in] the number of target bits

  • nControls[in] the number of control bits

  • extraWorkspaceSizeInBytes[out] extra workspace size


custatevecApplyGeneralizedPermutationMatrix
custatevecStatus_t custatevecApplyGeneralizedPermutationMatrix(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, custatevecIndex_t *permutation, const void *diagonals, cudaDataType_t diagonalsDataType, const int32_t adjoint, const int32_t *targets, const uint32_t nTargets, const int32_t *controls, const int32_t *controlBitValues, const uint32_t nControls, void *extraWorkspace, size_t extraWorkspaceSizeInBytes)

Apply generalized permutation matrix.

This function applies the generalized permutation matrix.

The generalized permutation matrix, \(A\), is expressed as \(A = DP\), where \(D\) and \(P\) are diagonal and permutation matrices, respectively.

The permutation matrix, \(P\), is specified as a permutation table which is an array of custatevecIndex_t and passed to the permutation argument.

The diagonal matrix, \(D\), is specified as an array of diagonal elements. The length of both arrays is \( 2^{{\text nTargets}} \). The diagonalsDataType argument specifies the type of diagonal elements.

Below is the table of combinations of svDataType and diagonalsDataType arguments available in this version.

svDataType

diagonalsDataType

CUDA_C_F64

CUDA_C_F64

CUDA_C_F32

CUDA_C_F64

CUDA_C_F32

CUDA_C_F32

This function can also be used to only apply either the diagonal or the permutation matrix. By passing a null pointer to the permutation argument, \(P\) is treated as an identity matrix, thus, only the diagonal matrix \(D\) is applied. Likewise, if a null pointer is passed to the diagonals argument, \(D\) is treated as an identity matrix, and only the permutation matrix \(P\) is applied.

The permutation argument should hold integers in [0, \( 2^{nTargets} \)). An integer should appear only once, otherwise the behavior of this function is undefined.

The permutation and diagonals arguments should not be null at the same time. In this case, this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

This function may return CUSTATEVEC_STATUS_INSUFFICIENT_WORKSPACE for large nTargets or nIndexBits. In such cases, the extraWorkspace and extraWorkspaceSizeInBytes arguments should be specified to provide extra workspace. The size of required extra workspace is obtained by calling custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize().

A null pointer can be passed to the extraWorkspace argument if no extra workspace is required. Also, if a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0.

Note

In this version, custatevecApplyGeneralizedPermutationMatrix() does not return error if an invalid permutation argument is specified.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • permutation[in] host or device pointer to a permutation table

  • diagonals[in] host or device pointer to diagonal elements

  • diagonalsDataType[in] data type of diagonals

  • adjoint[in] apply adjoint of generalized permutation matrix

  • targets[in] pointer to a host array of target bits

  • nTargets[in] the number of target bits

  • controls[in] pointer to a host array of control bits

  • controlBitValues[in] pointer to a host array of control bit values

  • nControls[in] the number of control bits

  • extraWorkspace[in] extra workspace

  • extraWorkspaceSizeInBytes[in] extra workspace size

Measurement

Measurement on Z-bases

Let us consider the measurement of an \(nIndexBits\)-qubit state vector \(sv\) on an \(nBasisBits\)-bit Z product basis \(basisBits\).

The sums of squared absolute values of state vector elements on the Z product basis, \(abs2sum0\) and \(abs2sum1\), are obtained by the followings:

\[\begin{split}abs2sum0 &= \Bra{sv} \left\{ \dfrac{1}{2} \left( 1 + Z_{basisBits[0]} \otimes Z_{basisBits[1]} \otimes \cdots \otimes Z_{basisBits[nBasisBits-1]} \right) \right\} \Ket{sv}, \\ abs2sum1 &= \Bra{sv} \left\{ \dfrac{1}{2} \left( 1 - Z_{basisBits[0]} \otimes Z_{basisBits[1]} \otimes \cdots \otimes Z_{basisBits[nBasisBits-1]} \right) \right\} \Ket{sv}.\end{split}\]

Therefore, probabilities to obtain parity 0 and 1 are expressed in the following expression:

\[\begin{split}Pr(parity = 0) &= \dfrac{abs2sum0}{abs2sum0 + abs2sum1}, \\ Pr(parity = 1) &= \dfrac{abs2sum1}{abs2sum0 + abs2sum1}.\end{split}\]

Depending on the measurement result, the state vector is collapsed. If parity is equal to 0, we obtain the following vector:

\[\ket{sv} = \dfrac{1}{\sqrt{norm}} \left\{ \dfrac{1}{2} \left( 1 + Z_{basisBits[0]} \otimes Z_{basisBits[1]} \otimes \cdots \otimes Z_{basisBits[nBasisBits-1]} \right) \right\} \Ket{sv},\]

and if parity is equal to 1, we obtain the following vector:

\[\ket{sv} = \dfrac{1}{\sqrt{norm}} \left\{ \dfrac{1}{2} \left( 1 - Z_{basisBits[0]} \otimes Z_{basisBits[1]} \otimes \cdots \otimes Z_{basisBits[nBasisBits-1]} \right) \right\} \Ket{sv},\]

where \(norm\) is the normalization factor.

Use case

We can measure by custatevecMeasureOnZBasis() as follows:

// measure on a Z basis
custatevecMeasureOnZBasis(
    handle, sv, svDataType, nIndexBits, &parity, basisBits, nBasisBits,
    randnum, collapse);

The operation is equivalent to the following:

// compute the sums of squared absolute values of state vector elements
// on a Z product basis
double abs2sum0, abs2sum1;
custatevecAbs2SumOnZBasis(
    handle, sv, svDataType, nIndexBits, &abs2sum0, &abs2sum1, basisBits,
    nBasisBits);

// [User] compute parity and norm
double abs2sum = abs2sum0 + abs2sum1;
int parity = (randnum * abs2sum < abs2sum0) ? 0 : 1;
double norm = (parity == 0) ? abs2sum0 : abs2sum1;

// collapse if necessary
switch (collapse) {
case CUSTATEVEC_COLLAPSE_NORMALIZE_AND_ZERO:
    custatevecCollapseOnZBasis(
        handle, sv, svDataType, nIndexBits, parity, basisBits, nBasisBits,
        norm);
    break;  /* collapse */
case CUSTATEVEC_COLLAPSE_NONE:
    break;  /* Do nothing */

API reference

custatevecAbs2SumOnZBasis
custatevecStatus_t custatevecAbs2SumOnZBasis(custatevecHandle_t handle, const void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, double *abs2sum0, double *abs2sum1, const int32_t *basisBits, const uint32_t nBasisBits)

Calculates the sum of squared absolute values on a given Z product basis.

This function calculates sums of squared absolute values on a given Z product basis. If a null pointer is specified to abs2sum0 or abs2sum1, the sum for the corresponding value is not calculated. Since the sum of (abs2sum0 + abs2sum1) is identical to the norm of the state vector, one can calculate the probability where parity == 0 as (abs2sum0 / (abs2sum0 + abs2sum1)).

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[in] state vector

  • svDataType[in] data type of state vector

  • nIndexBits[in] the number of index bits

  • abs2sum0[out] pointer to a host or device variable to store the sum of squared absolute values for parity == 0

  • abs2sum1[out] pointer to a host or device variable to store the sum of squared absolute values for parity == 1

  • basisBits[in] pointer to a host array of Z-basis index bits

  • nBasisBits[in] the number of basisBits


custatevecCollapseOnZBasis
custatevecStatus_t custatevecCollapseOnZBasis(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, const int32_t parity, const int32_t *basisBits, const uint32_t nBasisBits, double norm)

Collapse state vector on a given Z product basis.

This function collapses state vector on a given Z product basis. The state elements that match the parity argument are scaled by a factor specified in the norm argument. Other elements are set to zero.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of state vector

  • nIndexBits[in] the number of index bits

  • parity[in] parity, 0 or 1

  • basisBits[in] pointer to a host array of Z-basis index bits

  • nBasisBits[in] the number of Z basis bits

  • norm[in] normalization factor


custatevecMeasureOnZBasis
custatevecStatus_t custatevecMeasureOnZBasis(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, int32_t *parity, const int32_t *basisBits, const uint32_t nBasisBits, const double randnum, enum custatevecCollapseOp_t collapse)

Measurement on a given Z-product basis.

This function does measurement on a given Z product basis. The measurement result is the parity of the specified Z product basis. At least one basis bit should be specified, otherwise this function fails.

If CUSTATEVEC_COLLAPSE_NONE is specified for the collapse argument, this function only returns the measurement result without collapsing the state vector. If CUSTATEVEC_COLLAPSE_NORMALIZE_AND_ZERO is specified, this function collapses the state vector as custatevecCollapseOnZBasis() does.

If a random number is not in [0, 1), this function returns CUSTATEVEC_STATUS_INVALID_VALUE. At least one basis bit should be specified, otherwise this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of state vector

  • nIndexBits[in] the number of index bits

  • parity[out] parity, 0 or 1

  • basisBits[in] pointer to a host array of Z basis bits

  • nBasisBits[in] the number of Z basis bits

  • randnum[in] random number, [0, 1).

  • collapse[in] Collapse operation

Batched Single Qubit Measurement

Assume that we measure an \(nIndexBits\)-qubits state vector \(sv\) with a \(bitOrderingLen\)-bits bit string \(bitOrdering\).

The sums of squared absolute values of state vector elements are obtained by the following:

\[abs2sum[idx] = \braket{sv|i}\braket{i|sv},\]

where \(idx = b_{BitOrderingLen-1}\cdots b_1 b_0\), \(i = b_{bitOrdering[BitOrderingLen-1]} \cdots b_{bitOrdering[1]} b_{bitOrdering[0]}\), \(b_p \in \{0, 1\}\).

Therefore, probability to obtain the \(idx\)-th pattern of bits are expressed in the following expression:

\[Pr(idx) = \dfrac{abs2sum[idx]}{\sum_{k}abs2sum[k]}.\]

Depending on the measurement result, the state vector is collapsed.

If \(idx\) satisfies \((idx \ \& \ bitString) = idx\), we obtain \(sv[idx] = \dfrac{1}{\sqrt{norm}} sv[idx]\). Otherwise, \(sv[idx] = 0\), where \(norm\) is the normalization factor.

Use case

We can measure by custatevecBatchMeasure() as follows:

// measure with a bit string
custatevecBatchMeasure(
    handle, sv, svDataType, nIndexBits, bitString, bitOrdering, bitStringLen,
    randnum, collapse);

The operation is equivalent to the following:

// compute the sums of squared absolute values of state vector elements
int maskLen = 0;
int* maskBitString = nullptr;
int* maskOrdering = nullptr;

custatevecAbs2SumArray(
    handle, sv, svDataType, nIndexBits, abs2Sum, bitOrdering, bitOrderingLen,
    maskBitString, maskOrdering, maskLen);

// [User] compute a cumulative sum and choose bitString by a random number

// collapse if necessary
switch (collapse) {
case CUSTATEVEC_COLLAPSE_NORMALIZE_AND_ZERO:
    custatevecCollapseByBitString(
        handle, sv, svDataType, nIndexBits, bitString, bitOrdering,
        bitStringLen, norm);
    break;  /* collapse */
case CUSTATEVEC_COLLAPSE_NONE:
    break;  /* Do nothing */

For multi-GPU computations, custatevecBatchMeasureWithOffset() is available. This function works on one device, and users are required to compute the cumulative array of squared absolute values of state vector elements beforehand.

// The state vector is divided to nSubSvs sub state vectors.
// each state vector has its own ordinal and nLocalBits index bits.
// The ordinals of sub state vectors correspond to the extended index bits.
// In this example, all the local qubits are measured and collapsed.

// get abs2sum for each sub state vector
double abs2SumArray[nSubSvs];
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    custatevecAbs2SumArray(
        handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, &abs2SumArray[iSv], nullptr,
        0, nullptr, nullptr, 0);
}

for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    cudaDeviceSynchronize();
}

// get cumulative array
double cumulativeArray[nSubSvs + 1];
cumulativeArray[0] = 0.0;
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cumulativeArray[iSv + 1] = cumulativeArray[iSv] + abs2SumArray[iSv];
}

// measurement
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    // detect which sub state vector will be used for measurement.
    if (cumulativeArray[iSv] <= randnum && randnum < cumulativeArray[iSv + 1]) {
        double norm = cumulativeArray[nSubSvs];
        double offset = cumulativeArray[iSv];
        cudaSetDevice(devices[iSv]);
        // measure local qubits. Here the state vector will not be collapsed.
        // Only local qubits can be included in bitOrdering and bitString arguments.
        // That is, bitOrdering = {0, 1, 2, ..., nLocalBits - 1} and
        // bitString will store values of local qubits as an array of integers.
        custatevecBatchMeasureWithOffset(
            handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, bitString, bitOrdering,
            bitStringLen, randnum, CUSTATEVEC_COLLAPSE_NONE, offset, norm);
    }
}

for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    cudaDeviceSynchronize();
}

// get abs2Sum after collapse
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    custatevecAbs2SumArray(
        handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, &abs2SumArray[iSv], nullptr,
        0, bitString, bitOrdering, bitStringLen);
}

for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    cudaDeviceSynchronize();
}

// get norm after collapse
double norm = 0.0;
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    norm += abs2SumArray[iSv];
}

// collapse sub state vectors
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    custatevecCollapseByBitString(
        handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, bitString, bitOrdering,
        bitStringLen, norm);
}

// destroy handle
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    custatevecDestroy(handle[iSv]);
}

Please refer to NVIDIA/cuQuantum repository for further detail.

API reference

custatevecAbs2SumArray
custatevecStatus_t custatevecAbs2SumArray(custatevecHandle_t handle, const void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, double *abs2sum, const int32_t *bitOrdering, const uint32_t bitOrderingLen, const int32_t *maskBitString, const int32_t *maskOrdering, const uint32_t maskLen)

Calculate abs2sum array for a given set of index bits.

Calculates an array of sums of squared absolute values of state vector elements. The abs2sum array can be on host or device. The index bit ordering abs2sum array is specified by the bitOrdering and bitOrderingLen arguments. Unspecified bits are folded (summed up).

The maskBitString, maskOrdering and maskLen arguments set bit mask in the state vector index. The abs2sum array is calculated by using state vector elements whose indices match the mask bit string. If the maskLen argument is 0, null pointers can be specified to the maskBitString and maskOrdering arguments, and all state vector elements are used for calculation.

By definition, bit positions in bitOrdering and maskOrdering arguments should not overlap.

The empty bitOrdering can be specified to calculate the norm of state vector. In this case, 0 is passed to the bitOrderingLen argument and the bitOrdering argument can be a null pointer.

Note

Since the size of abs2sum array is proportional to \( 2^{bitOrderingLen} \) , the max length of bitOrdering depends on the amount of available memory and maskLen.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[in] state vector

  • svDataType[in] data type of state vector

  • nIndexBits[in] the number of index bits

  • abs2sum[out] pointer to a host or device array of sums of squared absolute values

  • bitOrdering[in] pointer to a host array of index bit ordering

  • bitOrderingLen[in] the length of bitOrdering

  • maskBitString[in] pointer to a host array for a bit string to specify mask

  • maskOrdering[in] pointer to a host array for the mask ordering

  • maskLen[in] the length of mask


custatevecCollapseByBitString
custatevecStatus_t custatevecCollapseByBitString(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, const int32_t *bitString, const int32_t *bitOrdering, const uint32_t bitStringLen, double norm)

Collapse state vector to the state specified by a given bit string.

This function collapses state vector to the state specified by a given bit string. The state vector elements specified by the bitString, bitOrdering and bitStringLen arguments are normalized by the norm argument. Other elements are set to zero.

At least one basis bit should be specified, otherwise this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of state vector

  • nIndexBits[in] the number of index bits

  • bitString[in] pointer to a host array of bit string

  • bitOrdering[in] pointer to a host array of bit string ordering

  • bitStringLen[in] length of bit string

  • norm[in] normalization constant


custatevecBatchMeasure
custatevecStatus_t custatevecBatchMeasure(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, int32_t *bitString, const int32_t *bitOrdering, const uint32_t bitStringLen, const double randnum, enum custatevecCollapseOp_t collapse)

Batched single qubit measurement.

This function does batched single qubit measurement and returns a bit string. The bitOrdering argument specifies index bits to be measured. The measurement result is stored in bitString in the ordering specified by the bitOrdering argument.

If CUSTATEVEC_COLLAPSE_NONE is specified for the collapse argument, this function only returns the measured bit string without collapsing the state vector. When CUSTATEVEC_COLLAPSE_NORMALIZE_AND_ZERO is specified, this function collapses the state vector as custatevecCollapseByBitString() does.

If a random number is not in [0, 1), this function returns CUSTATEVEC_STATUS_INVALID_VALUE. At least one basis bit should be specified, otherwise this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits

  • bitString[out] pointer to a host array of measured bit string

  • bitOrdering[in] pointer to a host array of bit string ordering

  • bitStringLen[in] length of bitString

  • randnum[in] random number, [0, 1).

  • collapse[in] Collapse operation


custatevecBatchMeasureWithOffset
custatevecStatus_t custatevecBatchMeasureWithOffset(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, int32_t *bitString, const int32_t *bitOrdering, const uint32_t bitStringLen, const double randnum, enum custatevecCollapseOp_t collapse, const double offset, const double abs2sum)

Batched single qubit measurement for partial vector.

This function does batched single qubit measurement and returns a bit string. The bitOrdering argument specifies index bits to be measured. The measurement result is stored in bitString in the ordering specified by the bitOrdering argument.

If CUSTATEVEC_COLLAPSE_NONE is specified for the collapse argument, this function only returns the measured bit string without collapsing the state vector. When CUSTATEVEC_COLLAPSE_NORMALIZE_AND_ZERO is specified, this function collapses the state vector as custatevecCollapseByBitString() does.

This function assumes that sv is partial state vector and drops some most significant bits. Prefix sums for lower indices and the entire state vector must be provided as offset and abs2sum, respectively. When offset == abs2sum == 0, this function behaves in the same way as custatevecBatchMeasure().

If a random number is not in [0, 1), this function returns CUSTATEVEC_STATUS_INVALID_VALUE. At least one basis bit should be specified, otherwise this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] partial state vector

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits

  • bitString[out] pointer to a host array of measured bit string

  • bitOrdering[in] pointer to a host array of bit string ordering

  • bitStringLen[in] length of bitString

  • randnum[in] random number, [0, 1).

  • collapse[in] Collapse operation

  • offset[in] partial sum of squared absolute values

  • abs2sum[in] sum of squared absolute values for the entire state vector

Expectation

Expectation via a Matrix

Expectation performs the following operation:

\[\langle A \rangle = \bra{\phi}A\ket{\phi},\]

where \(\ket{\phi}\) is a state vector and \(A\) is a matrix or an observer. The API for expectation custatevecComputeExpectation() may require external workspace for large matrices, and custatevecComputeExpectationGetWorkspaceSize() provides the size of workspace. If a device memory handler is set, custatevecComputeExpectationGetWorkspaceSize() can be skipped.

Use case

// check the size of external workspace
custatevecComputeExpectationGetWorkspaceSize(
    handle, svDataType, nIndexBits, matrix, matrixDataType, layout, nBasisBits, computeType,
    &extraWorkspaceSizeInBytes);

// allocate external workspace if necessary
void* extraWorkspace = nullptr;
if (extraWorkspaceSizeInBytes > 0)
    cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

// perform expectation
custatevecComputeExpectation(
    handle, sv, svDataType, nIndexBits, expect, expectDataType, residualNorm,
    matrix, matrixDataType, layout, basisBits, nBasisBits, computeType,
    extraWorkspace, extraWorkspaceSizeInBytes);

API reference

custatevecComputeExpectationGetWorkspaceSize
custatevecStatus_t custatevecComputeExpectationGetWorkspaceSize(custatevecHandle_t handle, cudaDataType_t svDataType, const uint32_t nIndexBits, const void *matrix, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const uint32_t nBasisBits, custatevecComputeType_t computeType, size_t *extraWorkspaceSizeInBytes)

This function gets the required workspace size for custatevecComputeExpectation().

This function returns the size of the extra workspace required to execute custatevecComputeExpectation(). extraWorkspaceSizeInBytes will be set to 0 if no extra buffer is required.

Parameters
  • handle[in] the handle to the cuStateVec context

  • svDataType[in] Data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • matrix[in] host or device pointer to a matrix

  • matrixDataType[in] data type of matrix

  • layout[in] enumerator specifying the memory layout of matrix

  • nBasisBits[in] the number of target bits

  • computeType[in] computeType of matrix multiplication

  • extraWorkspaceSizeInBytes[out] size of the extra workspace


custatevecComputeExpectation
custatevecStatus_t custatevecComputeExpectation(custatevecHandle_t handle, const void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, void *expectationValue, cudaDataType_t expectationDataType, double *residualNorm, const void *matrix, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const int32_t *basisBits, const uint32_t nBasisBits, custatevecComputeType_t computeType, void *extraWorkspace, size_t extraWorkspaceSizeInBytes)

Compute expectation of matrix observable.

This function calculates expectation for a given matrix observable. The acceptable values for the expectationDataType argument are CUDA_R_64F and CUDA_C_64F.

The basisBits and nBasisBits arguments specify the basis to calculate expectation. For the computeType argument, the same combinations for custatevecApplyMatrix() are available.

This function may return CUSTATEVEC_STATUS_INSUFFICIENT_WORKSPACE for large nBasisBits. In such cases, the extraWorkspace and extraWorkspaceSizeInBytes arguments should be specified to provide extra workspace. The size of required extra workspace is obtained by calling custatevecComputeExpectationGetWorkspaceSize(). A null pointer can be passed to the extraWorkspace argument if no extra workspace is required. Also, if a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0.

Note

The residualNorm argument is not available in this version. If a matrix given by the matrix argument may not be a Hermitian matrix, please specify CUDA_C_64F to the expectationDataType argument and check the imaginary part of the calculated expectation value.

Parameters
  • handle[in] the handle to the cuStateVec library

  • expectationValue[out] host pointer to a variable to store an expectation value

  • expectationDataType[in] data type of expect

  • residualNorm[out] result of matrix type test

  • sv[in] state vector

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • matrix[in] observable as matrix

  • matrixDataType[in] data type of matrix

  • layout[in] matrix memory layout

  • basisBits[in] pointer to a host array of basis index bits

  • nBasisBits[in] the number of basis bits

  • computeType[in] computeType of matrix multiplication

  • extraWorkspace[in] pointer to an extra workspace

  • extraWorkspaceSizeInBytes[in] the size of extra workspace

Expectation on Pauli Basis

cuStateVec API custatevecComputeExpectationsOnPauliBasis() computes expectation values for a batch of Pauli strings. Each observable can be expressed as follows:

\[P_{\text{basisBits}[0]} \otimes P_{\text{basisBits}[1]} \otimes \cdots \otimes P_{\text{basisBits}[\text{nBasisBits}-1]}.\]

Each matrix \(P_{\text{basisBits}[i]}\) can be one of the Pauli matrices \(I\), \(X\), \(Y\), and \(Z\), corresponding to the custatevecPauli_t enums CUSTATEVEC_PAULI_I, CUSTATEVEC_PAULI_X, CUSTATEVEC_PAULI_Y, and CUSTATEVEC_PAULI_Z, respectively. Also refer to custatevecPauli_t for details.

Use case

// calculate the norm and the expectations for Z(q1) and X(q0)Y(q2)

uint32_t nPauliOperatorArrays = 3;
custatevecPauli_t pauliOperators0[] = {};                                       // III
int32_t           basisBits0[]      = {};
custatevecPauli_t pauliOperators1[] = {CUSTATEVEC_PAULI_Z};                     // IZI
int32_t           basisBits1[]      = {1};
custatevecPauli_t pauliOperators2[] = {CUSTATEVEC_PAULI_X, CUSTATEVEC_PAULI_Y}; // XIY
int32_t           basisBits2[]      = {0, 2};
const uint32_t nBasisBitsArray[] = {0, 1, 2};

const custatevecPauli_t*
  pauliOperatorsArray[] = {pauliOperators0, pauliOperators1, pauliOperators2};
const int32_t *basisBitsArray[] = { basisBits0, basisBits1, basisBits2};

uint32_t nIndexBits = 3;
double expectationValues[nPauliOperatorArrays];

custatevecComputeExpectationsOnPauliBasis(
    handle, sv, svDataType, nIndexBits, expectationValues,
    pauliOperatorsArray, nPauliOperatorArrays,
    basisBitsArray, nBasisBitsArray);

API reference

custatevecComputeExpectationsOnPauliBasis
custatevecStatus_t custatevecComputeExpectationsOnPauliBasis(custatevecHandle_t handle, const void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, double *expectationValues, const custatevecPauli_t **pauliOperatorsArray, const uint32_t nPauliOperatorArrays, const int32_t **basisBitsArray, const uint32_t *nBasisBitsArray)

Calculate expectation values for a batch of (multi-qubit) Pauli operators.

This function calculates multiple expectation values for given sequences of Pauli operators by a single call.

A single Pauli operator sequence, pauliOperators, is represented by using an array of custatevecPauli_t. The basis bits on which these Pauli operators are acting are represented by an array of index bit positions. If no Pauli operator is specified for an index bit, the identity operator (CUSTATEVEC_PAULI_I) is implicitly assumed.

The length of pauliOperators and basisBits are the same and specified by nBasisBits.

The number of Pauli operator sequences is specified by the nPauliOperatorArrays argument.

Multiple sequences of Pauli operators are represented in the form of arrays of arrays in the following manners:

  • The pauliOperatorsArray argument is an array for arrays of custatevecPauli_t.

  • The basisBitsArray is an array of the arrays of basis bit positions.

  • The nBasisBitsArray argument holds an array of the length of Pauli operator sequences and basis bit arrays.

Calculated expectation values are stored in a host buffer specified by the expectationValues argument of length nPauliOpeartorsArrays.

This function returns CUSTATEVEC_STATUS_INVALID_VALUE if basis bits specified for a Pauli operator sequence has duplicates and/or out of the range of [0, nIndexBits).

This function accepts empty Pauli operator sequence to get the norm of the state vector.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[in] state vector

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • expectationValues[out] pointer to a host array to store expectation values

  • pauliOperatorsArray[in] pointer to a host array of Pauli operator arrays

  • nPauliOperatorArrays[in] the number of Pauli operator arrays

  • basisBitsArray[in] host array of basis bit arrays

  • nBasisBitsArray[in] host array of the number of basis bits

Sampling

Sampling enables to obtain measurement results many times by using probability calculated from quantum states.

Use case

// create sampler and check the size of external workspace
custatevecSamplerCreate(
    handle, sv, svDataType, nIndexBits, &sampler, nMaxShots,
    &extraWorkspaceSizeInBytes);

// allocate external workspace if necessary
void extraWorkspace = nullptr;
if (extraWorkspaceSizeInBytes > 0)
    cudaMalloc(extraWorkspace, extraWorkspaceSizeInBytes);

// calculate cumulative abs2sum
custatevecSamplerPreprocess(
    handle, sampler, extraWorkspace, extraWorkspaceSizeInBytes);

// [User] generate randnums, array of random numbers [0, 1) for sampling
...

// sample bit strings
custatevecSamplerSample(
    handle, sampler, bitStrings, bitOrdering, bitStringLen, randnums, nShots,
    output);

// deallocate the sampler
custatevecSamplerDestroy(sampler);

For multi-GPU computations, cuStateVec provides custatevecSamplerGetSquaredNorm() and custatevecSamplerApplySubSVOffset(). Users are required to calculate cumulative abs2sum array with the squared norm of each sub state vector via custatevecSamplerGetSquaredNorm() and provide its values to the sampler descriptor via custatevecSamplerApplySubSVOffset().

// The state vector is divided to nSubSvs sub state vectors.
// each state vector has its own ordinal and nLocalBits index bits.
// The ordinals of sub state vectors correspond to the extended index bits.

// create sampler and check the size of external workspace
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    custatevecSamplerCreate(
        handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, &sampler[iSv], nMaxShots,
        &extraWorkspaceSizeInBytes[iSv]);
}

// allocate external workspace if necessary
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    if (extraWorkspaceSizeInBytes[iSv] > 0) {
        cudaSetDevice(devices[iSv]);
        cudaMalloc(&extraWorkspace[iSv], extraWorkspaceSizeInBytes[iSv]);
    }
}

// sample preprocess
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]) );
    custatevecSampler_preprocess(
        handle[iSv], sampler[iSv], extraWorkspace[iSv],
        extraWorkspaceSizeInBytes[iSv]);
}

// get norm of the sub state vectors
double subNorms[nSubSvs];
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    custatevecSamplerGetSquaredNorm(
        handle[iSv], sampler[iSv], &subNorms[iSv]);
}

for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    cudaDeviceSynchronize();
}

// get cumulative array
double cumulativeArray[nSubSvs + 1];
cumulativeArray[0] = 0.0;
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cumulativeArray[iSv + 1] = cumulativeArray[iSv] + subNorms[iSv];
}
double norm = cumulativeArray[nSubSvs];

// apply offset and norm
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]) );
    custatevecSamplerApplySubSVOffset(
        handle[iSv], sampler[iSv], iSv, nSubSvs, cumulativeArray[iSv], norm);
}

// divide randnum array. randnums must be sorted in the ascending order.
int shotOffsets[nSubSvs + 1];
shotOffsets[0] = 0;
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    double* pos = std::lower_bound(randnums, randnums + nShots,
                                    cumulativeArray[iSv + 1] / norm);
    if (iSv == nSubSvs - 1) {
        pos = randnums + nShots;
    }
    shotOffsets[iSv + 1] = pos - randnums;
}

// sample bit strings
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    int shotOffset = shotOffsets[iSv];
    int nSubShots = shotOffsets[iSv + 1] - shotOffsets[iSv];
    if (nSubShots > 0) {
        cudaSetDevice(devices[iSv]);
        custatevecSamplerSample(
            handle[iSv], sampler[iSv], &bitStrings[shotOffset], bitOrdering,
            bitStringLen, &randnums[shotOffset], nSubShots,
            CUSTATEVEC_SAMPLER_OUTPUT_RANDNUM_ORDER);
    }
}

for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    custatevecSamplerDestroy(sampler[iSv]);
}

Please refer to NVIDIA/cuQuantum repository for further detail.

API reference

custatevecSamplerCreate

custatevecStatus_t custatevecSamplerCreate(custatevecHandle_t handle, const void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, custatevecSamplerDescriptor_t *sampler, uint32_t nMaxShots, size_t *extraWorkspaceSizeInBytes)

Create sampler descriptor.

This function creates a sampler descriptor. If an extra workspace is required, its size is set to extraWorkspaceSizeInBytes.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[in] pointer to state vector

  • svDataType[in] data type of state vector

  • nIndexBits[in] the number of index bits of the state vector

  • sampler[out] pointer to a new sampler descriptor

  • nMaxShots[in] the max number of shots used for this sampler context

  • extraWorkspaceSizeInBytes[out] workspace size


custatevecSamplerPreprocess

custatevecStatus_t custatevecSamplerPreprocess(custatevecHandle_t handle, custatevecSamplerDescriptor_t sampler, void *extraWorkspace, const size_t extraWorkspaceSizeInBytes)

Preprocess the state vector for preparation of sampling.

This function prepares internal states of the sampler descriptor. If a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0. Otherwise, a pointer passed to the extraWorkspace argument is associated to the sampler handle and should be kept during its life time. The size of extraWorkspace is obtained when custatevecSamplerCreate() is called.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sampler[inout] the sampler descriptor

  • extraWorkspace[in] extra workspace

  • extraWorkspaceSizeInBytes[in] size of the extra workspace


custatevecSamplerGetSquaredNorm

custatevecStatus_t custatevecSamplerGetSquaredNorm(custatevecHandle_t handle, custatevecSamplerDescriptor_t sampler, double *norm)

Get the squared norm of the state vector.

This function returns the squared norm of the state vector. An intended use case is sampling with multiple devices. This API should be called after custatevecSamplerPreprocess(). Otherwise, the behavior of this function is undefined.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sampler[in] the sampler descriptor

  • norm[out] the norm of the state vector


custatevecSamplerApplySubSVOffset

custatevecStatus_t custatevecSamplerApplySubSVOffset(custatevecHandle_t handle, custatevecSamplerDescriptor_t sampler, int32_t subSVOrd, uint32_t nSubSVs, double offset, double norm)

Apply the partial norm and norm to the state vector to the sample descriptor.

This function applies offsets assuming the given state vector is a sub state vector. An intended use case is sampling with distributed state vectors. The nSubSVs argument should be a power of 2 and subSVOrd should be less than nSubSVs. Otherwise, this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sampler[in] the sampler descriptor

  • subSVOrd[in] sub state vector ordinal

  • nSubSVs[in] the number of sub state vectors

  • offset[in] cumulative sum offset for the sub state vector

  • norm[in] norm for all sub vectors


custatevecSamplerSample

custatevecStatus_t custatevecSamplerSample(custatevecHandle_t handle, custatevecSamplerDescriptor_t sampler, custatevecIndex_t *bitStrings, const int32_t *bitOrdering, const uint32_t bitStringLen, const double *randnums, const uint32_t nShots, enum custatevecSamplerOutput_t output)

Sample bit strings from the state vector.

This function does sampling. The bitOrdering and bitStringLen arguments specify bits to be sampled. Sampled bit strings are represented as an array of custatevecIndex_t and are stored to the host memory buffer that the bitStrings argument points to.

The randnums argument is an array of user-generated random numbers whose length is nShots. The range of random numbers should be in [0, 1). A random number given by the randnums argument is clipped to [0, 1) if its range is not in [0, 1).

The output argument specifies the order of sampled bit strings:

This API should be called after custatevecSamplerPreprocess(). Otherwise, the behavior of this function is undefined. By calling custatevecSamplerApplySubSVOffset() prior to this function, it is possible to sample bits corresponding to the ordinal of sub state vector.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sampler[in] the sampler descriptor

  • bitStrings[out] pointer to a host array to store sampled bit strings

  • bitOrdering[in] pointer to a host array of bit ordering for sampling

  • bitStringLen[in] the number of bits in bitOrdering

  • randnums[in] pointer to an array of random numbers

  • nShots[in] the number of shots

  • output[in] the order of sampled bit strings


custatevecSamplerDestroy

custatevecStatus_t custatevecSamplerDestroy(custatevecSamplerDescriptor_t sampler)

This function releases resources used by the sampler.

Parameters

sampler[in] the sampler descriptor

Accessor

An accessor extracts or updates state vector segments.

The APIs custatevecAccessorCreate() and custatevecAccessorCreateView() initialize an accessor and also return the size of an extra workspace (if needed by the APIs custatevecAccessorGet() and custatevecAccessorSet() to perform the copy). The workspace must be bound to an accessor by custatevecAccessorSetExtraWorkspace(), and the lifetime of the workspace must be as long as the accessor’s to cover the entire duration of the copy operation. If a device memory handler is set, it is not necessary to provide explicit workspace by users.

The begin and end arguments in the Get/Set APIs correspond to the state vector elements’ indices such that elements within the specified range are copied.

Use case

Extraction

// create accessor and check the size of external workspace
custatevecAccessorCreateView(
    handle, d_sv, CUDA_C_64F, nIndexBits, &accessor, bitOrdering, bitOrderingLen,
    maskBitString, maskOrdering, maskLen, &extraWorkspaceSizeInBytes);

// allocate external workspace if necessary
if (extraWorkspaceSizeInBytes > 0)
    cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

// set external workspace
custatevecAccessorSetExtraWorkspace(
    handle, &accessor, extraWorkspace, extraWorkspaceSizeInBytes);

// get state vector elements
custatevecAccessorGet(
    handle, &accessor, buffer, accessBegin, accessEnd);

// deallocate the accessor
custatevecAccessorDestroy(accessor);

Update

// create accessor and check the size of external workspace
custatevecAccessorCreate(
    handle, d_sv, CUDA_C_64F, nIndexBits, &accessor, bitOrdering, bitOrderingLen,
    maskBitString, maskOrdering, maskLen, &extraWorkspaceSizeInBytes);

// allocate external workspace if necessary
if (extraWorkspaceSizeInBytes > 0)
    cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

// set external workspace
custatevecAccessorSetExtraWorkspace(
    handle, &accessor, extraWorkspace, extraWorkspaceSizeInBytes);

// set state vector elements
custatevecAccessorSet(
    handle, &accessor, buffer, 0, nSvSize);

// deallocate the accessor
custatevecAccessorDestroy(accessor);

API reference

custatevecAccessorCreate

custatevecStatus_t custatevecAccessorCreate(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, custatevecAccessorDescriptor_t *accessor, const int32_t *bitOrdering, const uint32_t bitOrderingLen, const int32_t *maskBitString, const int32_t *maskOrdering, const uint32_t maskLen, size_t *extraWorkspaceSizeInBytes)

Create accessor to copy elements between the state vector and an external buffer.

Accessor copies state vector elements between the state vector and external buffers. During the copy, the ordering of state vector elements are rearranged according to the bit ordering specified by the bitOrdering argument.

The state vector is assumed to have the default ordering: the LSB is the 0th index bit and the (N-1)th index bit is the MSB for an N index bit system. The bit ordering of the external buffer is specified by the bitOrdering argument. When 3 is given to the nIndexBits argument and [1, 2, 0] to the bitOrdering argument, the state vector index bits are permuted to specified bit positions. Thus, the state vector index is rearranged and mapped to the external buffer index as [0, 4, 1, 5, 2, 6, 3, 7].

The maskBitString, maskOrdering and maskLen arguments specify the bit mask for the state vector index being accessed. If the maskLen argument is 0, the maskBitString and/or maskOrdering arguments can be null.

All bit positions [0, nIndexBits), should appear exactly once, either in the bitOrdering or the maskOrdering arguments. If a bit position does not appear in these arguments and/or there are overlaps of bit positions, this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

The extra workspace improves performance if the accessor is called multiple times with small external buffers placed on device. A null pointer can be specified to the extraWorkspaceSizeInBytes if the extra workspace is not necessary.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[in] state vector

  • svDataType[in] Data type of state vector

  • nIndexBits[in] the number of index bits of state vector

  • accessor[in] pointer to an accessor descriptor

  • bitOrdering[in] pointer to a host array to specify the basis bits of the external buffer

  • bitOrderingLen[in] the length of bitOrdering

  • maskBitString[in] pointer to a host array to specify the mask values to limit access

  • maskOrdering[in] pointer to a host array for the mask ordering

  • maskLen[in] the length of mask

  • extraWorkspaceSizeInBytes[out] the required size of extra workspace


custatevecAccessorCreateView

custatevecStatus_t custatevecAccessorCreateView(custatevecHandle_t handle, const void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, custatevecAccessorDescriptor_t *accessor, const int32_t *bitOrdering, const uint32_t bitOrderingLen, const int32_t *maskBitString, const int32_t *maskOrdering, const uint32_t maskLen, size_t *extraWorkspaceSizeInBytes)

Create accessor for the constant state vector.

This function is the same as custatevecAccessorCreate(), but only accepts the constant state vector.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[in] state vector

  • svDataType[in] Data type of state vector

  • nIndexBits[in] the number of index bits of state vector

  • accessor[in] pointer to an accessor descriptor

  • bitOrdering[in] pointer to a host array to specify the basis bits of the external buffer

  • bitOrderingLen[in] the length of bitOrdering

  • maskBitString[in] pointer to a host array to specify the mask values to limit access

  • maskOrdering[in] pointer to a host array for the mask ordering

  • maskLen[in] the length of mask

  • extraWorkspaceSizeInBytes[out] the required size of extra workspace


custatevecAccessorDestroy

custatevecStatus_t custatevecAccessorDestroy(custatevecAccessorDescriptor_t accessor)

This function releases resources used by the accessor.

Parameters

accessor[in] the accessor descriptor


custatevecAccessorSetExtraWorkspace

custatevecStatus_t custatevecAccessorSetExtraWorkspace(custatevecHandle_t handle, custatevecAccessorDescriptor_t accessor, void *extraWorkspace, size_t extraWorkspaceSizeInBytes)

Set the external workspace to the accessor.

This function sets the extra workspace to the accessor. The required size for extra workspace can be obtained by custatevecAccessorCreate() or custatevecAccessorCreateView(). if a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0.

Parameters
  • handle[in] the handle to the cuStateVec library

  • accessor[in] the accessor descriptor

  • extraWorkspace[in] extra workspace

  • extraWorkspaceSizeInBytes[in] extra workspace size


custatevecAccessorGet

custatevecStatus_t custatevecAccessorGet(custatevecHandle_t handle, custatevecAccessorDescriptor_t accessor, void *externalBuffer, const custatevecIndex_t begin, const custatevecIndex_t end)

Copy state vector elements to an external buffer.

This function copies state vector elements to an external buffer specified by the externalBuffer argument. During the copy, the index bit is permuted as specified by the bitOrdering argument in custatevecAccessorCreate() or custatevecAccessorCreateView().

The begin and end arguments specify the range of state vector elements being copied. Both arguments have the bit ordering specified by the bitOrdering argument.

Parameters
  • handle[in] the handle to the cuStateVec library

  • accessor[in] the accessor descriptor

  • externalBuffer[out] pointer to a host or device buffer to receive copied elements

  • begin[in] index in the permuted bit ordering for the first elements being copied to the state vector

  • end[in] index in the permuted bit ordering for the last elements being copied to the state vector (non-inclusive)


custatevecAccessorSet

custatevecStatus_t custatevecAccessorSet(custatevecHandle_t handle, custatevecAccessorDescriptor_t accessor, const void *externalBuffer, const custatevecIndex_t begin, const custatevecIndex_t end)

Set state vector elements from an external buffer.

This function sets complex numbers to the state vector by using an external buffer specified by the externalBuffer argument. During the copy, the index bit is permuted as specified by the bitOrdering argument in custatevecAccessorCreate().

The begin and end arguments specify the range of state vector elements being set to the state vector. Both arguments have the bit ordering specified by the bitOrdering argument.

If a read-only accessor created by calling custatevecAccessorCreateView() is provided, this function returns CUSTATEVEC_STATUS_NOT_SUPPORTED.

Parameters
  • handle[in] the handle to the cuStateVec library

  • accessor[in] the accessor descriptor

  • externalBuffer[in] pointer to a host or device buffer of complex values being copied to the state vector

  • begin[in] index in the permuted bit ordering for the first elements being copied from the state vector

  • end[in] index in the permuted bit ordering for the last elements being copied from the state vector (non-inclusive)

Qubit reordering

cuStateVec provides custatevecSwapIndexBits() API for single device and custatevecMultiDeviceSwapIndexBits() for multiple devices to reorder state vector elements.

Use case

single device

// This example uses 3 qubits.
const int nIndexBits = 3;

// swap 0th and 2nd qubits
const int nBitSwaps  = 1;
const int2 bitSwaps[] = {{0, 2}}; // specify the qubit pairs

// swap the state vector elements only if 1st qubit is 1
const int maskLen = 1;
int maskBitString[] = {1}; // specify the values of mask qubits
int maskOrdering[] = {1};  // specify the mask qubits

// Swap index bit pairs.
// {|000>, |001>, |010>, |011>, |100>, |101>, |110>, |111>} will be permuted to
// {|000>, |001>, |010>, |110>, |100>, |101>, |011>, |111>}.
custatevecSwapIndexBits(handle, sv, svDataType, nIndexBits, bitSwaps, nBitSwaps,
    maskBitString, maskOrdering, maskLen);

multiple devices

// This example uses 2 GPUs and each GPU stores 2-qubit sub state vector.
const int nGlobalIndexBits = 1;
const int nLocalIndexBits = 2;
const int nHandles = 1 << nGlobalIndexBits;

// Users are required to enable direct access on a peer device prior to the swap API call.
for (int i0 = 0; i0 < nHandles; i0++) {
  cudaSetDevice(i0);
  for (int i1 = 0; i1 < nHandles; i1++) {
    if (i0 == i1)
      continue;
    cudaDeviceEnablePeerAccess(i1, 0);
  }
}
cudaSetDevice(0);

// specify the type of device network topology to optimize the data transfer sequence.
// Here, devices are assumed to be connected via NVLink with an NVSwitch or
// PCIe device network with a single PCIe switch.
const custatevecDeviceNetworkType_t deviceNetworkType = CUSTATEVEC_DEVICE_NETWORK_TYPE_SWITCH;

// swap 0th and 2nd qubits
const int nIndexBitSwaps  = 1;
const int2 indexBitSwaps[] = {{0, 2}}; // specify the qubit pairs

// swap the state vector elements only if 1st qubit is 1
const int maskLen = 1;
int maskBitString[] = {1}; // specify the values of mask qubits
int maskOrdering[] = {1};  // specify the mask qubits

// Swap index bit pairs.
// {|000>, |001>, |010>, |011>, |100>, |101>, |110>, |111>} will be permuted to
// {|000>, |001>, |010>, |110>, |100>, |101>, |011>, |111>}.
custatevecMultiDeviceSwapIndexBits(handles, nHandles, subSVs, svDataType,
    nGlobalIndexBits, nLocalIndexBits, indexBitSwaps, nIndexBitSwaps,
    maskBitString, maskOrdering, maskLen, deviceNetworkType);

API reference

custatevecSwapIndexBits

custatevecStatus_t custatevecSwapIndexBits(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, const int2 *bitSwaps, const uint32_t nBitSwaps, const int32_t *maskBitString, const int32_t *maskOrdering, const uint32_t maskLen)

Swap index bits and reorder state vector elements in one device.

This function updates the bit ordering of the state vector by swapping the pairs of bit positions.

The state vector is assumed to have the default ordering: the LSB is the 0th index bit and the (N-1)th index bit is the MSB for an N index bit system. The bitSwaps argument specifies the swapped bit index pairs, whose values must be in the range [0, nIndexBits).

The maskBitString, maskOrdering and maskLen arguments specify the bit mask for the state vector index being permuted. If the maskLen argument is 0, the maskBitString and/or maskOrdering arguments can be null.

A bit position can be included in both bitSwaps and maskOrdering. When a masked bit is swapped, state vector elements whose original indices match the mask bit string are written to the permuted indices while other elements are not copied.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] Data type of state vector

  • nIndexBits[in] the number of index bits of state vector

  • bitSwaps[in] pointer to a host array of swapping bit index pairs

  • nBitSwaps[in] the number of bit swaps

  • maskBitString[in] pointer to a host array to mask output

  • maskOrdering[in] pointer to a host array to specify the ordering of maskBitString

  • maskLen[in] the length of mask


custatevecMultiDeviceSwapIndexBits

custatevecStatus_t custatevecMultiDeviceSwapIndexBits(custatevecHandle_t *handles, const uint32_t nHandles, void **subSVs, const cudaDataType_t svDataType, const uint32_t nGlobalIndexBits, const uint32_t nLocalIndexBits, const int2 *indexBitSwaps, const uint32_t nIndexBitSwaps, const int32_t *maskBitString, const int32_t *maskOrdering, const uint32_t maskLen, const custatevecDeviceNetworkType_t deviceNetworkType)

Swap index bits and reorder state vector elements for multiple sub state vectors distributed to multiple devices.

This function updates the bit ordering of the state vector distributed in multiple devices by swapping the pairs of bit positions.

This function assumes the state vector is split into multiple sub state vectors and distributed to multiple devices to represent a (nGlobalIndexBits + nLocalIndexBits) qubit system.

The handles argument should receive cuStateVec handles created for all devices where sub state vectors are allocated. If two or more cuStateVec handles created for the same device are given, this function will return an error, CUSTATEVEC_STATUS_INVALID_VALUE. The handles argument should contain a handle created on the current device, as all operations in this function will be ordered on the stream of the current device’s handle. Otherwise, this function returns an error, CUSTATEVEC_STATUS_INVALID_VALUE.

Sub state vectors are specified by the subSVs argument as an array of device pointers. All sub state vectors are assumed to hold the same number of index bits specified by the nLocalIndexBits. Thus, each sub state vectors holds (1 << nLocalIndexBits) state vector elements. The global index bits is identical to the index of sub state vectors. The number of sub state vectors is given as (1 << nGlobalIndexBits). The max value of nGlobalIndexBits is 5, which corresponds to 32 sub state vectors.

The index bit of the distributed state vector has the default ordering: The index bits of the sub state vector are mapped from the 0th index bit to the (nLocalIndexBits-1)-th index bit. The global index bits are mapped from the (nLocalIndexBits)-th bit to the (nGlobalIndexBits + nLocalIndexBits - 1)-th bit.

The indexBitSwaps argument specifies the index bit pairs being swapped. Each index bit pair can be a pair of two global index bits or a pair of a global and a local index bit. Any pair of two local index bits is not accepted. Please use custatevecSwapIndexBits() for swapping local index bits.

The maskBitString, maskOrdering and maskLen arguments specify the bit string mask that limits the state vector elements swapped during the call. Bits in maskOrdering can overlap index bits specified in the indexBitSwaps argument. In such cases, the mask bit string is applied for the bit positions before index bit swaps. If the maskLen argument is 0, the maskBitString and/or maskOrdering arguments can be null.

The deviceNetworkType argument specifies the device network topology to optimize the data transfer sequence. The following two network topologies are assumed:

  • Switch network: devices connected via NVLink with an NVSwitch (ex. DGX A100 and DGX-2) or PCIe device network with a single PCIe switch

  • Full mesh network: all devices are connected by full mesh connections (ex. DGX Station V100/A100)

Note

Important notice This function assumes bidirectional GPUDirect P2P is supported and enabled by cudaDeviceEnablePeerAccess() between all devices where sub state vectors are allocated. If GPUDirect P2P is not enabled, the call to custatevecMultiDeviceSwapIndexBits() that accesses otherwise inaccessible device memory allocated in other GPUs would result in a segmentation fault.

For the best performance, please use \(2^n\) number of devices and allocate one sub state vector in each device. This function allows to use non- \(2^n\) number of devices, to allocate two or more sub state vectors on a device, or to allocate all sub state vectors on a single device to cover various hardware configurations. However, the performance is always the best when a single sub state vector is allocated on each \(2^n\) number of devices.

The copy on each participating device is enqueued on the CUDA stream bound to the corresponding handle via custatevecSetStream(). All CUDA calls before the call of this function are correctly ordered if these calls are issued on the streams set to handles. This function is asynchronously executed. Please use cudaStreamSynchronize() (for synchronization) or cudaStreamWaitEvent() (for establishing the stream order) with the stream set to the handle of the current device.

Parameters
  • handles[in] pointer to a host array of custatevecHandle_t

  • nHandles[in] the number of handles specified in the handles argument

  • subSVs[inout] pointer to an array of sub state vectors

  • svDataType[in] the data type of the state vector specified by the subSVs argument

  • nGlobalIndexBits[in] the number of global index bits of distributed state vector

  • nLocalIndexBits[in] the number of local index bits in sub state vector

  • indexBitSwaps[in] pointer to a host array of index bit pairs being swaped

  • nIndexBitSwaps[in] the number of index bit swaps

  • maskBitString[in] pointer to a host array to mask output

  • maskOrdering[in] pointer to a host array to specify the ordering of maskBitString

  • maskLen[in] the length of mask

  • deviceNetworkType[in] the device network topology

Matrix property testing

The API custatevecTestMatrixType() is available to check the properties of matrices.

If a matrix \(A\) is unitary, \(AA^{\dagger} = A^{\dagger}A = I\), where \(A^{\dagger}\) is the conjugate transpose of \(A\) and \(I\) is the identity matrix, respectively.

When CUSTATEVEC_MATRIX_TYPE_UNITARY is given for its argument, this API computes the 1-norm \(||R||_1 = \sum{|r_{ij}|}\), where \(R = AA^{\dagger} - I\). This value will be approximately zero if \(A\) is unitary.

If a matrix \(A\) is Hermitian, \(A^{\dagger} = A\).

When CUSTATEVEC_MATRIX_TYPE_HERMITIAN is given for its argument, this API computes the 2-norm \(||R||_2 = \sum{|r_{ij}|^2}\), where \(R = (A - A^{\dagger}) / 2\). This value will be approximately zero if \(A\) is Hermitian.

The API may require external workspace for large matrices, and custatevecTestMatrixTypeGetWorkspaceSize() provides the size of workspace. If a device memory handler is set, it is not necessary to provide explicit workspace by users.

Use case

double residualNorm;

void* extraWorkspace = nullptr;
size_t extraWorkspaceSizeInBytes = 0;

// check the size of external workspace
custatevecTestMatrixTypeGetWorkspaceSize(
    handle, matrixType, matrix, matrixDataType, layout,
    nTargets, adjoint, computeType, &extraWorkspaceSizeInBytes);

// allocate external workspace if necessary
if (extraWorkspaceSizeInBytes > 0)
    cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

// execute testing
custatevecTestMatrixType(
    handle, &residualNorm, matrixType, matrix, matrixDataType, layout,
    nTargets, adjoint, computeType, extraWorkspace, extraWorkspaceSizeInBytes);

API reference

custatevecTestMatrixTypeGetWorkspaceSize

custatevecStatus_t custatevecTestMatrixTypeGetWorkspaceSize(custatevecHandle_t handle, custatevecMatrixType_t matrixType, const void *matrix, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const uint32_t nTargets, const int32_t adjoint, custatevecComputeType_t computeType, size_t *extraWorkspaceSizeInBytes)

Get extra workspace size for custatevecTestMatrixType()

This function gets the size of an extra workspace required to execute custatevecTestMatrixType(). extraWorkspaceSizeInBytes will be set to 0 if no extra buffer is required.

Parameters
  • handle[in] the handle to cuStateVec library

  • matrix[in] host or device pointer to a matrix

  • matrixType[in] matrix type

  • matrixDataType[in] data type of matrix

  • layout[in] enumerator specifying the memory layout of matrix

  • nTargets[in] the number of target bits, up to 15

  • adjoint[in] flag to control whether the adjoint of matrix is tested

  • computeType[in] compute type

  • extraWorkspaceSizeInBytes[out] workspace size


custatevecTestMatrixType

custatevecStatus_t custatevecTestMatrixType(custatevecHandle_t handle, double *residualNorm, custatevecMatrixType_t matrixType, const void *matrix, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const uint32_t nTargets, const int32_t adjoint, custatevecComputeType_t computeType, void *extraWorkspace, size_t extraWorkspaceSizeInBytes)

Test the deviation of a given matrix from a Hermitian (or Unitary) matrix.

This function tests if the type of a given matrix matches the type given by the matrixType argument.

For tests for the unitary type, \( R = (AA^{\dagger} - I) \) is calculated where \( A \) is the given matrix. The sum of absolute values of \( R \) matrix elements is returned.

For tests for the Hermitian type, \( R = (M - M^{\dagger}) / 2 \) is calculated. The sum of squared absolute values of \( R \) matrix elements is returned.

This function may return CUSTATEVEC_STATUS_INSUFFICIENT_WORKSPACE for large nTargets. In such cases, the extraWorkspace and extraWorkspaceSizeInBytes arguments should be specified to provide extra workspace. The required size of an extra workspace is obtained by calling custatevecTestMatrixTypeGetWorkspaceSize(). A null pointer can be passed to the extraWorkspace argument if no extra workspace is required. Also, if a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0.

Note

The nTargets argument must be no more than 15 in this version. For larger nTargets, this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

Parameters
  • handle[in] the handle to cuStateVec library

  • residualNorm[out] host pointer, to store the deviation from certain matrix type

  • matrixType[in] matrix type

  • matrix[in] host or device pointer to a matrix

  • matrixDataType[in] data type of matrix

  • layout[in] enumerator specifying the memory layout of matrix

  • nTargets[in] the number of target bits, up to 15

  • adjoint[in] flag to control whether the adjoint of matrix is tested

  • computeType[in] compute type

  • extraWorkspace[in] extra workspace

  • extraWorkspaceSizeInBytes[in] extra workspace size