cuStateVec Functions

Library Management

Handle Management API

custatevecCreate

custatevecStatus_t custatevecCreate(custatevecHandle_t *handle)

This function initializes the cuStateVec library and creates a handle on the cuStateVec context. It must be called prior to any other cuStateVec API functions.

Parameters

handle[in] the pointer to the handle to the cuStateVec context


custatevecDestroy

custatevecStatus_t custatevecDestroy(custatevecHandle_t handle)

This function releases resources used by the cuStateVec library.

Parameters

handle[in] the handle to the cuStateVec context


custatevecGetDefaultWorkspaceSize

custatevecStatus_t custatevecGetDefaultWorkspaceSize(custatevecHandle_t handle, size_t *workspaceSizeInBytes)

This function returns the default workspace size defined by the cuStateVec library.

This function returns the default size used for the workspace.

Parameters
  • handle[in] the handle to the cuStateVec context

  • workspaceSizeInBytes[out] default workspace size


custatevecSetWorkspace

custatevecStatus_t custatevecSetWorkspace(custatevecHandle_t handle, void *workspace, size_t workspaceSizeInBytes)

This function sets the workspace used by the cuStateVec library.

This function sets the workspace attached to the handle. The required size of the workspace is obtained by custatevecGetDefaultWorkspaceSize().

By setting a larger workspace, users are able to execute functions without allocating the extra workspace in some functions.

If a device memory handler is set, the workspace can be set to null and the workspace is allocated using the user-defined memory pool.

Parameters
  • handle[in] the handle to the cuStateVec context

  • workspace[in] device pointer to workspace

  • workspaceSizeInBytes[in] workspace size

CUDA Stream Management API

custatevecSetStream

custatevecStatus_t custatevecSetStream(custatevecHandle_t handle, cudaStream_t streamId)

This function sets the stream to be used by the cuStateVec library to execute its routine.

Parameters
  • handle[in] the handle to the cuStateVec context

  • streamId[in] the stream to be used by the library


custatevecGetStream

custatevecStatus_t custatevecGetStream(custatevecHandle_t handle, cudaStream_t *streamId)

This function gets the cuStateVec library stream used to execute all calls from the cuStateVec library functions.

Parameters
  • handle[in] the handle to the cuStateVec context

  • streamId[out] the stream to be used by the library

Error Management API

custatevecGetErrorName

const char *custatevecGetErrorName(custatevecStatus_t status)

This function returns the name string for the input error code. If the error code is not recognized, “unrecognized error code” is returned.

Parameters

status[in] Error code to convert to string


custatevecGetErrorString

const char *custatevecGetErrorString(custatevecStatus_t status)

This function returns the description string for an error code. If the error code is not recognized, “unrecognized error code” is returned.

Parameters

status[in] Error code to convert to string

Logger API

custatevecLoggerSetCallback

custatevecStatus_t custatevecLoggerSetCallback(custatevecLoggerCallback_t callback)

Experimental: This function sets the logging callback function.

Parameters

callback[in] Pointer to a callback function. See custatevecLoggerCallback_t.


custatevecLoggerSetCallbackData

custatevecStatus_t custatevecLoggerSetCallbackData(custatevecLoggerCallbackData_t callback, void *userData)

Experimental: This function sets the logging callback function with user data.

Parameters

custatevecLoggerSetFile

custatevecStatus_t custatevecLoggerSetFile(FILE *file)

Experimental: This function sets the logging output file.

Note

Once registered using this function call, the provided file handle must not be closed unless the function is called again to switch to a different file handle.

Parameters

file[in] Pointer to an open file. File should have write permission.


custatevecLoggerOpenFile

custatevecStatus_t custatevecLoggerOpenFile(const char *logFile)

Experimental: This function opens a logging output file in the given path.

Parameters

logFile[in] Path of the logging output file.


custatevecLoggerSetLevel

custatevecStatus_t custatevecLoggerSetLevel(int32_t level)

Experimental: This function sets the value of the logging level.

Levels are defined as follows:

Level

Summary

Long Description

“0”

Off

logging is disabled (default)

“1”

Errors

only errors will be logged

“2”

Performance Trace

API calls that launch CUDA kernels will log their parameters and important information

“3”

Performance Hints

hints that can potentially improve the application’s performance

“4”

Heuristics Trace

provides general information about the library execution, may contain details about heuristic status

“5”

API Trace

API Trace - API calls will log their parameter and important information

Parameters

level[in] Value of the logging level.


custatevecLoggerSetMask

custatevecStatus_t custatevecLoggerSetMask(int32_t mask)

Experimental: This function sets the value of the logging mask. Masks are defined as a combination of the following masks:

Level

Description

“0”

Off

“1”

Errors

“2”

Performance Trace

“4”

Performance Hints

“8”

Heuristics Trace

“16”

API Trace

Refer to custatevecLoggerCallback_t for the details.

Parameters

mask[in] Value of the logging mask.


custatevecLoggerForceDisable

custatevecStatus_t custatevecLoggerForceDisable()

Experimental: This function disables logging for the entire run.

Versioning API

custatevecGetProperty

custatevecStatus_t custatevecGetProperty(libraryPropertyType type, int32_t *value)

This function returns the version information of the cuStateVec library.

Parameters
  • type[in] requested property (MAJOR_VERSION, MINOR_VERSION, or PATCH_LEVEL).

  • value[out] value of the requested property.


custatevecGetVersion

size_t custatevecGetVersion()

This function returns the version information of the cuStateVec library.

Memory Management API

A stream-ordered memory allocator (or mempool for short) allocates/deallocates memory asynchronously from/to a mempool in a stream-ordered fashion, meaning memory operations and computations enqueued on the streams have a well-defined inter- and intra- stream dependency. There are several well-implemented stream-ordered mempools available, such as cudaMemPool_t that is built-in at the CUDA driver level since CUDA 11.2 (so that all CUDA applications in the same process can easily share the same pool, see here) and the RAPIDS Memory Manager (RMM). For a detailed introduction, see the NVIDIA Developer Blog.

The new device memory handler APIs allow users to bind a stream-ordered mempool to the library handle, such that cuStateVec can take care of most of the memory management for users. Below is an illustration of what can be done:

MyMemPool pool = MyMemPool();  // kept alive for the entire process in real apps

int my_alloc(void* ctx, void** ptr, size_t size, cudaStream_t stream) {
  return reinterpret_cast<MyMemPool*>(ctx)->alloc(ptr, size, stream);
}

int my_dealloc(void* ctx, void* ptr, size_t size, cudaStream_t stream) {
  return reinterpret_cast<MyMemPool*>(ctx)->dealloc(ptr, size, stream);
}

// create a mem handler and fill in the required members for the library to use
custatevecDeviceMemHandler_t handler;
handler.ctx = reinterpret_cast<void*>(&pool);
handler.device_alloc = my_alloc;
handler.device_free = my_dealloc;
memcpy(handler.name, std::string("my pool").c_str(), CUSTATEVEC_ALLOCATOR_NAME_LEN);

// bind the handler to the library handle
custatevecSetDeviceMemHandler(handle, &handler);

/* ... use gate application as usual ... */

// User doesn’t compute the required sizes

// User doesn’t query the workspace size (but one can if desired)

// User doesn’t allocate memory!

// User sets null pointer to indicate the library should draw memory from the user's pool;
void* extraWorkspace = nullptr;
size_t extraWorkspaceInBytes = 0;
custatevecApplyMatrix(
    handle, sv, svDataType, nIndexBits, matrix, matrixDataType, layout,
    adjoint, targets, nTargets, controls, nControls, controlBitValues,
    computeType, extraWorkspace, extraWorkspaceSizeInBytes);

// User doesn’t deallocate memory!

As shown above, several calls to the workspace-related APIs can be skipped. Moreover, allowing the library to share your memory pool not only can alleviate potential memory conflicts, but also enable possible optimizations.

In the current release, only a device mempool can be bound.

custatevecSetDeviceMemHandler

custatevecStatus_t custatevecSetDeviceMemHandler(custatevecHandle_t handle, const custatevecDeviceMemHandler_t *handler)

Set the current device memory handler.

Once set, when cuStateVec needs device memory in various API calls it will allocate from the user-provided memory pool and deallocate at completion. See custatevecDeviceMemHandler_t and APIs that require extra workspace for further detail.

The internal stream order is established using the user-provided stream set via custatevecSetStream().

If handler argument is set to nullptr, the library handle will detach its existing memory handler.

Warning

It is undefined behavior for the following scenarios:

  • the library handle is bound to a memory handler and subsequently to another handler

  • the library handle outlives the attached memory pool

  • the memory pool is not stream-ordered

Parameters
  • handle[in] Opaque handle holding cuStateVec’s library context.

  • handler[in] the device memory handler that encapsulates the user’s mempool. The struct content is copied internally.


custatevecGetDeviceMemHandler

custatevecStatus_t custatevecGetDeviceMemHandler(custatevecHandle_t handle, custatevecDeviceMemHandler_t *handler)

Get the current device memory handler.

Parameters
  • handle[in] Opaque handle holding cuStateVec’s library context.

  • handler[out] If previously set, the struct pointed to by handler is filled in, otherwise CUSTATEVEC_STATUS_NO_DEVICE_ALLOCATOR is returned.

Initialization

cuStateVec API custatevecInitializeStateVector() can be used to initialize a state vector to any of a set of prescribed states. Please refer to custatevecStateVectorType_t for details.

Use case

// initialize state vector
custatevecInitializeStateVector(handle, sv, svDataType, nIndexBits, svType);

API reference

custatevecInitializeStateVector

custatevecStatus_t custatevecInitializeStateVector(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, custatevecStateVectorType_t svType)

Initialize the state vector to a certain form.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of state vector

  • nIndexBits[in] the number of index bits

  • svType[in] the target quantum state

Gate Application

General Matrices

cuStateVec API custatevecApplyMatrix() can apply a matrix representing a gate to a state vector. The API may require external workspace for large matrices, and custatevecApplyMatrixGetWorkspaceSize() provides the size of workspace. If a device memory handler is set, custatevecApplyMatrixGetWorkspaceSize() can be skipped.

custatevecApplyMatrixBatchedGetWorkspaceSize() and custatevecApplyMatrixBatched() can apply matrices to batched state vectors. Please refer to batched state vectors for the overview of batched state vector simulations.

Use case

// check the size of external workspace
custatevecApplyMatrixGetWorkspaceSize(
    handle, svDataType, nIndexBits, matrix, matrixDataType, layout, adjoint, nTargets,
    nControls, computeType, &extraWorkspaceSizeInBytes);

// allocate external workspace if necessary
void* extraWorkspace = nullptr;
if (extraWorkspaceSizeInBytes > 0)
    cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

// apply gate
custatevecApplyMatrix(
    handle, sv, svDataType, nIndexBits, matrix, matrixDataType, layout,
    adjoint, targets, nTargets, controls, controlBitValues, nControls,
    computeType, extraWorkspace, extraWorkspaceSizeInBytes);

API reference

custatevecApplyMatrixGetWorkspaceSize
custatevecStatus_t custatevecApplyMatrixGetWorkspaceSize(custatevecHandle_t handle, cudaDataType_t svDataType, const uint32_t nIndexBits, const void *matrix, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const int32_t adjoint, const uint32_t nTargets, const uint32_t nControls, custatevecComputeType_t computeType, size_t *extraWorkspaceSizeInBytes)

This function gets the required workspace size for custatevecApplyMatrix().

This function returns the required extra workspace size to execute custatevecApplyMatrix(). extraWorkspaceSizeInBytes will be set to 0 if no extra buffer is required for a given set of arguments.

Parameters
  • handle[in] the handle to the cuStateVec context

  • svDataType[in] Data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • matrix[in] host or device pointer to a matrix

  • matrixDataType[in] data type of matrix

  • layout[in] enumerator specifying the memory layout of matrix

  • adjoint[in] apply adjoint of matrix

  • nTargets[in] the number of target bits

  • nControls[in] the number of control bits

  • computeType[in] computeType of matrix multiplication

  • extraWorkspaceSizeInBytes[out] workspace size


custatevecApplyMatrix
custatevecStatus_t custatevecApplyMatrix(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, const void *matrix, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const int32_t adjoint, const int32_t *targets, const uint32_t nTargets, const int32_t *controls, const int32_t *controlBitValues, const uint32_t nControls, custatevecComputeType_t computeType, void *extraWorkspace, size_t extraWorkspaceSizeInBytes)

Apply gate matrix.

Apply gate matrix to a state vector. The state vector size is \(2^\text{nIndexBits}\).

The matrix argument is a host or device pointer of a 2-dimensional array for a square matrix. The size of matrix is ( \(2^\text{nTargets} \times 2^\text{nTargets}\) ) and the value type is specified by the matrixDataType argument. The layout argument specifies the matrix layout which can be in either row-major or column-major order. The targets and controls arguments specify target and control bit positions in the state vector index.

The controlBitValues argument specifies bit values of control bits. The ordering of controlBitValues is specified by the controls argument. If a null pointer is specified to this argument, all control bit values are set to 1.

By definition, bit positions in targets and controls arguments should not overlap.

This function may return CUSTATEVEC_STATUS_INSUFFICIENT_WORKSPACE for large nTargets. In such cases, the extraWorkspace and extraWorkspaceSizeInBytes arguments should be specified to provide extra workspace. The size of required extra workspace is obtained by calling custatevecApplyMatrixGetWorkspaceSize(). A null pointer can be passed to the extraWorkspace argument if no extra workspace is required. Also, if a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • matrix[in] host or device pointer to a square matrix

  • matrixDataType[in] data type of matrix

  • layout[in] enumerator specifying the memory layout of matrix

  • adjoint[in] apply adjoint of matrix

  • targets[in] pointer to a host array of target bits

  • nTargets[in] the number of target bits

  • controls[in] pointer to a host array of control bits

  • controlBitValues[in] pointer to a host array of control bit values

  • nControls[in] the number of control bits

  • computeType[in] computeType of matrix multiplication

  • extraWorkspace[in] extra workspace

  • extraWorkspaceSizeInBytes[in] extra workspace size


custatevecApplyMatrixBatchedGetWorkspaceSize
custatevecStatus_t custatevecApplyMatrixBatchedGetWorkspaceSize(custatevecHandle_t handle, cudaDataType_t svDataType, const uint32_t nIndexBits, const uint32_t nSVs, const custatevecIndex_t svStride, custatevecMatrixMapType_t mapType, const int32_t *matrixIndices, const void *matrices, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const int32_t adjoint, const uint32_t nMatrices, const uint32_t nTargets, const uint32_t nControls, custatevecComputeType_t computeType, size_t *extraWorkspaceSizeInBytes)

This function gets the required workspace size for custatevecApplyMatrixBatched().

This function returns the required extra workspace size to execute custatevecApplyMatrixBatched(). extraWorkspaceSizeInBytes will be set to 0 if no extra buffer is required for a given set of arguments.

Parameters
  • handle[in] the handle to the cuStateVec context

  • svDataType[in] Data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • nSVs[in] the number of state vectors

  • svStride[in] distance of two consecutive state vectors

  • mapType[in] enumerator specifying the way to assign matrices

  • matrixIndices[in] pointer to a host or device array of matrix indices

  • matrices[in] pointer to allocated matrices in one contiguous memory chunk on host or device

  • matrixDataType[in] data type of matrix

  • layout[in] enumerator specifying the memory layout of matrix

  • adjoint[in] apply adjoint of matrix

  • nMatrices[in] the number of matrices

  • nTargets[in] the number of target bits

  • nControls[in] the number of control bits

  • computeType[in] computeType of matrix multiplication

  • extraWorkspaceSizeInBytes[out] workspace size


custatevecApplyMatrixBatched
custatevecStatus_t custatevecApplyMatrixBatched(custatevecHandle_t handle, void *batchedSv, cudaDataType_t svDataType, const uint32_t nIndexBits, const uint32_t nSVs, custatevecIndex_t svStride, custatevecMatrixMapType_t mapType, const int32_t *matrixIndices, const void *matrices, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const int32_t adjoint, const uint32_t nMatrices, const int32_t *targets, const uint32_t nTargets, const int32_t *controls, const int32_t *controlBitValues, const uint32_t nControls, custatevecComputeType_t computeType, void *extraWorkspace, size_t extraWorkspaceSizeInBytes)

This function applies one gate matrix to each one of a set of batched state vectors.

This function applies one gate matrix for each of batched state vectors given by the batchedSv argument. Batched state vectors are allocated in single device memory chunk with the stride specified by the svStride argument. Each state vector size is \(2^\text{nIndexBits}\) and the number of state vectors is specified by the nSVs argument.

The mapType argument specifies the way to assign matrices to the state vectors, and the matrixIndices argument specifies the matrix indices for the state vectors. When mapType is CUSTATEVEC_MATRIX_MAP_TYPE_MATRIX_INDEXED, the \(\text{matrixIndices[}i\text{]}\)-th matrix will be assigned to the \(i\)-th state vector. matrixIndices should contain nSVs integers when mapType is CUSTATEVEC_MATRIX_MAP_TYPE_MATRIX_INDEXED and it can be a null pointer when mapType is CUSTATEVEC_MATRIX_MAP_TYPE_BROADCAST.

The matrices argument is a host or device pointer of a 2-dimensional array for a square matrix. The size of matrices is ( \(\text{nMatrices} \times 2^\text{nTargets} \times 2^\text{nTargets}\) ) and the value type is specified by the matrixDataType argument. The layout argument specifies the matrix layout which can be in either row-major or column-major order. The targets and controls arguments specify target and control bit positions in the state vector index. In this API, these bit positions are uniform for all the batched state vectors.

The controlBitValues argument specifies bit values of control bits. The ordering of controlBitValues is specified by the controls argument. If a null pointer is specified to this argument, all control bit values are set to 1.

By definition, bit positions in targets and controls arguments should not overlap.

This function may return CUSTATEVEC_STATUS_INSUFFICIENT_WORKSPACE for large nTargets. In such cases, the extraWorkspace and extraWorkspaceSizeInBytes arguments should be specified to provide extra workspace. The size of required extra workspace is obtained by calling custatevecApplyMatrixBatchedGetWorkspaceSize(). A null pointer can be passed to the extraWorkspace argument if no extra workspace is required. Also, if a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0.

Note

In this version, this API does not return any errors even if the matrixIndices argument contains invalid matrix indices. However, when applicable an error message would be printed to stdout.

Parameters
  • handle[in] the handle to the cuStateVec library

  • batchedSv[inout] batched state vector allocated in one continuous memory chunk on device

  • svDataType[in] data type of the state vectors

  • nIndexBits[in] the number of index bits of the state vectors

  • nSVs[in] the number of state vectors

  • svStride[in] distance of two consecutive state vectors

  • mapType[in] enumerator specifying the way to assign matrices

  • matrixIndices[in] pointer to a host or device array of matrix indices

  • matrices[in] pointer to allocated matrices in one contiguous memory chunk on host or device

  • matrixDataType[in] data type of matrices

  • layout[in] enumerator specifying the memory layout of matrix

  • adjoint[in] apply adjoint of matrix

  • nMatrices[in] the number of matrices

  • targets[in] pointer to a host array of target bits

  • nTargets[in] the number of target bits

  • controls[in] pointer to a host array of control bits

  • controlBitValues[in] pointer to a host array of control bit values

  • nControls[in] the number of control bits

  • computeType[in] computeType of matrix multiplication

  • extraWorkspace[in] extra workspace

  • extraWorkspaceSizeInBytes[in] extra workspace size

Pauli Matrices

Exponential of a tensor product of Pauli matrices can be expressed as follows:

\[e^{i \theta \left( P_{target[0]} \otimes P_{target[1]} \otimes \cdots \otimes P_{target[nTargets-1]} \right)}.\]

Matrix \(P_{target[i]}\) can be either of Pauli matrices \(I\), \(X\), \(Y\), and \(Z\), which are corresponding to the custatevecPauli_t enums CUSTATEVEC_PAULI_I, CUSTATEVEC_PAULI_X, CUSTATEVEC_PAULI_Y, and CUSTATEVEC_PAULI_Z, respectively. Also refer to custatevecPauli_t for details.

Use case

// apply exponential
custatevecApplyPauliRotation(
    handle, sv, svDataType, nIndexBits, theta, paulis, targets, nTargets,
    controls, controlBitValues, nControls);

API reference

custatevecApplyPauliRotation
custatevecStatus_t custatevecApplyPauliRotation(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, double theta, const custatevecPauli_t *paulis, const int32_t *targets, const uint32_t nTargets, const int32_t *controls, const int32_t *controlBitValues, const uint32_t nControls)

Apply the exponential of a multi-qubit Pauli operator.

Apply exponential of a tensor product of Pauli bases specified by bases, \( e^{i \theta P} \), where \(P\) is the product of Pauli bases. The paulis, targets, and nTargets arguments specify Pauli bases and their bit positions in the state vector index.

At least one target and a corresponding Pauli basis should be specified.

The controls and nControls arguments specifies the control bit positions in the state vector index.

The controlBitValues argument specifies bit values of control bits. The ordering of controlBitValues is specified by the controls argument. If a null pointer is specified to this argument, all control bit values are set to 1.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of bits in the state vector index

  • theta[in] theta

  • paulis[in] host pointer to custatevecPauli_t array

  • targets[in] pointer to a host array of target bits

  • nTargets[in] the number of target bits

  • controls[in] pointer to a host array of control bits

  • controlBitValues[in] pointer to a host array of control bit values

  • nControls[in] the number of control bits

Generalized Permutation Matrices

A generalized permutation matrix can be expressed as the multiplication of a permutation matrix \(P\) and a diagonal matrix \(D\). For instance, we can decompose a 4 \(\times\) 4 generalized permutation matrix \(A\) as follows:

\[\begin{split}A = \left[ \begin{array}{cccc} 0 & 0 & a_0 & 0 \\ a_1 & 0 & 0 & 0 \\ 0 & 0 & 0 & a_2 \\ 0 & a_3 & 0 & 0 \end{array}\right] = DP\end{split}\]

, where

\[\begin{split}D = \left[ \begin{array}{cccc} a_0 & 0 & 0 & 0 \\ 0 & a_1 & 0 & 0 \\ 0 & 0 & a_2 & 0 \\ 0 & 0 & 0 & a_3 \end{array}\right], P = \left[ \begin{array}{cccc} 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 \end{array}\right].\end{split}\]

When \(P\) is diagonal, the generalized permutation matrix is also diagonal. Similarly, when \(D\) is the identity matrix, the generalized permutation matrix becomes a permutation matrix.

The cuStateVec API custatevecApplyGeneralizedPermutationMatrix() applies a generalized permutation matrix like \(A\) to a state vector. The API may require extra workspace for large matrices, whose size can be queried using custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize(). If a device memory handler is set, custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize() can be skipped.

Use case

// check the size of external workspace
custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize(
    handle, CUDA_C_64F, nIndexBits, permutation, diagonals, CUDA_C_64F, targets,
    nTargets, nControls, &extraWorkspaceSizeInBytes);

// allocate external workspace if necessary
void* extraWorkspace = nullptr;
if (extraWorkspaceSizeInBytes > 0)
    cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

// apply a generalized permutation matrix
custatevecApplyGeneralizedPermutationMatrix(
    handle, d_sv, CUDA_C_64F, nIndexBits, permutation, diagonals, CUDA_C_64F,
    adjoint, targets, nTargets, controls, controlBitValues, nControls,
    extraWorkspace, extraWorkspaceSizeInBytes);

The operation is equivalent to the following:

// sv, sv_temp: the state vector and temporary buffer.

int64_t sv_size = int64_t{1} << nIndexBits;
for (int64_t sv_idx = 0; sv_idx < sv_size; sv_idx++) {
    // The basis of sv_idx is converted to permutation basis to obtain perm_idx
    auto perm_idx = convertToPermutationBasis(sv_idx);
    // apply generalized permutation matrix
    if (adjoint == 0)
        sv_temp[sv_idx] = sv[permutation[perm_idx]] * diagonals[perm_idx];
    else
        sv_temp[permutation[perm_idx]] = sv[sv_idx] * conj(diagonals[perm_idx]);
}

for (int64_t sv_idx = 0; sv_idx < sv_size; sv_idx++)
    sv[sv_idx] = sv_temp[sv_idx];

API reference

custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize
custatevecStatus_t custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize(custatevecHandle_t handle, cudaDataType_t svDataType, const uint32_t nIndexBits, const custatevecIndex_t *permutation, const void *diagonals, cudaDataType_t diagonalsDataType, const int32_t *targets, const uint32_t nTargets, const uint32_t nControls, size_t *extraWorkspaceSizeInBytes)

Get the extra workspace size required by custatevecApplyGeneralizedPermutationMatrix().

This function gets the size of extra workspace size required to execute custatevecApplyGeneralizedPermutationMatrix(). extraWorkspaceSizeInBytes will be set to 0 if no extra buffer is required for a given set of arguments.

Parameters
  • handle[in] the handle to the cuStateVec library

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • permutation[in] host or device pointer to a permutation table

  • diagonals[in] host or device pointer to diagonal elements

  • diagonalsDataType[in] data type of diagonals

  • targets[in] pointer to a host array of target bits

  • nTargets[in] the number of target bits

  • nControls[in] the number of control bits

  • extraWorkspaceSizeInBytes[out] extra workspace size


custatevecApplyGeneralizedPermutationMatrix
custatevecStatus_t custatevecApplyGeneralizedPermutationMatrix(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, custatevecIndex_t *permutation, const void *diagonals, cudaDataType_t diagonalsDataType, const int32_t adjoint, const int32_t *targets, const uint32_t nTargets, const int32_t *controls, const int32_t *controlBitValues, const uint32_t nControls, void *extraWorkspace, size_t extraWorkspaceSizeInBytes)

Apply generalized permutation matrix.

This function applies the generalized permutation matrix.

The generalized permutation matrix, \(A\), is expressed as \(A = DP\), where \(D\) and \(P\) are diagonal and permutation matrices, respectively.

The permutation matrix, \(P\), is specified as a permutation table which is an array of custatevecIndex_t and passed to the permutation argument.

The diagonal matrix, \(D\), is specified as an array of diagonal elements. The length of both arrays is \( 2^{{\text nTargets}} \). The diagonalsDataType argument specifies the type of diagonal elements.

Below is the table of combinations of svDataType and diagonalsDataType arguments available in this version.

svDataType

diagonalsDataType

CUDA_C_64F

CUDA_C_64F

CUDA_C_32F

CUDA_C_64F

CUDA_C_32F

CUDA_C_32F

This function can also be used to only apply either the diagonal or the permutation matrix. By passing a null pointer to the permutation argument, \(P\) is treated as an identity matrix, thus, only the diagonal matrix \(D\) is applied. Likewise, if a null pointer is passed to the diagonals argument, \(D\) is treated as an identity matrix, and only the permutation matrix \(P\) is applied.

The permutation argument should hold integers in [0, \( 2^{nTargets} \)). An integer should appear only once, otherwise the behavior of this function is undefined.

The permutation and diagonals arguments should not be null at the same time. In this case, this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

This function may return CUSTATEVEC_STATUS_INSUFFICIENT_WORKSPACE for large nTargets or nIndexBits. In such cases, the extraWorkspace and extraWorkspaceSizeInBytes arguments should be specified to provide extra workspace. The size of required extra workspace is obtained by calling custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize().

A null pointer can be passed to the extraWorkspace argument if no extra workspace is required. Also, if a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0.

Note

In this version, custatevecApplyGeneralizedPermutationMatrix() does not return error if an invalid permutation argument is specified.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • permutation[in] host or device pointer to a permutation table

  • diagonals[in] host or device pointer to diagonal elements

  • diagonalsDataType[in] data type of diagonals

  • adjoint[in] apply adjoint of generalized permutation matrix

  • targets[in] pointer to a host array of target bits

  • nTargets[in] the number of target bits

  • controls[in] pointer to a host array of control bits

  • controlBitValues[in] pointer to a host array of control bit values

  • nControls[in] the number of control bits

  • extraWorkspace[in] extra workspace

  • extraWorkspaceSizeInBytes[in] extra workspace size

Measurement

Measurement on Z-bases

Let us consider the measurement of an \(nIndexBits\)-qubit state vector \(sv\) on an \(nBasisBits\)-bit Z product basis \(basisBits\).

The sums of squared absolute values of state vector elements on the Z product basis, \(abs2sum0\) and \(abs2sum1\), are obtained by the followings:

\[\begin{split}abs2sum0 &= \Bra{sv} \left\{ \dfrac{1}{2} \left( 1 + Z_{basisBits[0]} \otimes Z_{basisBits[1]} \otimes \cdots \otimes Z_{basisBits[nBasisBits-1]} \right) \right\} \Ket{sv}, \\ abs2sum1 &= \Bra{sv} \left\{ \dfrac{1}{2} \left( 1 - Z_{basisBits[0]} \otimes Z_{basisBits[1]} \otimes \cdots \otimes Z_{basisBits[nBasisBits-1]} \right) \right\} \Ket{sv}.\end{split}\]

Therefore, probabilities to obtain parity 0 and 1 are expressed in the following expression:

\[\begin{split}Pr(parity = 0) &= \dfrac{abs2sum0}{abs2sum0 + abs2sum1}, \\ Pr(parity = 1) &= \dfrac{abs2sum1}{abs2sum0 + abs2sum1}.\end{split}\]

Depending on the measurement result, the state vector is collapsed. If parity is equal to 0, we obtain the following vector:

\[\ket{sv} = \dfrac{1}{\sqrt{norm}} \left\{ \dfrac{1}{2} \left( 1 + Z_{basisBits[0]} \otimes Z_{basisBits[1]} \otimes \cdots \otimes Z_{basisBits[nBasisBits-1]} \right) \right\} \Ket{sv},\]

and if parity is equal to 1, we obtain the following vector:

\[\ket{sv} = \dfrac{1}{\sqrt{norm}} \left\{ \dfrac{1}{2} \left( 1 - Z_{basisBits[0]} \otimes Z_{basisBits[1]} \otimes \cdots \otimes Z_{basisBits[nBasisBits-1]} \right) \right\} \Ket{sv},\]

where \(norm\) is the normalization factor.

Use case

We can measure by custatevecMeasureOnZBasis() as follows:

// measure on a Z basis
custatevecMeasureOnZBasis(
    handle, sv, svDataType, nIndexBits, &parity, basisBits, nBasisBits,
    randnum, collapse);

The operation is equivalent to the following:

// compute the sums of squared absolute values of state vector elements
// on a Z product basis
double abs2sum0, abs2sum1;
custatevecAbs2SumOnZBasis(
    handle, sv, svDataType, nIndexBits, &abs2sum0, &abs2sum1, basisBits,
    nBasisBits);

// [User] compute parity and norm
double abs2sum = abs2sum0 + abs2sum1;
int parity = (randnum * abs2sum < abs2sum0) ? 0 : 1;
double norm = (parity == 0) ? abs2sum0 : abs2sum1;

// collapse if necessary
switch (collapse) {
case CUSTATEVEC_COLLAPSE_NORMALIZE_AND_ZERO:
    custatevecCollapseOnZBasis(
        handle, sv, svDataType, nIndexBits, parity, basisBits, nBasisBits,
        norm);
    break;  /* collapse */
case CUSTATEVEC_COLLAPSE_NONE:
    break;  /* Do nothing */

API reference

custatevecAbs2SumOnZBasis
custatevecStatus_t custatevecAbs2SumOnZBasis(custatevecHandle_t handle, const void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, double *abs2sum0, double *abs2sum1, const int32_t *basisBits, const uint32_t nBasisBits)

Calculates the sum of squared absolute values on a given Z product basis.

This function calculates sums of squared absolute values on a given Z product basis. If a null pointer is specified to abs2sum0 or abs2sum1, the sum for the corresponding value is not calculated. Since the sum of (abs2sum0 + abs2sum1) is identical to the norm of the state vector, one can calculate the probability where parity == 0 as (abs2sum0 / (abs2sum0 + abs2sum1)).

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[in] state vector

  • svDataType[in] data type of state vector

  • nIndexBits[in] the number of index bits

  • abs2sum0[out] pointer to a host or device variable to store the sum of squared absolute values for parity == 0

  • abs2sum1[out] pointer to a host or device variable to store the sum of squared absolute values for parity == 1

  • basisBits[in] pointer to a host array of Z-basis index bits

  • nBasisBits[in] the number of basisBits


custatevecCollapseOnZBasis
custatevecStatus_t custatevecCollapseOnZBasis(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, const int32_t parity, const int32_t *basisBits, const uint32_t nBasisBits, double norm)

Collapse state vector on a given Z product basis.

This function collapses state vector on a given Z product basis. The state elements that match the parity argument are scaled by a factor specified in the norm argument. Other elements are set to zero.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of state vector

  • nIndexBits[in] the number of index bits

  • parity[in] parity, 0 or 1

  • basisBits[in] pointer to a host array of Z-basis index bits

  • nBasisBits[in] the number of Z basis bits

  • norm[in] normalization factor


custatevecMeasureOnZBasis
custatevecStatus_t custatevecMeasureOnZBasis(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, int32_t *parity, const int32_t *basisBits, const uint32_t nBasisBits, const double randnum, enum custatevecCollapseOp_t collapse)

Measurement on a given Z-product basis.

This function does measurement on a given Z product basis. The measurement result is the parity of the specified Z product basis. At least one basis bit should be specified, otherwise this function fails.

If CUSTATEVEC_COLLAPSE_NONE is specified for the collapse argument, this function only returns the measurement result without collapsing the state vector. If CUSTATEVEC_COLLAPSE_NORMALIZE_AND_ZERO is specified, this function collapses the state vector as custatevecCollapseOnZBasis() does.

If a random number is not in [0, 1), this function returns CUSTATEVEC_STATUS_INVALID_VALUE. At least one basis bit should be specified, otherwise this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of state vector

  • nIndexBits[in] the number of index bits

  • parity[out] parity, 0 or 1

  • basisBits[in] pointer to a host array of Z basis bits

  • nBasisBits[in] the number of Z basis bits

  • randnum[in] random number, [0, 1).

  • collapse[in] Collapse operation

Qubit Measurement

Assume that we measure an \(nIndexBits\)-qubits state vector \(sv\) with a \(bitOrderingLen\)-bits bit string \(bitOrdering\).

The sums of squared absolute values of state vector elements are obtained by the following:

\[abs2sum[idx] = \braket{sv|i}\braket{i|sv},\]

where \(idx = b_{BitOrderingLen-1}\cdots b_1 b_0\), \(i = b_{bitOrdering[BitOrderingLen-1]} \cdots b_{bitOrdering[1]} b_{bitOrdering[0]}\), \(b_p \in \{0, 1\}\).

Therefore, probability to obtain the \(idx\)-th pattern of bits are expressed in the following expression:

\[Pr(idx) = \dfrac{abs2sum[idx]}{\sum_{k}abs2sum[k]}.\]

Depending on the measurement result, the state vector is collapsed.

If \(idx\) satisfies \((idx \ \& \ bitString) = idx\), we obtain \(sv[idx] = \dfrac{1}{\sqrt{norm}} sv[idx]\). Otherwise, \(sv[idx] = 0\), where \(norm\) is the normalization factor.

Use case

We can measure by custatevecBatchMeasure() as follows:

// measure with a bit string
custatevecBatchMeasure(
    handle, sv, svDataType, nIndexBits, bitString, bitOrdering, bitStringLen,
    randnum, collapse);

The operation is equivalent to the following:

// compute the sums of squared absolute values of state vector elements
int maskLen = 0;
int* maskBitString = nullptr;
int* maskOrdering = nullptr;

custatevecAbs2SumArray(
    handle, sv, svDataType, nIndexBits, abs2Sum, bitOrdering, bitOrderingLen,
    maskBitString, maskOrdering, maskLen);

// [User] compute a cumulative sum and choose bitString by a random number

// collapse if necessary
switch (collapse) {
case CUSTATEVEC_COLLAPSE_NORMALIZE_AND_ZERO:
    custatevecCollapseByBitString(
        handle, sv, svDataType, nIndexBits, bitString, bitOrdering,
        bitStringLen, norm);
    break;  /* collapse */
case CUSTATEVEC_COLLAPSE_NONE:
    break;  /* Do nothing */

For batched state vectors, custatevecAbs2SumArrayBatched(), custatevecCollapseByBitStringBatched(), and custatevecMeasureBatched() are available. Please refer to batched state vectors for the overview of batched state vector simulations.

For multi-GPU computations, custatevecBatchMeasureWithOffset() is available. This function works on one device, and users are required to compute the cumulative array of squared absolute values of state vector elements beforehand.

// The state vector is divided to nSubSvs sub state vectors.
// each state vector has its own ordinal and nLocalBits index bits.
// The ordinals of sub state vectors correspond to the extended index bits.
// In this example, all the local qubits are measured and collapsed.

// get abs2sum for each sub state vector
double abs2SumArray[nSubSvs];
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    custatevecAbs2SumArray(
        handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, &abs2SumArray[iSv], nullptr,
        0, nullptr, nullptr, 0);
}

for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    cudaDeviceSynchronize();
}

// get cumulative array
double cumulativeArray[nSubSvs + 1];
cumulativeArray[0] = 0.0;
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cumulativeArray[iSv + 1] = cumulativeArray[iSv] + abs2SumArray[iSv];
}

// measurement
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    // detect which sub state vector will be used for measurement.
    if (cumulativeArray[iSv] <= randnum && randnum < cumulativeArray[iSv + 1]) {
        double norm = cumulativeArray[nSubSvs];
        double offset = cumulativeArray[iSv];
        cudaSetDevice(devices[iSv]);
        // measure local qubits. Here the state vector will not be collapsed.
        // Only local qubits can be included in bitOrdering and bitString arguments.
        // That is, bitOrdering = {0, 1, 2, ..., nLocalBits - 1} and
        // bitString will store values of local qubits as an array of integers.
        custatevecBatchMeasureWithOffset(
            handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, bitString, bitOrdering,
            bitStringLen, randnum, CUSTATEVEC_COLLAPSE_NONE, offset, norm);
    }
}

for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    cudaDeviceSynchronize();
}

// get abs2Sum after collapse
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    custatevecAbs2SumArray(
        handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, &abs2SumArray[iSv], nullptr,
        0, bitString, bitOrdering, bitStringLen);
}

for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    cudaDeviceSynchronize();
}

// get norm after collapse
double norm = 0.0;
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    norm += abs2SumArray[iSv];
}

// collapse sub state vectors
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    custatevecCollapseByBitString(
        handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, bitString, bitOrdering,
        bitStringLen, norm);
}

// destroy handle
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    custatevecDestroy(handle[iSv]);
}

Please refer to NVIDIA/cuQuantum repository for further detail.

API reference

custatevecAbs2SumArray
custatevecStatus_t custatevecAbs2SumArray(custatevecHandle_t handle, const void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, double *abs2sum, const int32_t *bitOrdering, const uint32_t bitOrderingLen, const int32_t *maskBitString, const int32_t *maskOrdering, const uint32_t maskLen)

Calculate abs2sum array for a given set of index bits.

Calculates an array of sums of squared absolute values of state vector elements. The abs2sum array can be on host or device. The index bit ordering abs2sum array is specified by the bitOrdering and bitOrderingLen arguments. Unspecified bits are folded (summed up).

The maskBitString, maskOrdering and maskLen arguments set bit mask in the state vector index. The abs2sum array is calculated by using state vector elements whose indices match the mask bit string. If the maskLen argument is 0, null pointers can be specified to the maskBitString and maskOrdering arguments, and all state vector elements are used for calculation.

By definition, bit positions in bitOrdering and maskOrdering arguments should not overlap.

The empty bitOrdering can be specified to calculate the norm of state vector. In this case, 0 is passed to the bitOrderingLen argument and the bitOrdering argument can be a null pointer.

Note

Since the size of abs2sum array is proportional to \( 2^{bitOrderingLen} \) , the max length of bitOrdering depends on the amount of available memory and maskLen.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[in] state vector

  • svDataType[in] data type of state vector

  • nIndexBits[in] the number of index bits

  • abs2sum[out] pointer to a host or device array of sums of squared absolute values

  • bitOrdering[in] pointer to a host array of index bit ordering

  • bitOrderingLen[in] the length of bitOrdering

  • maskBitString[in] pointer to a host array for a bit string to specify mask

  • maskOrdering[in] pointer to a host array for the mask ordering

  • maskLen[in] the length of mask


custatevecCollapseByBitString
custatevecStatus_t custatevecCollapseByBitString(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, const int32_t *bitString, const int32_t *bitOrdering, const uint32_t bitStringLen, double norm)

Collapse state vector to the state specified by a given bit string.

This function collapses state vector to the state specified by a given bit string. The state vector elements specified by the bitString, bitOrdering and bitStringLen arguments are normalized by the norm argument. Other elements are set to zero.

At least one basis bit should be specified, otherwise this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of state vector

  • nIndexBits[in] the number of index bits

  • bitString[in] pointer to a host array of bit string

  • bitOrdering[in] pointer to a host array of bit string ordering

  • bitStringLen[in] length of bit string

  • norm[in] normalization constant


custatevecBatchMeasure
custatevecStatus_t custatevecBatchMeasure(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, int32_t *bitString, const int32_t *bitOrdering, const uint32_t bitStringLen, const double randnum, enum custatevecCollapseOp_t collapse)

Batched single qubit measurement.

This function does batched single qubit measurement and returns a bit string. The bitOrdering argument specifies index bits to be measured. The measurement result is stored in bitString in the ordering specified by the bitOrdering argument.

If CUSTATEVEC_COLLAPSE_NONE is specified for the collapse argument, this function only returns the measured bit string without collapsing the state vector. When CUSTATEVEC_COLLAPSE_NORMALIZE_AND_ZERO is specified, this function collapses the state vector as custatevecCollapseByBitString() does.

If a random number is not in [0, 1), this function returns CUSTATEVEC_STATUS_INVALID_VALUE. At least one basis bit should be specified, otherwise this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

Note

This API is for measuring a single state vector. For measuring batched state vectors, please use custatevecMeasureBatched(), whose arguments are passed in a different convention.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits

  • bitString[out] pointer to a host array of measured bit string

  • bitOrdering[in] pointer to a host array of bit string ordering

  • bitStringLen[in] length of bitString

  • randnum[in] random number, [0, 1).

  • collapse[in] Collapse operation


custatevecAbs2SumArrayBatched
custatevecStatus_t custatevecAbs2SumArrayBatched(custatevecHandle_t handle, const void *batchedSv, cudaDataType_t svDataType, const uint32_t nIndexBits, const uint32_t nSVs, const custatevecIndex_t svStride, double *abs2sumArrays, const custatevecIndex_t abs2sumArrayStride, const int32_t *bitOrdering, const uint32_t bitOrderingLen, const custatevecIndex_t *maskBitStrings, const int32_t *maskOrdering, const uint32_t maskLen)

Calculate batched abs2sum array for a given set of index bits.

The batched version of custatevecAbs2SumArray() that calculates a batch of arrays that holds sums of squared absolute values from batched state vectors.

State vectors are placed on a single contiguous device memory chunk. The svStride argument specifies the distance between two adjacent state vectors. Thus, svStride should be equal to or larger than the state vector size.

The computed sums of squared absolute values are output to the abs2sumArrays which is a contiguous memory chunk. The abs2sumArrayStride specifies the distance between adjacent two abs2sum arrays. The batched abs2sum arrays can be on host or device. The index bit ordering the abs2sum array in the batch is specified by the bitOrdering and bitOrderingLen arguments. Unspecified bits are folded (summed up).

The maskBitStrings, maskOrdering and maskLen arguments specify bit mask to for the index bits of batched state vectors. The abs2sum array is calculated by using state vector elements whose indices match the specified mask bit strings. The maskBitStrings argument specifies an array of mask values as integer bit masks that are applied for the state vector index.

If the maskLen argument is 0, null pointers can be specified to the maskBitStrings and maskOrdering arguments. In this case, all state vector elements are used without masks to compute the squared sum of absolute values.

By definition, bit positions in bitOrdering and maskOrdering arguments should not overlap.

The empty bitOrdering can be specified to calculate the norm of state vector. In this case, 0 is passed to the bitOrderingLen argument and the bitOrdering argument can be a null pointer.

Note

In this version, this API does not return any errors even if the maskBitStrings argument contains invalid bit strings. However, when applicable an error message would be printed to stdout.

Parameters
  • handle[in] the handle to the cuStateVec library

  • batchedSv[in] batch of state vectors

  • svDataType[in] data type of state vector

  • nIndexBits[in] the number of index bits

  • nSVs[in] the number of state vectors in a batch

  • svStride[in] the stride of state vector

  • abs2sumArrays[out] pointer to a host or device array of sums of squared absolute values

  • abs2sumArrayStride[in] the distance between consequence abs2sumArrays

  • bitOrdering[in] pointer to a host array of index bit ordering

  • bitOrderingLen[in] the length of bitOrdering

  • maskBitStrings[in] pointer to a host or device array of mask bit strings

  • maskOrdering[in] pointer to a host array for the mask ordering

  • maskLen[in] the length of mask


custatevecCollapseByBitStringBatchedGetWorkspaceSize
custatevecStatus_t custatevecCollapseByBitStringBatchedGetWorkspaceSize(custatevecHandle_t handle, const uint32_t nSVs, const custatevecIndex_t *bitStrings, const double *norms, size_t *extraWorkspaceSizeInBytes)

This function gets the required workspace size for custatevecCollapseByBitStringBatched().

This function returns the required extra workspace size to execute custatevecCollapseByBitStringBatched(). extraWorkspaceSizeInBytes will be set to 0 if no extra buffer is required.

Note

The bitStrings and norms arrays are of the same size nSVs and can reside on either the host or the device, but their locations must remain the same when invoking custatevecCollapseByBitStringBatched(), or the computed workspace size may become invalid and lead to undefined behavior.

Parameters
  • handle[in] the handle to the cuStateVec context

  • nSVs[in] the number of batched state vectors

  • bitStrings[in] pointer to an array of bit strings, on either host or device

  • norms[in] pointer to an array of normalization constants, on either host or device

  • extraWorkspaceSizeInBytes[out] workspace size


custatevecCollapseByBitStringBatched
custatevecStatus_t custatevecCollapseByBitStringBatched(custatevecHandle_t handle, void *batchedSv, cudaDataType_t svDataType, const uint32_t nIndexBits, const uint32_t nSVs, const custatevecIndex_t svStride, const custatevecIndex_t *bitStrings, const int32_t *bitOrdering, const uint32_t bitStringLen, const double *norms, void *extraWorkspace, size_t extraWorkspaceSizeInBytes)

Collapse the batched state vectors to the state specified by a given bit string.

This function collapses all of the state vectors in a batch to the state specified by a given bit string. Batched state vectors are allocated in single device memory chunk with the stride specified by the svStride argument. Each state vector size is \(2^\text{nIndexBits}\) and the number of state vectors is specified by the nSVs argument.

The i-th state vector’s elements, as specified by the i-th bitStrings element and the bitOrdering and bitStringLen arguments, are normalized by the i-th norms element. Other state vector elements are set to zero.

At least one basis bit should be specified, otherwise this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

Note that bitOrdering and bitStringLen are applicable to all state vectors in the batch, while the bitStrings and norms arrays are of the same size nSVs and can reside on either the host or the device.

The bitStrings argument should hold integers in [0, \( 2^\text{bitStringLen} \)).

This function may return CUSTATEVEC_STATUS_INSUFFICIENT_WORKSPACE for large nSVs and/or nIndexBits. In such cases, the extraWorkspace and extraWorkspaceSizeInBytes arguments should be specified to provide extra workspace. The size of required extra workspace is obtained by calling custatevecCollapseByBitStringBatchedGetWorkspaceSize(). A null pointer can be passed to the extraWorkspace argument if no extra workspace is required. Also, if a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0.

Note

In this version, custatevecCollapseByBitStringBatched() does not return error if an invalid bitStrings or norms argument is specified. However, when applicable an error message would be printed to stdout.

Note

Unlike the non-batched version (custatevecCollapseByBitString()), in this batched version bitStrings are stored as an array with element type custatevecIndex_t; that is, each element is an integer representing a bit string in the binary form. This usage is in line with the custatevecSamplerSample() API. See the Bit Ordering section for further detail.

The bitStrings and norms arrays are of the same size nSVs and can reside on either the host or the device, but their locations must remain the same when invoking custatevecCollapseByBitStringBatchedGetWorkspaceSize(), or the computed workspace size may become invalid and lead to undefined behavior.

Parameters
  • handle[in] the handle to the cuStateVec library

  • batchedSv[inout] batched state vector allocated in one continuous memory chunk on device

  • svDataType[in] data type of the state vectors

  • nIndexBits[in] the number of index bits of the state vectors

  • nSVs[in] the number of batched state vectors

  • svStride[in] distance of two consecutive state vectors

  • bitStrings[in] pointer to an array of bit strings, on either host or device

  • bitOrdering[in] pointer to a host array of bit string ordering

  • bitStringLen[in] length of bit string

  • norms[in] pointer to an array of normalization constants on either host or device

  • extraWorkspace[in] extra workspace

  • extraWorkspaceSizeInBytes[in] size of the extra workspace


custatevecMeasureBatched
custatevecStatus_t custatevecMeasureBatched(custatevecHandle_t handle, void *batchedSv, cudaDataType_t svDataType, const uint32_t nIndexBits, const uint32_t nSVs, const custatevecIndex_t svStride, custatevecIndex_t *bitStrings, const int32_t *bitOrdering, const uint32_t bitStringLen, const double *randnums, enum custatevecCollapseOp_t collapse)

Single qubit measurements for batched state vectors.

This function measures bit strings of batched state vectors. The bitOrdering and bitStringLen arguments specify an integer array of index bit positions to be measured. The measurement results are returned to bitStrings which is a 64-bit integer array of 64-bit integer bit masks.

Ex. When bitOrdering = {3, 1} is specified, this function measures two index bits. The 0-th bit in bitStrings elements represents the measurement outcomes of the index bit 3, and the 1st bit represents those of the 1st index bit.

Batched state vectors are given in a single contiguous memory chunk where state vectors are placed at the distance specified by svStride. The svStride is expressed in the number of elements.

The randnums stores random numbers used for measurements. The number of random numbers is identical to nSVs, and values should be in [0, 1). Any random number not in this range, the value is clipped to [0, 1).

If CUSTATEVEC_COLLAPSE_NONE is specified for the collapse argument, this function only returns the measured bit strings without collapsing the state vector. When CUSTATEVEC_COLLAPSE_NORMALIZE_AND_ZERO is specified, this function collapses state vectors. After collapse of state vectors, the norms of all state vectors will be 1.

Note

This API is for measuring batched state vectors. For measuring a single state vector, custatevecBatchMeasure() is also available, whose arguments are passed in a different convention.

Parameters
  • handle[in] the handle to the cuStateVec library

  • batchedSv[inout] batched state vectors

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits

  • nSVs[in] the number of state vectors in the batched state vector

  • svStride[in] the distance between state vectors in the batch

  • bitStrings[out] pointer to a host or device array of measured bit strings

  • bitOrdering[in] pointer to a host array of bit string ordering

  • bitStringLen[in] length of bitString

  • randnums[in] pointer to a host or device array of random numbers.

  • collapse[in] Collapse operation


custatevecBatchMeasureWithOffset
custatevecStatus_t custatevecBatchMeasureWithOffset(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, int32_t *bitString, const int32_t *bitOrdering, const uint32_t bitStringLen, const double randnum, enum custatevecCollapseOp_t collapse, const double offset, const double abs2sum)

Batched single qubit measurement for partial vector.

This function does batched single qubit measurement and returns a bit string. The bitOrdering argument specifies index bits to be measured. The measurement result is stored in bitString in the ordering specified by the bitOrdering argument.

If CUSTATEVEC_COLLAPSE_NONE is specified for the collapse argument, this function only returns the measured bit string without collapsing the state vector. When CUSTATEVEC_COLLAPSE_NORMALIZE_AND_ZERO is specified, this function collapses the state vector as custatevecCollapseByBitString() does.

This function assumes that sv is partial state vector and drops some most significant bits. Prefix sums for lower indices and the entire state vector must be provided as offset and abs2sum, respectively. When offset == abs2sum == 0, this function behaves in the same way as custatevecBatchMeasure().

If a random number is not in [0, 1), this function returns CUSTATEVEC_STATUS_INVALID_VALUE. At least one basis bit should be specified, otherwise this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] partial state vector

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits

  • bitString[out] pointer to a host array of measured bit string

  • bitOrdering[in] pointer to a host array of bit string ordering

  • bitStringLen[in] length of bitString

  • randnum[in] random number, [0, 1).

  • collapse[in] Collapse operation

  • offset[in] partial sum of squared absolute values

  • abs2sum[in] sum of squared absolute values for the entire state vector

Expectation

Expectation via a Matrix

Expectation performs the following operation:

\[\langle A \rangle = \bra{\phi}A\ket{\phi},\]

where \(\ket{\phi}\) is a state vector and \(A\) is a matrix or an observer. The API for expectation custatevecComputeExpectation() may require external workspace for large matrices, and custatevecComputeExpectationGetWorkspaceSize() provides the size of workspace. If a device memory handler is set, custatevecComputeExpectationGetWorkspaceSize() can be skipped.

custatevecComputeExpectationBatchedGetWorkspaceSize() and custatevecComputeExpectationBatched() can compute expectation values for batched state vectors. Please refer to batched state vectors for the overview of batched state vector simulations.

Use case

// check the size of external workspace
custatevecComputeExpectationGetWorkspaceSize(
    handle, svDataType, nIndexBits, matrix, matrixDataType, layout, nBasisBits, computeType,
    &extraWorkspaceSizeInBytes);

// allocate external workspace if necessary
void* extraWorkspace = nullptr;
if (extraWorkspaceSizeInBytes > 0)
    cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

// perform expectation
custatevecComputeExpectation(
    handle, sv, svDataType, nIndexBits, expect, expectDataType, residualNorm,
    matrix, matrixDataType, layout, basisBits, nBasisBits, computeType,
    extraWorkspace, extraWorkspaceSizeInBytes);

API reference

custatevecComputeExpectationGetWorkspaceSize
custatevecStatus_t custatevecComputeExpectationGetWorkspaceSize(custatevecHandle_t handle, cudaDataType_t svDataType, const uint32_t nIndexBits, const void *matrix, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const uint32_t nBasisBits, custatevecComputeType_t computeType, size_t *extraWorkspaceSizeInBytes)

This function gets the required workspace size for custatevecComputeExpectation().

This function returns the size of the extra workspace required to execute custatevecComputeExpectation(). extraWorkspaceSizeInBytes will be set to 0 if no extra buffer is required.

Parameters
  • handle[in] the handle to the cuStateVec context

  • svDataType[in] Data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • matrix[in] host or device pointer to a matrix

  • matrixDataType[in] data type of matrix

  • layout[in] enumerator specifying the memory layout of matrix

  • nBasisBits[in] the number of target bits

  • computeType[in] computeType of matrix multiplication

  • extraWorkspaceSizeInBytes[out] size of the extra workspace


custatevecComputeExpectation
custatevecStatus_t custatevecComputeExpectation(custatevecHandle_t handle, const void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, void *expectationValue, cudaDataType_t expectationDataType, double *residualNorm, const void *matrix, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const int32_t *basisBits, const uint32_t nBasisBits, custatevecComputeType_t computeType, void *extraWorkspace, size_t extraWorkspaceSizeInBytes)

Compute expectation of matrix observable.

This function calculates expectation for a given matrix observable. The acceptable values for the expectationDataType argument are CUDA_R_64F and CUDA_C_64F.

The basisBits and nBasisBits arguments specify the basis to calculate expectation. For the computeType argument, the same combinations for custatevecApplyMatrix() are available.

This function may return CUSTATEVEC_STATUS_INSUFFICIENT_WORKSPACE for large nBasisBits. In such cases, the extraWorkspace and extraWorkspaceSizeInBytes arguments should be specified to provide extra workspace. The size of required extra workspace is obtained by calling custatevecComputeExpectationGetWorkspaceSize(). A null pointer can be passed to the extraWorkspace argument if no extra workspace is required. Also, if a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0.

Note

The residualNorm argument is not available in this version. If a matrix given by the matrix argument may not be a Hermitian matrix, please specify CUDA_C_64F to the expectationDataType argument and check the imaginary part of the calculated expectation value.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[in] state vector

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • expectationValue[out] host pointer to a variable to store an expectation value

  • expectationDataType[in] data type of expect

  • residualNorm[out] result of matrix type test

  • matrix[in] observable as matrix

  • matrixDataType[in] data type of matrix

  • layout[in] matrix memory layout

  • basisBits[in] pointer to a host array of basis index bits

  • nBasisBits[in] the number of basis bits

  • computeType[in] computeType of matrix multiplication

  • extraWorkspace[in] pointer to an extra workspace

  • extraWorkspaceSizeInBytes[in] the size of extra workspace

Note

This function might be asynchronous with respect to host depending on the arguments. Please use cudaStreamSynchronize (for synchronization) or cudaStreamWaitEvent (for establishing the stream order) with the stream set to the handle of the current device before using the results stored in expectationValue.


custatevecComputeExpectationBatchedGetWorkspaceSize
custatevecStatus_t custatevecComputeExpectationBatchedGetWorkspaceSize(custatevecHandle_t handle, cudaDataType_t svDataType, const uint32_t nIndexBits, const uint32_t nSVs, const custatevecIndex_t svStride, const void *matrices, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const uint32_t nMatrices, const uint32_t nBasisBits, custatevecComputeType_t computeType, size_t *extraWorkspaceSizeInBytes)

This function gets the required workspace size for custatevecComputeExpectationBatched().

This function returns the size of the extra workspace required to execute custatevecComputeExpectationBatched(). extraWorkspaceSizeInBytes will be set to 0 if no extra buffer is required.

Parameters
  • handle[in] the handle to the cuStateVec context

  • svDataType[in] Data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • nSVs[in] the number of state vectors

  • svStride[in] distance of two consecutive state vectors

  • matrices[in] pointer to allocated matrices in one contiguous memory chunk on host or device

  • matrixDataType[in] data type of matrices

  • layout[in] enumerator specifying the memory layout of matrix

  • nMatrices[in] the number of matrices

  • nBasisBits[in] the number of basis bits

  • computeType[in] computeType of matrix multiplication

  • extraWorkspaceSizeInBytes[out] size of the extra workspace


custatevecComputeExpectationBatched
custatevecStatus_t custatevecComputeExpectationBatched(custatevecHandle_t handle, const void *batchedSv, cudaDataType_t svDataType, const uint32_t nIndexBits, const uint32_t nSVs, custatevecIndex_t svStride, double2 *expectationValues, const void *matrices, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const uint32_t nMatrices, const int32_t *basisBits, const uint32_t nBasisBits, custatevecComputeType_t computeType, void *extraWorkspace, size_t extraWorkspaceSizeInBytes)

Compute the expectation values of matrix observables for each of the batched state vectors.

This function computes expectation values for given matrix observables to each one of batched state vectors given by the batchedSv argument. Batched state vectors are allocated in single device memory chunk with the stride specified by the svStride argument. Each state vector size is \(2^\text{nIndexBits}\) and the number of state vectors is specified by the nSVs argument.

The expectationValues argument points to single memory chunk to output the expectation values. This API returns values in double precision (complex128) regardless of input data types. The output array size is ( \(\text{nMatrices} \times \text{nSVs}\) ) and its leading dimension is nMatrices.

The matrices argument is a host or device pointer of a 2-dimensional array for a square matrix. The size of matrices is ( \(\text{nMatrices} \times 2^\text{nBasisBits} \times 2^\text{nBasisBits}\) ) and the value type is specified by the matrixDataType argument. The layout argument specifies the matrix layout which can be in either row-major or column-major order.

The basisBits and nBasisBits arguments specify the basis to calculate expectation. For the computeType argument, the same combinations for custatevecComputeExpectation() are available.

This function may return CUSTATEVEC_STATUS_INSUFFICIENT_WORKSPACE for large nBasisBits. In such cases, the extraWorkspace and extraWorkspaceSizeInBytes arguments should be specified to provide extra workspace. The size of required extra workspace is obtained by calling custatevecComputeExpectationBatchedGetWorkspaceSize(). A null pointer can be passed to the extraWorkspace argument if no extra workspace is required. Also, if a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0.

Parameters
  • handle[in] the handle to the cuStateVec library

  • batchedSv[in] batched state vector allocated in one continuous memory chunk on device

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • nSVs[in] the number of state vectors

  • svStride[in] distance of two consecutive state vectors

  • expectationValues[out] pointer to a host array to store expectation values

  • matrices[in] pointer to allocated matrices in one contiguous memory chunk on host or device

  • matrixDataType[in] data type of matrices

  • layout[in] matrix memory layout

  • nMatrices[in] the number of matrices

  • basisBits[in] pointer to a host array of basis index bits

  • nBasisBits[in] the number of basis bits

  • computeType[in] computeType of matrix multiplication

  • extraWorkspace[in] pointer to an extra workspace

  • extraWorkspaceSizeInBytes[in] the size of extra workspace

Expectation on Pauli Basis

cuStateVec API custatevecComputeExpectationsOnPauliBasis() computes expectation values for a batch of Pauli strings. Each observable can be expressed as follows:

\[P_{\text{basisBits}[0]} \otimes P_{\text{basisBits}[1]} \otimes \cdots \otimes P_{\text{basisBits}[\text{nBasisBits}-1]}.\]

Each matrix \(P_{\text{basisBits}[i]}\) can be one of the Pauli matrices \(I\), \(X\), \(Y\), and \(Z\), corresponding to the custatevecPauli_t enums CUSTATEVEC_PAULI_I, CUSTATEVEC_PAULI_X, CUSTATEVEC_PAULI_Y, and CUSTATEVEC_PAULI_Z, respectively. Also refer to custatevecPauli_t for details.

Use case

// calculate the norm and the expectations for Z(q1) and X(q0)Y(q2)

uint32_t nPauliOperatorArrays = 3;
custatevecPauli_t pauliOperators0[] = {};                                       // III
int32_t           basisBits0[]      = {};
custatevecPauli_t pauliOperators1[] = {CUSTATEVEC_PAULI_Z};                     // IZI
int32_t           basisBits1[]      = {1};
custatevecPauli_t pauliOperators2[] = {CUSTATEVEC_PAULI_X, CUSTATEVEC_PAULI_Y}; // XIY
int32_t           basisBits2[]      = {0, 2};
const uint32_t nBasisBitsArray[] = {0, 1, 2};

const custatevecPauli_t*
  pauliOperatorsArray[] = {pauliOperators0, pauliOperators1, pauliOperators2};
const int32_t *basisBitsArray[] = { basisBits0, basisBits1, basisBits2};

uint32_t nIndexBits = 3;
double expectationValues[nPauliOperatorArrays];

custatevecComputeExpectationsOnPauliBasis(
    handle, sv, svDataType, nIndexBits, expectationValues,
    pauliOperatorsArray, nPauliOperatorArrays,
    basisBitsArray, nBasisBitsArray);

API reference

custatevecComputeExpectationsOnPauliBasis
custatevecStatus_t custatevecComputeExpectationsOnPauliBasis(custatevecHandle_t handle, const void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, double *expectationValues, const custatevecPauli_t **pauliOperatorsArray, const uint32_t nPauliOperatorArrays, const int32_t **basisBitsArray, const uint32_t *nBasisBitsArray)

Calculate expectation values for a batch of (multi-qubit) Pauli operators.

This function calculates multiple expectation values for given sequences of Pauli operators by a single call.

A single Pauli operator sequence, pauliOperators, is represented by using an array of custatevecPauli_t. The basis bits on which these Pauli operators are acting are represented by an array of index bit positions. If no Pauli operator is specified for an index bit, the identity operator (CUSTATEVEC_PAULI_I) is implicitly assumed.

The length of pauliOperators and basisBits are the same and specified by nBasisBits.

The number of Pauli operator sequences is specified by the nPauliOperatorArrays argument.

Multiple sequences of Pauli operators are represented in the form of arrays of arrays in the following manners:

  • The pauliOperatorsArray argument is an array for arrays of custatevecPauli_t.

  • The basisBitsArray is an array of the arrays of basis bit positions.

  • The nBasisBitsArray argument holds an array of the length of Pauli operator sequences and basis bit arrays.

Calculated expectation values are stored in a host buffer specified by the expectationValues argument of length nPauliOpeartorsArrays.

This function returns CUSTATEVEC_STATUS_INVALID_VALUE if basis bits specified for a Pauli operator sequence has duplicates and/or out of the range of [0, nIndexBits).

This function accepts empty Pauli operator sequence to get the norm of the state vector.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[in] state vector

  • svDataType[in] data type of the state vector

  • nIndexBits[in] the number of index bits of the state vector

  • expectationValues[out] pointer to a host array to store expectation values

  • pauliOperatorsArray[in] pointer to a host array of Pauli operator arrays

  • nPauliOperatorArrays[in] the number of Pauli operator arrays

  • basisBitsArray[in] host array of basis bit arrays

  • nBasisBitsArray[in] host array of the number of basis bits

Matrix property testing

The API custatevecTestMatrixType() is available to check the properties of matrices.

If a matrix \(A\) is unitary, \(AA^{\dagger} = A^{\dagger}A = I\), where \(A^{\dagger}\) is the conjugate transpose of \(A\) and \(I\) is the identity matrix, respectively.

When CUSTATEVEC_MATRIX_TYPE_UNITARY is given for its argument, this API computes the 1-norm \(||R||_1 = \sum{|r_{ij}|}\), where \(R = AA^{\dagger} - I\). This value will be approximately zero if \(A\) is unitary.

If a matrix \(A\) is Hermitian, \(A^{\dagger} = A\).

When CUSTATEVEC_MATRIX_TYPE_HERMITIAN is given for its argument, this API computes the 2-norm \(||R||_2 = \sum{|r_{ij}|^2}\), where \(R = (A - A^{\dagger}) / 2\). This value will be approximately zero if \(A\) is Hermitian.

The API may require external workspace for large matrices, and custatevecTestMatrixTypeGetWorkspaceSize() provides the size of workspace. If a device memory handler is set, it is not necessary to provide explicit workspace by users.

Use case

double residualNorm;

void* extraWorkspace = nullptr;
size_t extraWorkspaceSizeInBytes = 0;

// check the size of external workspace
custatevecTestMatrixTypeGetWorkspaceSize(
    handle, matrixType, matrix, matrixDataType, layout,
    nTargets, adjoint, computeType, &extraWorkspaceSizeInBytes);

// allocate external workspace if necessary
if (extraWorkspaceSizeInBytes > 0)
    cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

// execute testing
custatevecTestMatrixType(
    handle, &residualNorm, matrixType, matrix, matrixDataType, layout,
    nTargets, adjoint, computeType, extraWorkspace, extraWorkspaceSizeInBytes);

API reference

custatevecTestMatrixTypeGetWorkspaceSize

custatevecStatus_t custatevecTestMatrixTypeGetWorkspaceSize(custatevecHandle_t handle, custatevecMatrixType_t matrixType, const void *matrix, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const uint32_t nTargets, const int32_t adjoint, custatevecComputeType_t computeType, size_t *extraWorkspaceSizeInBytes)

Get extra workspace size for custatevecTestMatrixType()

This function gets the size of an extra workspace required to execute custatevecTestMatrixType(). extraWorkspaceSizeInBytes will be set to 0 if no extra buffer is required.

Parameters
  • handle[in] the handle to cuStateVec library

  • matrixType[in] matrix type

  • matrix[in] host or device pointer to a matrix

  • matrixDataType[in] data type of matrix

  • layout[in] enumerator specifying the memory layout of matrix

  • nTargets[in] the number of target bits, up to 15

  • adjoint[in] flag to control whether the adjoint of matrix is tested

  • computeType[in] compute type

  • extraWorkspaceSizeInBytes[out] workspace size


custatevecTestMatrixType

custatevecStatus_t custatevecTestMatrixType(custatevecHandle_t handle, double *residualNorm, custatevecMatrixType_t matrixType, const void *matrix, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const uint32_t nTargets, const int32_t adjoint, custatevecComputeType_t computeType, void *extraWorkspace, size_t extraWorkspaceSizeInBytes)

Test the deviation of a given matrix from a Hermitian (or Unitary) matrix.

This function tests if the type of a given matrix matches the type given by the matrixType argument.

For tests for the unitary type, \( R = (AA^{\dagger} - I) \) is calculated where \( A \) is the given matrix. The sum of absolute values of \( R \) matrix elements is returned.

For tests for the Hermitian type, \( R = (M - M^{\dagger}) / 2 \) is calculated. The sum of squared absolute values of \( R \) matrix elements is returned.

This function may return CUSTATEVEC_STATUS_INSUFFICIENT_WORKSPACE for large nTargets. In such cases, the extraWorkspace and extraWorkspaceSizeInBytes arguments should be specified to provide extra workspace. The required size of an extra workspace is obtained by calling custatevecTestMatrixTypeGetWorkspaceSize(). A null pointer can be passed to the extraWorkspace argument if no extra workspace is required. Also, if a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0.

Note

The nTargets argument must be no more than 15 in this version. For larger nTargets, this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

Parameters
  • handle[in] the handle to cuStateVec library

  • residualNorm[out] host pointer, to store the deviation from certain matrix type

  • matrixType[in] matrix type

  • matrix[in] host or device pointer to a matrix

  • matrixDataType[in] data type of matrix

  • layout[in] enumerator specifying the memory layout of matrix

  • nTargets[in] the number of target bits, up to 15

  • adjoint[in] flag to control whether the adjoint of matrix is tested

  • computeType[in] compute type

  • extraWorkspace[in] extra workspace

  • extraWorkspaceSizeInBytes[in] extra workspace size

Sampling

Sampling enables to obtain measurement results many times by using probability calculated from quantum states.

Use case

// create sampler and check the size of external workspace
custatevecSamplerCreate(
    handle, sv, svDataType, nIndexBits, &sampler, nMaxShots,
    &extraWorkspaceSizeInBytes);

// allocate external workspace if necessary
void extraWorkspace = nullptr;
if (extraWorkspaceSizeInBytes > 0)
    cudaMalloc(extraWorkspace, extraWorkspaceSizeInBytes);

// calculate cumulative abs2sum
custatevecSamplerPreprocess(
    handle, sampler, extraWorkspace, extraWorkspaceSizeInBytes);

// [User] generate randnums, array of random numbers [0, 1) for sampling
...

// sample bit strings
custatevecSamplerSample(
    handle, sampler, bitStrings, bitOrdering, bitStringLen, randnums, nShots,
    output);

// deallocate the sampler
custatevecSamplerDestroy(sampler);

For multi-GPU computations, cuStateVec provides custatevecSamplerGetSquaredNorm() and custatevecSamplerApplySubSVOffset(). Users are required to calculate cumulative abs2sum array with the squared norm of each sub state vector via custatevecSamplerGetSquaredNorm() and provide its values to the sampler descriptor via custatevecSamplerApplySubSVOffset().

// The state vector is divided to nSubSvs sub state vectors.
// each state vector has its own ordinal and nLocalBits index bits.
// The ordinals of sub state vectors correspond to the extended index bits.

// create sampler and check the size of external workspace
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    custatevecSamplerCreate(
        handle[iSv], d_sv[iSv], CUDA_C_64F, nLocalBits, &sampler[iSv], nMaxShots,
        &extraWorkspaceSizeInBytes[iSv]);
}

// allocate external workspace if necessary
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    if (extraWorkspaceSizeInBytes[iSv] > 0) {
        cudaSetDevice(devices[iSv]);
        cudaMalloc(&extraWorkspace[iSv], extraWorkspaceSizeInBytes[iSv]);
    }
}

// sample preprocess
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]) );
    custatevecSampler_preprocess(
        handle[iSv], sampler[iSv], extraWorkspace[iSv],
        extraWorkspaceSizeInBytes[iSv]);
}

// get norm of the sub state vectors
double subNorms[nSubSvs];
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    custatevecSamplerGetSquaredNorm(
        handle[iSv], sampler[iSv], &subNorms[iSv]);
}

for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    cudaDeviceSynchronize();
}

// get cumulative array
double cumulativeArray[nSubSvs + 1];
cumulativeArray[0] = 0.0;
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cumulativeArray[iSv + 1] = cumulativeArray[iSv] + subNorms[iSv];
}
double norm = cumulativeArray[nSubSvs];

// apply offset and norm
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]) );
    custatevecSamplerApplySubSVOffset(
        handle[iSv], sampler[iSv], iSv, nSubSvs, cumulativeArray[iSv], norm);
}

// divide randnum array. randnums must be sorted in the ascending order.
int shotOffsets[nSubSvs + 1];
shotOffsets[0] = 0;
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    double* pos = std::lower_bound(randnums, randnums + nShots,
                                    cumulativeArray[iSv + 1] / norm);
    if (iSv == nSubSvs - 1) {
        pos = randnums + nShots;
    }
    shotOffsets[iSv + 1] = pos - randnums;
}

// sample bit strings
for (int iSv = 0; iSv < nSubSvs; iSv++) {
    int shotOffset = shotOffsets[iSv];
    int nSubShots = shotOffsets[iSv + 1] - shotOffsets[iSv];
    if (nSubShots > 0) {
        cudaSetDevice(devices[iSv]);
        custatevecSamplerSample(
            handle[iSv], sampler[iSv], &bitStrings[shotOffset], bitOrdering,
            bitStringLen, &randnums[shotOffset], nSubShots,
            CUSTATEVEC_SAMPLER_OUTPUT_RANDNUM_ORDER);
    }
}

for (int iSv = 0; iSv < nSubSvs; iSv++) {
    cudaSetDevice(devices[iSv]);
    custatevecSamplerDestroy(sampler[iSv]);
}

Please refer to NVIDIA/cuQuantum repository for further detail.

API reference

custatevecSamplerCreate

custatevecStatus_t custatevecSamplerCreate(custatevecHandle_t handle, const void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, custatevecSamplerDescriptor_t *sampler, uint32_t nMaxShots, size_t *extraWorkspaceSizeInBytes)

Create sampler descriptor.

This function creates a sampler descriptor. If an extra workspace is required, its size is set to extraWorkspaceSizeInBytes.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[in] pointer to state vector

  • svDataType[in] data type of state vector

  • nIndexBits[in] the number of index bits of the state vector

  • sampler[out] pointer to a new sampler descriptor

  • nMaxShots[in] the max number of shots used for this sampler context

  • extraWorkspaceSizeInBytes[out] workspace size

Note

The max value of nMaxShots is \(2^{31} - 1\). If the value exceeds the limit, custatevecSamplerCreate() returns CUSTATEVEC_STATUS_INVALID_VALUE.


custatevecSamplerPreprocess

custatevecStatus_t custatevecSamplerPreprocess(custatevecHandle_t handle, custatevecSamplerDescriptor_t sampler, void *extraWorkspace, const size_t extraWorkspaceSizeInBytes)

Preprocess the state vector for preparation of sampling.

This function prepares internal states of the sampler descriptor. If a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0. Otherwise, a pointer passed to the extraWorkspace argument is associated to the sampler handle and should be kept during its life time. The size of extraWorkspace is obtained when custatevecSamplerCreate() is called.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sampler[inout] the sampler descriptor

  • extraWorkspace[in] extra workspace

  • extraWorkspaceSizeInBytes[in] size of the extra workspace


custatevecSamplerGetSquaredNorm

custatevecStatus_t custatevecSamplerGetSquaredNorm(custatevecHandle_t handle, custatevecSamplerDescriptor_t sampler, double *norm)

Get the squared norm of the state vector.

This function returns the squared norm of the state vector. An intended use case is sampling with multiple devices. This API should be called after custatevecSamplerPreprocess(). Otherwise, the behavior of this function is undefined.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sampler[in] the sampler descriptor

  • norm[out] the norm of the state vector


custatevecSamplerApplySubSVOffset

custatevecStatus_t custatevecSamplerApplySubSVOffset(custatevecHandle_t handle, custatevecSamplerDescriptor_t sampler, int32_t subSVOrd, uint32_t nSubSVs, double offset, double norm)

Apply the partial norm and norm to the state vector to the sample descriptor.

This function applies offsets assuming the given state vector is a sub state vector. An intended use case is sampling with distributed state vectors. The nSubSVs argument should be a power of 2 and subSVOrd should be less than nSubSVs. Otherwise, this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sampler[in] the sampler descriptor

  • subSVOrd[in] sub state vector ordinal

  • nSubSVs[in] the number of sub state vectors

  • offset[in] cumulative sum offset for the sub state vector

  • norm[in] norm for all sub vectors


custatevecSamplerSample

custatevecStatus_t custatevecSamplerSample(custatevecHandle_t handle, custatevecSamplerDescriptor_t sampler, custatevecIndex_t *bitStrings, const int32_t *bitOrdering, const uint32_t bitStringLen, const double *randnums, const uint32_t nShots, enum custatevecSamplerOutput_t output)

Sample bit strings from the state vector.

This function does sampling. The bitOrdering and bitStringLen arguments specify bits to be sampled. Sampled bit strings are represented as an array of custatevecIndex_t and are stored to the host memory buffer that the bitStrings argument points to.

The randnums argument is an array of user-generated random numbers whose length is nShots. The range of random numbers should be in [0, 1). A random number given by the randnums argument is clipped to [0, 1) if its range is not in [0, 1).

The output argument specifies the order of sampled bit strings:

If you don’t need a particular order, choose CUSTATEVEC_SAMPLER_OUTPUT_RANDNUM_ORDER by default. (It may offer slightly better performance.)

This API should be called after custatevecSamplerPreprocess(). Otherwise, the behavior of this function is undefined. By calling custatevecSamplerApplySubSVOffset() prior to this function, it is possible to sample bits corresponding to the ordinal of sub state vector.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sampler[in] the sampler descriptor

  • bitStrings[out] pointer to a host array to store sampled bit strings

  • bitOrdering[in] pointer to a host array of bit ordering for sampling

  • bitStringLen[in] the number of bits in bitOrdering

  • randnums[in] pointer to an array of random numbers

  • nShots[in] the number of shots

  • output[in] the order of sampled bit strings


custatevecSamplerDestroy

custatevecStatus_t custatevecSamplerDestroy(custatevecSamplerDescriptor_t sampler)

This function releases resources used by the sampler.

Parameters

sampler[in] the sampler descriptor

Accessor

An accessor extracts or updates state vector segments.

The APIs custatevecAccessorCreate() and custatevecAccessorCreateView() initialize an accessor and also return the size of an extra workspace (if needed by the APIs custatevecAccessorGet() and custatevecAccessorSet() to perform the copy). The workspace must be bound to an accessor by custatevecAccessorSetExtraWorkspace(), and the lifetime of the workspace must be as long as the accessor’s to cover the entire duration of the copy operation. If a device memory handler is set, it is not necessary to provide explicit workspace by users.

The begin and end arguments in the Get/Set APIs correspond to the state vector elements’ indices such that elements within the specified range are copied.

Use case

Extraction

// create accessor and check the size of external workspace
custatevecAccessorCreateView(
    handle, d_sv, CUDA_C_64F, nIndexBits, &accessor, bitOrdering, bitOrderingLen,
    maskBitString, maskOrdering, maskLen, &extraWorkspaceSizeInBytes);

// allocate external workspace if necessary
if (extraWorkspaceSizeInBytes > 0)
    cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

// set external workspace
custatevecAccessorSetExtraWorkspace(
    handle, &accessor, extraWorkspace, extraWorkspaceSizeInBytes);

// get state vector elements
custatevecAccessorGet(
    handle, &accessor, buffer, accessBegin, accessEnd);

// deallocate the accessor
custatevecAccessorDestroy(accessor);

Update

// create accessor and check the size of external workspace
custatevecAccessorCreate(
    handle, d_sv, CUDA_C_64F, nIndexBits, &accessor, bitOrdering, bitOrderingLen,
    maskBitString, maskOrdering, maskLen, &extraWorkspaceSizeInBytes);

// allocate external workspace if necessary
if (extraWorkspaceSizeInBytes > 0)
    cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

// set external workspace
custatevecAccessorSetExtraWorkspace(
    handle, &accessor, extraWorkspace, extraWorkspaceSizeInBytes);

// set state vector elements
custatevecAccessorSet(
    handle, &accessor, buffer, 0, nSvSize);

// deallocate the accessor
custatevecAccessorDestroy(accessor);

API reference

custatevecAccessorCreate

custatevecStatus_t custatevecAccessorCreate(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, custatevecAccessorDescriptor_t *accessor, const int32_t *bitOrdering, const uint32_t bitOrderingLen, const int32_t *maskBitString, const int32_t *maskOrdering, const uint32_t maskLen, size_t *extraWorkspaceSizeInBytes)

Create accessor to copy elements between the state vector and an external buffer.

Accessor copies state vector elements between the state vector and external buffers. During the copy, the ordering of state vector elements are rearranged according to the bit ordering specified by the bitOrdering argument.

The state vector is assumed to have the default ordering: the LSB is the 0th index bit and the (N-1)th index bit is the MSB for an N index bit system. The bit ordering of the external buffer is specified by the bitOrdering argument. When 3 is given to the nIndexBits argument and [1, 2, 0] to the bitOrdering argument, the state vector index bits are permuted to specified bit positions. Thus, the state vector index is rearranged and mapped to the external buffer index as [0, 4, 1, 5, 2, 6, 3, 7].

The maskBitString, maskOrdering and maskLen arguments specify the bit mask for the state vector index being accessed. If the maskLen argument is 0, the maskBitString and/or maskOrdering arguments can be null.

All bit positions [0, nIndexBits), should appear exactly once, either in the bitOrdering or the maskOrdering arguments. If a bit position does not appear in these arguments and/or there are overlaps of bit positions, this function returns CUSTATEVEC_STATUS_INVALID_VALUE.

The extra workspace improves performance if the accessor is called multiple times with small external buffers placed on device. A null pointer can be specified to the extraWorkspaceSizeInBytes if the extra workspace is not necessary.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[in] state vector

  • svDataType[in] Data type of state vector

  • nIndexBits[in] the number of index bits of state vector

  • accessor[in] pointer to an accessor descriptor

  • bitOrdering[in] pointer to a host array to specify the basis bits of the external buffer

  • bitOrderingLen[in] the length of bitOrdering

  • maskBitString[in] pointer to a host array to specify the mask values to limit access

  • maskOrdering[in] pointer to a host array for the mask ordering

  • maskLen[in] the length of mask

  • extraWorkspaceSizeInBytes[out] the required size of extra workspace


custatevecAccessorCreateView

custatevecStatus_t custatevecAccessorCreateView(custatevecHandle_t handle, const void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, custatevecAccessorDescriptor_t *accessor, const int32_t *bitOrdering, const uint32_t bitOrderingLen, const int32_t *maskBitString, const int32_t *maskOrdering, const uint32_t maskLen, size_t *extraWorkspaceSizeInBytes)

Create accessor for the constant state vector.

This function is the same as custatevecAccessorCreate(), but only accepts the constant state vector.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[in] state vector

  • svDataType[in] Data type of state vector

  • nIndexBits[in] the number of index bits of state vector

  • accessor[in] pointer to an accessor descriptor

  • bitOrdering[in] pointer to a host array to specify the basis bits of the external buffer

  • bitOrderingLen[in] the length of bitOrdering

  • maskBitString[in] pointer to a host array to specify the mask values to limit access

  • maskOrdering[in] pointer to a host array for the mask ordering

  • maskLen[in] the length of mask

  • extraWorkspaceSizeInBytes[out] the required size of extra workspace


custatevecAccessorDestroy

custatevecStatus_t custatevecAccessorDestroy(custatevecAccessorDescriptor_t accessor)

This function releases resources used by the accessor.

Parameters

accessor[in] the accessor descriptor


custatevecAccessorSetExtraWorkspace

custatevecStatus_t custatevecAccessorSetExtraWorkspace(custatevecHandle_t handle, custatevecAccessorDescriptor_t accessor, void *extraWorkspace, size_t extraWorkspaceSizeInBytes)

Set the external workspace to the accessor.

This function sets the extra workspace to the accessor. The required size for extra workspace can be obtained by custatevecAccessorCreate() or custatevecAccessorCreateView(). if a device memory handler is set, the extraWorkspace can be set to null, and the extraWorkspaceSizeInBytes can be set to 0.

Parameters
  • handle[in] the handle to the cuStateVec library

  • accessor[in] the accessor descriptor

  • extraWorkspace[in] extra workspace

  • extraWorkspaceSizeInBytes[in] extra workspace size


custatevecAccessorGet

custatevecStatus_t custatevecAccessorGet(custatevecHandle_t handle, custatevecAccessorDescriptor_t accessor, void *externalBuffer, const custatevecIndex_t begin, const custatevecIndex_t end)

Copy state vector elements to an external buffer.

This function copies state vector elements to an external buffer specified by the externalBuffer argument. During the copy, the index bit is permuted as specified by the bitOrdering argument in custatevecAccessorCreate() or custatevecAccessorCreateView().

The begin and end arguments specify the range of state vector elements being copied. Both arguments have the bit ordering specified by the bitOrdering argument.

Parameters
  • handle[in] the handle to the cuStateVec library

  • accessor[in] the accessor descriptor

  • externalBuffer[out] pointer to a host or device buffer to receive copied elements

  • begin[in] index in the permuted bit ordering for the first elements being copied to the state vector

  • end[in] index in the permuted bit ordering for the last elements being copied to the state vector (non-inclusive)


custatevecAccessorSet

custatevecStatus_t custatevecAccessorSet(custatevecHandle_t handle, custatevecAccessorDescriptor_t accessor, const void *externalBuffer, const custatevecIndex_t begin, const custatevecIndex_t end)

Set state vector elements from an external buffer.

This function sets complex numbers to the state vector by using an external buffer specified by the externalBuffer argument. During the copy, the index bit is permuted as specified by the bitOrdering argument in custatevecAccessorCreate().

The begin and end arguments specify the range of state vector elements being set to the state vector. Both arguments have the bit ordering specified by the bitOrdering argument.

If a read-only accessor created by calling custatevecAccessorCreateView() is provided, this function returns CUSTATEVEC_STATUS_NOT_SUPPORTED.

Parameters
  • handle[in] the handle to the cuStateVec library

  • accessor[in] the accessor descriptor

  • externalBuffer[in] pointer to a host or device buffer of complex values being copied to the state vector

  • begin[in] index in the permuted bit ordering for the first elements being copied from the state vector

  • end[in] index in the permuted bit ordering for the last elements being copied from the state vector (non-inclusive)

Single-process qubit reordering

For single-process computations, cuStateVec provides custatevecSwapIndexBits() API for single device and custatevecMultiDeviceSwapIndexBits() for multiple devices to reorder state vector elements.

Use case

single device

// This example uses 3 qubits.
const int nIndexBits = 3;

// swap 0th and 2nd qubits
const int nBitSwaps  = 1;
const int2 bitSwaps[] = {{0, 2}}; // specify the qubit pairs

// swap the state vector elements only if 1st qubit is 1
const int maskLen = 1;
int maskBitString[] = {1}; // specify the values of mask qubits
int maskOrdering[] = {1};  // specify the mask qubits

// Swap index bit pairs.
// {|000>, |001>, |010>, |011>, |100>, |101>, |110>, |111>} will be permuted to
// {|000>, |001>, |010>, |110>, |100>, |101>, |011>, |111>}.
custatevecSwapIndexBits(handle, sv, svDataType, nIndexBits, bitSwaps, nBitSwaps,
    maskBitString, maskOrdering, maskLen);

multiple devices

// This example uses 2 GPUs and each GPU stores 2-qubit sub state vector.
const int nGlobalIndexBits = 1;
const int nLocalIndexBits = 2;
const int nHandles = 1 << nGlobalIndexBits;

// Users are required to enable direct access on a peer device prior to the swap API call.
for (int i0 = 0; i0 < nHandles; i0++) {
  cudaSetDevice(i0);
  for (int i1 = 0; i1 < nHandles; i1++) {
    if (i0 == i1)
      continue;
    cudaDeviceEnablePeerAccess(i1, 0);
  }
}
cudaSetDevice(0);

// specify the type of device network topology to optimize the data transfer sequence.
// Here, devices are assumed to be connected via NVLink with an NVSwitch or
// PCIe device network with a single PCIe switch.
const custatevecDeviceNetworkType_t deviceNetworkType = CUSTATEVEC_DEVICE_NETWORK_TYPE_SWITCH;

// swap 0th and 2nd qubits
const int nIndexBitSwaps  = 1;
const int2 indexBitSwaps[] = {{0, 2}}; // specify the qubit pairs

// swap the state vector elements only if 1st qubit is 1
const int maskLen = 1;
int maskBitString[] = {1}; // specify the values of mask qubits
int maskOrdering[] = {1};  // specify the mask qubits

// Swap index bit pairs.
// {|000>, |001>, |010>, |011>, |100>, |101>, |110>, |111>} will be permuted to
// {|000>, |001>, |010>, |110>, |100>, |101>, |011>, |111>}.
custatevecMultiDeviceSwapIndexBits(handles, nHandles, subSVs, svDataType,
    nGlobalIndexBits, nLocalIndexBits, indexBitSwaps, nIndexBitSwaps,
    maskBitString, maskOrdering, maskLen, deviceNetworkType);

API reference

custatevecSwapIndexBits

custatevecStatus_t custatevecSwapIndexBits(custatevecHandle_t handle, void *sv, cudaDataType_t svDataType, const uint32_t nIndexBits, const int2 *bitSwaps, const uint32_t nBitSwaps, const int32_t *maskBitString, const int32_t *maskOrdering, const uint32_t maskLen)

Swap index bits and reorder state vector elements in one device.

This function updates the bit ordering of the state vector by swapping the pairs of bit positions.

The state vector is assumed to have the default ordering: the LSB is the 0th index bit and the (N-1)th index bit is the MSB for an N index bit system. The bitSwaps argument specifies the swapped bit index pairs, whose values must be in the range [0, nIndexBits).

The maskBitString, maskOrdering and maskLen arguments specify the bit mask for the state vector index being permuted. If the maskLen argument is 0, the maskBitString and/or maskOrdering arguments can be null.

A bit position can be included in both bitSwaps and maskOrdering. When a masked bit is swapped, state vector elements whose original indices match the mask bit string are written to the permuted indices while other elements are not copied.

Parameters
  • handle[in] the handle to the cuStateVec library

  • sv[inout] state vector

  • svDataType[in] Data type of state vector

  • nIndexBits[in] the number of index bits of state vector

  • bitSwaps[in] pointer to a host array of swapping bit index pairs

  • nBitSwaps[in] the number of bit swaps

  • maskBitString[in] pointer to a host array to mask output

  • maskOrdering[in] pointer to a host array to specify the ordering of maskBitString

  • maskLen[in] the length of mask


custatevecMultiDeviceSwapIndexBits

custatevecStatus_t custatevecMultiDeviceSwapIndexBits(custatevecHandle_t *handles, const uint32_t nHandles, void **subSVs, const cudaDataType_t svDataType, const uint32_t nGlobalIndexBits, const uint32_t nLocalIndexBits, const int2 *indexBitSwaps, const uint32_t nIndexBitSwaps, const int32_t *maskBitString, const int32_t *maskOrdering, const uint32_t maskLen, const custatevecDeviceNetworkType_t deviceNetworkType)

Swap index bits and reorder state vector elements for multiple sub state vectors distributed to multiple devices.

This function updates the bit ordering of the state vector distributed in multiple devices by swapping the pairs of bit positions.

This function assumes the state vector is split into multiple sub state vectors and distributed to multiple devices to represent a (nGlobalIndexBits + nLocalIndexBits) qubit system.

The handles argument should receive cuStateVec handles created for all devices where sub state vectors are allocated. If two or more cuStateVec handles created for the same device are given, this function will return an error, CUSTATEVEC_STATUS_INVALID_VALUE. The handles argument should contain a handle created on the current device, as all operations in this function will be ordered on the stream of the current device’s handle. Otherwise, this function returns an error, CUSTATEVEC_STATUS_INVALID_VALUE.

Sub state vectors are specified by the subSVs argument as an array of device pointers. All sub state vectors are assumed to hold the same number of index bits specified by the nLocalIndexBits. Thus, each sub state vectors holds (1 << nLocalIndexBits) state vector elements. The global index bits is identical to the index of sub state vectors. The number of sub state vectors is given as (1 << nGlobalIndexBits). The max value of nGlobalIndexBits is 5, which corresponds to 32 sub state vectors.

The index bit of the distributed state vector has the default ordering: The index bits of the sub state vector are mapped from the 0th index bit to the (nLocalIndexBits-1)-th index bit. The global index bits are mapped from the (nLocalIndexBits)-th bit to the (nGlobalIndexBits + nLocalIndexBits - 1)-th bit.

The indexBitSwaps argument specifies the index bit pairs being swapped. Each index bit pair can be a pair of two global index bits or a pair of a global and a local index bit. A pair of two local index bits is not accepted. Please use custatevecSwapIndexBits() for swapping local index bits.

The maskBitString, maskOrdering and maskLen arguments specify the bit string mask that limits the state vector elements swapped during the call. Bits in maskOrdering can overlap index bits specified in the indexBitSwaps argument. In such cases, the mask bit string is applied for the bit positions before index bit swaps. If the maskLen argument is 0, the maskBitString and/or maskOrdering arguments can be null.

The deviceNetworkType argument specifies the device network topology to optimize the data transfer sequence. The following two network topologies are assumed:

  • Switch network: devices connected via NVLink with an NVSwitch (ex. DGX A100 and DGX-2) or PCIe device network with a single PCIe switch

  • Full mesh network: all devices are connected by full mesh connections (ex. DGX Station V100/A100)

Note

Important notice This function assumes bidirectional GPUDirect P2P is supported and enabled by cudaDeviceEnablePeerAccess() between all devices where sub state vectors are allocated. If GPUDirect P2P is not enabled, the call to custatevecMultiDeviceSwapIndexBits() that accesses otherwise inaccessible device memory allocated in other GPUs would result in a segmentation fault.

For the best performance, please use \(2^n\) number of devices and allocate one sub state vector in each device. This function allows the use of non- \(2^n\) number of devices, to allocate two or more sub state vectors on a device, or to allocate all sub state vectors on a single device to cover various hardware configurations. However, the performance is always the best when a single sub state vector is allocated on each \(2^n\) number of devices.

The copy on each participating device is enqueued on the CUDA stream bound to the corresponding handle via custatevecSetStream(). All CUDA calls before the call of this function are correctly ordered if these calls are issued on the streams set to handles. This function is asynchronously executed. Please use cudaStreamSynchronize() (for synchronization) or cudaStreamWaitEvent() (for establishing the stream order) with the stream set to the handle of the current device.

Parameters
  • handles[in] pointer to a host array of custatevecHandle_t

  • nHandles[in] the number of handles specified in the handles argument

  • subSVs[inout] pointer to an array of sub state vectors

  • svDataType[in] the data type of the state vector specified by the subSVs argument

  • nGlobalIndexBits[in] the number of global index bits of distributed state vector

  • nLocalIndexBits[in] the number of local index bits in sub state vector

  • indexBitSwaps[in] pointer to a host array of index bit pairs being swapped

  • nIndexBitSwaps[in] the number of index bit swaps

  • maskBitString[in] pointer to a host array to mask output

  • maskOrdering[in] pointer to a host array to specify the ordering of maskBitString

  • maskLen[in] the length of mask

  • deviceNetworkType[in] the device network topology

Multi-process qubit reordering

For multiple-process computations, cuStateVec provides APIs to schedule/reorder distributed state vector elements. In addition, cuStateVec has custatevecCommunicator_t, which wraps MPI libraries for inter-process communications. Please refer to Distributed Index Bit Swap API for the overview and the detailed usages.

API reference

custatevecCommunicatorCreate

custatevecStatus_t custatevecCommunicatorCreate(custatevecHandle_t handle, custatevecCommunicatorDescriptor_t *communicator, custatevecCommunicatorType_t communicatorType, const char *soname)

Create communicator.

This function creates a communicator instance.

The type of the communicator is specified by the communicatorType argument. By specifying CUSTATEVEC_COMMUNICATOR_TYPE_OPENMPI or CUSTATEVEC_COMMUNICATOR_TYPE_MPICH this function creates a communicator instance that internally uses Open MPI or MPICH, respectively. By specifying CUSTATEVEC_COMMUNICATOR_TYPE_EXTERNAL, this function loads a custom plugin that wraps an MPI library. The source code for the custom plugin is downloadable from NVIDIA/cuQuantum.

The soname argument specifies the name of the shared library that will be used by the communicator instance.

This function uses dlopen() to load the specified shared library. If Open MPI or MPICH library is directly linked to an application and CUSTATEVEC_COMMUNICATOR_TYPE_OPENMPI or CUSTATEVEC_COMMUNICATOR_TYPE_MPICH is specified to the communicatorType argument, the soname argument should be set to NULL. Thus, function symbols are resolved by searching the functions loaded to the application at startup time.

Parameters
  • handle[in] the handle to cuStateVec library

  • communicator[out] a pointer to the communicator

  • communicatorType[in] the communicator type

  • soname[in] the shared object name


custatevecCommunicatorDestroy

custatevecStatus_t custatevecCommunicatorDestroy(custatevecHandle_t handle, custatevecCommunicatorDescriptor_t communicator)

This function releases communicator.

Parameters
  • handle[in] the handle to cuStateVec library

  • communicator[in] the communicator descriptor


custatevecDistIndexBitSwapSchedulerCreate

custatevecStatus_t custatevecDistIndexBitSwapSchedulerCreate(custatevecHandle_t handle, custatevecDistIndexBitSwapSchedulerDescriptor_t *scheduler, const uint32_t nGlobalIndexBits, const uint32_t nLocalIndexBits)

Create distributed index bit swap scheduler.

This function creates a distributed index bit swap scheduler descriptor.

The local index bits are from the 0th index bit to the (nLocalIndexBits-1)-th index bit. The global index bits are mapped from the (nLocalIndexBits)-th bit to the (nGlobalIndexBits + nLocalIndexBits - 1)-th bit.

Parameters
  • handle[in] the handle to cuStateVec library

  • scheduler[out] a pointer to a batch swap scheduler

  • nGlobalIndexBits[in] the number of global index bits

  • nLocalIndexBits[in] the number of local index bits


custatevecDistIndexBitSwapSchedulerDestroy

custatevecStatus_t custatevecDistIndexBitSwapSchedulerDestroy(custatevecHandle_t handle, custatevecDistIndexBitSwapSchedulerDescriptor_t scheduler)

This function releases distributed index bit swap scheduler.

Parameters
  • handle[in] the handle to cuStateVec library

  • scheduler[in] a pointer to the batch swap scheduler to destroy


custatevecDistIndexBitSwapSchedulerSetIndexBitSwaps

custatevecStatus_t custatevecDistIndexBitSwapSchedulerSetIndexBitSwaps(custatevecHandle_t handle, custatevecDistIndexBitSwapSchedulerDescriptor_t scheduler, const int2 *indexBitSwaps, const uint32_t nIndexBitSwaps, const int32_t *maskBitString, const int32_t *maskOrdering, const uint32_t maskLen, uint32_t *nSwapBatches)

Set index bit swaps to distributed index bit swap scheduler.

This function sets index bit swaps to the distributed index bit swap scheduler and computes the number of necessary batched data transfers for the given index bit swaps.

The index bit of the distributed state vector has the default ordering: The index bits of the sub state vector are mapped from the 0th index bit to the (nLocalIndexBits-1)-th index bit. The global index bits are mapped from the (nLocalIndexBits)-th bit to the (nGlobalIndexBits + nLocalIndexBits - 1)-th bit.

The indexBitSwaps argument specifies the index bit pairs being swapped. Each index bit pair can be a pair of two global index bits or a pair of a global and a local index bit. A pair of two local index bits is not accepted. Please use custatevecSwapIndexBits() for swapping local index bits.

The maskBitString, maskOrdering and maskLen arguments specify the bit string mask that limits the state vector elements swapped during the call. Bits in maskOrdering can overlap index bits specified in the indexBitSwaps argument. In such cases, the mask bit string is applied for the bit positions before index bit swaps. If the maskLen argument is 0, the maskBitString and/or maskOrdering arguments can be null.

The returned value by the nSwapBatches argument represents the number of loops required to complete index bit swaps and is used in later stages.

Parameters
  • handle[in] the handle to cuStateVec library

  • scheduler[in] a pointer to batch swap scheduler descriptor

  • indexBitSwaps[in] pointer to a host array of index bit pairs being swapped

  • nIndexBitSwaps[in] the number of index bit swaps

  • maskBitString[in] pointer to a host array to mask output

  • maskOrdering[in] pointer to a host array to specify the ordering of maskBitString

  • maskLen[in] the length of mask

  • nSwapBatches[out] the number of batched data transfers


custatevecDistIndexBitSwapSchedulerGetParameters

custatevecStatus_t custatevecDistIndexBitSwapSchedulerGetParameters(custatevecHandle_t handle, custatevecDistIndexBitSwapSchedulerDescriptor_t scheduler, const int32_t swapBatchIndex, const int32_t orgSubSVIndex, custatevecSVSwapParameters_t *parameters)

Get parameters to be set to the state vector swap worker.

This function computes parameters used for data transfers between sub state vectors. The value of the swapBatchIndex argument should be in range of [0, nSwapBatches) where nSwapBatches is the number of loops obtained by the call to custatevecDistIndexBitSwapSchedulerSetIndexBitSwaps().

The parameters argument returns the computed parameters for data transfer, which is set to custatevecSVSwapWorker by the call to custatevecSVSwapWorkerSetParameters().

Parameters
  • handle[in] the handle to cuStateVec library

  • scheduler[in] a pointer to batch swap scheduler descriptor

  • swapBatchIndex[in] swap batch index for state vector swap parameters

  • orgSubSVIndex[in] the index of the origin sub state vector to swap state vector segments

  • parameters[out] a pointer to data transfer parameters


custatevecSVSwapWorkerCreate

custatevecStatus_t custatevecSVSwapWorkerCreate(custatevecHandle_t handle, custatevecSVSwapWorkerDescriptor_t *svSwapWorker, custatevecCommunicatorDescriptor_t communicator, void *orgSubSV, int32_t orgSubSVIndex, cudaEvent_t orgEvent, cudaDataType_t svDataType, cudaStream_t stream, size_t *extraWorkspaceSizeInBytes, size_t *minTransferWorkspaceSizeInBytes)

Create state vector swap worker.

This function creates a custatevecSVSwapWorkerDescriptor_t that swaps/sends/receives state vector elements between multiple sub state vectors. The communicator specified as the communicator argument is used for inter-process communication, thus state vector elements are transferred between sub state vectors distributed to multiple processes and nodes.

The created descriptor works on the device where the handle is created. The origin sub state vector specified by the orgSubSV argument should be allocated on the same device. The same applies to the event and the stream specified by the orgEvent and stream arguments respectively.

There are two workspaces, extra workspace and data transfer workspace. The extra workspace has constant size and is used to keep the internal state of the descriptor. The data transfer workspace is used to stage state vector elements being transferred. Its minimum size is given by the minTransferWorkspaceSizeInBytes argument. Depending on the system, increasing the size of data transfer workspace can improve performance.

If all the destination sub state vectors are specified by using custatevecSVSwapWorkerSetSubSVsP2P(), the communicator argument can be null. In this case, the internal CUDA calls are not serialized on the stream specified by the stream argument. It’s the user’s responsibility to call cudaStreamSynchronize() and global barrier such as MPI_Barrier() in this order to complete all internal CUDA calls. This limitation will be fixed in a future version.

If sub state vectors are distributed to multiple processes, the event should be created with the cudaEventInterprocess flag. Please refer to the CUDA Toolkit documentation for the details.

Parameters
  • handle[in] the handle to cuStateVec library

  • svSwapWorker[out] state vector swap worker

  • communicator[in] a pointer to the MPI communicator

  • orgSubSV[in] a pointer to a sub state vector

  • orgSubSVIndex[in] the index of the sub state vector specified by the orgSubSV argument

  • orgEvent[in] the event for synchronization with the peer worker

  • svDataType[in] data type used by the state vector representation

  • stream[in] a stream that is used to locally execute kernels during data transfers

  • extraWorkspaceSizeInBytes[out] the size of the extra workspace needed

  • minTransferWorkspaceSizeInBytes[out] the minimum-required size of the transfer workspace


custatevecSVSwapWorkerDestroy

custatevecStatus_t custatevecSVSwapWorkerDestroy(custatevecHandle_t handle, custatevecSVSwapWorkerDescriptor_t svSwapWorker)

This function releases the state vector swap worker.

Parameters
  • handle[in] the handle to cuStateVec library

  • svSwapWorker[in] state vector swap worker


custatevecSVSwapWorkerSetExtraWorkspace

custatevecStatus_t custatevecSVSwapWorkerSetExtraWorkspace(custatevecHandle_t handle, custatevecSVSwapWorkerDescriptor_t svSwapWorker, void *extraWorkspace, size_t extraWorkspaceSizeInBytes)

Set extra workspace.

This function sets the extra workspace to the state vector swap worker. The required size for extra workspace can be obtained by custatevecSVSwapWorkerCreate().

The extra workspace should be set before calling custatevecSVSwapWorkerSetParameters().

Parameters
  • handle[in] the handle to cuStateVec library

  • svSwapWorker[in] state vector swap worker

  • extraWorkspace[in] pointer to the user-owned workspace

  • extraWorkspaceSizeInBytes[in] size of the user-provided workspace


custatevecSVSwapWorkerSetTransferWorkspace

custatevecStatus_t custatevecSVSwapWorkerSetTransferWorkspace(custatevecHandle_t handle, custatevecSVSwapWorkerDescriptor_t svSwapWorker, void *transferWorkspace, size_t transferWorkspaceSizeInBytes)

Set transfer workspace.

This function sets the transfer workspace to the state vector swap worker instance. The minimum size for transfer workspace can be obtained by custatevecSVSwapWorkerCreate().

Depending on the system hardware configuration, larger size of the transfer workspace can improve the performance. The size specified by the transferWorkspaceSizeInBytes should a power of two number and should be equal to or larger than the value of the minTransferWorkspaceSizeInBytes returned by the call to custatevecSVSwapWorkerCreate().

Parameters
  • handle[in] the handle to cuStateVec library

  • svSwapWorker[in] state vector swap worker

  • transferWorkspace[in] pointer to the user-owned workspace

  • transferWorkspaceSizeInBytes[in] size of the user-provided workspace


custatevecSVSwapWorkerSetSubSVsP2P

custatevecStatus_t custatevecSVSwapWorkerSetSubSVsP2P(custatevecHandle_t handle, custatevecSVSwapWorkerDescriptor_t svSwapWorker, void **dstSubSVsP2P, const int32_t *dstSubSVIndicesP2P, cudaEvent_t *dstEvents, const uint32_t nDstSubSVsP2P)

Set sub state vector pointers accessible via GPUDirect P2P.

This function sets sub state vector pointers that are accessible by GPUDirect P2P from the device where the state vector swap worker works. The sub state vector pointers should be specified together with the sub state vector indices and events which are passed to custatevecSVSwapWorkerCreate() to create peer SV swap worker instances.

If sub state vectors are allocated in different processes, the sub state vector pointers and the events should be retrieved by using CUDA IPC.

Parameters
  • handle[in] the handle to cuStateVec library

  • svSwapWorker[in] state vector swap worker

  • dstSubSVsP2P[in] an array of pointers to sub state vectors that are accessed by GPUDirect P2P

  • dstSubSVIndicesP2P[in] the sub state vector indices of sub state vector pointers specified by the dstSubSVsP2P argument

  • dstEvents[in] events used to create peer workers

  • nDstSubSVsP2P[in] the number of sub state vector pointers specified by the dstSubSVsP2P argument


custatevecSVSwapWorkerSetParameters

custatevecStatus_t custatevecSVSwapWorkerSetParameters(custatevecHandle_t handle, custatevecSVSwapWorkerDescriptor_t svSwapWorker, const custatevecSVSwapParameters_t *parameters, int peer)

Set state vector swap parameters.

This function sets the parameters to swap state vector elements. The value of the parameters argument is retrieved by calling custatevecDistIndexBitSwapSchedulerGetParameters().

The peer argument specifies the rank of the peer process that holds the destination sub state vector. The sub state vector index of the destination sub state vector is obtained from the dstSubSVIndex member defined in custatevecSVSwapParameters_t.

If all the sub state vectors are accessible by GPUDirect P2P and a null pointer is passed to the communicator argument when calling custatevecSVSwapWorkerCreate(), the peer argument is ignored.

Parameters
  • handle[in] the handle to cuStateVec library

  • svSwapWorker[in] state vector swap worker

  • parameters[in] data transfer parameters

  • peer[in] the peer process identifier of the data transfer


custatevecSVSwapWorkerExecute

custatevecStatus_t custatevecSVSwapWorkerExecute(custatevecHandle_t handle, custatevecSVSwapWorkerDescriptor_t svSwapWorker, custatevecIndex_t begin, custatevecIndex_t end)

Execute the data transfer.

This function executes the transfer of state vector elements. The number of elements being transferred is obtained from the transferSize member in custatevecSVSwapParameters_t. The begin and end arguments specify the range, [begin, end), for elements being transferred.

Parameters
  • handle[in] the handle to cuStateVec library

  • svSwapWorker[in] state vector swap worker

  • begin[in] the index to start transfer

  • end[in] the index to end transfer

Sub state vector migration

For distributed state vector simulations on host/device memories, cuStateVec provides APIs to migrate distributed state vector elements. Please refer to the Sub State Vector Migration API for the overview and the detailed usages.

API reference

custatevecSubSVMigratorCreate

custatevecStatus_t custatevecSubSVMigratorCreate(custatevecHandle_t handle, custatevecSubSVMigratorDescriptor_t *migrator, void *deviceSlots, cudaDataType_t svDataType, int nDeviceSlots, int nLocalIndexBits)

Create sub state vector migrator descriptor.

This function creates a sub state vector migrator descriptor. Sub state vectors specified by the deviceSlots argument are allocated in one contiguous memory array and its size should be at least ( \(\text{nDeviceSlots} \times 2^\text{nLocalIndexBits}\)).

Parameters
  • handle[in] the handle to the cuStateVec library

  • migrator[out] pointer to a new migrator descriptor

  • deviceSlots[in] pointer to sub state vectors on device

  • svDataType[in] data type of state vector

  • nDeviceSlots[in] the number of sub state vectors in deviceSlots

  • nLocalIndexBits[in] the number of index bits of sub state vectors


custatevecSubSVMigratorDestroy

custatevecStatus_t custatevecSubSVMigratorDestroy(custatevecHandle_t handle, custatevecSubSVMigratorDescriptor_t migrator)

Destroy sub state vector migrator descriptor.

This function releases a sub state vector migrator descriptor.

Parameters
  • handle[in] the handle to the cuStateVec library

  • migrator[inout] the migrator descriptor


custatevecSubSVMigratorMigrate

custatevecStatus_t custatevecSubSVMigratorMigrate(custatevecHandle_t handle, custatevecSubSVMigratorDescriptor_t migrator, int deviceSlotIndex, const void *srcSubSV, void *dstSubSV, custatevecIndex_t begin, custatevecIndex_t end)

Sub state vector migration.

This function performs a sub state vector migration. The deviceSlotIndex argument specifies the index of the sub state vector to be transferred, and the srcSubSV and dstSubSV arguments specify sub state vectors to be transferred from/to the sub state vector on device. In the current version, srcSubSV and dstSubSV must be arrays allocated on host memory and accessible from the device. If either srcSubSV or dstSubSV is a null pointer, the corresponding data transfer will be skipped. The begin and end arguments specify the range, [begin, end), for elements being transferred.

Parameters
  • handle[in] the handle to the cuStateVec library

  • migrator[in] the migrator descriptor

  • deviceSlotIndex[in] the index to specify sub state vector to migrate

  • srcSubSV[in] a pointer to a sub state vector that is migrated to deviceSlots

  • dstSubSV[out] a pointer to a sub state vector that is migrated from deviceSlots

  • begin[in] the index to start migration

  • end[in] the index to end migration