Overview

This section describes the concepts and working principles of the cuDensityMat library. Some familiarity with quantum mechanics may be required.

A general library usage workflow is the following:

  1. Create the library context handle. This will initialize the library context. A library context is associated with the currently set GPU. All subsequent API calls and numerical operations must be executed on that same GPU.

  2. (Optional). In order to run on multiple GPUs in parallel, one will need to activate distributed execution explicitly by calling the C API function cudensitymatResetDistributedConfiguration() and giving it a pointer to a valid MPI communicator. Each parallel process will be assigned its current GPU used during the library context creation.

  3. Define the elements of necessary elementary tensor operators in GPU memory. The corresponding arrays must be stored using the generalized column-wise storage layout.

  4. Define the necessary elementary tensor operators, operator terms, and full quantum many-body operators.

  5. Create the necessary quantum states, query required storage, and attach GPU buffers to the quantum states, making them ready for computation. The original content of the attached GPU buffer defines the initial value of the quantum state. For multi-GPU runs, the library provides special API functions for querying which particular tensor slice is stored at the current parallel process.

  6. Prepare the action of a desired quantum operator on a desired kind of quantum states (shape, purity, and data type). This only needs to be done once.

  7. Compute the action of a desired quantum operator on a concrete quantum state that will be accumulated into another quantum state of the same kind.

  8. One can also compute the expectation value of a given quantum operator over a given quantum state. This computation also needs to be prepared first (once), and then executed as many times as needed.

  9. If needed, one can perform different basic operations on a quantum state, like initialization to zero, scaling, computing the norm, trace, accumulating a quantum state into antother quantum state, and computing the inner product.

The details are presented below.

Tensor Spaces and Quantum States

A tensor space is a composite linear space constructed as a tensor product of one or more vector spaces. Each constituting vector space represents a quantum degree of freedom of dimension equal to the dimension of the vector space. The shape of the composite tensor space is a tuple of the dimensions of the consituting vector spaces in their original order. Note that the shape of the tensor space stays the same irrespective of whether we deal with pure states (vectors living in the tensor space) or mixed states (matrices associated with the same tensor space).

A pure quantum state is represented by a tensor from the specified tensor space, thus the shape of a pure quantum state coincides with the tensor space shape. A mixed quantum state is represented by a tensor belonging to the tensor space constructed as a tensor product of the specified tensor space (primary space or ket space) and its dual tensor space (dual space or bra space). Thus, the shape of a mixed quantum state is a concatenation of two tensor space shapes, where the first half of the tensor modes corresponds to the ket space while the second half of the tensor modes corresponds to the bra space. For example, if tensor \(V[i0, i1, i2]\) represents a pure quantum state from a tensor space constructed from three vector spaces (quantum degrees of freedom), then tensor \(D[i0, i1, i2; j0, j1, j2]\) represents a mixed quantum state from the same tensor space, where modes \(i0, i1, i2\) are the ket modes while modes \(j0, j1, j2\) are the bra modes. The corresponding tensor space shape in both cases is \([range(i0), range(i1), range(i2)]\).

In C API, a pure/mixed quantum state is represented by the opaque cudensitymatState_t type. The cudensitymatCreateState() API function is used to define a pure or mixed quantum state from a tensor space of a desired shape. Optionally, one can define a batch of quantum states of the same shape (batched quantum state) by specifying a batch size greater than 1. Note that cudensitymatCreateState() expects a tensor space shape as an input argument, not the quantum state shape (which is the same for pure states, different for mixed states). The created quantum state has no associated physical storage yet. In order to attach physical storage to a quantum state, one must first query the number of storage components expected by the quantum state representation by calling cudensitymatStateGetNumComponents(). A storage component is a memory buffer used to store a specific tensor from the quantum state representation. By default, pure and mixed quantum states are stored using a single storage component which stores the dense tensor representing the quantum state, using the generalized column-major storage layout. The batch dimension (mode), if greater than 1, is always the last (most significant) mode. In contrast, factorized representations of quantum states may use more than one storage component, where each storage component stores one (potentially batched) tensor associated with the requested tensor factorization scheme. The current release does not support factorized tensor representations yet.

Once the number of required storage components has been determined, the user can query the required size of each storage component by calling the C API function cudensitymatStateGetComponentStorageSize(). Knowing the required size of each storage component for a given quantum state, the user can allocate a GPU buffer for each storage component and attach the GPU buffer(s) to the quantum state by calling the C API function cudensitymatStateAttachComponentStorage(). At this point, the quantum state is fully set and can be used in actual computations. The attached GPU buffers must stay allocated for the whole lifetime of the quantum state. Note that the GPU buffers used to store the quantum state representation must be 256-byte aligned, which is guaranteed when using cudaMalloc or cudaMallocManaged. For custom GPU memory pools, this alignment must be respected explicitly, otherwise it may result in an undefined behavior. The data stored in the attached GPU buffer associated with a given storage component defines the initial value of the tensor associated with that storage component.

For distributed multi-GPU runs, which can be activated by calling the C API function cudensitymatResetDistributedConfiguration() and giving it a duplicate of an MPI communicator, the tensor storage will be distributed across all GPUs. The distributed execution configuration assumes one GPU per parallel process, with the total number of parallel processes being a power of two (1, 2, 4, 8, 16, 32, 64, etc.). In this case, dense pure/mixed quantum states will still be stored using a single storage component (tensor), however each parallel process will locally store its own slice of the full tensor. To query which specific tensor slice is stored by the current parallel process, one can first call the C API function cudensitymatStateGetComponentNumModes() to retrieve the number of tensor slice modes, and then cudensitymatStateGetComponentInfo() to retrieve the tensor slice information (mode offsets, mode extents).

There are a number of basic operations one can perform on a quantum state. The C API function cudensitymatStateInitializeZero() initializes the quantum state to zero (sets all tensor elements to zero). The C API function cudensitymatStateComputeScaling() multiplies the quantum state by a given scalar (or array of scalars for batched quantum states, one scalar per instance of a batch). The C API function cudensitymatStateComputeNorm() computes the squared Euclidean/Frobenius norm of a pure/mixed quantum state, respectively (taking the square root will yield the norm). For a batched quantum state, it will return an array of the size of the batch. The C API function cudensitymatStateComputeTrace() computes the trace of a mixed quantum state (represented by a density matrix), or the squared Euclidean norm of a pure quantum state. For a batched quantum state, it will return an array of the size of the batch. The C API function cudensitymatStateComputeAccumulation() accumulates (adds) one quantum state into another quantum state with an optional scaling coefficient applied to the former. For batched quantum states, it accepts an array of scaling coefficients of the size of the batch. The C API function cudensitymatStateComputeInnerProduct() computes the inner product of two quantum states, which is the regular Euclidean inner product for pure quantum states and the Frobenius inner product for mixed quantum states. For batched quantum states, it will return an array of the size of the batch. In all cases where the quantum state is batched and an array is returned, the provided GPU storage for the results must be sufficient to store such an array of the size of the batch.

Note that all numerical operations accept the CUDA stream argument, that is, the execution of these operations is generally asynchronous and requires explicit synchronization via cudaStreamSynchronize or cudaDeviceSynchronize. Once the quantum state is no longer needed and all preceding operations on it have been synchronized via the CUDA synchronization API, it can be safely destroyed by calling the C API function cudensitymatDestroyState().

Quantum Many-Body Operators (Super-Operators)

A broad class of quantum Liouvillian superoperators, or quantum many-body operators in general, can be expressed as a sum of tensor products of elementary tensor operators with some, generally time-dependent coefficients, where an elementary tensor operator is a tensor operator acting on a specific number of quantum degrees of freedom, either from the right or from the left, or both. Each elementary tensor operator has the same number of the ket and bra modes, with the ket modes preceding the bra modes in its tensor. The bra modes of the elementary tensor operator act on the ket modes of the quantum state (action from the left) while the ket modes of the elementary tensor operator act on the bra modes of the quantum state (action from the right). In practice, many quantum Liouvillian operators are expressed in terms of tensor products of the one-body elementary tensor operators, like creation/annihilation operators, particle number operator, coordinate/momentum operators, individual spin operators, etc. For the sake of concreteness, let’s consider the following quantum Liouvillian operator as an example:

\[L = \omega_{c} b^{\dagger} b + \frac{1}{2} \omega_{a} \sigma_{z} + [ \frac{1}{2} g(t, \alpha, \beta) b^{\dagger} \sigma_{-} + \frac{1}{2} g(t, \alpha, \beta) b \sigma_{+} ] + \gamma \sigma_{-}\]

In the above Liouvillian operator, \(b\), \(b^{\dagger}\), \(\sigma_{z}\), \(\sigma_{-}\), \(\sigma_{+}\) are elementary tensor operators, and each of them acts only on a single quantum degree of freedom, represented by an order-2 tensor \(U[i0; j0]\), where \(i0\) is the ket mode and \(j0\) is the bra mode (an order-k elementary tensor operator is repsented by a tensor with \(k\) ket modes and \(k\) bra modes). \(\omega_{c}\), \(\omega_{a}\), and \(\gamma\) are constant scalars. \(g(t, \alpha, \beta)\) is a time-dependent scalar coefficient that depends on time and two real parameters, \(\alpha\) and \(\beta\), which parameterize the Liouvillian operator.

The C API function cudensitymatCreateElementaryOperator() creates an elementary tensor operator (cudensitymatElementaryOperator_t) by specifying the number of quantum degrees of freedom (space modes) it acts on, the extents of those space modes, the numerical data type (real/complex, FP32/FP64), a GPU buffer containing the tensor elements of the elementary tensor operator (generalized column-wise layout), and an optional user-defined tensor callback function. By registering a user-provided tensor callback function, we allow the cuDensityMat library to automatically update tensor elements of the elementary tensor operator during the computation of the operator action on a quantum state, based on the current value of time as well as an arbitrary set of real parameters (the signature of the tensor callback function pointer cudensitymatTensorCallback_t is provided in the public C header cudensitymat.h). Note that the user-provided tensor callback function performs an update of the tensor elements inside a temporary CPU buffer passed to the function callback, and then automatically propagates it to the original GPU buffer used during the definition of the elementary tensor operator. Once the elementary tensor operator is no longer needed and is not used by any operator term, it can be safely destroyed by calling the C API function cudensitymatDestroyElementaryOperator(). Note that the GPU buffer storing tensor elements of the elementary tensor operator must stay valid for the whole life time of the elementary tensor operator.

It is often convenient to represent a given quantum many-body operator as a sum of different terms, with each term having its own physical meaning, for example the system Hamiltonian, pulse/drive Hamiltonian, dissipation terms, etc. For example, the Liouvillian operator defined above can be thought of as composed of four operator terms (the operator term in the square brackets comprises two tensor products of elementary tensor operators). The cuDensityMat library provides the concept of the operator term for this purpose, represented by the cudensitymatOperatorTerm_t C type. The C API function cudensitymatCreateOperatorTerm() creates an empty operator term by only specifying the tensor space shape where the quantum states acted on by this operator term are residing. Subsequently, the user can construct the desired operator term by gradually appending tensor products of elementary tensor operators with some, generally time-dependent coefficients to it. The C API function cudensitymatOperatorTermAppendElementaryProduct() serves this purpose. It accepts an array of elementary tensor operators forming the desired tensor product, the list of quantum degrees of freedom (space modes) the tensor product is acting on (concatenated across all consituting elementary tensor operators in their respective order), and the duality of action on each space mode. If the duality is set to 0, the bra mode of the elementary tensor operator is picked to act on a ket mode of the quantum state (action from the left); otherwise, the ket mode of the elementary tensor operator is picked to act on a bra mode of the quantum state (action from the right). Finally, the C API function cudensitymatOperatorTermAppendElementaryProduct() also accepts a static coefficient associated with the appended tensor product as well as an optional user-defined coefficient callback function (cudensitymatScalarCallback_t) that will also be associated with the appended tensor product. The cuDensityMat library will use the user-provided coefficient callback function for retrieving the time-dependent part of the total coefficient associated with the appended tensor product, where the total coefficient is a product of the static coefficient and the value returned by the coefficient callback function for a given value of time and an arbitrary set of real parameters during operator action computation. For example, the tensor product \(\frac{1}{2} g(t, \alpha, \beta) b \sigma_{+}\) comprises two elementary tensor operators, \(b\) and \(\sigma_{+}\), a constant (static) coefficient \(\frac{1}{2}\), and a time-dependent parameterized coefficient \(g(t, \alpha, \beta)\) which can be defined as a user-provided CPU callback function.

Once an operator term is no longer needed and is not used by any operator, it can be safely destroyed by calling the C API function cudensitymatDestroyOperatorTerm(). Note that the elementary tensor operators constituting the operator term must stay alive for the whole life time of the operator term.

Once all necessary operator terms have been defined, the user can proceed to defining the full quantum many-body operator (or superoperator), represented by the cudensitymatOperator_t C type, by calling the C API function cudensitymatCreateOperator(). Similarly to creating an operator term, this API function creates an empty operator by only taking the shape of the tensor space where the quantum states it will act on reside. Subsequently, the user can construct the desired operator by appending the constituting operator terms into it via the C API function cudensitymatOperatorAppendTerm(). Each appended operator term can be supplied with a static coefficient as well as an optional user-defined coefficient callback function such that the total coefficient associated with the appended operator term will be a product of the two. Additionally, the user can request to append the operator term in its dual form, where the duality of all its modes will be flipped (ket <–> bra). This is convenient for defining the von Neumann part of the master equation, where the Hamiltonian acts from different sides on the density matrix to form the commutator. Once the operator is no longer needed and all computations in which it participated have been synchronized, it can be safely destroyed by calling the C API function cudensitymatDestroyOperator(). Note that the constituting operator terms must stay alive for the whole life time of the operator.

Finally, in order to apply the defined quantum many-body operator to a quantum state, the user needs to first prepare its action (only once) and then compute its action as many times as needed. The C API function cudensitymatOperatorPrepareAction() prepares the action of the operator on a quantum state of a given kind. It will return the workspace buffer size required for the actual computation. The user can then allocate the GPU workspace buffer of the required size (or larger) and attach it to the workspace descriptor used in the call (cudensitymatWorkspaceDescriptor_t). Once the workspace is ready, calling the C API function cudensitymatOperatorComputeAction() will apply the operator to a given (input) quantum state, accumulating its action into another (output) quantum state. This API call takes the current value of time as well as an arbitrary set of real parameters that will trigger an update of the value of the operator via registered callback functions before starting the computation. The requested computation can be inserted in a CUDA stream and executed asynchronously with respect to Host, in which case it is user’s responsibility to synchronize the computation via cudaStreamSynchronize.

Workspace Descriptor

Each non-trivial numerical operation supported by the cuDensityMat library generally requires a workspace buffer allocated in GPU memory. The C API function cudensitymatCreateWorkspace() creates an empty workspace descriptor. The preparation stage for a desired numerical computation sets the required workspace buffer size inside the workspace descriptor. The user can then query the required size via cudensitymatWorkspaceGetMemorySize() and allocate a GPU workspace buffer of at least that size. Once the workspace buffer has been allocated (or taken from a pre-allocated memory pool), it can be attached to the workspace descriptor by calling the C API function cudensitymatWorkspaceSetMemory(). At this point, the workspace descriptor is ready to be passed to the actual computation call. The workspace descriptor can be reused after the computation call it was passed to has been synchronized, provided that the workspace buffer size is sufficient for the next computation call. The GPU buffer used by the workspace descriptor must stay allocated while the workspace descriptor is in use. Once the workspace descriptor is no longer needed, it can be safely destroyed via cudensitymatDestroyWorkspace().

Note that any GPU buffer attached to a workspace descriptor must be aligned to a 256-byte boundary, which is guaranteed when using cudaMalloc. For custom GPU memory pools, this alignment must be respected explicitly.

Coupled Quantum Dynamics Master Equations (System of ODE)

While the concept of a Liouvillian superoperator and its action on a quantum state is sufficient to specify the right-hand side of the broadly used master equations, like the Lindblad equation, in a more general setting one may need to be able to specify a system of coupled ordinary differential equations (ODE) to simultaneously evolve multiple (potentially batched) quantum states under their own Liouvillian operators, like it is done in the hierarchical equations of motion (HEOM) method. In such cases, the user needs to define the so-called composite operator action, represented by a set of C objects of the cudensitymatOperatorAction_t type. They can be constructed by calling the C API function cudensitymatCreateOperatorAction(). The composite operator action assumes a simultaneous evolution of \(M\) quantum states described by a system of \(M\) coupled ODEs. For each coupled ODE from the ODE system, one needs to construct a separate operator action object (cudensitymatOperatorAction_t) by calling cudensitymatCreateOperatorAction(), thus requiring \(M\) operator action objects for the full ODE system specification. Construction of each operator action object representing a specific ODE from the coupled ODE system accepts \(M\) operators corresponding to the \(M\) evolved quantum states, where each operator will be acting on its corresponding quantum state, accumulating into the right-hand side of that specific ODE. In turn, the right-hand side of that specific ODE determines the time derivative of the corresponding quantum state from the set of \(M\) quantum states being evolved in time. If a specific quantum state is not present in a specific ODE in the coupled ODE system, the corresponding operator is formally zero, which is specified by providing a NULL pointer in the corresponding position. Once the \(M\) operator action (cudensitymatOperatorAction_t) objects have been constructed, that is, the coupled ODE system has been fully specified, one can prepare each of them for computation by calling the C API function cudensitymatOperatorActionPrepare() which will return the size of the required workspace buffer inside the workspace descriptor. In order to prepare each operation action object, this API function needs to know the kind of quantum states to which the operators will be applied. Once prepared, the user can then query the required workspace buffer size, allocate the GPU workspace buffer of the required (or larger) size and attach it to the workspace descriptor (cudensitymatWorkspaceDescriptor_t) used in the call. Once the workspace is ready, calling the C API function cudensitymatOperatorActionCompute() will apply all operators to the given (input) quantum states, accumulating its action into the output quantum state. This API call takes the current value of time as well as an arbitrary set of real parameters that will trigger an update of the value of the operator via registered callback functions before starting the computation. The requested computation can be inserted in a CUDA stream and executed asynchronously with respect to Host, in which case it is user’s responsibility to synchronize the computation via cudaStreamSynchronize. Note that if all constructed and prepared operator actions (cudensitymatOperatorAction_t) are placed in the same CUDA stream, they will be computed in-order, in which case the same workspace buffer can be reused for all these prepared operator actions as long as the buffer size is large enough for any of them.

Note that the operator action object can also be constructed for a single (potentially batched) quantum state acted on by a single Liouvillian operator, in which case it is equivalent to preparing and computing the operator action directly via the operator API. When the operator action object is no longer needed, it can be safely destroyed by calling cudensitymatDestroyOperatorAction().

Properties of Quantum States

In order to compute different properties of a quantum state, the user can create, prepare, and compute the corresponding property calculation object. The C API function cudensitymatCreateExpectation() takes an arbitrary operator (cudensitymatOperator_t) and creates the expectation value calculation object for it (cudensitymatExpectation_t). The C API function cudensitymatExpectationPrepare(), which needs to be called only once, prepares the expectation value calculation object for computation for a given kind of quantum states. Similarly to other prepare functions, it sets the required workspace buffer size inside the workspace descriptor. Then the user will need to allocate a GPU buffer of that (or larger) size and attach it to the workspace descriptor. Once prepared, the expectation value of the operator can be computed any number of times via cudensitymatExpectationCompute() for the same or different quantum states of the same kind (same shape, puriry, and data type). If the quantum state is batched, cudensitymatExpectationCompute() will compute the expectation values for all instances of the batch, thus returning an array of expectation values, with the array size expected to be at least the batch size.

Multi-GPU Multi-Node Execution

The cuDensityMat library C API works on both single and multiple GPUs. Running parallel jobs on multiple GPUs requires building an MPI plugin after downloading the library archive. After unpacking the archive, one can find the script called activate_mpi_cudm.sh inside the distributed_interfaces folder. By running the script inside that folder, the MPI interface plugin library libcudensitymat_distributed_interface_mpi.so will be built. The script requires environment variables CUDA_PATH and MPI_PATH to point to the CUDA and MPI installation directories, respectively. The provided MPI installation must point to a CUDA-aware MPI implementation. The script will also set environment variable CUDENSITYMAT_COMM_LIB to point to the MPI interface plugin library location. This variable should be exported globally to make sure the cuDensityMat library can find the MPI interface plugin library during run time.

Once the MPI interface plugin has been set up, the activation of parallel execution on multiple GPUs is achieved by calling the C API function cudensitymatResetDistributedConfiguration(). This function accepts a duplicate of an MPI communicator in a type-erased form, for example a duplicate of the MPI_COMM_WORLD communicator obtained via a MPI_Comm_duplicate call (consult MPI library). After that point, all numerical computations will run in parallel across all MPI ranks available inside the provided MPI communicator. The user can query the total number of parallel ranks and the rank of the current process by calling cudensitymatGetNumRanks() and cudensitymatGetProcRank(), respectively. The cuDensityMat library supports parallel execution only with a power of two number of parallel processes, one GPU per process.

Citing cuQuantum

    1. Bayraktar et al., “cuQuantum SDK: A High-Performance Library for Accelerating Quantum Science,” 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), Bellevue, WA, USA, 2023, pp. 1050-1061, doi: 10.1109/QCE57702.2023.00119.