Overview¶

This section describes math concepts and usage workflow of the cuDensityMat library. Some familiarity with quantum mechanics may be required.

A general library usage workflow is the following:

Create the library context handle. This will initialize the library context. A library context is associated with the currently set GPU. All subsequent API calls and numerical operations must be executed on that same GPU.
(Optional). In order to run on multiple GPUs in parallel, one will need to activate distributed execution explicitly by calling the C API function cudensitymatResetDistributedConfiguration() and giving it a pointer to a valid MPI communicator. Each parallel process will be assigned its current GPU used during the library context creation. It is the responsibility of the user to assign GPUs to MPI processes, one GPU per MPI process.
Define the elements of necessary elementary tensor operators and/or full matrix operators in GPU memory. The corresponding arrays must be stored using the generalized column-wise storage layout (F-order). For batched operators, one will need to add one more mode (batch mode) at the end of the array (the most significant mode in F-order).
Define the necessary elementary tensor operators, full matrix operators, operator terms, and final quantum many-body operators. Elementary tensor operators and full matrix operators can be batched, as can be the coefficient in front of any operator product or operator term. If an elementary tensor operator product has a batched elementary tensor operator and/or a batched coefficient in front of it, the operator product is considered batched.
Create the necessary quantum states, query required storage, and attach GPU buffers to the quantum states, making them ready for computation. The original content of the attached GPU buffer defines the initial value of the quantum state. For multi-GPU runs, the library provides special API functions for querying which particular state tensor slice is stored on the current parallel process. Quantum states can be batched, in which case the batch mode is the last mode of the quantum state tensor.
Prepare the action of a desired quantum many-body operator on a desired kind of quantum states (shape, purity, and data type). This only needs to be done once. Note that if any of the quantum state or quantum operator is batched, the batch sizes must agree. If the quantum operator is not batched, it can still act on a batched quantum state to produce another batched quantun state.
Compute the action of a desired quantum operator on a concrete quantum state that will be accumulated into another quantum state of the same kind. The operator action can be computed as many times as needed.
One can also compute the expectation value of a given quantum operator over a given quantum state. This computation also needs to be prepared first (once), and then executed as many times as needed.
If needed, one can perform different basic operations on a quantum state, like initialization to zero, scaling, computing the norm, trace, accumulating a quantum state into another quantum state, and computing the inner product.

Further details are presented below.

Tensor spaces and quantum states¶

A tensor space is a composite linear space constructed as a tensor product of one or more vector spaces. Each constituting vector space represents a quantum degree of freedom of dimension equal to the dimension of the vector space. The shape of the composite tensor space is a tuple of the dimensions of the constituting vector spaces in their original order. Note that the shape of the tensor space stays the same irrespective of whether we deal with pure states (vectors living in that tensor space) or mixed states (matrices associated with the same tensor space).

A pure quantum state is represented by a tensor from the specified tensor space, thus the shape of a pure quantum state coincides with the tensor space shape. A mixed quantum state is represented by a tensor belonging to the tensor space constructed as a tensor product of the specified tensor space (primary space or ket space) and its dual tensor space (dual space or bra space). Thus, the shape of a mixed quantum state is a concatenation of two tensor space shapes, where the first half of the tensor modes corresponds to the ket space while the second half of the tensor modes corresponds to the bra space. For example, if tensor \(V[i0, i1, i2]\) represents a pure quantum state from a tensor space constructed from three vector spaces (quantum degrees of freedom), then tensor \(D[i0, i1, i2; j0, j1, j2]\) represents a mixed quantum state in the same tensor space, where modes \(i0, i1, i2\) are the ket modes while modes \(j0, j1, j2\) are the bra modes. The corresponding tensor space shape in both cases is \([range(i0), range(i1), range(i2)]\).

In C API, a pure/mixed quantum state is represented by the opaque cudensitymatState_t type. The cudensitymatCreateState() API function is used to define a pure or mixed quantum state from a tensor space of a desired shape. Optionally, one can define a batch of quantum states of the same shape (batched quantum state) by specifying a batch size greater than 1. Note that cudensitymatCreateState() expects a tensor space shape as an input argument, not the quantum state shape (which is the same for pure states, but different for mixed states). The created quantum state has no associated physical storage yet. In order to attach physical storage to a quantum state, one must first query the number of storage components expected by the quantum state representation by calling cudensitymatStateGetNumComponents(). A storage component is a memory buffer used to store a specific tensor from the quantum state representation. By default, pure and mixed quantum states are stored using a single storage component which stores the dense tensor representing the quantum state, using the generalized column-major storage layout. The batch dimension (mode), if greater than 1, is always the last (most significant) mode. In contrast, factorized representations of quantum states may use more than one storage component, where each storage component stores one (potentially batched) tensor associated with the requested tensor factorization scheme. The current release does not support factorized tensor representations yet.

Once the number of required storage components has been determined, the user can query the required size of each storage component by calling the C API function cudensitymatStateGetComponentStorageSize(). Knowing the required size of each storage component for a given quantum state, the user can allocate a GPU buffer for each storage component and then attach the GPU buffer(s) to the quantum state by calling the C API function cudensitymatStateAttachComponentStorage(). At this point, the quantum state is fully set and can be used in actual computations. The attached GPU buffers must stay allocated for the whole lifetime of the quantum state. Note that the GPU buffers used to store the quantum state representation must be 256-byte aligned, which is guaranteed when using cudaMalloc or cudaMallocManaged. For custom GPU memory pools, this alignment must be respected explicitly, otherwise it may result in an undefined behavior. The data stored in the attached GPU buffer associated with a given storage component defines the initial value of the tensor associated with that storage component.

For distributed multi-GPU runs, which can be activated by calling the C API function cudensitymatResetDistributedConfiguration() and giving it a duplicate of an MPI communicator, the tensor storage will be distributed across all GPUs. The distributed execution configuration assumes one GPU per parallel process, with the total number of parallel processes being a power of two (1, 2, 4, 8, 16, 32, 64, etc.). In this case, dense pure/mixed quantum states will still be stored using a single storage component (tensor), however each parallel process will locally store its own slice of the full tensor. To query which specific tensor slice is stored by the current parallel process, one can first call the C API function cudensitymatStateGetComponentNumModes() to retrieve the number of tensor slice modes, and then call cudensitymatStateGetComponentInfo() to retrieve the tensor slice information (mode offsets and mode extents).

There are a number of basic operations one can perform on a quantum state. The C API function cudensitymatStateInitializeZero() initializes the quantum state to zero (sets all tensor elements to zero). The C API function cudensitymatStateComputeScaling() multiplies the quantum state by a given scalar (or array of scalars for batched quantum states, one scalar per instance of a batch). The C API function cudensitymatStateComputeNorm() computes the squared Euclidean/Frobenius norm of a pure/mixed quantum state, respectively (taking the square root will yield the norm). For a batched quantum state, it will return an array of the size of the batch. The C API function cudensitymatStateComputeTrace() computes the trace of a mixed quantum state (represented by a density matrix), or the squared Euclidean norm for a pure quantum state. For a batched quantum state, it will return an array of the size of the batch. The C API function cudensitymatStateComputeAccumulation() accumulates (adds) one quantum state into another quantum state with an optional scaling coefficient applied to the former. For batched quantum states, it accepts an array of scaling coefficients of the size of the batch. The C API function cudensitymatStateComputeInnerProduct() computes the inner product of two quantum states, which is the regular Euclidean inner product for pure quantum states and the Frobenius inner product for mixed quantum states. For batched quantum states, it will return an array of the size of the batch. In all cases where the quantum state is batched and an array is returned, the provided GPU storage for the results must be sufficient to store such an array of the size of the batch.

Note that all numerical operations accept the CUDA stream argument, that is, the execution of these operations is generally asynchronous and requires explicit synchronization via cudaStreamSynchronize or cudaDeviceSynchronize. Once the quantum state is no longer needed and all preceding operations on it have been synchronized via the CUDA synchronization API, it can be safely destroyed by calling the C API function cudensitymatDestroyState().

Quantum many-body operators (super-operators)¶

A broad class of quantum Liouvillian superoperators, or quantum many-body operators in general, can be expressed as a sum of tensor products of elementary tensor operators with some, generally time-dependent coefficients, where an elementary tensor operator is a tensor operator acting on a specific number of quantum degrees of freedom, either from the right or from the left, or both (pure quantum states only support operator action from the right). Each elementary tensor operator has the same number of the ket and bra modes, with the ket modes preceding the bra modes in its tensor representation. The bra modes of the elementary tensor operator act on the ket modes of the quantum state (action from the left) while the ket modes of the elementary tensor operator act on the bra modes of the quantum state (action from the right). In practice, many quantum Liouvillian operators are expressed in terms of tensor products of one-body elementary tensor operators, like creation/annihilation operators, particle number operator, coordinate/momentum operators, individual spin operators, etc. For the sake of concreteness, let’s consider the following quantum Liouvillian operator as an example:

\[L = \omega_{c} b^{\dagger} b + \frac{1}{2} \omega_{a} \sigma_{z} + [ \frac{1}{2} g(t, \alpha, \beta) b^{\dagger} \sigma_{-} + \frac{1}{2} g(t, \alpha, \beta) b \sigma_{+} ] + \gamma \sigma_{-}\]

In the above Liouvillian operator, \(b\), \(b^{\dagger}\), \(\sigma_{z}\), \(\sigma_{-}\), \(\sigma_{+}\) are elementary tensor operators, and each of them acts only on a single quantum degree of freedom, represented by an order-2 tensor \(U[i0; j0]\), where \(i0\) is the ket mode and \(j0\) is the bra mode (an order-k elementary tensor operator is represented by a tensor with \(k\) ket modes and \(k\) bra modes). \(\omega_{c}\), \(\omega_{a}\), and \(\gamma\) are constant scalars. \(g(t, \alpha, \beta)\) is a time-dependent scalar coefficient that depends on time and two user-provided real parameters, \(\alpha\) and \(\beta\), which parameterize the Liouvillian operator (such that one can run parameter sweeps, for example). Elementary tensor operators can also be batched. A batched elementary tensor operator is represented by a batch of operator tensors stored contiguously in memory (effectively in this case the operator tensor will have an additional batch mode at the last position). When a batched elementary tensor operator is applied to a batched quantum state, each operator instance from the batched operator acts on the corresponding instance of the batched quantum state, thus the batch size must be consistent.

The C API function cudensitymatCreateElementaryOperator() creates an elementary tensor operator (cudensitymatElementaryOperator_t) by specifying the number of quantum degrees of freedom (space modes) it acts on, the extents of those space modes, the numerical data type (real/complex, FP32/FP64), a GPU buffer containing the tensor elements of the elementary tensor operator (generalized column-wise layout), and an optional user-defined tensor callback function. By registering a user-provided tensor callback function, we allow the cuDensityMat library to automatically update tensor elements of the elementary tensor operator during the computation of the operator action on a quantum state, based on the current value of time as well as an arbitrary set of user-provided real parameters. The signature of the tensor callback function pointer cudensitymatTensorCallback_t is provided in the public C header cudensitymat.h, however the C API functions accept tensor callback arguments in a wrapped form represented by the cudensitymatWrappedTensorCallback_t structure in cudensitymat.h, which contains the actual C function pointer for the tensor callback as well as device type (CPU or GPU) where the callback function will be executed. Note that a CPU-side user-provided tensor callback function performs an update of the tensor elements inside a temporary CPU buffer passed to the that callback function, and then automatically propagates it to the original GPU buffer used during the definition of the elementary tensor operator.

In order to create a batched elementary tensor operator, one must use the batched version of the API cudensitymatCreateElementaryOperatorBatch(). If the tensor callback argument is provided, the corresponding tesor callback function must update the entire batched elementary operator tensor (the callback function takes batch size as an input argument).

Once the elementary tensor operator is no longer needed and is not used by any operator term (see below), it can be safely destroyed by calling the C API function cudensitymatDestroyElementaryOperator(). Note that the GPU buffer storing tensor elements of the elementary tensor operator must stay valid for the whole life time of the elementary tensor operator.

Sometimes a quantum many-body operator is defined by its full matrix from the beginning, where the full matrix acts on all quantum degrees of freedom. Such an operator is called the full matrix operator. It can be created by calling the C API function cudensitymatCreateMatrixOperatorDenseLocal() or its batched version cudensitymatCreateMatrixOperatorDenseLocalBatch(), fully analogous to how an elementary tensor operator is created. The C API function cudensitymatDestroyMatrixOperator is used to destroy the full matrix operator when it is no longer needed and is not used by any operator term.

Going one level higher, it is often convenient to represent a given quantum many-body operator as a sum of different terms, with each term having its own physical meaning, for example the system Hamiltonian, pulse/drive Hamiltonian, dissipation terms, etc. For example, the Liouvillian operator defined above can be thought of as composed of four operator terms (note that the operator term in the square brackets comprises two tensor products of elementary tensor operators). The cuDensityMat library provides the concept of the operator term for this purpose, represented by the cudensitymatOperatorTerm_t C type. The C API function cudensitymatCreateOperatorTerm() creates an empty operator term by only specifying the tensor space shape where the quantum states acted on by this operator term reside. Subsequently, the user can construct the desired operator term by gradually appending to it elementary operator products or full matrix operators with some, generally time-dependent coefficients. The C API function cudensitymatOperatorTermAppendElementaryProduct() is used for appending a tensor product of elementary tensor operators to the operator term. It accepts an array of elementary tensor operators forming the desired tensor product, the list of quantum degrees of freedom (space modes) the tensor product is acting on (concatenated across all constituting elementary tensor operators in their respective order), and the duality of action on each space mode. If the duality is set to 0, the bra mode of the elementary tensor operator is picked to act on a ket mode of the quantum state (action from the left); otherwise, the ket mode of the elementary tensor operator is picked to act on a bra mode of the quantum state (action from the right). Additionally, cudensitymatOperatorTermAppendElementaryProduct() accepts a static coefficient associated with the appended elementary operator product as well as an optional user-defined coefficient callback function (cudensitymatWrappedScalarCallback_t) that will also be associated with the appended elementary operator product. The cuDensityMat library will use the user-provided scalar callback function for retrieving the time-dependent part of the total coefficient associated with the appended elementary operator product, where the total coefficient is a product of the static coefficient and the value returned by the scalar callback function for a given value of time and an arbitrary set of user-provided real parameters during operator action computation. For example, the tensor product \(\frac{1}{2} g(t, \alpha, \beta) b \sigma_{+}\) comprises two elementary tensor operators, \(b\) and \(\sigma_{+}\), a constant (static) coefficient \(\frac{1}{2}\), and a time-dependent parameterized coefficient \(g(t, \alpha, \beta)\) which can be defined as a user-provided scalar callback function (either CPU-resident or GPU-resident). Similarly, to cudensitymatOperatorTermAppendElementaryProduct(), the C API function cudensitymatOperatorTermAppendMatrixProduct() is used for appending either a single full matrix operator or a product of full matrix operators to the operator term (note that in this case it is a matrix product, not tensor product). Both C API functions used for appending tensor products of elementary operators or matrix products of full matrices have their batched counterparts: cudensitymatOperatorTermAppendElementaryProductBatch() and cudensitymatOperatorTermAppendMatrixProductBatch(). The batched version accepts the batch size, an array of static coefficients of that size, and another array of the same size for storing the total coefficients, where the total coefficients will be computed as an elementwise product of the static coefficients and the values retured by the user-provided scalar callback function which is supposed to compute the variable part of the total coefficient for all instances of the batch in one call (if the scalar callback function is not provided, the variable part will be equal to identity). In general, an operator term is considered batched if it was created via the batch API and/or if it contains at least one batched elementary tensor operator or full matrix operator. A batched operator term can only be applied to a batched quantum state of the same batch size. In this case, each quantum state instance inside the quantum state batch is acted on by the corresponding instance of the batched operator term. A specific instance of the operator term is selected by setting all batched elementary tensor operators, batched full matrix operators, and batched coefficients inside the operator term to that instance.

Once an operator term is no longer needed and is not used by any operator, it can be safely destroyed by calling the C API function cudensitymatDestroyOperatorTerm(). Note that the elementary tensor operators and full matrix operators constituting the operator term must stay alive for the whole life time of the operator term.

Once all necessary operator terms have been defined, the user can proceed to defining the full quantum many-body operator (or superoperator), represented by the cudensitymatOperator_t C type, by calling the C API function cudensitymatCreateOperator(). Similarly to creating an operator term, this API function creates an empty operator by only taking the shape of the tensor space where the quantum states it will act on reside. Subsequently, the user can construct the desired operator by appending the constituting operator terms into it via the C API function cudensitymatOperatorAppendTerm(). Each appended operator term can be supplied with a static coefficient as well as an optional user-defined scalar callback function such that the total coefficient associated with the appended operator term will be a product of the two. Additionally, the user can request to append the operator term in its dual form, where the duality of all its modes will be flipped (ket <–> bra). This is convenient for defining the von Neumann part of the master equation, where the Hamiltonian acts from different sides on the density matrix to form the commutator. There is also a batched version of the API, called cudensitymatOperatorAppendTermBatch(). The batched version accepts the batch size, an array of static coefficients of that size, and another array of the same size for storing the total coefficients, where the total coefficients will be computed as an elementwise product of the static coefficients and the values retured by the user-provided scalar callback function which is supposed to compute the variable part of the total coefficient for all instances of the batch in one call (if the scalar callback function is not provided, the variable part will be equal to identity). In general, an operator is considered batched if it was created via the batch API and/or if it contains at least one batched operator term. A batched operator can only be applied to a batched quantum state of the same batch size. In this case, each quantum state instance inside the quantum state batch is acted on by the corresponding instance of the batched operator. A specific instance of the operator is selected by setting all batched operator terms and batched coefficients inside the operator to that instance. A batched operator can only be applied to a batched quantum state of the same batch size. If the operator is not batched, it can be applied to both simple and batched quantum states.

Once the operator is no longer needed and all computations in which it participated have been synchronized, it can be safely destroyed by calling the C API function cudensitymatDestroyOperator(). Note that the constituting operator terms must stay alive for the whole life time of the operator.

Finally, in order to apply the defined (simple or batched) quantum many-body operator to a (simple or batched) quantum state, the user needs to first prepare its action (only once) and then compute its action as many times as needed. The C API function cudensitymatOperatorPrepareAction() prepares the action of the operator on a quantum state of a given kind. It will return the workspace buffer size required for the actual computation. The user can then allocate a GPU workspace buffer of the required size (or larger) and attach it to the workspace descriptor (cudensitymatWorkspaceDescriptor_t) used in the compute call. Once the workspace is ready, calling the C API function cudensitymatOperatorComputeAction() will apply the operator to a given (input) quantum state, accumulating its action into another (output) quantum state. This API call takes the current value of time as well as an arbitrary set of real parameters that will trigger an update of the value of the operator via registered callback functions before starting the computation. The requested computation can be inserted in a CUDA stream and executed asynchronously with respect to Host, in which case it is user’s responsibility to synchronize the computation via cudaStreamSynchronize.

Workspace descriptor¶

Each non-trivial numerical operation supported by the cuDensityMat library generally requires a workspace buffer allocated in GPU memory. The C API function cudensitymatCreateWorkspace() creates an empty workspace descriptor. The preparation stage for a desired numerical computation sets the required workspace buffer size inside the workspace descriptor. The user can then query the required size via cudensitymatWorkspaceGetMemorySize() and allocate a GPU workspace buffer of at least that size. Once the workspace buffer has been allocated (or taken from a pre-allocated memory pool), it can be attached to the workspace descriptor by calling the C API function cudensitymatWorkspaceSetMemory(). At this point, the workspace descriptor is ready to be passed to the actual computation call. The workspace descriptor can be reused after the computation call it was passed to has been synchronized, provided that the workspace buffer size is sufficient for the next computation call. The GPU buffer used by the workspace descriptor must stay allocated while the workspace descriptor is in use. Once the workspace descriptor is no longer needed, it can be safely destroyed via cudensitymatDestroyWorkspace() (note that the memory buffer used by the workspace descriptor stays intact).

Importantly, any GPU buffer attached to a workspace descriptor must be aligned to a 256-byte boundary, which is guaranteed when using cudaMalloc or cudaMallocManaged. For custom GPU memory pools, this alignment must be respected explicitly. Providing unaligned workspace buffers may result in undefined behavior.

Coupled quantum dynamics master equations (system of ODE)¶

While the concept of a Liouvillian superoperator and its action on a quantum state is sufficient to specify the right-hand side of the broadly used master equations, like the Lindblad equation, in a more general setting one may need to be able to specify a system of coupled ordinary differential equations (ODE) to simultaneously evolve multiple (potentially batched) quantum states under their own Liouvillian operators, like it is done in the hierarchical equations of motion (HEOM) method. In such cases, the user needs to define the so-called composite operator action, represented by a set of C objects of the cudensitymatOperatorAction_t type. They can be constructed by calling the C API function cudensitymatCreateOperatorAction(). The composite operator action assumes a simultaneous evolution of \(M\) quantum states described by a system of \(M\) coupled ODEs. For each coupled ODE from the ODE system, one needs to construct a separate operator action object (cudensitymatOperatorAction_t) by calling cudensitymatCreateOperatorAction(), thus requiring \(M\) operator action objects for the full ODE system specification. Construction of each operator action object representing a specific ODE from the coupled ODE system accepts \(M\) operators corresponding to the \(M\) evolved quantum states, where each operator will be acting on its corresponding quantum state, accumulating into the right-hand side of that specific ODE. In turn, the right-hand side of that specific ODE determines the time derivative of the corresponding quantum state from the set of \(M\) quantum states being evolved in time. If a specific quantum state is not present in a specific ODE in the coupled ODE system, the corresponding operator is formally zero, which is specified by providing a NULL pointer in the corresponding position. Once the \(M\) operator action (cudensitymatOperatorAction_t) objects have been constructed, that is, the coupled ODE system has been fully specified, one can prepare each of them for computation by calling the C API function cudensitymatOperatorActionPrepare() which will return the size of the required workspace buffer inside the workspace descriptor. In order to prepare each operation action object, this API function needs to know the kind of quantum states to which the operators will be applied. Once prepared, the user can then query the required workspace buffer size, allocate the GPU workspace buffer of the required (or larger) size and attach it to the workspace descriptor (cudensitymatWorkspaceDescriptor_t) used in the compute call. Once the workspace is ready, calling the C API function cudensitymatOperatorActionCompute() will apply all operators to the given (input) quantum states, accumulating its action into the provided output quantum state. This API call takes the current value of time as well as an arbitrary set of user-provided real parameters that will trigger an update of the value of the operator via registered callback functions before starting the computation. The requested computation can be inserted in a CUDA stream and executed asynchronously with respect to Host, in which case it is user’s responsibility to synchronize the computation via cudaStreamSynchronize. Note that if all constructed and prepared operator actions (cudensitymatOperatorAction_t) are placed in the same CUDA stream, they will be computed in-order, in which case the same workspace buffer can be reused for all these prepared operator actions as long as the buffer size is large enough for any of them.

Note that the operator action object can also be constructed for a single (potentially batched) quantum state acted on by a single (potentially batched) Liouvillian operator, in which case it is equivalent to preparing and computing the operator action directly via the operator API. When the operator action object is no longer needed, it can be safely destroyed by calling cudensitymatDestroyOperatorAction().

Properties of quantum states¶

In order to compute different properties of a quantum state, the user can create, prepare, and compute the corresponding property calculation object. The C API function cudensitymatCreateExpectation() takes an arbitrary operator (cudensitymatOperator_t) and creates the expectation value calculation object for it (cudensitymatExpectation_t). The C API function cudensitymatExpectationPrepare(), which needs to be called only once, prepares the expectation value calculation object for computation for a given kind of quantum states. Similarly to other prepare functions, it sets the required workspace buffer size inside the workspace descriptor. Then the user will need to allocate a GPU buffer of that (or larger) size and attach it to the workspace descriptor. Once prepared, the expectation value of the operator can be computed any number of times via cudensitymatExpectationCompute() for the same or different quantum states of the same kind (same shape, batch size, purity, and data type). If the quantum state is batched, cudensitymatExpectationCompute() will compute the expectation values for all instances of the batch, thus returning an array of expectation values, with the array size expected to be at least the batch size. If the operator is batched as well, the instances inside the operator batch will be matched with the instances inside the quantum state batch.

Multi-GPU multi-node execution¶

The cuDensityMat library C API works on both single and multiple GPUs. Running parallel jobs on multiple GPUs requires building an MPI plugin after downloading the library archive. After unpacking the archive, one can find the script called activate_mpi_cudm.sh inside the distributed_interfaces folder. By running the script inside that folder, the MPI interface plugin library libcudensitymat_distributed_interface_mpi.so will be built. The script requires environment variables CUDA_PATH and MPI_PATH to point to the CUDA and MPI installation directories, respectively. The provided MPI installation must point to a CUDA-aware MPI implementation (all major MPI libraries, like OpenMPI, MPICH, and MVAPICH are CUDA-aware). The script will also set environment variable CUDENSITYMAT_COMM_LIB to point to the MPI interface plugin library location. This variable should be exported globally to make sure the cuDensityMat library can find the MPI interface plugin library during run time.

Once the MPI interface plugin has been set up, the activation of parallel execution on multiple GPUs is achieved by calling the C API function cudensitymatResetDistributedConfiguration(). This function accepts a duplicate of an MPI communicator in a type-erased form, for example a duplicate of the MPI_COMM_WORLD communicator obtained via a MPI_Comm_duplicate call (consult MPI library). After that point, all numerical computations will run in parallel across all MPI ranks available inside the provided MPI communicator. The user can query the total number of parallel ranks and the rank of the current process by calling cudensitymatGetNumRanks() and cudensitymatGetProcRank(), respectively. The cuDensityMat library supports parallel execution only with a power of two number of parallel processes, one GPU per process.

Citing cuQuantum¶

1. Bayraktar et al., “cuQuantum SDK: A High-Performance Library for Accelerating Quantum Science,” 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), Bellevue, WA, USA, 2023, pp. 1050-1061, doi: 10.1109/QCE57702.2023.00119.