Overview¶
This section describes math concepts and usage workflow of the cuDensityMat library. Some familiarity with quantum mechanics may be required.
A general library usage workflow is the following:
- Create the library context handle. This will initialize the library context. A library context is associated with the currently set GPU. All subsequent API calls and numerical operations must be executed on that same GPU. 
- (Optional). In order to run on multiple GPUs in parallel, one will need to activate distributed execution explicitly by calling the C API function - cudensitymatResetDistributedConfiguration()and giving it a pointer to a valid MPI communicator. Each parallel process will be assigned its current GPU used during the library context creation. It is the responsibility of the user to assign GPUs to MPI processes, one GPU per MPI process.
- Define the elements of necessary elementary tensor operators and/or full matrix operators in GPU memory. The corresponding arrays must be stored using the generalized column-wise storage layout (F-order). For batched operators, one will need to add one more mode (batch mode) at the end of the array (the most significant mode in F-order). 
- Define the necessary elementary tensor operators, full matrix operators, operator terms, and final quantum many-body operators. Elementary tensor operators and full matrix operators can be batched, as can be the coefficient in front of any operator product or operator term. If an elementary tensor operator product has a batched elementary tensor operator and/or a batched coefficient in front of it, the operator product is considered batched. 
- Create the necessary quantum states, query required storage, and attach GPU buffers to the quantum states, making them ready for computation. The original content of the attached GPU buffer defines the initial value of the quantum state. For multi-GPU runs, the library provides special API functions for querying which particular state tensor slice is stored on the current parallel process. Quantum states can be batched, in which case the batch mode is the last mode of the quantum state tensor. 
- Prepare the action of a desired quantum many-body operator on a desired kind of quantum states (shape, purity, and data type). This only needs to be done once. Note that if any of the quantum state or quantum operator is batched, the batch sizes must agree. If the quantum operator is not batched, it can still act on a batched quantum state to produce another batched quantun state. 
- Compute the action of a desired quantum operator on a concrete quantum state that will be accumulated into another quantum state of the same kind. The operator action can be computed as many times as needed. 
- One can also compute the expectation value of a given quantum operator over a given quantum state. This computation also needs to be prepared first (once), and then executed as many times as needed. 
- If needed, one can perform different basic operations on a quantum state, like initialization to zero, scaling, computing the norm, trace, accumulating a quantum state into another quantum state, and computing the inner product. 
Further details are presented below.
Tensor spaces and quantum states¶
A tensor space is a composite linear space constructed as a tensor product of one or more vector spaces. Each constituting vector space represents a quantum degree of freedom of dimension equal to the dimension of the vector space. The shape of the composite tensor space is a tuple of the dimensions of the constituting vector spaces in their original order. Note that the shape of the tensor space stays the same irrespective of whether we deal with pure states (vectors living in that tensor space) or mixed states (matrices associated with the same tensor space).
A pure quantum state is represented by a tensor from the specified tensor space, thus the shape of a pure quantum state coincides with the tensor space shape. A mixed quantum state is represented by a tensor belonging to the tensor space constructed as a tensor product of the specified tensor space (primary space or ket space) and its dual tensor space (dual space or bra space). Thus, the shape of a mixed quantum state is a concatenation of two tensor space shapes, where the first half of the tensor modes corresponds to the ket space while the second half of the tensor modes corresponds to the bra space. For example, if tensor \(V[i0, i1, i2]\) represents a pure quantum state from a tensor space constructed from three vector spaces (quantum degrees of freedom), then tensor \(D[i0, i1, i2; j0, j1, j2]\) represents a mixed quantum state in the same tensor space, where modes \(i0, i1, i2\) are the ket modes while modes \(j0, j1, j2\) are the bra modes. The corresponding tensor space shape in both cases is \([range(i0), range(i1), range(i2)]\).
In C API, a pure/mixed quantum state is represented by the opaque cudensitymatState_t type.
The cudensitymatCreateState() API function is used to define a pure or mixed quantum state
from a tensor space of a desired shape. Optionally, one can define a batch of quantum
states of the same shape (batched quantum state) by specifying a batch size greater than 1.
Note that cudensitymatCreateState() expects a tensor space shape as an input argument,
not the quantum state shape (which is the same for pure states, but different for mixed states).
The created quantum state has no associated physical storage yet. In order to attach
physical storage to a quantum state, one must first query the number of storage components
expected by the quantum state representation by calling cudensitymatStateGetNumComponents().
A storage component is a memory buffer used to store a specific tensor from the quantum
state representation. By default, pure and mixed quantum states are stored using a single
storage component which stores the dense tensor representing the quantum state, using
the generalized column-major storage layout. The batch dimension (mode), if greater than 1,
is always the last (most significant) mode. In contrast, factorized representations of
quantum states may use more than one storage component, where each storage component stores
one (potentially batched) tensor associated with the requested tensor factorization scheme.
The current release does not support factorized tensor representations yet.
Once the number of required storage components has been determined, the user can
query the required size of each storage component by calling the C API function
cudensitymatStateGetComponentStorageSize(). Knowing the required size of each
storage component for a given quantum state, the user can allocate a GPU buffer
for each storage component and then attach the GPU buffer(s) to the quantum state
by calling the C API function cudensitymatStateAttachComponentStorage(). At this
point, the quantum state is fully set and can be used in actual computations.
The attached GPU buffers must stay allocated for the whole lifetime
of the quantum state. Note that the GPU buffers used to store the quantum
state representation must be 256-byte aligned, which is guaranteed when
using cudaMalloc or cudaMallocManaged. For custom GPU memory pools,
this alignment must be respected explicitly, otherwise it may result
in an undefined behavior. The data stored in the attached GPU buffer
associated with a given storage component defines the initial value
of the tensor associated with that storage component.
For distributed multi-GPU runs, which can be activated by calling the
C API function cudensitymatResetDistributedConfiguration() and giving it
a duplicate of an MPI communicator, the tensor storage will be distributed
across all GPUs. The distributed execution configuration assumes one GPU
per parallel process, with the total number of parallel processes being
a power of two (1, 2, 4, 8, 16, 32, 64, etc.). In this case, dense pure/mixed
quantum states will still be stored using a single storage component (tensor),
however each parallel process will locally store its own slice of the full tensor.
To query which specific tensor slice is stored by the current parallel process,
one can first call the C API function cudensitymatStateGetComponentNumModes() to
retrieve the number of tensor slice modes, and then call cudensitymatStateGetComponentInfo()
to retrieve the tensor slice information (mode offsets and mode extents).
There are a number of basic operations one can perform on a quantum state.
The C API function cudensitymatStateInitializeZero() initializes
the quantum state to zero (sets all tensor elements to zero). The C API
function cudensitymatStateComputeScaling() multiplies the quantum state
by a given scalar (or array of scalars for batched quantum states, one
scalar per instance of a batch). The C API function cudensitymatStateComputeNorm()
computes the squared Euclidean/Frobenius norm of a pure/mixed
quantum state, respectively (taking the square root will yield the norm).
For a batched quantum state, it will return an array of the size of the batch.
The C API function cudensitymatStateComputeTrace() computes the trace
of a mixed quantum state (represented by a density matrix), or
the squared Euclidean norm for a pure quantum state. For a batched quantum state,
it will return an array of the size of the batch. The C API function
cudensitymatStateComputeAccumulation() accumulates (adds) one quantum state
into another quantum state with an optional scaling coefficient applied
to the former. For batched quantum states, it accepts an array of scaling coefficients
of the size of the batch. The C API function cudensitymatStateComputeInnerProduct()
computes the inner product of two quantum states, which is the regular
Euclidean inner product for pure quantum states and the Frobenius
inner product for mixed quantum states. For batched quantum states,
it will return an array of the size of the batch. In all cases
where the quantum state is batched and an array is returned,
the provided GPU storage for the results must be sufficient
to store such an array of the size of the batch.
Note that all numerical operations accept the CUDA stream argument,
that is, the execution of these operations is generally asynchronous
and requires explicit synchronization via cudaStreamSynchronize or
cudaDeviceSynchronize. Once the quantum state is no longer needed and
all preceding operations on it have been synchronized via the CUDA
synchronization API, it can be safely destroyed by calling
the C API function cudensitymatDestroyState().
Quantum many-body operators (super-operators)¶
A broad class of quantum Liouvillian superoperators, or quantum many-body operators in general, can be expressed as a sum of tensor products of elementary tensor operators with some, generally time-dependent coefficients, where an elementary tensor operator is a tensor operator acting on a specific number of quantum degrees of freedom, either from the right or from the left, or both (pure quantum states only support operator action from the right). Each elementary tensor operator has the same number of the ket and bra modes, with the ket modes preceding the bra modes in its tensor representation. The bra modes of the elementary tensor operator act on the ket modes of the quantum state (action from the left) while the ket modes of the elementary tensor operator act on the bra modes of the quantum state (action from the right). In practice, many quantum Liouvillian operators are expressed in terms of tensor products of one-body elementary tensor operators, like creation/annihilation operators, particle number operator, coordinate/momentum operators, individual spin operators, etc. For the sake of concreteness, let’s consider the following quantum Liouvillian operator as an example:
In the above Liouvillian operator, \(b\), \(b^{\dagger}\), \(\sigma_{z}\),
\(\sigma_{-}\), \(\sigma_{+}\) are elementary tensor operators,
and each of them acts only on a single quantum degree of freedom,
represented by an order-2 tensor \(U[i0; j0]\), where \(i0\) is the ket
mode and \(j0\) is the bra mode (an order-k elementary tensor operator
is represented by a tensor with \(k\) ket modes and \(k\) bra modes).
\(\omega_{c}\), \(\omega_{a}\), and \(\gamma\) are constant scalars.
\(g(t, \alpha, \beta)\) is a time-dependent scalar coefficient that
depends on time and two user-provided real parameters, \(\alpha\) and \(\beta\),
which parameterize the Liouvillian operator (such that one can run parameter sweeps, for example).
Elementary tensor operators can also be batched. A batched elementary tensor operator
is represented by a batch of operator tensors stored contiguously in memory (effectively
in this case the operator tensor will have an additional batch mode at the last position).
When a batched elementary tensor operator is applied to a batched quantum state,
each operator instance from the batched operator acts on the corresponding instance
of the batched quantum state, thus the batch size must be consistent.
The C API function cudensitymatCreateElementaryOperator() creates
an elementary tensor operator (cudensitymatElementaryOperator_t)
by specifying the number of quantum degrees of freedom (space modes)
it acts on, the extents of those space modes, the numerical data
type (real/complex, FP32/FP64), a GPU buffer containing the tensor
elements of the elementary tensor operator (generalized column-wise layout),
and an optional user-defined tensor callback function. By registering
a user-provided tensor callback function, we allow the cuDensityMat library
to automatically update tensor elements of the elementary tensor operator
during the computation of the operator action on a quantum state,
based on the current value of time as well as an arbitrary set of user-provided
real parameters. The signature of the tensor callback function pointer
cudensitymatTensorCallback_t is provided in the public
C header cudensitymat.h, however the C API functions accept
tensor callback arguments in a wrapped form represented by the
cudensitymatWrappedTensorCallback_t structure in cudensitymat.h,
which contains the actual C function pointer for the tensor callback
as well as device type (CPU or GPU) where the callback function will
be executed. Note that a CPU-side user-provided tensor callback function
performs an update of the tensor elements inside a temporary CPU buffer
passed to the that callback function, and then automatically propagates
it to the original GPU buffer used during the definition of the elementary
tensor operator.
In order to create a batched elementary tensor operator, one must use
the batched version of the API cudensitymatCreateElementaryOperatorBatch().
If the tensor callback argument is provided, the corresponding tesor
callback function must update the entire batched elementary operator
tensor (the callback function takes batch size as an input argument).
Once the elementary tensor operator is no longer needed and is not
used by any operator term (see below), it can be safely destroyed
by calling the C API function cudensitymatDestroyElementaryOperator().
Note that the GPU buffer storing tensor elements of the elementary
tensor operator must stay valid for the whole life time of the
elementary tensor operator.
Sometimes a quantum many-body operator is defined by its full matrix
from the beginning, where the full matrix acts on all quantum degrees
of freedom. Such an operator is called the full matrix operator. It
can be created by calling the C API function cudensitymatCreateMatrixOperatorDenseLocal()
or its batched version cudensitymatCreateMatrixOperatorDenseLocalBatch(),
fully analogous to how an elementary tensor operator is created.
The C API function cudensitymatDestroyMatrixOperator is used to
destroy the full matrix operator when it is no longer needed and
is not used by any operator term.
Going one level higher, it is often convenient to represent a given quantum many-body
operator as a sum of different terms, with each term having its own physical meaning,
for example the system Hamiltonian, pulse/drive Hamiltonian, dissipation terms,
etc. For example, the Liouvillian operator defined above can be thought of
as composed of four operator terms (note that the operator term in the square
brackets comprises two tensor products of elementary tensor operators).
The cuDensityMat library provides the concept of the operator term
for this purpose, represented by the cudensitymatOperatorTerm_t C type.
The C API function cudensitymatCreateOperatorTerm() creates an empty
operator term by only specifying the tensor space shape where the quantum
states acted on by this operator term reside. Subsequently, the user can construct
the desired operator term by gradually appending to it elementary operator
products or full matrix operators with some, generally time-dependent coefficients.
The C API function cudensitymatOperatorTermAppendElementaryProduct() is used
for appending a tensor product of elementary tensor operators to the operator term.
It accepts an array of elementary tensor operators forming the desired
tensor product, the list of quantum degrees of freedom (space modes) the tensor product
is acting on (concatenated across all constituting elementary tensor operators
in their respective order), and the duality of action on each space mode.
If the duality is set to 0, the bra mode of the elementary tensor operator
is picked to act on a ket mode of the quantum state (action from the left);
otherwise, the ket mode of the elementary tensor operator is picked to act
on a bra mode of the quantum state (action from the right). Additionally,
cudensitymatOperatorTermAppendElementaryProduct() accepts a static
coefficient associated with the appended elementary operator product
as well as an optional user-defined coefficient callback function
(cudensitymatWrappedScalarCallback_t) that will also be associated with
the appended elementary operator product. The cuDensityMat library
will use the user-provided scalar callback function for retrieving
the time-dependent part of the total coefficient associated with
the appended elementary operator product, where the total coefficient
is a product of the static coefficient and the value returned by the
scalar callback function for a given value of time and an arbitrary set
of user-provided real parameters during operator action computation. For example,
the tensor product \(\frac{1}{2} g(t, \alpha, \beta) b \sigma_{+}\)
comprises two elementary tensor operators, \(b\) and \(\sigma_{+}\),
a constant (static) coefficient \(\frac{1}{2}\), and a time-dependent
parameterized coefficient \(g(t, \alpha, \beta)\) which can be defined
as a user-provided scalar callback function (either CPU-resident or GPU-resident).
Similarly, to cudensitymatOperatorTermAppendElementaryProduct(), the C API function
cudensitymatOperatorTermAppendMatrixProduct() is used for appending either a single
full matrix operator or a product of full matrix operators to the operator term
(note that in this case it is a matrix product, not tensor product). Both C API
functions used for appending tensor products of elementary operators or
matrix products of full matrices have their batched counterparts:
cudensitymatOperatorTermAppendElementaryProductBatch() and
cudensitymatOperatorTermAppendMatrixProductBatch(). The batched version
accepts the batch size, an array of static coefficients of that size,
and another array of the same size for storing the total coefficients,
where the total coefficients will be computed as an elementwise product
of the static coefficients and the values retured by the user-provided
scalar callback function which is supposed to compute the variable part
of the total coefficient for all instances of the batch in one call
(if the scalar callback function is not provided, the variable part will
be equal to identity). In general, an operator term is considered batched
if it was created via the batch API and/or if it contains at least one
batched elementary tensor operator or full matrix operator. A batched
operator term can only be applied to a batched quantum state of the same
batch size. In this case, each quantum state instance inside the quantum
state batch is acted on by the corresponding instance of the batched
operator term. A specific instance of the operator term is selected
by setting all batched elementary tensor operators, batched full matrix
operators, and batched coefficients inside the operator term to that instance.
Once an operator term is no longer needed and is not used by any operator,
it can be safely destroyed by calling the C API function cudensitymatDestroyOperatorTerm().
Note that the elementary tensor operators and full matrix operators constituting
the operator term must stay alive for the whole life time of the operator term.
Once all necessary operator terms have been defined, the user
can proceed to defining the full quantum many-body operator (or superoperator),
represented by the cudensitymatOperator_t C type, by calling the C API
function cudensitymatCreateOperator(). Similarly to creating an operator term,
this API function creates an empty operator by only taking the shape
of the tensor space where the quantum states it will act on reside.
Subsequently, the user can construct the desired operator by appending
the constituting operator terms into it via the C API function
cudensitymatOperatorAppendTerm(). Each appended operator term
can be supplied with a static coefficient as well as an optional
user-defined scalar callback function such that the total coefficient
associated with the appended operator term will be a product of the two.
Additionally, the user can request to append the operator term in its
dual form, where the duality of all its modes will be flipped (ket <–> bra).
This is convenient for defining the von Neumann part of the master equation,
where the Hamiltonian acts from different sides on the density matrix
to form the commutator. There is also a batched version of the API,
called cudensitymatOperatorAppendTermBatch(). The batched version
accepts the batch size, an array of static coefficients of that size,
and another array of the same size for storing the total coefficients,
where the total coefficients will be computed as an elementwise product
of the static coefficients and the values retured by the user-provided
scalar callback function which is supposed to compute the variable part
of the total coefficient for all instances of the batch in one call
(if the scalar callback function is not provided, the variable part will
be equal to identity). In general, an operator is considered batched
if it was created via the batch API and/or if it contains at least one
batched operator term. A batched operator can only be applied to a batched
quantum state of the same batch size. In this case, each quantum state
instance inside the quantum state batch is acted on by the corresponding
instance of the batched operator. A specific instance of the operator is
selected by setting all batched operator terms and batched coefficients
inside the operator to that instance. A batched operator can only
be applied to a batched quantum state of the same batch size. If the
operator is not batched, it can be applied to both simple and batched
quantum states.
Once the operator is no longer needed and all computations in which
it participated have been synchronized, it can be safely destroyed
by calling the C API function cudensitymatDestroyOperator(). Note that
the constituting operator terms must stay alive for the whole life time
of the operator.
Finally, in order to apply the defined (simple or batched) quantum many-body operator
to a (simple or batched) quantum state, the user needs to first prepare its
action (only once) and then compute its action as many times as needed.
The C API function cudensitymatOperatorPrepareAction() prepares the action
of the operator on a quantum state of a given kind. It will return the workspace
buffer size required for the actual computation. The user can then allocate
a GPU workspace buffer of the required size (or larger) and attach it
to the workspace descriptor (cudensitymatWorkspaceDescriptor_t) used in the compute call.
Once the workspace is ready, calling the C API function cudensitymatOperatorComputeAction()
will apply the operator to a given (input) quantum state, accumulating its action into
another (output) quantum state. This API call takes the current value of time as well as
an arbitrary set of real parameters that will trigger an update of the value of the operator
via registered callback functions before starting the computation. The requested computation
can be inserted in a CUDA stream and executed asynchronously with respect to Host, in which
case it is user’s responsibility to synchronize the computation via cudaStreamSynchronize.
Workspace descriptor¶
Each non-trivial numerical operation supported by the cuDensityMat
library generally requires a workspace buffer allocated in GPU memory.
The C API function cudensitymatCreateWorkspace() creates an empty
workspace descriptor. The preparation stage for a desired
numerical computation sets the required workspace buffer size
inside the workspace descriptor. The user can then query the
required size via cudensitymatWorkspaceGetMemorySize() and
allocate a GPU workspace buffer of at least that size. Once
the workspace buffer has been allocated (or taken from a
pre-allocated memory pool), it can be attached to the workspace
descriptor by calling the C API function cudensitymatWorkspaceSetMemory().
At this point, the workspace descriptor is ready to be passed
to the actual computation call. The workspace descriptor
can be reused after the computation call it was passed to
has been synchronized, provided that the workspace buffer
size is sufficient for the next computation call. The GPU
buffer used by the workspace descriptor must stay allocated
while the workspace descriptor is in use. Once the workspace
descriptor is no longer needed, it can be safely destroyed
via cudensitymatDestroyWorkspace() (note that the memory
buffer used by the workspace descriptor stays intact).
Importantly, any GPU buffer attached to a workspace descriptor
must be aligned to a 256-byte boundary, which is guaranteed
when using cudaMalloc or cudaMallocManaged. For custom
GPU memory pools, this alignment must be respected explicitly.
Providing unaligned workspace buffers may result in undefined
behavior.
Coupled quantum dynamics master equations (system of ODE)¶
While the concept of a Liouvillian superoperator and its action
on a quantum state is sufficient to specify the right-hand side
of the broadly used master equations, like the Lindblad equation,
in a more general setting one may need to be able to specify
a system of coupled ordinary differential equations (ODE)
to simultaneously evolve multiple (potentially batched) quantum
states under their own Liouvillian operators, like it is done
in the hierarchical equations of motion (HEOM) method. In such cases,
the user needs to define the so-called composite operator action,
represented by a set of C objects of the cudensitymatOperatorAction_t
type. They can be constructed by calling the C API function
cudensitymatCreateOperatorAction(). The composite operator action
assumes a simultaneous evolution of \(M\) quantum states described
by a system of \(M\) coupled ODEs. For each coupled ODE from
the ODE system, one needs to construct a separate operator action object
(cudensitymatOperatorAction_t) by calling cudensitymatCreateOperatorAction(),
thus requiring \(M\) operator action objects for the full ODE system
specification. Construction of each operator action object
representing a specific ODE from the coupled ODE system accepts
\(M\) operators corresponding to the \(M\) evolved quantum states,
where each operator will be acting on its corresponding quantum state,
accumulating into the right-hand side of that specific ODE.
In turn, the right-hand side of that specific ODE determines
the time derivative of the corresponding quantum state from
the set of \(M\) quantum states being evolved in time.
If a specific quantum state is not present in a specific ODE
in the coupled ODE system, the corresponding operator is formally zero,
which is specified by providing a NULL pointer in the corresponding position.
Once the \(M\) operator action (cudensitymatOperatorAction_t) objects
have been constructed, that is, the coupled ODE system has been fully
specified, one can prepare each of them for computation by calling the
C API function cudensitymatOperatorActionPrepare() which will return
the size of the required workspace buffer inside the workspace
descriptor. In order to prepare each operation action object,
this API function needs to know the kind of quantum states
to which the operators will be applied. Once prepared, the user
can then query the required workspace buffer size, allocate
the GPU workspace buffer of the required (or larger) size and attach
it to the workspace descriptor (cudensitymatWorkspaceDescriptor_t)
used in the compute call. Once the workspace is ready, calling the
C API function cudensitymatOperatorActionCompute() will apply all
operators to the given (input) quantum states, accumulating its action
into the provided output quantum state. This API call takes the current value
of time as well as an arbitrary set of user-provided real parameters that
will trigger an update of the value of the operator via registered callback
functions before starting the computation. The requested computation
can be inserted in a CUDA stream and executed asynchronously with respect to Host,
in which case it is user’s responsibility to synchronize the computation
via cudaStreamSynchronize. Note that if all constructed and prepared
operator actions (cudensitymatOperatorAction_t) are placed in the same
CUDA stream, they will be computed in-order, in which case the same
workspace buffer can be reused for all these prepared operator actions
as long as the buffer size is large enough for any of them.
Note that the operator action object can also be constructed for
a single (potentially batched) quantum state acted on by a single
(potentially batched) Liouvillian operator, in which case it is
equivalent to preparing and computing the operator action directly
via the operator API. When the operator action object is no longer needed,
it can be safely destroyed by calling cudensitymatDestroyOperatorAction().
Properties of quantum states¶
In order to compute different properties of a quantum state, the user can
create, prepare, and compute the corresponding property calculation object.
The C API function cudensitymatCreateExpectation() takes an arbitrary
operator (cudensitymatOperator_t) and creates the expectation value
calculation object for it (cudensitymatExpectation_t).
The C API function cudensitymatExpectationPrepare(), which needs to be called
only once, prepares the expectation value calculation object for computation
for a given kind of quantum states. Similarly to other prepare functions,
it sets the required workspace buffer size inside the workspace descriptor.
Then the user will need to allocate a GPU buffer of that (or larger) size
and attach it to the workspace descriptor. Once prepared, the expectation
value of the operator can be computed any number of times via
cudensitymatExpectationCompute() for the same or different quantum states
of the same kind (same shape, batch size, purity, and data type). If the quantum
state is batched, cudensitymatExpectationCompute() will compute the expectation
values for all instances of the batch, thus returning an array of expectation
values, with the array size expected to be at least the batch size.
If the operator is batched as well, the instances inside the operator batch
will be matched with the instances inside the quantum state batch.
Multi-GPU multi-node execution¶
The cuDensityMat library C API works on both single and multiple GPUs.
Running parallel jobs on multiple GPUs requires building an MPI plugin
after downloading the library archive. After unpacking the archive,
one can find the script called activate_mpi_cudm.sh inside
the distributed_interfaces folder. By running the script
inside that folder, the MPI interface plugin library
libcudensitymat_distributed_interface_mpi.so will be built.
The script requires environment variables CUDA_PATH and MPI_PATH
to point to the CUDA and MPI installation directories, respectively.
The provided MPI installation must point to a CUDA-aware MPI implementation
(all major MPI libraries, like OpenMPI, MPICH, and MVAPICH are CUDA-aware).
The script will also set environment variable CUDENSITYMAT_COMM_LIB
to point to the MPI interface plugin library location. This variable
should be exported globally to make sure the cuDensityMat library
can find the MPI interface plugin library during run time.
Once the MPI interface plugin has been set up, the activation of
parallel execution on multiple GPUs is achieved by calling the C API
function cudensitymatResetDistributedConfiguration(). This function
accepts a duplicate of an MPI communicator in a type-erased form,
for example a duplicate of the MPI_COMM_WORLD communicator
obtained via a MPI_Comm_duplicate call (consult MPI library).
After that point, all numerical computations will run in parallel
across all MPI ranks available inside the provided MPI communicator.
The user can query the total number of parallel ranks and the rank
of the current process by calling cudensitymatGetNumRanks() and
cudensitymatGetProcRank(), respectively. The cuDensityMat library
supports parallel execution only with a power of two number of parallel
processes, one GPU per process.
Citing cuQuantum¶
- Bayraktar et al., “cuQuantum SDK: A High-Performance Library for Accelerating Quantum Science,” 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), Bellevue, WA, USA, 2023, pp. 1050-1061, doi: 10.1109/QCE57702.2023.00119.