Overview¶
This section describes the concepts and working principles of the cuDensityMat library. Some familiarity with quantum mechanics may be required.
A general library usage workflow is the following:
Create the library context handle. This will initialize the library context. A library context is associated with the currently set GPU. All subsequent API calls and numerical operations must be executed on that same GPU.
(Optional). In order to run on multiple GPUs in parallel, one will need to activate distributed execution explicitly by calling the C API function
cudensitymatResetDistributedConfiguration()
and giving it a pointer to a valid MPI communicator. Each parallel process will be assigned its current GPU used during the library context creation.Define the elements of necessary elementary tensor operators in GPU memory. The corresponding arrays must be stored using the generalized column-wise storage layout.
Define the necessary elementary tensor operators, operator terms, and full quantum many-body operators.
Create the necessary quantum states, query required storage, and attach GPU buffers to the quantum states, making them ready for computation. The original content of the attached GPU buffer defines the initial value of the quantum state. For multi-GPU runs, the library provides special API functions for querying which particular tensor slice is stored at the current parallel process.
Prepare the action of a desired quantum operator on a desired kind of quantum states (shape, purity, and data type). This only needs to be done once.
Compute the action of a desired quantum operator on a concrete quantum state that will be accumulated into another quantum state of the same kind.
One can also compute the expectation value of a given quantum operator over a given quantum state. This computation also needs to be prepared first (once), and then executed as many times as needed.
If needed, one can perform different basic operations on a quantum state, like initialization to zero, scaling, computing the norm, trace, accumulating a quantum state into antother quantum state, and computing the inner product.
The details are presented below.
Tensor Spaces and Quantum States¶
A tensor space is a composite linear space constructed as a tensor product of one or more vector spaces. Each constituting vector space represents a quantum degree of freedom of dimension equal to the dimension of the vector space. The shape of the composite tensor space is a tuple of the dimensions of the consituting vector spaces in their original order. Note that the shape of the tensor space stays the same irrespective of whether we deal with pure states (vectors living in the tensor space) or mixed states (matrices associated with the same tensor space).
A pure quantum state is represented by a tensor from the specified tensor space, thus the shape of a pure quantum state coincides with the tensor space shape. A mixed quantum state is represented by a tensor belonging to the tensor space constructed as a tensor product of the specified tensor space (primary space or ket space) and its dual tensor space (dual space or bra space). Thus, the shape of a mixed quantum state is a concatenation of two tensor space shapes, where the first half of the tensor modes corresponds to the ket space while the second half of the tensor modes corresponds to the bra space. For example, if tensor \(V[i0, i1, i2]\) represents a pure quantum state from a tensor space constructed from three vector spaces (quantum degrees of freedom), then tensor \(D[i0, i1, i2; j0, j1, j2]\) represents a mixed quantum state from the same tensor space, where modes \(i0, i1, i2\) are the ket modes while modes \(j0, j1, j2\) are the bra modes. The corresponding tensor space shape in both cases is \([range(i0), range(i1), range(i2)]\).
In C API, a pure/mixed quantum state is represented by the opaque cudensitymatState_t
type.
The cudensitymatCreateState()
API function is used to define a pure or mixed quantum state
from a tensor space of a desired shape. Optionally, one can define a batch of quantum
states of the same shape (batched quantum state) by specifying a batch size greater than 1.
Note that cudensitymatCreateState()
expects a tensor space shape as an input argument,
not the quantum state shape (which is the same for pure states, different for mixed states).
The created quantum state has no associated physical storage yet. In order to attach
physical storage to a quantum state, one must first query the number of storage components
expected by the quantum state representation by calling cudensitymatStateGetNumComponents()
.
A storage component is a memory buffer used to store a specific tensor from the quantum
state representation. By default, pure and mixed quantum states are stored using a single
storage component which stores the dense tensor representing the quantum state, using
the generalized column-major storage layout. The batch dimension (mode), if greater than 1,
is always the last (most significant) mode. In contrast, factorized representations of
quantum states may use more than one storage component, where each storage component stores
one (potentially batched) tensor associated with the requested tensor factorization scheme.
The current release does not support factorized tensor representations yet.
Once the number of required storage components has been determined, the user can
query the required size of each storage component by calling the C API function
cudensitymatStateGetComponentStorageSize()
. Knowing the required size of each
storage component for a given quantum state, the user can allocate a GPU buffer
for each storage component and attach the GPU buffer(s) to the quantum state
by calling the C API function cudensitymatStateAttachComponentStorage()
. At this
point, the quantum state is fully set and can be used in actual computations.
The attached GPU buffers must stay allocated for the whole lifetime
of the quantum state. Note that the GPU buffers used to store the quantum
state representation must be 256-byte aligned, which is guaranteed when
using cudaMalloc
or cudaMallocManaged
. For custom GPU memory pools,
this alignment must be respected explicitly, otherwise it may result
in an undefined behavior. The data stored in the attached GPU buffer
associated with a given storage component defines the initial value
of the tensor associated with that storage component.
For distributed multi-GPU runs, which can be activated by calling the
C API function cudensitymatResetDistributedConfiguration()
and giving it
a duplicate of an MPI communicator, the tensor storage will be distributed
across all GPUs. The distributed execution configuration assumes one GPU
per parallel process, with the total number of parallel processes being
a power of two (1, 2, 4, 8, 16, 32, 64, etc.). In this case, dense pure/mixed
quantum states will still be stored using a single storage component (tensor),
however each parallel process will locally store its own slice of the full tensor.
To query which specific tensor slice is stored by the current parallel process,
one can first call the C API function cudensitymatStateGetComponentNumModes()
to
retrieve the number of tensor slice modes, and then cudensitymatStateGetComponentInfo()
to retrieve the tensor slice information (mode offsets, mode extents).
There are a number of basic operations one can perform on a quantum state.
The C API function cudensitymatStateInitializeZero()
initializes
the quantum state to zero (sets all tensor elements to zero). The C API
function cudensitymatStateComputeScaling()
multiplies the quantum state
by a given scalar (or array of scalars for batched quantum states, one
scalar per instance of a batch). The C API function cudensitymatStateComputeNorm()
computes the squared Euclidean/Frobenius norm of a pure/mixed
quantum state, respectively (taking the square root will yield the norm).
For a batched quantum state, it will return an array of the size of the batch.
The C API function cudensitymatStateComputeTrace()
computes the trace
of a mixed quantum state (represented by a density matrix), or
the squared Euclidean norm of a pure quantum state. For a batched quantum state,
it will return an array of the size of the batch. The C API function
cudensitymatStateComputeAccumulation()
accumulates (adds) one quantum state
into another quantum state with an optional scaling coefficient applied
to the former. For batched quantum states, it accepts an array of scaling coefficients
of the size of the batch. The C API function cudensitymatStateComputeInnerProduct()
computes the inner product of two quantum states, which is the regular
Euclidean inner product for pure quantum states and the Frobenius
inner product for mixed quantum states. For batched quantum states,
it will return an array of the size of the batch. In all cases
where the quantum state is batched and an array is returned,
the provided GPU storage for the results must be sufficient
to store such an array of the size of the batch.
Note that all numerical operations accept the CUDA stream argument,
that is, the execution of these operations is generally asynchronous
and requires explicit synchronization via cudaStreamSynchronize
or
cudaDeviceSynchronize
. Once the quantum state is no longer needed and
all preceding operations on it have been synchronized via the CUDA
synchronization API, it can be safely destroyed by calling
the C API function cudensitymatDestroyState()
.
Quantum Many-Body Operators (Super-Operators)¶
A broad class of quantum Liouvillian superoperators, or quantum many-body operators in general, can be expressed as a sum of tensor products of elementary tensor operators with some, generally time-dependent coefficients, where an elementary tensor operator is a tensor operator acting on a specific number of quantum degrees of freedom, either from the right or from the left, or both. Each elementary tensor operator has the same number of the ket and bra modes, with the ket modes preceding the bra modes in its tensor. The bra modes of the elementary tensor operator act on the ket modes of the quantum state (action from the left) while the ket modes of the elementary tensor operator act on the bra modes of the quantum state (action from the right). In practice, many quantum Liouvillian operators are expressed in terms of tensor products of the one-body elementary tensor operators, like creation/annihilation operators, particle number operator, coordinate/momentum operators, individual spin operators, etc. For the sake of concreteness, let’s consider the following quantum Liouvillian operator as an example:
In the above Liouvillian operator, \(b\), \(b^{\dagger}\), \(\sigma_{z}\),
\(\sigma_{-}\), \(\sigma_{+}\) are elementary tensor operators,
and each of them acts only on a single quantum degree of freedom,
represented by an order-2 tensor \(U[i0; j0]\), where \(i0\) is the ket
mode and \(j0\) is the bra mode (an order-k
elementary tensor operator
is repsented by a tensor with \(k\) ket modes and \(k\) bra modes).
\(\omega_{c}\), \(\omega_{a}\), and \(\gamma\) are constant scalars.
\(g(t, \alpha, \beta)\) is a time-dependent scalar coefficient that
depends on time and two real parameters, \(\alpha\) and \(\beta\),
which parameterize the Liouvillian operator.
The C API function cudensitymatCreateElementaryOperator()
creates
an elementary tensor operator (cudensitymatElementaryOperator_t
)
by specifying the number of quantum degrees of freedom (space modes)
it acts on, the extents of those space modes, the numerical data
type (real/complex, FP32/FP64), a GPU buffer containing the tensor
elements of the elementary tensor operator (generalized column-wise layout),
and an optional user-defined tensor callback function. By registering
a user-provided tensor callback function, we allow the cuDensityMat library
to automatically update tensor elements of the elementary tensor operator
during the computation of the operator action on a quantum state,
based on the current value of time as well as an arbitrary set of real
parameters (the signature of the tensor callback function pointer
cudensitymatTensorCallback_t
is provided in the public
C header cudensitymat.h
). Note that the user-provided
tensor callback function performs an update of the tensor elements
inside a temporary CPU buffer passed to the function callback,
and then automatically propagates it to the original GPU buffer
used during the definition of the elementary tensor operator.
Once the elementary tensor operator is no longer needed and
is not used by any operator term, it can be safely destroyed
by calling the C API function cudensitymatDestroyElementaryOperator()
.
Note that the GPU buffer storing tensor elements of the elementary
tensor operator must stay valid for the whole life time of the
elementary tensor operator.
It is often convenient to represent a given quantum many-body operator
as a sum of different terms, with each term having its own physical meaning,
for example the system Hamiltonian, pulse/drive Hamiltonian, dissipation terms,
etc. For example, the Liouvillian operator defined above can be thought of
as composed of four operator terms (the operator term in the square
brackets comprises two tensor products of elementary tensor operators).
The cuDensityMat library provides the concept of the operator term
for this purpose, represented by the cudensitymatOperatorTerm_t
C type.
The C API function cudensitymatCreateOperatorTerm()
creates an empty
operator term by only specifying the tensor space shape where the quantum
states acted on by this operator term are residing. Subsequently, the user
can construct the desired operator term by gradually appending tensor products
of elementary tensor operators with some, generally time-dependent coefficients
to it. The C API function cudensitymatOperatorTermAppendElementaryProduct()
serves
this purpose. It accepts an array of elementary tensor operators forming the desired
tensor product, the list of quantum degrees of freedom (space modes) the tensor product
is acting on (concatenated across all consituting elementary tensor operators
in their respective order), and the duality of action on each space mode.
If the duality is set to 0, the bra mode of the elementary tensor operator
is picked to act on a ket mode of the quantum state (action from the left);
otherwise, the ket mode of the elementary tensor operator is picked to act
on a bra mode of the quantum state (action from the right). Finally,
the C API function cudensitymatOperatorTermAppendElementaryProduct()
also
accepts a static coefficient associated with the appended tensor product
as well as an optional user-defined coefficient callback function
(cudensitymatScalarCallback_t
) that will also be associated with
the appended tensor product. The cuDensityMat library will use
the user-provided coefficient callback function for retrieving
the time-dependent part of the total coefficient associated with
the appended tensor product, where the total coefficient is a product
of the static coefficient and the value returned by the coefficient
callback function for a given value of time and an arbitrary set
of real parameters during operator action computation. For example,
the tensor product \(\frac{1}{2} g(t, \alpha, \beta) b \sigma_{+}\)
comprises two elementary tensor operators, \(b\) and \(\sigma_{+}\),
a constant (static) coefficient \(\frac{1}{2}\), and a time-dependent
parameterized coefficient \(g(t, \alpha, \beta)\) which can be defined
as a user-provided CPU callback function.
Once an operator term is no longer needed and is not used by any operator,
it can be safely destroyed by calling the C API function cudensitymatDestroyOperatorTerm()
.
Note that the elementary tensor operators constituting the operator term
must stay alive for the whole life time of the operator term.
Once all necessary operator terms have been defined, the user
can proceed to defining the full quantum many-body operator (or superoperator),
represented by the cudensitymatOperator_t
C type, by calling the C API
function cudensitymatCreateOperator()
. Similarly to creating an operator term,
this API function creates an empty operator by only taking the shape
of the tensor space where the quantum states it will act on reside.
Subsequently, the user can construct the desired operator by appending
the constituting operator terms into it via the C API function
cudensitymatOperatorAppendTerm()
. Each appended operator term
can be supplied with a static coefficient as well as an optional
user-defined coefficient callback function such that the total coefficient
associated with the appended operator term will be a product of the two.
Additionally, the user can request to append the operator term in its
dual form, where the duality of all its modes will be flipped (ket <–> bra).
This is convenient for defining the von Neumann part of the master equation,
where the Hamiltonian acts from different sides on the density matrix
to form the commutator. Once the operator is no longer needed and all
computations in which it participated have been synchronized, it can be
safely destroyed by calling the C API function cudensitymatDestroyOperator()
.
Note that the constituting operator terms must stay alive for the whole
life time of the operator.
Finally, in order to apply the defined quantum many-body operator to a quantum state,
the user needs to first prepare its action (only once) and then compute its action
as many times as needed. The C API function cudensitymatOperatorPrepareAction()
prepares
the action of the operator on a quantum state of a given kind. It will return
the workspace buffer size required for the actual computation. The user can then allocate
the GPU workspace buffer of the required size (or larger) and attach it
to the workspace descriptor used in the call (cudensitymatWorkspaceDescriptor_t
).
Once the workspace is ready, calling the C API function cudensitymatOperatorComputeAction()
will apply the operator to a given (input) quantum state, accumulating its action into
another (output) quantum state. This API call takes the current value of time as well as
an arbitrary set of real parameters that will trigger an update of the value of the operator
via registered callback functions before starting the computation. The requested computation
can be inserted in a CUDA stream and executed asynchronously with respect to Host, in which
case it is user’s responsibility to synchronize the computation via cudaStreamSynchronize
.
Workspace Descriptor¶
Each non-trivial numerical operation supported by the cuDensityMat
library generally requires a workspace buffer allocated in GPU memory.
The C API function cudensitymatCreateWorkspace()
creates an empty
workspace descriptor. The preparation stage for a desired
numerical computation sets the required workspace buffer size
inside the workspace descriptor. The user can then query the
required size via cudensitymatWorkspaceGetMemorySize()
and
allocate a GPU workspace buffer of at least that size. Once
the workspace buffer has been allocated (or taken from a
pre-allocated memory pool), it can be attached to the workspace
descriptor by calling the C API function cudensitymatWorkspaceSetMemory()
.
At this point, the workspace descriptor is ready to be passed
to the actual computation call. The workspace descriptor
can be reused after the computation call it was passed to
has been synchronized, provided that the workspace buffer
size is sufficient for the next computation call. The GPU
buffer used by the workspace descriptor must stay allocated
while the workspace descriptor is in use. Once the workspace
descriptor is no longer needed, it can be safely destroyed
via cudensitymatDestroyWorkspace()
.
Note that any GPU buffer attached to a workspace descriptor
must be aligned to a 256-byte boundary, which is guaranteed
when using cudaMalloc
. For custom GPU memory pools, this
alignment must be respected explicitly.
Coupled Quantum Dynamics Master Equations (System of ODE)¶
While the concept of a Liouvillian superoperator and its action
on a quantum state is sufficient to specify the right-hand side
of the broadly used master equations, like the Lindblad equation,
in a more general setting one may need to be able to specify
a system of coupled ordinary differential equations (ODE)
to simultaneously evolve multiple (potentially batched) quantum
states under their own Liouvillian operators, like it is done
in the hierarchical equations of motion (HEOM) method. In such cases,
the user needs to define the so-called composite operator action,
represented by a set of C objects of the cudensitymatOperatorAction_t
type. They can be constructed by calling the C API function
cudensitymatCreateOperatorAction()
. The composite operator action
assumes a simultaneous evolution of \(M\) quantum states described
by a system of \(M\) coupled ODEs. For each coupled ODE from
the ODE system, one needs to construct a separate operator action object
(cudensitymatOperatorAction_t
) by calling cudensitymatCreateOperatorAction()
,
thus requiring \(M\) operator action objects for the full ODE system
specification. Construction of each operator action object
representing a specific ODE from the coupled ODE system accepts
\(M\) operators corresponding to the \(M\) evolved quantum states,
where each operator will be acting on its corresponding quantum state,
accumulating into the right-hand side of that specific ODE.
In turn, the right-hand side of that specific ODE determines
the time derivative of the corresponding quantum state from
the set of \(M\) quantum states being evolved in time.
If a specific quantum state is not present in a specific ODE
in the coupled ODE system, the corresponding operator is formally zero,
which is specified by providing a NULL pointer in the corresponding position.
Once the \(M\) operator action (cudensitymatOperatorAction_t
) objects
have been constructed, that is, the coupled ODE system has been fully
specified, one can prepare each of them for computation by calling the
C API function cudensitymatOperatorActionPrepare()
which will return
the size of the required workspace buffer inside the workspace
descriptor. In order to prepare each operation action object,
this API function needs to know the kind of quantum states
to which the operators will be applied. Once prepared, the user
can then query the required workspace buffer size, allocate
the GPU workspace buffer of the required (or larger) size and attach
it to the workspace descriptor (cudensitymatWorkspaceDescriptor_t
)
used in the call. Once the workspace is ready, calling the C API function
cudensitymatOperatorActionCompute()
will apply all operators
to the given (input) quantum states, accumulating its action
into the output quantum state. This API call takes the current value
of time as well as an arbitrary set of real parameters that will trigger
an update of the value of the operator via registered callback
functions before starting the computation. The requested computation
can be inserted in a CUDA stream and executed asynchronously with respect to Host,
in which case it is user’s responsibility to synchronize the computation
via cudaStreamSynchronize
. Note that if all constructed and prepared
operator actions (cudensitymatOperatorAction_t
) are placed in the same
CUDA stream, they will be computed in-order, in which case the same
workspace buffer can be reused for all these prepared operator actions
as long as the buffer size is large enough for any of them.
Note that the operator action object can also be constructed for
a single (potentially batched) quantum state acted on by a single
Liouvillian operator, in which case it is equivalent to preparing
and computing the operator action directly via the operator API.
When the operator action object is no longer needed, it can be
safely destroyed by calling cudensitymatDestroyOperatorAction()
.
Properties of Quantum States¶
In order to compute different properties of a quantum state, the user can
create, prepare, and compute the corresponding property calculation object.
The C API function cudensitymatCreateExpectation()
takes an arbitrary
operator (cudensitymatOperator_t
) and creates the expectation value
calculation object for it (cudensitymatExpectation_t
).
The C API function cudensitymatExpectationPrepare()
, which needs to be called
only once, prepares the expectation value calculation object for computation
for a given kind of quantum states. Similarly to other prepare functions,
it sets the required workspace buffer size inside the workspace descriptor.
Then the user will need to allocate a GPU buffer of that (or larger) size
and attach it to the workspace descriptor. Once prepared, the expectation
value of the operator can be computed any number of times via
cudensitymatExpectationCompute()
for the same or different quantum states
of the same kind (same shape, puriry, and data type). If the quantum state
is batched, cudensitymatExpectationCompute()
will compute the expectation
values for all instances of the batch, thus returning an array of expectation
values, with the array size expected to be at least the batch size.
Multi-GPU Multi-Node Execution¶
The cuDensityMat library C API works on both single and multiple GPUs.
Running parallel jobs on multiple GPUs requires building an MPI plugin
after downloading the library archive. After unpacking the archive,
one can find the script called activate_mpi_cudm.sh
inside
the distributed_interfaces
folder. By running the script
inside that folder, the MPI interface plugin library
libcudensitymat_distributed_interface_mpi.so
will be built.
The script requires environment variables CUDA_PATH
and MPI_PATH
to point to the CUDA and MPI installation directories, respectively.
The provided MPI installation must point to a CUDA-aware MPI implementation.
The script will also set environment variable CUDENSITYMAT_COMM_LIB
to point to the MPI interface plugin library location. This variable
should be exported globally to make sure the cuDensityMat library
can find the MPI interface plugin library during run time.
Once the MPI interface plugin has been set up, the activation of
parallel execution on multiple GPUs is achieved by calling the C API
function cudensitymatResetDistributedConfiguration()
. This function
accepts a duplicate of an MPI communicator in a type-erased form,
for example a duplicate of the MPI_COMM_WORLD
communicator
obtained via a MPI_Comm_duplicate
call (consult MPI library).
After that point, all numerical computations will run in parallel
across all MPI ranks available inside the provided MPI communicator.
The user can query the total number of parallel ranks and the rank
of the current process by calling cudensitymatGetNumRanks()
and
cudensitymatGetProcRank()
, respectively. The cuDensityMat library
supports parallel execution only with a power of two number of parallel
processes, one GPU per process.
Citing cuQuantum¶
Bayraktar et al., “cuQuantum SDK: A High-Performance Library for Accelerating Quantum Science,” 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), Bellevue, WA, USA, 2023, pp. 1050-1061, doi: 10.1109/QCE57702.2023.00119.