Overview#

This section describes the basic working principles of the cuPauliProp library. For a general introduction to quantum circuits, please refer to Introduction to quantum computing.

Introduction to Pauli propagation#

Pauli propagation is an approximate method to find expectation values and similar quantities of qubit states output from quantum circuits. Its key features include that the simulated state is sparsely represented in the Pauli basis via some dynamically sized data structure, which is regularly “truncated” to discard information which contributes insignificantly to the sought output quantity. Truncating the evolving state permits Pauli propagation to curtail the costs which make exact simulation techniques prohibitively expensive. As such, Pauli propagation can approximate the outputs of noisy quantum circuits far beyond the simulable regime of full-state methods.

Pauli basis#

The Pauli matrices \(I\), \(X\), \(Y\) and \(Z\) form a basis for all \(2 \times 2\) matrices. We refer to a tensor product of these matrices as a Pauli string which we notate in shorthand as \(X \otimes Y \otimes Z = XYZ\). Pauli strings of \(N\) sites form a basis for all \(2^N \times 2^N\) matrices, and so any \(N\)-qubit density matrix or observable operator can be expressed as a weighted sum of Pauli strings. For example,

\[\begin{split}\begin{pmatrix} a & b \\ c & d \end{pmatrix} = \frac{1}{2}(a + d) I + \frac{1}{2}(b+c) X + \frac{i}{2}(b-c)Y + \frac{1}{2} (a-d) Z\end{split}\]

We refer to a weighted sum of Pauli strings as a “Pauli expansion”. Real coefficients are sufficient to express all Hermitian operators and physically valid states, as result from completely positive and trace preserving (CPTP) operations; like all physically valid quantum circuits and noise processes. Complex coefficients permit encoding any matrix in the Pauli basis, and simulation of even non-physical states and processes. The cuPauliProp library supports both real and complex coefficients, in single and double precision, indicated by passing the below cudaDataType_t enums to the API.

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Pauli strings can be efficiently encoded using bitfields, and it is necessary to understand cuPauliProp’s convention in order to correctly initialise a Pauli expansion. Each Pauli string is instantiated as a sequence of “packed integers” (using the cupaulipropPackedIntegerType_t type) where a bit value of \(1\) at bit index \(i\) indicates a non-identity Pauli operating upon qubit \(i\). We employ two packed integers for each span of 64 qubits, which separately indicate the locations of \(X\) and \(Z\) Paulis, and where \(Y\) is indicated by a set bit in both packed integers. For example,

\[\begin{split}IXIZIY \equiv \begin{cases} \text{xInt}: 010001 \\ \text{zInt}: 000101 \end{cases}\end{split}\]

Pauli strings longer than 64 qubits are represented by a sequence of packed integers. For example, here notating a Pauli operator \(\sigma\) acting upon the \(i\)-th qubit as \(\sigma_i\) (with all unspecified Paulis being assumed identity),

\[\begin{split}X_0 X_{63} Z_{64} Y_{127} \equiv \begin{cases} \text{xInts}: \{ 10 \dots 01, \; 00 \dots 01 \} \\ \text{zInts}: \{ 00 \dots 00, \; 10 \dots 01 \} \end{cases}\end{split}\]

Note

The bitfields above are written with the zero-th bit on the left for clarity. The packed integers accepted by the cuPauliProp API expect the n-th qubit corresponds to the n-th bit, as accessed via (packedint >> n) & 1. In other words, the Pauli string \(X_0 Z_1\) is encoded into \(X\) and \(Z\) packed integers with respective decimal values of \(1\) and \(2\).

When a Pauli string has a length which is not a multiple of \(64\), its final packed integer(s) are padded with unset (i.e. \(0\)) significant bits. Finally, the multiple Pauli strings which make up a Pauli expansion are stored together as a single, contiguous sequence of packed integers, with the \(X\) component(s) of each string placed before the corresponding \(Z\) component(s). For example, using \(\vert\) to delimit the paadded bits,

\[ZZZ + XYI \equiv \{ 000 \vert 0\dots0, \; 111 \vert 0\dots0, \;\; 110 \vert 0\dots0, \; 010 \vert 0\dots0 \}\]

The coefficients of the Pauli strings are stored in a separate array. We refer to a Pauli string along with its corresponding coefficient as a “term” of a Pauli expansion. The memory requirements of a Pauli expansion grow linearly both with the number of qubits and terms, at a total cost of \(\mathcal{O}(\# \text{qubits} \times \# \text{terms})\).

The API function cupaulipropCreatePauliExpansion() creates a Pauli expansion from GPU device memory which is assumed to have been pre-initialised as per the encodings above.

Applying gates#

Operations like gates and decoherence channels can be performed upon a Pauli expansion, “evolving” the state. This can change some or all of the Pauli strings and coefficients in the expansion, and/or create new terms. Let \(s_{\text{in}}\) be an input Pauli expansion, and let \(\hat{U}\) be some unitary operator to be applied. The resulting expansion is given by

\[s_{\text{out}} = \hat{U} \, s_{\text{in}} \, \hat{U}^\dagger,\]

or we can alternatively apply the adjoint of \(\hat{U}\) (by passing adjoint=1 to the relevant cuPauliProp API function) to obtain

\[s_{\text{out}} = \hat{U}^\dagger \, s_{\text{in}} \, \hat{U}.\]

These respective forms typically appear in the Schrödinger and Heisenberg pictures of quantum evolution. For instance, letting \(\rho_{\text{in}}\) denote an input state to a circuit described by the channel \(\mathcal{D}\), and letting \(\hat{O}\) denote some observable operator, the expectation value of \(\hat{O}\) can be expressed in either of these pictures as

\[\langle O \rangle = \text{Tr}\left( \mathcal{D}(\rho_{\text{in}}) \, \hat{O} \right) = \text{Tr}\left( \rho_{\text{in}} \, \mathcal{D}^\dagger(\hat{O}) \right).\]

The adjoint circuit \(\mathcal{D}^\dagger\) is merely the reversed circuit with each operator therein adjointed. Pauli propagation is often used (though not exclusively) in the latter, Heisenberg picture, since observable operators \(\hat{O}\) of experimental interest are almost always sparse in the Pauli basis and are ideal initial Pauli expansions. In contrast, even the ubiquitous, fiducial \(N\)-qubit zero-state contains \(2^N\) terms when expressed in the Pauli basis. For example:

\[|0\rangle\langle 0|^{\otimes 3} = \frac{1}{2^3} \left( III + IIZ + IZI + IZZ + ZII + ZIZ + ZZI + ZZZ \right).\]

The action of an operator upon a state expressed in the Pauli basis is captured by the operator’s Pauli transfer matrix (PTM). Constructing and processing the PTM directly however is usually unnecessary, and cuPauliProp employs bespoke, optimised strategies for simulating specific operations in varying settings. While these are abstracted away from the user, it remains important to understand how different operations can principally change the number of terms stored in the Pauli expansion.

Clifford gates, such as Hadamard and controlled NOTs, map Pauli strings one-to-one and do not change the Pauli expansion size. For example,

\[\hat{H} Z \hat{H}^\dagger = X.\]

Pauli noise channels affect only the coefficients of Pauli strings, also leaving the Pauli expansion size unchanged. For example, two-qubit depolarising noise of probability \(p\) effects

\[\Lambda_p(XY) = \left(1 - \frac{16}{15}p \right) XY.\]

Pauli rotation gates, regardless of the number of targeted qubits, map Pauli strings one-to-one or one-to-two. In the worst case, they can double the size of a Pauli expansion. For example

\[e^{-i\theta/2 XYZ} \, \big( ZYZ \big) \, e^{i\theta/2 XYZ} = \cos(\theta) ZYZ - \sin(\theta) YII\]

So unlike the fixed-memory techniques employed by statevector simulation, Pauli propagation maintains a dynamically sized state. Circuits composed of many Pauli rotations can see the initial Pauli expansion rapidly grow, requiring greater memory capacity and gradually slowing simulation. This mechanism is responsible for the main computational expense of the Pauli propagation paradigm, and in many settings, necessitates the use of truncation (introduced below), to avoid the number of terms in the Pauli expansion exceeding the expansion’s underlying static buffer size.

The cuPauliProp library presently supports the gates and noise channels indicated by the below constructors.

cupaulipropCreatePauliRotationGateOperator()

cupaulipropCreatePauliNoiseChannelOperator()

cupaulipropCreateCliffordGateOperator()

These operators are effected upon a Pauli expansion via

cupaulipropPauliExpansionViewComputeOperatorApplication()

Traces#

The final output of a Pauli propagation simulation is typically a scalar produced via a trace or an inner product. For example, we might seek the expectation value of an observable operator \(\hat{O}\) as admitted by final state \(\rho_{\text{out}}\), i.e.

\[\langle O \rangle = \text{Tr}\left( \rho_{\text{out}} \, \hat{O} \right).\]

Had we worked in the Heisenberg picture and evolved the observable \(\hat{O}\) through the adjoint circuit to produce \(\hat{O}_{\text{out}}\), then the final expectation value could be expressed in terms of the initial state \(\rho_{\text{in}}\) as

\[\langle O \rangle = \text{Tr}\left( \rho_{\text{in}} \, \hat{O}_{\text{out}} \right).\]

Because distinct Pauli strings are trace-orthogonal (i.e. \(\sigma_1 \ne \sigma_2 \implies \text{Tr}(\sigma_1 \sigma_2)=0\)), these expectation values can be quickly evaluated when either of \(\rho_{\text{in}}\) or \(\hat{O}_{\text{out}}\) are sparse in the Pauli basis. Evaluation remains efficient even when \(\rho_{\text{in}} = |0\rangle\langle 0|^{\otimes N}\), i.e. when the circuit input state is the ubiquitous fiducial zero-state, despite that it contains \(2^N\) terms when expressed in the Pauli basis. This is through a bespoke evaluation of the trace which avoids explicitly instantiating \(|0\rangle\langle 0|^{\otimes N}\). We sometimes instead seek quantities of the more general form

\[\langle A,B\rangle = \text{Tr}\left( s_A \, s_B \right)\]

where \(s_A\) and \(s_B\) are both Pauli expansions resulting from separate evolutions, as capture e.g. out-of-time-order correlators.

The cuPauliProp library supports evaluation of the above quantities through functions

cupaulipropPauliExpansionViewComputeTraceWithExpansionView()

cupaulipropPauliExpansionViewComputeTraceWithZeroState()

Truncation#

We saw quantum operations upon a Pauli expansion can cause it to “grow”, increasing the number of weighted Pauli strings contained within. Simulation of some systems, such as near-Clifford circuits, see the expansion grow sufficiently slowly that the entire circuit can be applied at full accuracy. In more general settings however, the exponential growth of the expansion size with increasing gate depth thwarts full-state simulation. We instead perform approximate simulation by discarding Pauli strings on the fly which we predict to contribute insignificantly to our final output quantity (i.e. the traces above).

The cuPauliProp library so far supports two truncation strategies:

Coefficient truncation with cutoff \(c\) discards any Pauli string from the expansion for which the corresponding coefficient has an absolute value smaller or equal to \(c\). Since the quantities eventually output by simulation (as seen above) depend linearly upon the final Pauli expansion coefficients, discarding small coefficients is expected to induce only a small error upon the final output. In other words, we discard strings which we expect to contribute the least to the output scalar.

Pauli weight truncation with cutoff \(w\) discards any Pauli string of Pauli weight strictly exceeding \(w\). Pauli weight refers to the number of non-identity Pauli operators in the string. For example, string \(IIXIZ\) has a Pauli weight of \(2\) and \(XYZYX\) has a Pauli weight of \(5\). Pauli weight truncation is well-suited to simulation where the final sought output quantity is the trace of the truncated expansion with a low-weight (or local) operator, and where the simulated circuit increases average Pauli weight. Pauli weight truncation can work well when ultimately computing the trace with the zero state.

Ideal choices of the truncation parameters, and precisely when to perform truncations, can be hard to prior motivate and are often instead found through trial and error. Frequent truncation with low cutoffs may sabotage accuracy, while infrequent truncations whith high cutoffs can cause prohibitive expansion growth and exhausting of memory resources.

The cuPauliProp API permits truncating a Pauli expansion in isolation, or during application of an operator for potential runtime savings, through functions:

cupaulipropPauliExpansionViewExecuteTruncation()

cupaulipropPauliExpansionViewComputeOperatorApplication()

Deduplication and sorting#

In general, the Pauli strings in a Pauli expansion are arbitrarily ordered, and can be duplicated across terms. The cuPauliProp API permits forcing sortedness and Pauli string uniqueness as optional postconditions to most (if not all) functions which modify an expansion. This is helpful for users performing their own analyses on the simulated states.

Further, sorting the terms via a canonical ordering of the Pauli strings, and/or ensuring their uniqueness among terms, can have performance advantages. Some API operations can leverage when the input Pauli expansion has either precondition for accelerated treatment. Conversely however, gratuitously forcing either property as postconditions can harm performance. As such, when the user does not force either postcondition, the cuPauliProp library endeavours to internally choose ideal postconditions of the output expansion based on the preconditions of the input expansion, and the operation being performed.

The cuPauliProp API permits declaring whether an input expansion already has these properties, or whether an output expansion should gain/keep these properties, respectively through functions

cupaulipropCreatePauliExpansion()

cupaulipropPauliExpansionViewComputeOperatorApplication()

The properties can also be established in isolation via

cupaulipropPauliExpansionViewExecuteDeduplication()

cupaulipropPauliExpansionViewExecuteCanonicalSort()

and queried of a given expansion via

cupaulipropPauliExpansionIsSorted()

cupaulipropPauliExpansionIsDeduplicated()

API design#

We finally clarify some important nuances of the cuPauliProp API.

In lieu of accepting a full input Pauli expansion, most API functions accept a view thereof. A view restricts the functions to seeing and processing only a user-chosen subset of the terms in a Pauli expansion, and is useful when the expansion is otherwise too large (or will become too large by subsequent API calls) to be processed in its entirety. This enables hybridising depth-first and breadth-first evaluation of the Pauli paths, reducing total memory requirements at the cost of increased runtime, and expanding the set of simulable systems.

The API is predominantly out-of-place, where in addition to the immutable view of an input Pauli expansion, functions accept an output expansion to be overwritten. Typical simulation therefore involves the creation of two Pauli expansions, where both alternate serving as input and output to the API.

Quantum operators which can grow the Pauli expansion (such as Pauli rotations) do so typically by an unpredictable amount, though which can be upperbounded; Pauli rotations at most double the number of terms, for example. When applying operators, the cuPauliProp API requires upfront that the output Pauli expansion has sufficient memory capacity to store the worst case growth. Explicitly, applying a Pauli rotation to an input Pauli expansion view containing \(T\) terms requires that the passed output Pauli expansion has capacity to store at least \(2T\) terms.

Many API functions circumstantially require temporary memory, as is provided by a user-maintained “workspace”. Typical simulation involves maintaining a workspace (and an attached memory buffer) alongside the Pauli expansions, which has either been pre-allocated to be sufficiently large, or is regularly checked to have sufficient capacity.

A CUDA stream can be optionally set upon the cuPauliProp library handle. When no stream is set, all API functions are synchronous. When a stream is set, all functions should be assumed asynchronous, although select functions will remain synchronous as internally necessary. It is safe to use the cuPauliProp API without explicit synchronisation calls when respecting the usual stream semantics.

Citing cuQuantum#

1. Bayraktar et al., “cuQuantum SDK: A High-Performance Library for Accelerating Quantum Science,” 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), Bellevue, WA, USA, 2023, pp. 1050-1061, doi: 10.1109/QCE57702.2023.00119.