Examples#

In this section, we show a basic example on how to use the cuPauliProp API for an end-to-end quantum circuit simulation.

Building code#

Assuming your cuPauliprop header files are located in CUPAULIPROP_INCLUDE_DIR and the shared library is located in CUPAULIPROP_LIB_DIR, the example code can be built via the following command:

nvcc kicked_ising_example.cpp -I${CUPAULIPROP_INCLUDE_DIR} -L${CUPAULIPROP_LIB_DIR} -lcupauliprop -o kicked_ising_example

When building statically, use the following command:

nvcc kicked_ising_example.cpp -I${CUPAULIPROP_INCLUDE_DIR} ${CUPAULIPROP_LIB_DIR}/libcupauliprop_static.a -o kicked_ising_example

When using cuQuantum downloaded from the NVIDIA devzone website and extracted in CUQUANTUM_ROOT, the header files are located in ${CUQUANTUM_ROOT}/include and the shared library is located in ${CUQUANTUM_ROOT}/lib. For convenience, we also provide a Makefile and CMakeLists.txt file in the same folder as the example code.

Kicked Ising example#

In this example, we demonstrate use of the cuPauliProp API to perform an end-to-end simulation of IBM’s 127 qubit kicked Ising experiment, as presented in Nature volume 618, pages 500–505 (2023). We will specifically simulate a single datum of Figure 4 b, which is the expectation value of observable \(Z_{62}\) after a 20-repetition Trotter circuit. The full demonstration is available at kicked_ising_example.cpp.

Our example code will make use of the below helper macros which handle errors thrown by the CUDA or cuPauliProp APIs.

28// ========================================================================
29// CUDA and cuPauliProp error handling
30// ========================================================================
31
32#define HANDLE_CUPP_ERROR(x)                               \
33{                                                          \
34  const auto err = x;                                      \
35  if (err != CUPAULIPROP_STATUS_SUCCESS)                   \
36  {                                                        \
37    printf("cuPauliProp error in line %d\n", __LINE__);    \
38    fflush(stdout);                                        \
39    std::abort();                                          \
40  }                                                        \
41};
42
43
44#define HANDLE_CUDA_ERROR(x)                                \
45{                                                           \
46  const auto err = x;                                       \
47  if (err != cudaSuccess)                                   \
48  {                                                         \
49    const char * error = cudaGetErrorString(err);           \
50    printf("CUDA Error: %s in line %d\n", error, __LINE__); \
51    fflush(stdout);                                         \
52    std::abort();                                           \
53  }                                                         \
54};

We next decide the memory budget for our simulation. It will turn out that for our chosen combination of simulated circuit, output observable and truncation parameters, only around 60 MiB of total device memory is needed. We will demonstrate how to dedicate all available device memory for simulation however, which this example optionally performs when USE_MAX_VRAM = true.

58// ========================================================================
59// Memory usage
60// ========================================================================
61
62// Each Pauli expansion has two pre-allocated GPU buffers, storing packed
63// integers (which encode Pauli strings) and corresponding coefficients.
64// As much memory can be dedicated as your hardware allows, while the min-
65// imum required is specific and very sensitive to the simulated circuit,
66// studied observable, and the chosen truncation hyperparameters.
67// Some operations also require additional workspace memory which is also
68// ideally pre-allocated, and can be established using the API 'Prepare'
69// functions (e.g. cupaulipropPauliExpansionViewPrepareTraceWithZeroState).
70// In this demo, we dedicate either (a percentage of) the entirety of GPU
71// memory uniformly between the required memory buffers, or instead use a
72// fixed hardcoded amount which has been prior tested to be consistent with
73// our other simulation parameters (like truncations); these choices are
74// toggled via USE_MAX_VRAM below.
75
76// =true to use MAX_VRAM_PERCENT of VRAM, and =false to use fixed memories below
77bool USE_MAX_VRAM = false;
78double MAX_VRAM_PERCENT = 90; // 0-100%
79
80size_t FIXED_EXPANSION_PAULI_MEM = 16 * (1LLU << 20); // bytes = 16 MiB
81size_t FIXED_EXPANSION_COEF_MEM  =  4 * (1LLU << 20); // bytes = 4  MiB
82size_t FIXED_WORKSPACE_MEM       = 20 * (1LLU << 20); // bytes = 20 MiB

We here scaffold the quantum circuit we will subsequently instantiate. In short, it is a Trotter circuit encoding the dynamics of a kicked Ising system, with a heavy-hex topology matching the connectivity of IBM’s 127 qubit Eagle device.

 86// ========================================================================
 87// Circuit preparation (Trotterised Ising on IBM heavy hex topology)
 88// ========================================================================
 89
 90// This demo simulates the circuits experimentally executed by IBM in article
 91// 'Nature volume 618, pages 500–505 (2023)'. This is a circuit Trotterising
 92// the evolution operator of a 2D transverse-field Ising model, but where the
 93// prescribed ZZ rotations have a fixed angle of -pi/2, and where the X angles
 94// are arbitrarily set/swept; later, we will fix the X angle to be pi/4. The
 95// Hamiltonian ZZ interactions are confined to a heavy-hex topology, matching
 96// the connectivity of the IBM Eagle processor 'ibm_kyiv', as we fix below.
 97
 98const int NUM_CIRCUIT_QUBITS = 127;
 99const int NUM_ROTATIONS_PER_LAYER = 48;
100const int NUM_PAULIS_PER_X_ROTATION = 1;
101const int NUM_PAULIS_PER_Z_ROTATION = 2;
102
103const double PI = 3.14159265358979323846;
104const double ZZ_ROTATION_ANGLE = - PI / 2.0;
105
106// Indices of ZZ-interacting qubits which undergo the first (red) Trotter round
107const int32_t ZZ_QUBITS_RED[NUM_ROTATIONS_PER_LAYER][NUM_PAULIS_PER_Z_ROTATION] = {
108  {  2,   1},  { 33,  39}, { 59,  60}, { 66,  67}, { 72,  81}, {118, 119},
109  { 21,  20},  { 26,  25}, { 13,  12}, { 31,  32}, { 70,  74}, {122, 123},
110  { 96,  97},  { 57,  56}, { 63,  64}, {107, 108}, {103, 104}, { 46,  45},
111  { 28,  35},  {  7,   6}, { 79,  78}, {  5,   4}, {109, 114}, { 62,  61},
112  { 58,  71},  { 37,  52}, { 76,  77}, {  0,  14}, { 36,  51}, {106, 105},
113  { 73,  85},  { 88,  87}, { 68,  55}, {116, 115}, { 94,  95}, {100, 110},
114  { 17,  30},  { 92, 102}, { 50,  49}, { 83,  84}, { 48,  47}, { 98,  99},
115  {  8,   9},  {121, 120}, { 23,  24}, { 44,  43}, { 22,  15}, { 53,  41}
116};
117
118// Indices of ZZ-interacting qubits which undergo the second (blue) Trotter round
119const int32_t ZZ_QUBITS_BLUE[NUM_ROTATIONS_PER_LAYER][NUM_PAULIS_PER_Z_ROTATION] = {
120  { 53,  60}, {123, 124}, { 21,  22}, { 11,  12}, { 67,  68}, {  2,   3},
121  { 66,  65}, {122, 121}, {110, 118}, {  6,   5}, { 94,  90}, { 28,  29},
122  { 14,  18}, { 63,  62}, {111, 104}, {100,  99}, { 45,  44}, {  4,  15},
123  { 20,  19}, { 57,  58}, { 77,  71}, { 76,  75}, { 26,  27}, { 16,   8},
124  { 35,  47}, { 31,  30}, { 48,  49}, { 69,  70}, {125, 126}, { 89,  74},
125  { 80,  79}, {116, 117}, {114, 113}, { 10,   9}, {106,  93}, {101, 102},
126  { 92,  83}, { 98,  91}, { 82,  81}, { 54,  64}, { 96, 109}, { 85,  84},
127  { 87,  86}, {108, 112}, { 34,  24}, { 42,  43}, { 40,  41}, { 39,  38}
128};
129
130// Indices of ZZ-interacting qubits which undergo the third (green) Trotter round
131const int32_t ZZ_QUBITS_GREEN[NUM_ROTATIONS_PER_LAYER][NUM_PAULIS_PER_Z_ROTATION] = {
132  { 10,  11}, { 54,  45}, {111, 122}, { 64,  65}, { 60,  61}, {103, 102},
133  { 72,  62}, {  4,   3}, { 33,  20}, { 58,  59}, { 26,  16}, { 28,  27},
134  {  8,   7}, {104, 105}, { 73,  66}, { 87,  93}, { 85,  86}, { 55,  49},
135  { 68,  69}, { 89,  88}, { 80,  81}, {117, 118}, {101, 100}, {114, 115},
136  { 96,  95}, { 29,  30}, {106, 107}, { 83,  82}, { 91,  79}, {  0,   1},
137  { 56,  52}, { 90,  75}, {126, 112}, { 36,  32}, { 46,  47}, { 77,  78},
138  { 97,  98}, { 17,  12}, {119, 120}, { 22,  23}, { 24,  25}, { 43,  34},
139  { 42,  41}, { 40,  39}, { 37,  38}, {125, 124}, { 50,  51}, { 18,  19}
140};

We instantiate the circuit as a std::vector of cupaulipropQuantumOperator_t types. In this demonstration, all circuit gates are Pauli rotations of 1 or 2 qubits, chiefly

\[\begin{split}Rx(\theta_X) = \exp\left(-i\theta_X/2 \, \hat{X}\right), \\ Rzz(\theta_Z) = \exp\left(-i\theta_Z/2 \, \hat{Z}\otimes\hat{Z}\right).\end{split}\]

We fix \(\theta_Z\) to be \(-\pi/2\), and will later substitute a value for the X-rotation strength \(\theta_X\).

144// ========================================================================
145// Circuit construction
146// ========================================================================
147
148// Each 'step' of the Trotter circuit alternates a layer of single-qubit X
149// rotations on every qubit, then a sequence of two-qubit Z rotations on the
150// heavy-hex topology, upon qubit pairs in the red, blue and green lists 
151// above. Note that ZZ rotations about -pi/2 are actually Clifford, though
152// we still here treat them like a generic Pauli rotation. The functions
153// below construct a Trotter circuit with a variable number of steps.
154
155
156std::vector<cupaulipropQuantumOperator_t> getXRotationLayer(
157  cupaulipropHandle_t handle, double xRotationAngle
158) {  
159  std::vector<cupaulipropQuantumOperator_t> layer(NUM_CIRCUIT_QUBITS);
160
161  const cupaulipropPauliKind_t paulis[NUM_PAULIS_PER_X_ROTATION] = {CUPAULIPROP_PAULI_X};
162
163  for (int32_t i=0; i<NUM_CIRCUIT_QUBITS; i++) {
164    HANDLE_CUPP_ERROR(cupaulipropCreatePauliRotationGateOperator(
165      handle, xRotationAngle, NUM_PAULIS_PER_X_ROTATION, &i, paulis, &layer[i]));
166  }
167
168  return layer;
169}
170
171
172std::vector<cupaulipropQuantumOperator_t> getZZRotationLayer(
173  cupaulipropHandle_t handle,
174  const int32_t topology[NUM_ROTATIONS_PER_LAYER][NUM_PAULIS_PER_Z_ROTATION]
175) {
176  std::vector<cupaulipropQuantumOperator_t> layer(NUM_ROTATIONS_PER_LAYER);
177
178  const cupaulipropPauliKind_t paulis[NUM_PAULIS_PER_Z_ROTATION] = {
179    CUPAULIPROP_PAULI_Z, CUPAULIPROP_PAULI_Z};
180
181  for (uint32_t i=0; i<NUM_ROTATIONS_PER_LAYER; i++) {
182    HANDLE_CUPP_ERROR(cupaulipropCreatePauliRotationGateOperator(
183      handle, ZZ_ROTATION_ANGLE, NUM_PAULIS_PER_Z_ROTATION, topology[i], paulis, &layer[i]));
184  }
185
186  return layer;
187}
188
189
190std::vector<cupaulipropQuantumOperator_t> getIBMHeavyHexIsingCircuit(
191  cupaulipropHandle_t handle, double xRotationAngle, int numTrotterSteps
192) {
193  std::vector<cupaulipropQuantumOperator_t> circuit;
194
195  for (int n=0; n<numTrotterSteps; n++) {
196    auto layerX       = getXRotationLayer (handle, xRotationAngle);
197    auto layerRedZZ   = getZZRotationLayer(handle, ZZ_QUBITS_RED);
198    auto layerBlueZZ  = getZZRotationLayer(handle, ZZ_QUBITS_BLUE);
199    auto layerGreenZZ = getZZRotationLayer(handle, ZZ_QUBITS_GREEN);
200    
201    circuit.insert(circuit.end(), layerX.begin(),       layerX.end());
202    circuit.insert(circuit.end(), layerRedZZ.begin(),   layerRedZZ.end());
203    circuit.insert(circuit.end(), layerBlueZZ.begin(),  layerBlueZZ.end());
204    circuit.insert(circuit.end(), layerGreenZZ.begin(), layerGreenZZ.end());
205  }
206
207  return circuit;
208}

This demonstration will simulate the circuit in the Heisenberg picture. As such, our input Pauli expansion will be initialized to an experimentally measured observable operator. We’ll later choose observable \(Z_{62}\), but we here define a function which encodes any Pauli string as a sequence of “packed integers” as accepted by the cuPauliProp API.

212// ========================================================================
213// Observable preparation
214// ========================================================================
215
216// This demo simulates the IBM circuit via back-propagating the measurement
217// observable through the adjoint circuit. As such, we encode the measured
218// observable into our initial Pauli expansion, in the format recognised by
219// cuPauliProp. Pauli strings are represented with "packed integers" wherein
220// every bit encodes a Pauli operator upon a corresponding qubit. We maintain
221// two masks which separately encode the position of X and Z Pauli operators,
222// indicated by a set bit at the qubit index, with a common set bit encoding
223// a Y Pauli operator. Simulating more qubits than exist bits in the packed
224// integer type (64) requires using multiple packed integers for each X and Z
225// mask. We store a Pauli string's constituent X and Z masks contiguously in
226// a single array, where the final mask of each per-string is padded with zero
227// bits to be an integer multiple of the packed integer size (64 bits).
228
229// The below function accepts a single Pauli string (i.e. a tensor product of
230// the given Pauli operators at the specified qubit indices) and returns the
231// sequence of packed integers which encode it as per the cuPauliProp API;
232// this sequence can be copied directly to the GPU buffer of a Pauli expansion.
233
234std::vector<cupaulipropPackedIntegerType_t> getPauliStringAsPackedIntegers(
235  std::vector<cupaulipropPauliKind_t> paulis, 
236  std::vector<uint32_t> qubits
237) {
238  assert(paulis.size() == qubits.size());
239  assert(*std::max_element(qubits.begin(), qubits.end()) < NUM_CIRCUIT_QUBITS);
240
241  int32_t numPackedInts;
242  HANDLE_CUPP_ERROR(cupaulipropGetNumPackedIntegers(NUM_CIRCUIT_QUBITS, &numPackedInts));
243
244  // A single Pauli string is composed of separate X and Z masks, one after the other
245  std::vector<cupaulipropPackedIntegerType_t> out(numPackedInts * 2, 0);
246  auto xPtr = &out[0];
247  auto zPtr = &out[numPackedInts];
248
249  // Process one input (pauli, qubit) pair at a time
250  for (auto i=0; i<qubits.size(); i++) {
251
252    // The qubit corresponds to a specific bit of a specific packed integer
253    auto numBitsPerPackedInt = 8 * sizeof(cupaulipropPackedIntegerType_t);
254    auto intInd = qubits[i] / numBitsPerPackedInt;
255    auto bitInd = qubits[i] % numBitsPerPackedInt;
256
257    // Overwrite a bit of either the X or Z masks (or both when pauli==Y)
258    if (paulis[i] == CUPAULIPROP_PAULI_X || paulis[i] == CUPAULIPROP_PAULI_Y)
259      xPtr[intInd] = xPtr[intInd] | (1ULL << bitInd);
260    if (paulis[i] == CUPAULIPROP_PAULI_Z || paulis[i] == CUPAULIPROP_PAULI_Y)
261      zPtr[intInd] = zPtr[intInd] | (1ULL << bitInd);
262  }
263
264  return out;
265}

With all convenience functions defined, we are ready to begin the main control flow.

269// ========================================================================
270// Main
271// ========================================================================
272
273// Simulation of the IBM utility experiment proceeds as follows. We setup the
274// cuPauliProp library, attaching a new stream, then proceed to creating two
275// Pauli expansions (since the API is out-of-place, as elaborated upon below).
276// One expansion is initialised to the measured observable of the IBM circuit.
277// We prepare workspace memory, fix truncation hyperparameters, then create
278// the circuit as a list of cuPauliProp operators. We process the circuit in
279// reverse, adjointing each operation, applied upon a newly prepared view of
280// the input expansion, each time checking our dynamically growing memory costs
281// have not exceeded our budgets. Thereafter we compute the overlap between the
282// final back-propagated observable and the experimental initial state (the all-
283// zero state), producing an estimate of the experimental expectation value.
284// Finally, we free all allocated memory like good citizens.
285
286
287int main(int argc, char** argv) {
288  std::cout << "cuPauliProp IBM Heavy-hex Ising Example" << std::endl;
289  std::cout << "=======================================" << std::endl << std::endl;

Our first chore is to setup the cuPauliProp library. For simplicity, we choose not to attach any stream, and therefore use the default stream.

293  // ========================================================================
294  // Library setup
295  // ========================================================================
296
297  int deviceId = 0;
298  HANDLE_CUDA_ERROR(cudaSetDevice(deviceId));
299
300  // cuPauliProp operations accept a cudaStream_t argument for asynchronous usage
301  // We use the default stream (0) in this example
302  cudaStream_t stream = 0;
303  cupaulipropHandle_t handle;
304  HANDLE_CUPP_ERROR(cupaulipropCreate(&handle));

We here choose to use either our pre-decided memory budget, or optionally consult the available device memory and distribute it between the needed data structures. In total, we will reserve memory for two Pauli expansions (each containing a separate Pauli string and coefficient buffer), and a workspace buffer. The needed workspace buffer size is informed by both the capacity of the expansions and the API operations to be later performed. While it can be precomputed using the API “prepare” functions, such as cupaulipropPauliExpansionViewPrepareOperatorApplication(), we here merely reserve as much memory for the workspace as needed by a Pauli expansion, for simplicity.

308  // ========================================================================
309  // Decide memory usage
310  // ========================================================================
311
312  // As outlined in the 'Memory usage' section above, we here either uniformly
313  // allocate all (or a high percentage of) available memory between the needed
314  // memory buffers, or use the pre-decided fixed values. This demo will create
315  // a total of two Pauli expansions (each of which accepts two separate buffers
316  // to store Pauli strings and their corresponding coefficients; these have
317  // different sizes) and one workspace, hence we arrange for an allocation of
318  // five buffers in total.
319
320  size_t expansionPauliMem;
321  size_t expansionCoefMem;
322  size_t workspaceMem;
323  size_t totalUsedMem;
324
325  if (USE_MAX_VRAM) {
326
327    // Find usable device memory
328    size_t totalFreeMem, totalGlobalMem;
329    HANDLE_CUDA_ERROR(cudaMemGetInfo(&totalFreeMem, &totalGlobalMem));
330    size_t totalUsableMem = static_cast<size_t>(totalFreeMem * MAX_VRAM_PERCENT/100);
331
332    // Divide it between the three instances (two expansions, one workspace)
333    size_t instanceMem = totalUsableMem / 3;
334
335    // Determine the ideal ratio between an expansion's Pauli and coef buffers
336    int32_t numPackedInts;
337    HANDLE_CUPP_ERROR(cupaulipropGetNumPackedIntegers(NUM_CIRCUIT_QUBITS, &numPackedInts));
338    size_t pauliMemPerTerm = 2 * numPackedInts * sizeof(cupaulipropPackedIntegerType_t);
339    size_t coefMemPerTerm = sizeof(double);
340    size_t totalMemPerTerm = pauliMemPerTerm + coefMemPerTerm;
341
342    expansionPauliMem = (instanceMem * pauliMemPerTerm) / totalMemPerTerm;
343    expansionCoefMem  = (instanceMem * coefMemPerTerm ) / totalMemPerTerm;
344    workspaceMem      = instanceMem;
345
346    totalUsedMem = 2*expansionPauliMem + 2*expansionCoefMem + workspaceMem;
347    std::cout << "Dedicated memory: " << MAX_VRAM_PERCENT << "% of " << totalFreeMem;
348    std::cout << " B free = " << totalUsedMem << " B, divided into..." << std::endl;
349
350  } else {
351
352    // Use pre-decided buffer sizes
353    expansionPauliMem = FIXED_EXPANSION_PAULI_MEM;
354    expansionCoefMem  = FIXED_EXPANSION_COEF_MEM;
355    workspaceMem      = FIXED_WORKSPACE_MEM;
356
357    totalUsedMem = 2*expansionPauliMem + 2*expansionCoefMem + workspaceMem;
358    std::cout << "Dedicated memory: " << totalUsedMem << " B = 60 MiB, divided into..." << std::endl;
359  }
360
361  std::cout << "  expansion Pauli buffer: " << expansionPauliMem << " B" << std::endl;
362  std::cout << "  expansion coef buffer:  " << expansionCoefMem << " B" << std::endl;
363  std::cout << "  workspace buffer:       " << workspaceMem << " B\n" << std::endl;

It is our responsibility to allocate and initialise the device memory before passing it to the cuPauliProp Pauli expansion constructor. We create an “input” Pauli expansion initialised to the single Pauli string \(Z_{62}\), and an “output” Pauli expansion which is initially empty. Notice that we indicate to cupaulipropCreatePauliExpansion() that the pre-initialised input strings (of which there is one) happen to be sorted and unique, which cuPauliProp can internally later leverage for potential speedup.

367  // ========================================================================
368  // Pauli expansion preparation
369  // ========================================================================
370
371  // Create buffers for two Pauli expansions, which will serve as 'input' and
372  // 'output' to the out-of-place cuPauliProp API. Note that the capacities of
373  // these buffers constrain the maximum number of Pauli strings maintained
374  // during simulation, and ergo inform the accuracy of the simulation. The
375  // sufficient buffer sizes are specific to the simulated system, and we here
376  // choose a surprisingly small capacity as admitted by the studied circuit.
377
378  void * d_inExpansionPauliBuffer;
379  void * d_outExpansionPauliBuffer;
380  void * d_inExpansionCoefBuffer;
381  void * d_outExpansionCoefBuffer;
382  HANDLE_CUDA_ERROR(cudaMalloc(&d_inExpansionPauliBuffer,  expansionPauliMem));
383  HANDLE_CUDA_ERROR(cudaMalloc(&d_inExpansionCoefBuffer,   expansionCoefMem));
384  HANDLE_CUDA_ERROR(cudaMalloc(&d_outExpansionPauliBuffer, expansionPauliMem));
385  HANDLE_CUDA_ERROR(cudaMalloc(&d_outExpansionCoefBuffer,  expansionCoefMem));
386
387  // Prepare the X and Z masks which encode the experimental observable Z_62,
388  // which has a coefficient of unity, as seen in Figure 4. b) of the IBM work.
389  std::cout << "Observable: Z_62\n" << std::endl;
390  int64_t numObservableTerms = 1;
391  double observableCoef = 1.0;
392  std::vector<cupaulipropPauliKind_t> observablePaulis = {CUPAULIPROP_PAULI_Z};
393  std::vector<uint32_t> observableQubits = {62};
394
395  // Overwrite the 'input' Pauli expansion buffers with the observable data
396  auto observablePackedInts = getPauliStringAsPackedIntegers(observablePaulis, observableQubits);
397  size_t numObservableBytes = observablePackedInts.size() * sizeof(observablePackedInts[0]);
398  HANDLE_CUDA_ERROR(cudaMemcpy(
399    d_inExpansionPauliBuffer, observablePackedInts.data(), numObservableBytes, cudaMemcpyHostToDevice));
400  HANDLE_CUDA_ERROR(cudaMemcpy(
401    d_inExpansionCoefBuffer, &observableCoef, sizeof(observableCoef), cudaMemcpyHostToDevice));
402
403  // Create two Pauli expansions, which will serve as 'input and 'output' to the API.
404  // Because we begin from a real observable coefficient, and our circuit is completely
405  // positive and trace preserving, it is sufficient to use strictly real coefficients
406  // in our expansions, informing dataType below. We indicate that the single prepared
407  // term in the input expansion is technically unique, and the terms sorted, which
408  // permits cuPauliProp to use automatic optimisations during simulation.
409
410  cupaulipropPauliExpansion_t inExpansion;
411  cupaulipropPauliExpansion_t outExpansion;
412
413  cupaulipropSortOrder_t sortOrder = CUPAULIPROP_SORT_ORDER_NONE;
414  int32_t hasDuplicates = 0;  // isUnique = !hasDuplicates
415  cudaDataType_t dataType = CUDA_R_64F;
416
417  HANDLE_CUPP_ERROR(cupaulipropCreatePauliExpansion( // init to above
418    handle, NUM_CIRCUIT_QUBITS,
419    d_inExpansionPauliBuffer, expansionPauliMem,
420    d_inExpansionCoefBuffer,  expansionCoefMem,
421    dataType, numObservableTerms, sortOrder, hasDuplicates, 
422    &inExpansion));
423  HANDLE_CUPP_ERROR(cupaulipropCreatePauliExpansion( // init to empty
424    handle, NUM_CIRCUIT_QUBITS,
425    d_outExpansionPauliBuffer, expansionPauliMem,
426    d_outExpansionCoefBuffer,  expansionCoefMem,
427    dataType, 0, CUPAULIPROP_SORT_ORDER_NONE, 0,
428    &outExpansion));
429
430  

We must also ourselves allocate device memory for the workspace.

432  // ========================================================================
433  // Workspace preparation
434  // ========================================================================
435
436  // Some API functions require additional workspace memory which we bind to a
437  // workspace descriptor. Ordinarily we use the 'Prepare' functions to precisely
438  // bound upfront the needed workspace memory, but in this simple demo, we
439  // instead use a workspace memory which we prior know to be sufficient.
440
441  // Create a workspace
442  cupaulipropWorkspaceDescriptor_t workspace;
443  HANDLE_CUPP_ERROR(cupaulipropCreateWorkspaceDescriptor(handle, &workspace));
444
445  void* d_workspaceBuffer;
446  HANDLE_CUDA_ERROR(cudaMalloc(&d_workspaceBuffer, workspaceMem));
447  HANDLE_CUPP_ERROR(cupaulipropWorkspaceSetMemory(
448    handle, workspace,
449    CUPAULIPROP_MEMSPACE_DEVICE, CUPAULIPROP_WORKSPACE_SCRATCH,
450    d_workspaceBuffer, workspaceMem));
451  
452  // Note that the 'prepare' functions which check the required workspace memory
453  // will detach this buffer, requiring we re-call SetMemory above. We can avoid
454  // this repeated re-attachment by use of a second, bufferless workspace descri-
455  // ptor which we pass to the 'prepare' functions in lieu of this one.

During subsequent Heisenberg evolution of the observable operator \(Z_{62}\), the number of terms in the Pauli expansion will grow exponentially quickly. To mitigate this growth and keep simulation tractable through the entire circuit, we employ truncation strategies. For the simulated system, it is sufficient to perform truncation after every \(10\) gates, wherein we discard all Pauli strings with corresponding coefficient magnitudes smaller or equal to \(10^{-4}\), or which have a Pauli “weight” (the number of non-identity Paulis therein) exceeding \(8\). This strategy keeps total memory costs below 60 MiB (and so within the capacities of our Pauli expansion and workspace buffers), while ultimately producing an expectation value within the error bars of the error-mitigated experimental value.

459  // ========================================================================
460  // Truncation parameter preparation
461  // ========================================================================
462
463  // The Pauli propagation simulation technique has memory and runtime costs which
464  // (for generic circuits) grow exponentially with the circuit length, which we
465  // curtail through "truncation"; dynamic discarding of Pauli strings in our
466  // expansion which are predicted to contribute negligibly to the final output
467  // expectation value. This file demonstrates simultaneous usage of two truncation
468  // techniques; discarding of Pauli strings with an absolute coefficient less than
469  // 0.0001, or a "Pauli weight" (the number of non-identity operators in the string)
470  // exceeding eight. For example, given a Pauli expansion containing:
471  //    0.1 XYZXYZII + 1E-5 ZIIIIIII + 0.2 XXXXYYYY,
472  // our truncation parameters below would see the latter two strings discarded due
473  // to coefficient and weight truncation respectively.
474
475  cupaulipropCoefficientTruncationParams_t coefTruncParams;
476  coefTruncParams.cutoff = 1E-4;
477
478  cupaulipropPauliWeightTruncationParams_t weightTruncParams;
479  weightTruncParams.cutoff = 8;
480
481  const uint32_t numTruncStrats = 2;
482  cupaulipropTruncationStrategy_t truncStrats[] = {
483    {
484      CUPAULIPROP_TRUNCATION_STRATEGY_COEFFICIENT_BASED,
485      &coefTruncParams
486    },
487    {
488      CUPAULIPROP_TRUNCATION_STRATEGY_PAULI_WEIGHT_BASED,
489      &weightTruncParams
490    }
491  };
492
493  // It is not necessary to perform truncation after every gate, since the
494  // Pauli expansion size may not have grown substantially, and attempting
495  // to truncate may incur superfluous memory enumeration costs. In this
496  // demo, we choose to truncate only after every tenth applied gate. Note
497  // deferring truncation requires additional expansion memory; choosing to
498  // truncate after every gate shrinks this demo's costs to 20 MiB total.
499  const int numGatesBetweenTruncations = 10;
500
501  std::cout << "Coefficient truncation threshold:  " << coefTruncParams.cutoff << std::endl;
502  std::cout << "Pauli weight truncation threshold: " << weightTruncParams.cutoff << std::endl;
503  std::cout << "Truncation performed after every:  " << numGatesBetweenTruncations << " gates\n" << std::endl;

With the stage set, we are ready to begin simulation! We fix the angle of all X-rotation gates in our circuit to \(\theta_X = \pi/4\), and use \(20\) Trotter repetitions. Our simulated system therefore corresponds to the antepenultimate experimental datum in Figure 4. b of Nature volume 618, pages 500–505 (2023). Let operator \(\hat{U}\) represent the action of this full, forward experimental circuit. The original experimentalists performed \(\hat{U}\) upon the fiducial zero-state \(|0\rangle^{\otimes 127}\), and estimated the expectation value of observable \(Z_{62}\) through repeated sampling.

In contrast, we will here work in the Heisenberg picture, and simulate the adjoint (ergo reversed) circuit upon the observable operator \(Z_{62}\). For each gate, we create a view of the entire input Pauli expansion; we confirm that the output Pauli expansion has sufficient capacity to store the (worst case) result of applying the gate upon the view; we reattach the workspace buffer which is detached by that query; we apply the gate and overwrite the output Pauli expansion; and finally, we swap the input and output Pauli expansions so that the updated expansion becomes the next input.

507  // ========================================================================
508  // Back-propagation of the observable through the circuit
509  // ========================================================================
510
511  // We now simulate the observable operator being back-propagated through the
512  // adjoint circuit, mapping the input expansion (initialised to Z_62) to a
513  // final output expansion containing many weighted Pauli strings. We use the
514  // heavy-hex fixed-angle Ising circuit with 20 total repetitions, fixing the
515  // angle of the X rotation gates to PI/4. Our simulation therefore corresponds
516  // to the middle datum of Fig. 4 b) of the IBM manuscript, for which MPS and
517  // isoTNS siulation techniques showed the greatest divergence from experiment.
518
519  double xRotationAngle = PI / 4.;
520  int numTrotterSteps = 20;
521  auto circuit = getIBMHeavyHexIsingCircuit(handle, xRotationAngle, numTrotterSteps);
522
523  std::cout << "Circuit: 127 qubit IBM heavy-hex Ising circuit, with..." << std::endl;
524  std::cout << "  Trotter steps: " << numTrotterSteps << std::endl;
525  std::cout << "  Total gates:   " << circuit.size() << std::endl;
526  std::cout << "  Rx angle:      " << xRotationAngle << " (i.e. PI/4)\n" << std::endl;
527
528  // Constrain that every intermediate output expansion contains unique Pauli
529  // strings (forbidding duplicates), but permit the retained strings to be
530  // unsorted. This combination gives cuPauliProp the best chance of automatically
531  // selecting optimal internal functions and postconditions for the simulation.
532  uint32_t adjoint = true;
533  sortOrder = CUPAULIPROP_SORT_ORDER_NONE;  // reuse variable from above
534  uint32_t keepDuplicates = false;
535
536  std::cout << "Imposed postconditions:" << std::endl;
537  if (sortOrder != CUPAULIPROP_SORT_ORDER_NONE) {
538    std::cout << "  Pauli strings will be sorted." << std::endl;
539  }
540  if (!keepDuplicates) {
541    std::cout << "  Pauli strings will be unique." << std::endl;
542  }
543  if (keepDuplicates && sortOrder == CUPAULIPROP_SORT_ORDER_NONE) {
544    std::cout << "No postconditions imposed on Pauli strings." << std::endl;
545  }
546  std::cout << std::endl;
547
548  // Begin timing before any gates are applied
549  auto startTime = std::chrono::high_resolution_clock::now();
550  int64_t maxNumTerms = 0;
551
552  // Iterate the circuit in reverse to effect the adjoint of the total circuit
553  for (int gateInd=circuit.size()-1; gateInd >= 0; --gateInd) {
554    cupaulipropQuantumOperator_t gate = circuit[gateInd];
555
556    // Create a view of the current input expansion, selecting all currently
557    // contained terms. For very large systems, we may have alternatively
558    // chosen a smaller view of the partial state to work around memory limits.
559    cupaulipropPauliExpansionView_t inView;
560    int64_t numExpansionTerms;
561    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetNumTerms(handle, inExpansion, &numExpansionTerms));
562    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetContiguousRange(
563      handle, inExpansion, 0, numExpansionTerms, &inView));
564
565    // Track the intermediate expansion size, for our curiousity
566    if (numExpansionTerms > maxNumTerms)
567      maxNumTerms = numExpansionTerms;
568
569    // Choose whether or not to perform truncations after this gate
570    int numPassedTruncStrats = (gateInd % numGatesBetweenTruncations == 0)? numTruncStrats : 0;
571
572    // Check the expansion and workspace memories needed to apply the current gate
573    int64_t reqExpansionPauliMem;
574    int64_t reqExpansionCoefMem;
575    int64_t reqWorkspaceMem;
576    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewPrepareOperatorApplication(
577      handle, inView, gate, sortOrder, keepDuplicates,
578      numPassedTruncStrats, numPassedTruncStrats > 0 ? truncStrats : nullptr,
579      workspaceMem,
580      &reqExpansionPauliMem, &reqExpansionCoefMem, workspace));
581    HANDLE_CUPP_ERROR(cupaulipropWorkspaceGetMemorySize(
582      handle, workspace,
583      CUPAULIPROP_MEMSPACE_DEVICE, CUPAULIPROP_WORKSPACE_SCRATCH,
584      &reqWorkspaceMem));
585
586    // Verify that our existing buffers and workspace have sufficient memory
587    assert(reqExpansionPauliMem <= expansionPauliMem);
588    assert(reqExpansionCoefMem  <= expansionCoefMem);
589    assert(reqWorkspaceMem      <= workspaceMem);
590
591    // Beware that cupaulipropPauliExpansionViewPrepareOperatorApplication() above
592    // detaches the memory buffer from the workspace, which we here re-attach.
593    HANDLE_CUPP_ERROR(cupaulipropWorkspaceSetMemory(
594      handle, workspace,
595      CUPAULIPROP_MEMSPACE_DEVICE, CUPAULIPROP_WORKSPACE_SCRATCH,
596      d_workspaceBuffer, workspaceMem));
597
598    // Apply the gate upon the prepared view of the input expansion, evolving the
599    // Pauli strings pointed to within, truncating the result. The input expansion
600    // is unchanged while the output expansion is entirely overwritten.
601    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewComputeOperatorApplication(
602      handle, inView, outExpansion, gate,
603      adjoint, sortOrder, keepDuplicates,
604      numPassedTruncStrats, numPassedTruncStrats > 0 ? truncStrats : nullptr,
605      workspace, stream));
606
607    // Free the temporary view since it points to the old input expansion, whereas
608    // we will subsequently treat the modified output expansion as the next input
609    HANDLE_CUPP_ERROR(cupaulipropDestroyPauliExpansionView(inView));
610
611    // Treat outExpansion as the input in the next gate application
612    std::swap(inExpansion, outExpansion);
613  }
614
615  // Restore outExpansion to being the final output for clarity
616  std::swap(inExpansion, outExpansion);

Our desired output quantity is the expectation value of \(Z_{62}\), as given by

\[\langle Z_{62} \rangle = \text{Tr}\left( \hat{U} |0\rangle\langle 0| \hat{U}^\dagger \; \hat{Z}_{62} \right).\]

This is expressible in terms of our final output Pauli expansion \(s_{\text{out}}\) (the Heisenberg-evolved observable) as

\[\langle Z_{62} \rangle = \text{Tr}\left( |0\rangle\langle 0 | \; \hat{U}^\dagger \hat{Z}_{62} \hat{U} \right) = \text{Tr}\left( |0\rangle\langle 0 | \; s_{\text{out}} \right).\]

Therefore we calculate the trace of our output Pauli expansion with the zero state, producing our estimate to the experimental expectation value of \(Z_{62}\).

620  // ========================================================================
621  // Evaluation of the the expectation value of observable
622  // ========================================================================
623
624  // The output expansion is now a proxy for the observable back-propagated
625  // through to the front of the circuit (though having discarded components
626  // which negligibly influence the subsequent overlap). The expectation value
627  // of the IBM experiment is the overlap of the output expansion with the
628  // zero state, i.e. Tr(outExpansion * |0><0|), as we now compute.
629
630  // Obtain a view of the full output expansion (we'll free it in 'Clean up')
631  cupaulipropPauliExpansionView_t outView;
632  int64_t numOutTerms;
633  HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetNumTerms(handle, outExpansion, &numOutTerms));
634  HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetContiguousRange(
635    handle, outExpansion, 0, numOutTerms, &outView));
636
637  // Check that the existing workspace memory is sufficient to compute the trace 
638  int64_t reqWorkspaceMem;
639  HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewPrepareTraceWithZeroState(
640    handle, outView, workspaceMem, workspace));
641  HANDLE_CUPP_ERROR(cupaulipropWorkspaceGetMemorySize(
642    handle, workspace,
643    CUPAULIPROP_MEMSPACE_DEVICE, CUPAULIPROP_WORKSPACE_SCRATCH,
644    &reqWorkspaceMem));
645  assert(reqWorkspaceMem <= workspaceMem);
646
647  // Beware that we must now reattach the buffer to the workspace
648  HANDLE_CUPP_ERROR(cupaulipropWorkspaceSetMemory(
649    handle, workspace,
650    CUPAULIPROP_MEMSPACE_DEVICE, CUPAULIPROP_WORKSPACE_SCRATCH,
651    d_workspaceBuffer, workspaceMem));
652
653  // Compute the trace; the main and final output of this simulation!
654  double expecSignificand;
655  double expecExponent;
656  HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewComputeTraceWithZeroState(
657    handle, outView, &expecSignificand, &expecExponent, workspace, stream));
658  double expec = expecSignificand * std::pow(2.0, expecExponent);
659
660  // End timing after trace is evaluated
661  auto endTime = std::chrono::high_resolution_clock::now();
662  auto duration = std::chrono::duration_cast<std::chrono::microseconds>(endTime - startTime);
663  auto durationSecs = (duration.count() / 1e6);
664
665  std::cout << "Expectation value:       " << expec << std::endl;
666  std::cout << "Final number of terms:   " << numOutTerms << std::endl;
667  std::cout << "Maximum number of terms: " << maxNumTerms << std::endl;
668  std::cout << "Runtime:                 " << durationSecs << " seconds\n" << std::endl;

Finally, we clean up resources and free all device memory reserved during our simulation. This includes any outstanding views, all quantum operators, all expansions and their underlying buffers, and similarly for the workspace. Our final step is to finalise the cuPauliProp library, concluding simulation.

672  // ========================================================================
673  // Clean up
674  // ========================================================================
675
676  HANDLE_CUPP_ERROR(cupaulipropDestroyPauliExpansionView(outView));
677 
678  for (auto & gate : circuit) {
679    HANDLE_CUPP_ERROR(cupaulipropDestroyOperator(gate));
680  }
681
682  HANDLE_CUPP_ERROR(cupaulipropDestroyWorkspaceDescriptor(workspace));
683  HANDLE_CUDA_ERROR(cudaFree(d_workspaceBuffer));
684
685  HANDLE_CUPP_ERROR(cupaulipropDestroyPauliExpansion(inExpansion));
686  HANDLE_CUPP_ERROR(cupaulipropDestroyPauliExpansion(outExpansion));
687
688  HANDLE_CUDA_ERROR(cudaFree(d_inExpansionPauliBuffer));
689  HANDLE_CUDA_ERROR(cudaFree(d_outExpansionPauliBuffer));
690  HANDLE_CUDA_ERROR(cudaFree(d_inExpansionCoefBuffer));
691  HANDLE_CUDA_ERROR(cudaFree(d_outExpansionCoefBuffer));
692
693  HANDLE_CUPP_ERROR(cupaulipropDestroy(handle));
694
695  return EXIT_SUCCESS;
696}

Kicked Ising backward differentiation example#

This example demonstrates the cuPauliProp facilities for calculating gradients of expectation values with respect to circuit parameters via reverse-mode automatic differentiation (AD). For simplicity, we employ the same circuit as in the above example, parameterised by the \(X\) and \(ZZ\) rotation angles which we label \(\theta_X\) and \(\theta_Z\) respectively. Where the above example computed the expectation value \(\langle Z_{62} \rangle\) for \((\theta_X,\theta_Z) = (\pi/4, -\pi/2)\), we will here compute the gradient of this value at the same coordinate. That is, we seek

\[\nabla_{\theta_X, \theta_Z} \langle Z_{62} \rangle = ( \frac{\partial}{\partial \theta_X} \langle Z_{62} \rangle, \frac{\partial}{\partial \theta_Z} \langle Z_{62} \rangle ).\]

We will skip all code which is similar to the above example, and focus instead on the additional or particular steps involved in use of the AD API. The full demonstration is available in kicked_ising_backward_diff_example.cpp.

Our first algorithmic step is to simulate the non-differentiated circuit and obtain the output expectation value, in an identical fashion to the example above. We evaluate the circuit in the Heisenberg picture, beginning from the \(Z_{62}\) observable and applying the adjoint of the full circuit. Somewhat confusingly, this is the “forward pass” of AD, which just so happens to be a backward enumeration of the experimental circuit. Below, inExpansion is initialised to observable operator \(Z_{62}\).

 1  // ========================================================================
 2  // Forward pass: back-propagate observable through adjoint circuit
 3  // ========================================================================
 4  auto startTime = std::chrono::high_resolution_clock::now();
 5  int64_t maxNumTerms = 0;
 6
 7  for (int i = static_cast<int>(circuit.size()) - 1; i >= 0; --i) {
 8    int64_t inTerms = 0, reqXZ = 0, reqCoef = 0, reqWs = 0;
 9    cupaulipropPauliExpansionView_t inView{};
10    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetNumTerms(handle, inExpansion, &inTerms));
11    if (inTerms > maxNumTerms) maxNumTerms = inTerms;
12    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetContiguousRange(handle, inExpansion, 0, inTerms, &inView));
13    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewPrepareOperatorApplication(
14        handle, inView, circuit[i].op, sortOrder, 0, 2, truncStrats, WORKSPACE_MEM, &reqXZ, &reqCoef, workspace));
15    HANDLE_CUPP_ERROR(cupaulipropWorkspaceGetMemorySize(
16        handle, workspace, CUPAULIPROP_MEMSPACE_DEVICE, CUPAULIPROP_WORKSPACE_SCRATCH, &reqWs));
17    assert(reqXZ <= static_cast<int64_t>(EXPANSION_PAULI_MEM));
18    assert(reqCoef <= static_cast<int64_t>(EXPANSION_COEF_MEM));
19    assert(reqWs <= static_cast<int64_t>(WORKSPACE_MEM));
20    reattachWorkspace(handle, workspace, dWorkspace);
21    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewComputeOperatorApplication(
22        handle, inView, outExpansion, circuit[i].op, 1, sortOrder, 0, 2, truncStrats, workspace, stream));
23    HANDLE_CUPP_ERROR(cupaulipropDestroyPauliExpansionView(inView));
24    std::swap(inExpansion, outExpansion);
25  }

We then obtain the expectation value \(\langle Z_{62} \rangle\) at \((\theta_X,\theta_Z) = (\pi/4, -\pi/2)\). Notice below that the value is given to us as a separate significand and exponent, which we will here denote as \(s\) and \(p\) respectively, such that \(\langle Z_{62} \rangle = s \times 2^p\).

 1  // ========================================================================
 2  // Trace evaluation (expectation value)
 3  // ========================================================================
 4  int64_t finalTerms = 0;
 5  cupaulipropPauliExpansionView_t finalView{};
 6  HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetNumTerms(handle, inExpansion, &finalTerms));
 7  if (finalTerms > maxNumTerms) maxNumTerms = finalTerms;
 8  HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetContiguousRange(handle, inExpansion, 0, finalTerms, &finalView));
 9  HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewPrepareTraceWithZeroState(handle, finalView, WORKSPACE_MEM, workspace));
10  reattachWorkspace(handle, workspace, dWorkspace);
11  double s = 0.0, p = 0.0;
12  HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewComputeTraceWithZeroState(handle, finalView, &s, &p, workspace, stream));
13  const double expec = s * std::exp2(p);

We now proceed to evaluating the gradient of this quantity. While above we obtained the result of applying the full circuit, computing the gradient will make use of each of the intermediate states reached by evolving the observable through only a partial circuit. We did obtain these intermediate states in-turn but did not store them, which would necessitate as many Pauli expansions as operators in the circuit. Instead, we will leverage that the kicked Ising circuit is unitary and therefore invertible, enabling us to re-obtain intermediate states by applying the inverse of the last applied operator. This trick is an instance of “tape replay” and means that in addition to our input and output expansions used above, we need only maintain two additional Pauli expansions during gradient evaluation, independent of the circuit length. We prepare these initially empty Pauli expansions just like those used above.

 1  cupaulipropPauliExpansion_t inExpansion{}, outExpansion{}, cot0Expansion{}, cot1Expansion{};
 2  const auto sortOrder = CUPAULIPROP_SORT_ORDER_NONE;
 3  HANDLE_CUPP_ERROR(cupaulipropCreatePauliExpansion(
 4      handle, NUM_CIRCUIT_QUBITS, dInXZ, EXPANSION_PAULI_MEM, dInCoef, EXPANSION_COEF_MEM,
 5      CUDA_R_64F, 1, sortOrder, 0, &inExpansion));
 6  HANDLE_CUPP_ERROR(cupaulipropCreatePauliExpansion(
 7      handle, NUM_CIRCUIT_QUBITS, dOutXZ, EXPANSION_PAULI_MEM, dOutCoef, EXPANSION_COEF_MEM,
 8      CUDA_R_64F, 0, sortOrder, 0, &outExpansion));
 9  HANDLE_CUPP_ERROR(cupaulipropCreatePauliExpansion(
10      handle, NUM_CIRCUIT_QUBITS, dCot0XZ, EXPANSION_PAULI_MEM, dCot0Coef, EXPANSION_COEF_MEM,
11      CUDA_R_64F, 0, sortOrder, 0, &cot0Expansion));
12  HANDLE_CUPP_ERROR(cupaulipropCreatePauliExpansion(
13      handle, NUM_CIRCUIT_QUBITS, dCot1XZ, EXPANSION_PAULI_MEM, dCot1Coef, EXPANSION_COEF_MEM,
14      CUDA_R_64F, 0, sortOrder, 0, &cot1Expansion));

After the forward pass already seen, which computed the output expectation value \(\langle Z_{62} \rangle = s 2^p\), we are ready to proceed with backward differentiation. This involves computing the backward derivative of each API operation involved in evaluating \(\langle Z_{62} \rangle\), the final of which (and ergo our first to now process) was the trace function; that which output \(s\) and \(p\). So we “seed” our AD cotangent vector with the gradient of \(\langle Z_{62} \rangle\) (our “cost function”) with respect to these symbols.

\[\nabla_{s,p} \langle Z_{62} \rangle = ( \frac{\partial}{\partial s} \langle Z_{62} \rangle, \frac{\partial}{\partial p} \langle Z_{62} \rangle ) = ( 2^p, s 2^p \log_e(2) ).\]

And as always, it is prudent to use the prepare functions to ensure that our cotangent expansion has sufficient memory to store the output, which requires subsequently re-attaching the workspace buffer.

 1  // ========================================================================
 2  // Backward: seed cotangent from trace
 3  // ========================================================================
 4  const double cotS = std::exp2(p);
 5  const double cotP = s * std::exp2(p) * std::log(2.0);
 6  int64_t reqCotXZ = 0, reqCotCoef = 0;
 7  HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewPrepareTraceWithZeroStateBackwardDiff(
 8      handle, finalView, WORKSPACE_MEM, &reqCotXZ, &reqCotCoef, workspace));
 9  assert(reqCotXZ <= static_cast<int64_t>(EXPANSION_PAULI_MEM));
10  assert(reqCotCoef <= static_cast<int64_t>(EXPANSION_COEF_MEM));
11  reattachWorkspace(handle, workspace, dWorkspace);
12  HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewComputeTraceWithZeroStateBackwardDiff(
13      handle, finalView, &cotS, &cotP, cot0Expansion, workspace, stream));
14  HANDLE_CUPP_ERROR(cupaulipropDestroyPauliExpansionView(finalView));

This modified cot0Expansion, our cotangent expansion. We now enumerate backward through the operators applied upon our original Pauli expansion, which happens to be the forward order of the experimental circuit, merely because we originally worked in the Heisenberg picture. Each processed operator will contribute to either the \(\theta_X\) or \(\theta_Z\) component of the gradient, which we store separately.

1  // ========================================================================
2  // Backward pass: reverse-mode AD with tape replay
3  // ========================================================================
4  double gradX = 0.0, gradZZ = 0.0;
5  auto bwdStartTime = std::chrono::high_resolution_clock::now();
6
7  for (size_t i = 0; i < circuit.size(); ++i) {

For each operator, we apply its inverse (which for unitary operators is merely the adjoint) in order to restore the intermediate expansion from before that operator was applied (this is the tape replay).

 1    int64_t inTerms = 0;
 2    cupaulipropPauliExpansionView_t inView{};
 3    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetNumTerms(handle, inExpansion, &inTerms));
 4    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetContiguousRange(handle, inExpansion, 0, inTerms, &inView));
 5    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewPrepareOperatorApplication(
 6        handle, inView, circuit[i].op, sortOrder, 0, 2, truncStrats, WORKSPACE_MEM, &reqCotXZ, &reqCotCoef, workspace));
 7    reattachWorkspace(handle, workspace, dWorkspace);
 8    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewComputeOperatorApplication(
 9        handle, inView, outExpansion, circuit[i].op, 0, sortOrder, 0, 2, truncStrats, workspace, stream));
10    HANDLE_CUPP_ERROR(cupaulipropDestroyPauliExpansionView(inView));
11    std::swap(inExpansion, outExpansion);

We pass this intermediate expansion, along with our cotangent expansion, to the backward derivative function of the operator, obtaining a term of a derivative of \(\langle Z_{62} \rangle\).

 1    int64_t cotTerms = 0, reqXZ = 0, reqCoef = 0, reqWs = 0;
 2    cupaulipropPauliExpansionView_t viewIn{}, cotOutView{};
 3    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetNumTerms(handle, inExpansion, &inTerms));
 4    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetContiguousRange(handle, inExpansion, 0, inTerms, &viewIn));
 5    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetNumTerms(handle, cot0Expansion, &cotTerms));
 6    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetContiguousRange(handle, cot0Expansion, 0, cotTerms, &cotOutView));
 7    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewPrepareOperatorApplicationBackwardDiff(
 8        handle, viewIn, cotOutView, circuit[i].op, sortOrder, 0, 2, truncStrats, WORKSPACE_MEM, &reqXZ, &reqCoef, workspace));
 9    HANDLE_CUPP_ERROR(cupaulipropWorkspaceGetMemorySize(
10        handle, workspace, CUPAULIPROP_MEMSPACE_DEVICE, CUPAULIPROP_WORKSPACE_SCRATCH, &reqWs));
11    assert(reqXZ <= static_cast<int64_t>(EXPANSION_PAULI_MEM));
12    assert(reqCoef <= static_cast<int64_t>(EXPANSION_COEF_MEM));
13    assert(reqWs <= static_cast<int64_t>(WORKSPACE_MEM));
14    reattachWorkspace(handle, workspace, dWorkspace);
15
16    double gateGrad = 0.0;
17    HANDLE_CUPP_ERROR(cupaulipropQuantumOperatorAttachCotangentBuffer(
18        handle, circuit[i].op, &gateGrad, sizeof(double), CUDA_R_64F, CUPAULIPROP_MEMSPACE_HOST));
19    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewComputeOperatorApplicationBackwardDiff(
20        handle, viewIn, cotOutView, cot1Expansion, circuit[i].op, 1, sortOrder, 0, 2, truncStrats, workspace, stream));
21    HANDLE_CUPP_ERROR(cupaulipropDestroyPauliExpansionView(viewIn));
22    HANDLE_CUPP_ERROR(cupaulipropDestroyPauliExpansionView(cotOutView));

If all our circuit operators were uniquely parameterised (e.g. \(\hat{U} = \hat{U}_2(\theta_2)\hat{U}_1(\theta_1)\)), this quantity would be an element of the gradient, corresponding to e.g.

\[\frac{\partial \hat{U}}{\partial \theta_2} = \frac{\partial\hat{U}_2}{\partial \theta_2} \hat{U}_1(\theta_1).\]

But because our \(\theta_X\) and \(\theta_Z\) parameters are repeated between gates (akin to \(\hat{U} = \hat{U}_2(\theta)\hat{U}_1(\theta)\)), the product-rule prescribes

\[\frac{\partial \hat{U}}{\partial \theta} = \frac{\partial\hat{U}_2}{\partial \theta} \hat{U}_1(\theta) + \hat{U}_2(\theta) \frac{\partial\hat{U}_1}{\partial \theta},\]

and so the output quantity (which is linear in \(\hat{U}\)) is one of the above summands, contributing to either the \(\theta_X\) or \(\theta_Z\) gradient element.

1    if (circuit[i].isX) gradX += gateGrad; else gradZZ += gateGrad;
2    std::swap(cot0Expansion, cot1Expansion);

After fully enumerating the circuit operators in this way, gradX and gradZZ contain the desired gradient.

 1  // ========================================================================
 2  // Results
 3  // ========================================================================
 4  std::cout << "Backward pass completed in " << bwdSecs << " seconds" << std::endl;
 5  std::cout << std::endl;
 6  std::cout << "Expectation value:       " << expec << std::endl;
 7  std::cout << "d<Z_62>/d(x_angle):      " << gradX << std::endl;
 8  std::cout << "d<Z_62>/d(zz_angle):     " << gradZZ << std::endl;
 9  std::cout << "Final number of terms:   " << finalTerms << std::endl;
10  std::cout << "Maximum number of terms: " << maxNumTerms << std::endl;
11  std::cout << "Total runtime:           " << totalSecs << " seconds" << std::endl;
12  std::cout << std::endl;

Useful tips#

  • For debugging, one can set the environment variable CUPAULIPROP_LOG_LEVEL=n. The level n = 0, 1, …, 5 corresponds to the logger level as described in the table below. The environment variable CUPAULIPROP_LOG_FILE=<filepath> can be used to redirect the log output to a custom file at <filepath> instead of stdout.

Level

Summary

Long Description

0

Off

Logging is disabled (default)

1

Errors

Only errors will be logged

2

Performance Trace

API calls that launch CUDA kernels will log their parameters and important information

3

Performance Hints

Hints that can potentially improve the application’s performance

4

Heuristics Trace

Provides general information about the library execution, may contain details about heuristic status

5

API Trace

API calls will log their parameter and important information