Examples#

In this section, we show a basic example on how to use the cuPauliProp API for an end-to-end quantum circuit simulation.

Building code#

Assuming your cuPauliprop header files are located in CUPAULIPROP_INCLUDE_DIR and the shared library is located in CUPAULIPROP_LIB_DIR, the example code can be built via the following command:

nvcc kicked_ising_example.cpp -I${CUPAULIPROP_INCLUDE_DIR} -L${CUPAULIPROP_LIB_DIR} -lcupauliprop -o kicked_ising_example

When building statically, use the following command:

nvcc kicked_ising_example.cpp -I${CUPAULIPROP_INCLUDE_DIR} ${CUPAULIPROP_LIB_DIR}/libcupauliprop_static.a -o kicked_ising_example

When using cuQuantum downloaded from the NVIDIA devzone website and extracted in CUQUANTUM_ROOT, the header files are located in ${CUQUANTUM_ROOT}/include and the shared library is located in ${CUQUANTUM_ROOT}/lib. For convenience, we also provide a Makefile and CMakeLists.txt file in the same folder as the example code.

Kicked Ising example#

In this example, we demonstrate use of the cuPauliProp API to perform an end-to-end simulation of IBM’s 127 qubit kicked Ising experiment, as presented in Nature volume 618, pages 500–505 (2023). We will specifically simulate a single datum of Figure 4 b, which is the expectation value of observable \(Z_{62}\) after a 20-repetition Trotter circuit. The full demonstration is available at kicked_ising_example.cpp.

Our example code will make use of the below helper macros which handle errors thrown by the CUDA or cuPauliProp APIs.

27// ========================================================================
28// CUDA and cuPauliProp error handling
29// ========================================================================
30
31#define HANDLE_CUPP_ERROR(x)                               \
32{                                                          \
33  const auto err = x;                                      \
34  if (err != CUPAULIPROP_STATUS_SUCCESS)                   \
35  {                                                        \
36    printf("cuPauliProp error in line %d\n", __LINE__);    \
37    fflush(stdout);                                        \
38    std::abort();                                          \
39  }                                                        \
40};
41
42
43#define HANDLE_CUDA_ERROR(x)                                \
44{                                                           \
45  const auto err = x;                                       \
46  if (err != cudaSuccess)                                   \
47  {                                                         \
48    const char * error = cudaGetErrorString(err);           \
49    printf("CUDA Error: %s in line %d\n", error, __LINE__); \
50    fflush(stdout);                                         \
51    std::abort();                                           \
52  }                                                         \
53};

We next decide the memory budget for our simulation. It will turn out that for our chosen combination of simulated circuit, output observable and truncation parameters, only around 60 MiB of total device memory is needed. We will demonstrate how to dedicate all available device memory for simulation however, which this example optionally performs when USE_MAX_VRAM = true.

57// ========================================================================
58// Memory usage
59// ========================================================================
60
61// Each Pauli expansion has two pre-allocated GPU buffers, storing packed
62// integers (which encode Pauli strings) and corresponding coefficients.
63// As much memory can be dedicated as your hardware allows, while the min-
64// imum required is specific and very sensitive to the simulated circuit,
65// studied observable, and the chosen truncation hyperparameters.
66// Some operations also require additional workspace memory which is also
67// ideally pre-allocated, and can be established using the API 'Prepare'
68// functions (e.g. cupaulipropPauliExpansionViewPrepareTraceWithZeroState).
69// In this demo, we dedicate either (a percentage of) the entirety of GPU
70// memory uniformly between the required memory buffers, or instead use a
71// fixed hardcoded amount which has been prior tested to be consistent with
72// our other simulation parameters (like truncations); these choices are
73// toggled via USE_MAX_VRAM below.
74
75// =true to use MAX_VRAM_PERCENT of VRAM, and =false to use fixed memories below
76bool USE_MAX_VRAM = false;
77double MAX_VRAM_PERCENT = 90; // 0-100%
78
79size_t FIXED_EXPANSION_PAULI_MEM = 16 * (1LLU << 20); // bytes = 16 MiB
80size_t FIXED_EXPANSION_COEF_MEM  =  4 * (1LLU << 20); // bytes = 4  MiB
81size_t FIXED_WORKSPACE_MEM       = 20 * (1LLU << 20); // bytes = 20 MiB

We here scaffold the quantum circuit we will subsequently instantiate. In short, it is a Trotter circuit encoding the dynamics of a kicked Ising system, with a heavy-hex topology matching the connectivity of IBM’s 127 qubit Eagle device.

 85// ========================================================================
 86// Circuit preparation (Trotterised Ising on IBM heavy hex topology)
 87// ========================================================================
 88
 89// This demo simulates the circuits experimentally executed by IBM in article
 90// 'Nature volume 618, pages 500–505 (2023)'. This is a circuit Trotterising
 91// the evolution operator of a 2D transverse-field Ising model, but where the
 92// prescribed ZZ rotations have a fixed angle of -pi/2, and where the X angles
 93// are arbitrarily set/swept; later, we will fix the X angle to be pi/4. The
 94// Hamiltonian ZZ interactions are confined to a heavy-hex topology, matching
 95// the connectivity of the IBM Eagle processor 'ibm_kyiv', as we fix below.
 96
 97const int NUM_CIRCUIT_QUBITS = 127;
 98const int NUM_ROTATIONS_PER_LAYER = 48;
 99const int NUM_PAULIS_PER_X_ROTATION = 1;
100const int NUM_PAULIS_PER_Z_ROTATION = 2;
101
102const double PI = 3.14159265358979323846;
103const double ZZ_ROTATION_ANGLE = - PI / 2.0;
104
105// Indices of ZZ-interacting qubits which undergo the first (red) Trotter round
106const int32_t ZZ_QUBITS_RED[NUM_ROTATIONS_PER_LAYER][NUM_PAULIS_PER_Z_ROTATION] = {
107  {  2,   1},  { 33,  39}, { 59,  60}, { 66,  67}, { 72,  81}, {118, 119},
108  { 21,  20},  { 26,  25}, { 13,  12}, { 31,  32}, { 70,  74}, {122, 123},
109  { 96,  97},  { 57,  56}, { 63,  64}, {107, 108}, {103, 104}, { 46,  45},
110  { 28,  35},  {  7,   6}, { 79,  78}, {  5,   4}, {109, 114}, { 62,  61},
111  { 58,  71},  { 37,  52}, { 76,  77}, {  0,  14}, { 36,  51}, {106, 105},
112  { 73,  85},  { 88,  87}, { 68,  55}, {116, 115}, { 94,  95}, {100, 110},
113  { 17,  30},  { 92, 102}, { 50,  49}, { 83,  84}, { 48,  47}, { 98,  99},
114  {  8,   9},  {121, 120}, { 23,  24}, { 44,  43}, { 22,  15}, { 53,  41}
115};
116
117// Indices of ZZ-interacting qubits which undergo the second (blue) Trotter round
118const int32_t ZZ_QUBITS_BLUE[NUM_ROTATIONS_PER_LAYER][NUM_PAULIS_PER_Z_ROTATION] = {
119  { 53,  60}, {123, 124}, { 21,  22}, { 11,  12}, { 67,  68}, {  2,   3},
120  { 66,  65}, {122, 121}, {110, 118}, {  6,   5}, { 94,  90}, { 28,  29},
121  { 14,  18}, { 63,  62}, {111, 104}, {100,  99}, { 45,  44}, {  4,  15},
122  { 20,  19}, { 57,  58}, { 77,  71}, { 76,  75}, { 26,  27}, { 16,   8},
123  { 35,  47}, { 31,  30}, { 48,  49}, { 69,  70}, {125, 126}, { 89,  74},
124  { 80,  79}, {116, 117}, {114, 113}, { 10,   9}, {106,  93}, {101, 102},
125  { 92,  83}, { 98,  91}, { 82,  81}, { 54,  64}, { 96, 109}, { 85,  84},
126  { 87,  86}, {108, 112}, { 34,  24}, { 42,  43}, { 40,  41}, { 39,  38}
127};
128
129// Indices of ZZ-interacting qubits which undergo the third (green) Trotter round
130const int32_t ZZ_QUBITS_GREEN[NUM_ROTATIONS_PER_LAYER][NUM_PAULIS_PER_Z_ROTATION] = {
131  { 10,  11}, { 54,  45}, {111, 122}, { 64,  65}, { 60,  61}, {103, 102},
132  { 72,  62}, {  4,   3}, { 33,  20}, { 58,  59}, { 26,  16}, { 28,  27},
133  {  8,   7}, {104, 105}, { 73,  66}, { 87,  93}, { 85,  86}, { 55,  49},
134  { 68,  69}, { 89,  88}, { 80,  81}, {117, 118}, {101, 100}, {114, 115},
135  { 96,  95}, { 29,  30}, {106, 107}, { 83,  82}, { 91,  79}, {  0,   1},
136  { 56,  52}, { 90,  75}, {126, 112}, { 36,  32}, { 46,  47}, { 77,  78},
137  { 97,  98}, { 17,  12}, {119, 120}, { 22,  23}, { 24,  25}, { 43,  34},
138  { 42,  41}, { 40,  39}, { 37,  38}, {125, 124}, { 50,  51}, { 18,  19}
139};

We instantiate the circuit as a std::vector of cupaulipropQuantumOperator_t types. In this demonstration, all circuit gates are Pauli rotations of 1 or 2 qubits, chiefly

\[\begin{split}Rx(\theta_X) = \exp\left(-i\theta_X/2 \, \hat{X}\right), \\ Rzz(\theta_Z) = \exp\left(-i\theta_Z/2 \, \hat{Z}\otimes\hat{Z}\right).\end{split}\]

We fix \(\theta_Z\) to be \(-\pi/2\), and will later substitute a value for the X-rotation strength \(\theta_X\).

143// ========================================================================
144// Circuit construction
145// ========================================================================
146
147// Each 'step' of the Trotter circuit alternates a layer of single-qubit X
148// rotations on every qubit, then a sequence of two-qubit Z rotations on the
149// heavy-hex topology, upon qubit pairs in the red, blue and green lists 
150// above. Note that ZZ rotations about -pi/2 are actually Clifford, though
151// we still here treat them like a generic Pauli rotation. The functions
152// below construct a Trotter circuit with a variable number of steps.
153
154
155std::vector<cupaulipropQuantumOperator_t> getXRotationLayer(
156  cupaulipropHandle_t handle, double xRotationAngle
157) {  
158  std::vector<cupaulipropQuantumOperator_t> layer(NUM_CIRCUIT_QUBITS);
159
160  const cupaulipropPauliKind_t paulis[NUM_PAULIS_PER_X_ROTATION] = {CUPAULIPROP_PAULI_X};
161
162  for (int32_t i=0; i<NUM_CIRCUIT_QUBITS; i++) {
163    HANDLE_CUPP_ERROR(cupaulipropCreatePauliRotationGateOperator(
164      handle, xRotationAngle, NUM_PAULIS_PER_X_ROTATION, &i, paulis, &layer[i]));
165  }
166
167  return layer;
168}
169
170
171std::vector<cupaulipropQuantumOperator_t> getZZRotationLayer(
172  cupaulipropHandle_t handle,
173  const int32_t topology[NUM_ROTATIONS_PER_LAYER][NUM_PAULIS_PER_Z_ROTATION]
174) {
175  std::vector<cupaulipropQuantumOperator_t> layer(NUM_ROTATIONS_PER_LAYER);
176
177  const cupaulipropPauliKind_t paulis[NUM_PAULIS_PER_Z_ROTATION] = {
178    CUPAULIPROP_PAULI_Z, CUPAULIPROP_PAULI_Z};
179
180  for (uint32_t i=0; i<NUM_ROTATIONS_PER_LAYER; i++) {
181    HANDLE_CUPP_ERROR(cupaulipropCreatePauliRotationGateOperator(
182      handle, ZZ_ROTATION_ANGLE, NUM_PAULIS_PER_Z_ROTATION, topology[i], paulis, &layer[i]));
183  }
184
185  return layer;
186}
187
188
189std::vector<cupaulipropQuantumOperator_t> getIBMHeavyHexIsingCircuit(
190  cupaulipropHandle_t handle, double xRotationAngle, int numTrotterSteps
191) {
192  std::vector<cupaulipropQuantumOperator_t> circuit;
193
194  for (int n=0; n<numTrotterSteps; n++) {
195    auto layerX       = getXRotationLayer (handle, xRotationAngle);
196    auto layerRedZZ   = getZZRotationLayer(handle, ZZ_QUBITS_RED);
197    auto layerBlueZZ  = getZZRotationLayer(handle, ZZ_QUBITS_BLUE);
198    auto layerGreenZZ = getZZRotationLayer(handle, ZZ_QUBITS_GREEN);
199    
200    circuit.insert(circuit.end(), layerX.begin(),       layerX.end());
201    circuit.insert(circuit.end(), layerRedZZ.begin(),   layerRedZZ.end());
202    circuit.insert(circuit.end(), layerBlueZZ.begin(),  layerBlueZZ.end());
203    circuit.insert(circuit.end(), layerGreenZZ.begin(), layerGreenZZ.end());
204  }
205
206  return circuit;
207}

This demonstration will simulate the circuit in the Heisenberg picture. As such, our input Pauli expansion will be initialized to an experimentally measured observable operator. We’ll later choose observable \(Z_{62}\), but we here define a function which encodes any Pauli string as a sequence of “packed integers” as accepted by the cuPauliProp API.

211// ========================================================================
212// Observable preparation
213// ========================================================================
214
215// This demo simulates the IBM circuit via back-propagating the measurement
216// observable through the adjoint circuit. As such, we encode the measured
217// observable into our initial Pauli expansion, in the format recognised by
218// cuPauliProp. Pauli strings are represented with "packed integers" wherein
219// every bit encodes a Pauli operator upon a corresponding qubit. We maintain
220// two masks which separately encode the position of X and Z Pauli operators,
221// indicated by a set bit at the qubit index, with a common set bit encoding
222// a Y Pauli operator. Simulating more qubits than exist bits in the packed
223// integer type (64) requires using multiple packed integers for each X and Z
224// mask. We store a Pauli string's constituent X and Z masks contiguously in
225// a single array, where the final mask of each per-string is padded with zero
226// bits to be an integer multiple of the packed integer size (64 bits).
227
228// The below function accepts a single Pauli string (i.e. a tensor product of
229// the given Pauli operators at the specified qubit indices) and returns the
230// sequence of packed integers which encode it as per the cuPauliProp API;
231// this sequence can be copied directly to the GPU buffer of a Pauli expansion.
232
233std::vector<cupaulipropPackedIntegerType_t> getPauliStringAsPackedIntegers(
234  std::vector<cupaulipropPauliKind_t> paulis, 
235  std::vector<uint32_t> qubits
236) {
237  assert(paulis.size() == qubits.size());
238  assert(*std::max_element(qubits.begin(), qubits.end()) < NUM_CIRCUIT_QUBITS);
239
240  int32_t numPackedInts;
241  HANDLE_CUPP_ERROR(cupaulipropGetNumPackedIntegers(NUM_CIRCUIT_QUBITS, &numPackedInts));
242
243  // A single Pauli string is composed of separate X and Z masks, one after the other
244  std::vector<cupaulipropPackedIntegerType_t> out(numPackedInts * 2, 0);
245  auto xPtr = &out[0];
246  auto zPtr = &out[numPackedInts];
247
248  // Process one input (pauli, qubit) pair at a time
249  for (auto i=0; i<qubits.size(); i++) {
250
251    // The qubit corresponds to a specific bit of a specific packed integer
252    auto numBitsPerPackedInt = 8 * sizeof(cupaulipropPackedIntegerType_t);
253    auto intInd = qubits[i] / numBitsPerPackedInt;
254    auto bitInd = qubits[i] % numBitsPerPackedInt;
255
256    // Overwrite a bit of either the X or Z masks (or both when pauli==Y)
257    if (paulis[i] == CUPAULIPROP_PAULI_X || paulis[i] == CUPAULIPROP_PAULI_Y)
258      xPtr[intInd] = xPtr[intInd] | (1ULL << bitInd);
259    if (paulis[i] == CUPAULIPROP_PAULI_Z || paulis[i] == CUPAULIPROP_PAULI_Y)
260      zPtr[intInd] = zPtr[intInd] | (1ULL << bitInd);
261  }
262
263  return out;
264}

With all convenience functions defined, we are ready to begin the main control flow.

268// ========================================================================
269// Main
270// ========================================================================
271
272// Simulation of the IBM utility experiment proceeds as follows. We setup the
273// cuPauliProp library, attaching a new stream, then proceed to creating two
274// Pauli expansions (since the API is out-of-place, as elaborated upon below).
275// One expansion is initialised to the measured observable of the IBM circuit.
276// We prepare workspace memory, fix truncation hyperparameters, then create
277// the circuit as a list of cuPauliProp operators. We process the circuit in
278// reverse, adjointing each operation, applied upon a newly prepared view of
279// the input expansion, each time checking our dynamically growing memory costs
280// have not exceeded our budgets. Thereafter we compute the overlap between the
281// final back-propagated observable and the experimental initial state (the all-
282// zero state), producing an estimate of the experimental expectation value.
283// Finally, we free all allocated memory like good citizens.
284
285
286int main(int argc, char** argv) {
287  std::cout << "cuPauliProp IBM Heavy-hex Ising Example" << std::endl;
288  std::cout << "=======================================" << std::endl << std::endl;

Our first chore is to setup the cuPauliProp library. For simplicity, we choose not to attach any stream, and therefore use the default stream, such that all cuPauliProp API functions are synchronous.

292  // ========================================================================
293  // Library setup
294  // ========================================================================
295
296  int deviceId = 0;
297  HANDLE_CUDA_ERROR(cudaSetDevice(deviceId));
298
299  // cuPauliProp operations accept a cudaStream_t argument for asynchronous usage
300  // We use the default stream (0) in this example
301  cudaStream_t stream = 0;
302  cupaulipropHandle_t handle;
303  HANDLE_CUPP_ERROR(cupaulipropCreate(&handle));

We here choose to use either our pre-decided memory budget, or optionally consult the available device memory and distribute it between the needed data structures. In total, we will reserve memory for two Pauli expansions (each containing a separate Pauli string and coefficient buffer), and a workspace buffer. The needed workspace buffer size is informed by both the capacity of the expansions and the API operations to be later performed. While it can be precomputed using the API “prepare” functions, such as cupaulipropPauliExpansionViewPrepareOperatorApplication(), we here merely reserve as much memory for the workspace as needed by a Pauli expansion, for simplicity.

307  // ========================================================================
308  // Decide memory usage
309  // ========================================================================
310
311  // As outlined in the 'Memory usage' section above, we here either uniformly
312  // allocate all (or a high percentage of) available memory between the needed
313  // memory buffers, or use the pre-decided fixed values. This demo will create
314  // a total of two Pauli expansions (each of which accepts two separate buffers
315  // to store Pauli strings and their corresponding coefficients; these have
316  // different sizes) and one workspace, hence we arrange for an allocation of
317  // five buffers in total.
318
319  size_t expansionPauliMem;
320  size_t expansionCoefMem;
321  size_t workspaceMem;
322  size_t totalUsedMem;
323
324  if (USE_MAX_VRAM) {
325
326    // Find usable device memory
327    size_t totalFreeMem, totalGlobalMem;
328    HANDLE_CUDA_ERROR(cudaMemGetInfo(&totalFreeMem, &totalGlobalMem));
329    size_t totalUsableMem = static_cast<size_t>(totalFreeMem * MAX_VRAM_PERCENT/100);
330
331    // Divide it between the three instances (two expansions, one workspace)
332    size_t instanceMem = totalUsableMem / 3;
333
334    // Determine the ideal ratio between an expansion's Pauli and coef buffers
335    int32_t numPackedInts;
336    HANDLE_CUPP_ERROR(cupaulipropGetNumPackedIntegers(NUM_CIRCUIT_QUBITS, &numPackedInts));
337    size_t pauliMemPerTerm = 2 * numPackedInts * sizeof(cupaulipropPackedIntegerType_t);
338    size_t coefMemPerTerm = sizeof(double);
339    size_t totalMemPerTerm = pauliMemPerTerm + coefMemPerTerm;
340
341    expansionPauliMem = (instanceMem * pauliMemPerTerm) / totalMemPerTerm;
342    expansionCoefMem  = (instanceMem * coefMemPerTerm ) / totalMemPerTerm;
343    workspaceMem      = instanceMem;
344
345    totalUsedMem = 2*expansionPauliMem + 2*expansionCoefMem + workspaceMem;
346    std::cout << "Dedicated memory: " << MAX_VRAM_PERCENT << "% of " << totalFreeMem;
347    std::cout << " B free = " << totalUsedMem << " B, divided into..." << std::endl;
348
349  } else {
350
351    // Use pre-decided buffer sizes
352    expansionPauliMem = FIXED_EXPANSION_PAULI_MEM;
353    expansionCoefMem  = FIXED_EXPANSION_COEF_MEM;
354    workspaceMem      = FIXED_WORKSPACE_MEM;
355
356    totalUsedMem = 2*expansionPauliMem + 2*expansionCoefMem + workspaceMem;
357    std::cout << "Dedicated memory: " << totalUsedMem << " B = 60 MiB, divided into..." << std::endl;
358  }
359
360  std::cout << "  expansion Pauli buffer: " << expansionPauliMem << " B" << std::endl;
361  std::cout << "  expansion coef buffer:  " << expansionCoefMem << " B" << std::endl;
362  std::cout << "  workspace buffer:       " << workspaceMem << " B\n" << std::endl;

It is our responsibility to allocate and initialise the device memory before passing it to the cuPauliProp Pauli expansion constructor. We create an “input” Pauli expansion initialised to the single Pauli string \(Z_{62}\), and an “output” Pauli expansion which is initially empty. Notice that we indicate to cupaulipropCreatePauliExpansion() that the pre-initialised input strings (of which there is one) happen to be sorted and unique, which cuPauliProp can internally later leverage for potential speedup.

366  // ========================================================================
367  // Pauli expansion preparation
368  // ========================================================================
369
370  // Create buffers for two Pauli expansions, which will serve as 'input' and
371  // 'output' to the out-of-place cuPauliProp API. Note that the capacities of
372  // these buffers constrain the maximum number of Pauli strings maintained
373  // during simulation, and ergo inform the accuracy of the simulation. The
374  // sufficient buffer sizes are specific to the simulated system, and we here
375  // choose a surprisingly small capacity as admitted by the studied circuit.
376
377  void * d_inExpansionPauliBuffer;
378  void * d_outExpansionPauliBuffer;
379  void * d_inExpansionCoefBuffer;
380  void * d_outExpansionCoefBuffer;
381  HANDLE_CUDA_ERROR(cudaMalloc(&d_inExpansionPauliBuffer,  expansionPauliMem));
382  HANDLE_CUDA_ERROR(cudaMalloc(&d_inExpansionCoefBuffer,   expansionCoefMem));
383  HANDLE_CUDA_ERROR(cudaMalloc(&d_outExpansionPauliBuffer, expansionPauliMem));
384  HANDLE_CUDA_ERROR(cudaMalloc(&d_outExpansionCoefBuffer,  expansionCoefMem));
385
386  // Prepare the X and Z masks which encode the experimental observable Z_62,
387  // which has a coefficient of unity, as seen in Figure 4. b) of the IBM work.
388  std::cout << "Observable: Z_62\n" << std::endl;
389  int64_t numObservableTerms = 1;
390  double observableCoef = 1.0;
391  std::vector<cupaulipropPauliKind_t> observablePaulis = {CUPAULIPROP_PAULI_Z};
392  std::vector<uint32_t> observableQubits = {62};
393
394  // Overwrite the 'input' Pauli expansion buffers with the observable data
395  auto observablePackedInts = getPauliStringAsPackedIntegers(observablePaulis, observableQubits);
396  size_t numObservableBytes = observablePackedInts.size() * sizeof(observablePackedInts[0]);
397  HANDLE_CUDA_ERROR(cudaMemcpy(
398    d_inExpansionPauliBuffer, observablePackedInts.data(), numObservableBytes, cudaMemcpyHostToDevice));
399  HANDLE_CUDA_ERROR(cudaMemcpy(
400    d_inExpansionCoefBuffer, &observableCoef, sizeof(observableCoef), cudaMemcpyHostToDevice));
401
402  // Create two Pauli expansions, which will serve as 'input and 'output' to the API.
403  // Because we begin from a real observable coefficient, and our circuit is completely
404  // positive and trace preserving, it is sufficient to use strictly real coefficients
405  // in our expansions, informing dataType below. We indicate that the single prepared
406  // term in the input expansion is technically unique, and the terms sorted, which
407  // permits cuPauliProp to use automatic optimisations during simulation.
408
409  cupaulipropPauliExpansion_t inExpansion;
410  cupaulipropPauliExpansion_t outExpansion;
411
412  cupaulipropSortOrder_t sortOrder = CUPAULIPROP_SORT_ORDER_NONE;
413  int32_t hasDuplicates = 0;  // isUnique = !hasDuplicates
414  cudaDataType_t dataType = CUDA_R_64F;
415
416  HANDLE_CUPP_ERROR(cupaulipropCreatePauliExpansion( // init to above
417    handle, NUM_CIRCUIT_QUBITS,
418    d_inExpansionPauliBuffer, expansionPauliMem,
419    d_inExpansionCoefBuffer,  expansionCoefMem,
420    dataType, numObservableTerms, sortOrder, hasDuplicates, 
421    &inExpansion));
422  HANDLE_CUPP_ERROR(cupaulipropCreatePauliExpansion( // init to empty
423    handle, NUM_CIRCUIT_QUBITS,
424    d_outExpansionPauliBuffer, expansionPauliMem,
425    d_outExpansionCoefBuffer,  expansionCoefMem,
426    dataType, 0, CUPAULIPROP_SORT_ORDER_NONE, 0,
427    &outExpansion));
428
429  

We must also ourselves allocate device memory for the workspace.

431  // ========================================================================
432  // Workspace preparation
433  // ========================================================================
434
435  // Some API functions require additional workspace memory which we bind to a
436  // workspace descriptor. Ordinarily we use the 'Prepare' functions to precisely
437  // bound upfront the needed workspace memory, but in this simple demo, we
438  // instead use a workspace memory which we prior know to be sufficient.
439
440  // Create a workspace
441  cupaulipropWorkspaceDescriptor_t workspace;
442  HANDLE_CUPP_ERROR(cupaulipropCreateWorkspaceDescriptor(handle, &workspace));
443
444  void* d_workspaceBuffer;
445  HANDLE_CUDA_ERROR(cudaMalloc(&d_workspaceBuffer, workspaceMem));
446  HANDLE_CUPP_ERROR(cupaulipropWorkspaceSetMemory(
447    handle, workspace,
448    CUPAULIPROP_MEMSPACE_DEVICE, CUPAULIPROP_WORKSPACE_SCRATCH,
449    d_workspaceBuffer, workspaceMem));
450  
451  // Note that the 'prepare' functions which check the required workspace memory
452  // will detach this buffer, requiring we re-call SetMemory above. We can avoid
453  // this repeated re-attachment by use of a second, bufferless workspace descri-
454  // ptor which we pass to the 'prepare' functions in lieu of this one.

During subsequent Heisenberg evolution of the observable operator \(Z_{62}\), the number of terms in the Pauli expansion will grow exponentially quickly. To mitigate this growth and keep simulation tractable through the entire circuit, we employ truncation strategies. For the simulated system, it is sufficient to perform truncation after every \(10\) gates, wherein we discard all Pauli strings with corresponding coefficient magnitudes smaller or equal to \(10^{-4}\), or which have a Pauli “weight” (the number of non-identity Paulis therein) exceeding \(8\). This strategy keeps total memory costs below 60 MiB (and so within the capacities of our Pauli expansion and workspace buffers), while ultimately producing an expectation value within the error bars of the error-mitigated experimental value.

458  // ========================================================================
459  // Truncation parameter preparation
460  // ========================================================================
461
462  // The Pauli propagation simulation technique has memory and runtime costs which
463  // (for generic circuits) grow exponentially with the circuit length, which we
464  // curtail through "truncation"; dynamic discarding of Pauli strings in our
465  // expansion which are predicted to contribute negligibly to the final output
466  // expectation value. This file demonstrates simultaneous usage of two truncation
467  // techniques; discarding of Pauli strings with an absolute coefficient less than
468  // 0.0001, or a "Pauli weight" (the number of non-identity operators in the string)
469  // exceeding eight. For example, given a Pauli expansion containing:
470  //    0.1 XYZXYZII + 1E-5 ZIIIIIII + 0.2 XXXXYYYY,
471  // our truncation parameters below would see the latter two strings discarded due
472  // to coefficient and weight truncation respectively.
473
474  cupaulipropCoefficientTruncationParams_t coefTruncParams;
475  coefTruncParams.cutoff = 1E-4;
476
477  cupaulipropPauliWeightTruncationParams_t weightTruncParams;
478  weightTruncParams.cutoff = 8;
479
480  const uint32_t numTruncStrats = 2;
481  cupaulipropTruncationStrategy_t truncStrats[] = {
482    {
483      CUPAULIPROP_TRUNCATION_STRATEGY_COEFFICIENT_BASED,
484      &coefTruncParams
485    },
486    {
487      CUPAULIPROP_TRUNCATION_STRATEGY_PAULI_WEIGHT_BASED,
488      &weightTruncParams
489    }
490  };
491
492  // It is not necessary to perform truncation after every gate, since the
493  // Pauli expansion size may not have grown substantially, and attempting
494  // to truncate may incur superfluous memory enumeration costs. In this
495  // demo, we choose to truncate only after every tenth applied gate. Note
496  // deferring truncation requires additional expansion memory; choosing to
497  // truncate after every gate shrinks this demo's costs to 20 MiB total.
498  const int numGatesBetweenTruncations = 10;
499
500  std::cout << "Coefficient truncation threshold:  " << coefTruncParams.cutoff << std::endl;
501  std::cout << "Pauli weight truncation threshold: " << weightTruncParams.cutoff << std::endl;
502  std::cout << "Truncation performed after every:  " << numGatesBetweenTruncations << " gates\n" << std::endl;

With the stage set, we are ready to begin simulation! We fix the angle of all X-rotation gates in our circuit to \(\theta_X = \pi/4\), and use \(20\) Trotter repetitions. Our simulated system therefore corresponds to the antepenultimate experimental datum in Figure 4. b of Nature volume 618, pages 500–505 (2023). Let operator \(\hat{U}\) represent the action of this full, forward experimental circuit. The original experimentalists performed \(\hat{U}\) upon the fiducial zero-state \(|0\rangle^{\otimes 127}\), and estimated the expectation value of observable \(Z_{62}\) through repeated sampling.

In contrast, we will here work in the Heisenberg picture, and simulate the adjoint (ergo reversed) circuit upon the observable operator \(Z_{62}\). For each gate, we create a view of the entire input Pauli expansion; we confirm that the output Pauli expansion has sufficient capacity to store the (worst case) result of applying the gate upon the view; we reattach the workspace buffer which is detached by that query; we apply the gate and overwrite the output Pauli expansion; and finally, we swap the input and output Pauli expansions so that the updated expansion becomes the next input.

506  // ========================================================================
507  // Back-propagation of the observable through the circuit
508  // ========================================================================
509
510  // We now simulate the observable operator being back-propagated through the
511  // adjoint circuit, mapping the input expansion (initialised to Z_62) to a
512  // final output expansion containing many weighted Pauli strings. We use the
513  // heavy-hex fixed-angle Ising circuit with 20 total repetitions, fixing the
514  // angle of the X rotation gates to PI/4. Our simulation therefore corresponds
515  // to the middle datum of Fig. 4 b) of the IBM manuscript, for which MPS and
516  // isoTNS siulation techniques showed the greatest divergence from experiment.
517
518  double xRotationAngle = PI / 4.;
519  int numTrotterSteps = 20;
520  auto circuit = getIBMHeavyHexIsingCircuit(handle, xRotationAngle, numTrotterSteps);
521
522  std::cout << "Circuit: 127 qubit IBM heavy-hex Ising circuit, with..." << std::endl;
523  std::cout << "  Trotter steps: " << numTrotterSteps << std::endl;
524  std::cout << "  Total gates:   " << circuit.size() << std::endl;
525  std::cout << "  Rx angle:      " << xRotationAngle << " (i.e. PI/4)\n" << std::endl;
526
527  // Constrain that every intermediate output expansion contains unique Pauli
528  // strings (forbidding duplicates), but permit the retained strings to be
529  // unsorted. This combination gives cuPauliProp the best chance of automatically
530  // selecting optimal internal functions and postconditions for the simulation.
531  uint32_t adjoint = true;
532  sortOrder = CUPAULIPROP_SORT_ORDER_NONE;  // reuse variable from above
533  uint32_t keepDuplicates = false;
534
535  std::cout << "Imposed postconditions:" << std::endl;
536  if (sortOrder != CUPAULIPROP_SORT_ORDER_NONE) {
537    std::cout << "  Pauli strings will be sorted." << std::endl;
538  }
539  if (!keepDuplicates) {
540    std::cout << "  Pauli strings will be unique." << std::endl;
541  }
542  if (keepDuplicates && sortOrder == CUPAULIPROP_SORT_ORDER_NONE) {
543    std::cout << "No postconditions imposed on Pauli strings." << std::endl;
544  }
545  std::cout << std::endl;
546
547  // Begin timing before any gates are applied
548  auto startTime = std::chrono::high_resolution_clock::now();
549  int64_t maxNumTerms = 0;
550
551  // Iterate the circuit in reverse to effect the adjoint of the total circuit
552  for (int gateInd=circuit.size()-1; gateInd >= 0; --gateInd) {
553    cupaulipropQuantumOperator_t gate = circuit[gateInd];
554
555    // Create a view of the current input expansion, selecting all currently
556    // contained terms. For very large systems, we may have alternatively
557    // chosen a smaller view of the partial state to work around memory limits.
558    cupaulipropPauliExpansionView_t inView;
559    int64_t numExpansionTerms;
560    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetNumTerms(handle, inExpansion, &numExpansionTerms));
561    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetContiguousRange(
562      handle, inExpansion, 0, numExpansionTerms, &inView));
563
564    // Track the intermediate expansion size, for our curiousity
565    if (numExpansionTerms > maxNumTerms)
566      maxNumTerms = numExpansionTerms;
567
568    // Choose whether or not to perform truncations after this gate
569    int numPassedTruncStrats = (gateInd % numGatesBetweenTruncations == 0)? numTruncStrats : 0;
570
571    // Check the expansion and workspace memories needed to apply the current gate
572    int64_t reqExpansionPauliMem;
573    int64_t reqExpansionCoefMem;
574    int64_t reqWorkspaceMem;
575    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewPrepareOperatorApplication(
576      handle, inView, gate, sortOrder, keepDuplicates,
577      numPassedTruncStrats, numPassedTruncStrats > 0 ? truncStrats : nullptr,
578      workspaceMem,
579      &reqExpansionPauliMem, &reqExpansionCoefMem, workspace));
580    HANDLE_CUPP_ERROR(cupaulipropWorkspaceGetMemorySize(
581      handle, workspace,
582      CUPAULIPROP_MEMSPACE_DEVICE, CUPAULIPROP_WORKSPACE_SCRATCH,
583      &reqWorkspaceMem));
584
585    // Verify that our existing buffers and workspace have sufficient memory
586    assert(reqExpansionPauliMem <= expansionPauliMem);
587    assert(reqExpansionCoefMem  <= expansionCoefMem);
588    assert(reqWorkspaceMem      <= workspaceMem);
589
590    // Beware that cupaulipropPauliExpansionViewPrepareOperatorApplication() above
591    // detaches the memory buffer from the workspace, which we here re-attach.
592    HANDLE_CUPP_ERROR(cupaulipropWorkspaceSetMemory(
593      handle, workspace,
594      CUPAULIPROP_MEMSPACE_DEVICE, CUPAULIPROP_WORKSPACE_SCRATCH,
595      d_workspaceBuffer, workspaceMem));
596
597    // Apply the gate upon the prepared view of the input expansion, evolving the
598    // Pauli strings pointed to within, truncating the result. The input expansion
599    // is unchanged while the output expansion is entirely overwritten.
600    HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewComputeOperatorApplication(
601      handle, inView, outExpansion, gate,
602      adjoint, sortOrder, keepDuplicates,
603      numPassedTruncStrats, numPassedTruncStrats > 0 ? truncStrats : nullptr,
604      workspace, stream));
605
606    // Free the temporary view since it points to the old input expansion, whereas
607    // we will subsequently treat the modified output expansion as the next input
608    HANDLE_CUPP_ERROR(cupaulipropDestroyPauliExpansionView(inView));
609
610    // Treat outExpansion as the input in the next gate application
611    std::swap(inExpansion, outExpansion);
612  }
613
614  // Restore outExpansion to being the final output for clarity
615  std::swap(inExpansion, outExpansion);

Our desired output quantity is the expectation value of \(Z_{62}\), as given by

\[\langle Z_{64} \rangle = \text{Tr}\left( \hat{U} |0\rangle\langle 0| \hat{U}^\dagger \; \hat{Z}_{62} \right).\]

This is expressible in terms of our final output Pauli expansion \(s_{\text{out}}\) (the Heisenberg-evolved observable) as

\[\langle Z_{64} \rangle = \text{Tr}\left( |0\rangle\langle 0 | \; \hat{U}^\dagger \hat{Z}_{62} \hat{U} \right) = \text{Tr}\left( |0\rangle\langle 0 | \; s_{\text{out}} \right).\]

Therefore we calculate the trace of our output Pauli expansion with the zero state, producing our estimate to the experimental expectation value of \(Z_{62}\).

619  // ========================================================================
620  // Evaluation of the the expectation value of observable
621  // ========================================================================
622
623  // The output expansion is now a proxy for the observable back-propagated
624  // through to the front of the circuit (though having discarded components
625  // which negligibly influence the subsequent overlap). The expectation value
626  // of the IBM experiment is the overlap of the output expansion with the
627  // zero state, i.e. Tr(outExpansion * |0><0|), as we now compute.
628
629  // Obtain a view of the full output expansion (we'll free it in 'Clean up')
630  cupaulipropPauliExpansionView_t outView;
631  int64_t numOutTerms;
632  HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetNumTerms(handle, inExpansion, &numOutTerms));
633  HANDLE_CUPP_ERROR(cupaulipropPauliExpansionGetContiguousRange(
634    handle, inExpansion, 0, numOutTerms, &outView));
635
636  // Check that the existing workspace memory is sufficient to compute the trace 
637  int64_t reqWorkspaceMem;
638  HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewPrepareTraceWithZeroState(
639    handle, outView, workspaceMem, workspace));
640  HANDLE_CUPP_ERROR(cupaulipropWorkspaceGetMemorySize(
641    handle, workspace,
642    CUPAULIPROP_MEMSPACE_DEVICE, CUPAULIPROP_WORKSPACE_SCRATCH,
643    &reqWorkspaceMem));
644  assert(reqWorkspaceMem <= workspaceMem);
645
646  // Beware that we must now reattach the buffer to the workspace
647  HANDLE_CUPP_ERROR(cupaulipropWorkspaceSetMemory(
648    handle, workspace,
649    CUPAULIPROP_MEMSPACE_DEVICE, CUPAULIPROP_WORKSPACE_SCRATCH,
650    d_workspaceBuffer, workspaceMem));
651
652  // Compute the trace; the main and final output of this simulation!
653  double expec;
654  HANDLE_CUPP_ERROR(cupaulipropPauliExpansionViewComputeTraceWithZeroState(
655    handle, outView, &expec, workspace, stream));
656
657  // End timing after trace is evaluated
658  auto endTime = std::chrono::high_resolution_clock::now();
659  auto duration = std::chrono::duration_cast<std::chrono::microseconds>(endTime - startTime);
660  auto durationSecs = (duration.count() / 1e6);
661
662  std::cout << "Expectation value:       " << expec << std::endl;
663  std::cout << "Final number of terms:   " << numOutTerms << std::endl;
664  std::cout << "Maximum number of terms: " << maxNumTerms << std::endl;
665  std::cout << "Runtime:                 " << durationSecs << " seconds\n" << std::endl;

Finally, we clean up resources and free all device memory reserved during our simulation. This includes any outstanding views, all quantum operators, all expansions and their underlying buffers, and similarly for the workspace. Our final step is to finalise the cuPauliProp library, concluding simulation.

669  // ========================================================================
670  // Clean up
671  // ========================================================================
672
673  HANDLE_CUPP_ERROR(cupaulipropDestroyPauliExpansionView(outView));
674 
675  for (auto & gate : circuit) {
676    HANDLE_CUPP_ERROR(cupaulipropDestroyOperator(gate));
677  }
678
679  HANDLE_CUPP_ERROR(cupaulipropDestroyWorkspaceDescriptor(workspace));
680  HANDLE_CUDA_ERROR(cudaFree(d_workspaceBuffer));
681
682  HANDLE_CUPP_ERROR(cupaulipropDestroyPauliExpansion(inExpansion));
683  HANDLE_CUPP_ERROR(cupaulipropDestroyPauliExpansion(outExpansion));
684
685  HANDLE_CUDA_ERROR(cudaFree(d_inExpansionPauliBuffer));
686  HANDLE_CUDA_ERROR(cudaFree(d_outExpansionPauliBuffer));
687  HANDLE_CUDA_ERROR(cudaFree(d_inExpansionCoefBuffer));
688  HANDLE_CUDA_ERROR(cudaFree(d_outExpansionCoefBuffer));
689
690  HANDLE_CUPP_ERROR(cupaulipropDestroy(handle));
691
692  return EXIT_SUCCESS;
693}

Useful tips#

  • For debugging, one can set the environment variable CUPAULIPROP_LOG_LEVEL=n. The level n = 0, 1, …, 5 corresponds to the logger level as described in the table below. The environment variable CUPAULIPROP_LOG_FILE=<filepath> can be used to redirect the log output to a custom file at <filepath> instead of stdout.

Level

Summary

Long Description

0

Off

Logging is disabled (default)

1

Errors

Only errors will be logged

2

Performance Trace

API calls that launch CUDA kernels will log their parameters and important information

3

Performance Hints

Hints that can potentially improve the application’s performance

4

Heuristics Trace

Provides general information about the library execution, may contain details about heuristic status

5

API Trace

API calls will log their parameter and important information