Examples¶
In this section, we show how to contract a tensor network using cuTensorNet. First, we describe how to compile sample code. Then, we present an example code used to perform common steps in cuTensorNet. In the example, we perform the following tensor contraction:
We construct the code step by step, each step adding code at the end. The steps are separated by succinct multi-line comment blocks.
It is recommended that the reader refers to Overview and cuTENSOR documentation for familiarity with the nomenclature and cuTENSOR operations.
Compiling code¶
Assuming cuQuantum has been extracted in CUQUANTUM_ROOT
and cuTENSOR in CUTENSOR_ROOT
, we update the library path as follows:
export LD_LIBRARY_PATH=${CUQUANTUM_ROOT}/lib:${CUTENSOR_ROOT}/lib/11:${LD_LIBRARY_PATH}
Depending on your CUDA Toolkit, you might have to choose a different library version (e.g., ${CUTENSOR_ROOT}/lib/11.0
).
The serial sample code discussed below (tensornet_example.cu
) can be compiled via the following command:
nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -o tensornet_example
For static linking against the cuTensorNet library, use the following command (note that libmetis_static.a
needs to be explicitly linked against,
assuming it is installed through the NVIDIA CUDA Toolkit and accessible through $LIBRARY_PATH
):
nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include ${CUQUANTUM_ROOT}/lib/libcutensornet_static.a -L${CUTENSOR_DIR}/lib/11 -lcutensor libmetis_static.a -o tensornet_example
In order to build parallel (MPI) versions of the examples (tensornet_example_mpi_auto.cu
and tensornet_example_mpi.cu
),
one will need to have an MPI library installed (e.g., recent Open MPI, MVAPICH, or MPICH).
In particular, the automatic parallel example requires CUDA-aware MPI, see Code example (automatic slice-based distributed parallelization) below.
In this case, one will need to add -I${MPI_PATH}/include
and -L${MPI_PATH}/lib -lmpi
to the build command:
nvcc tensornet_example_mpi_auto.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -I${MPI_PATH}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -L${MPI_PATH}/lib -lmpi -o tensornet_example_mpi_auto
nvcc tensornet_example_mpi.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -I${MPI_PATH}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -L${MPI_PATH}/lib -lmpi -o tensornet_example_mpi
Warning
When running tensornet_example_mpi_auto.cu
without CUDA-aware MPI, the program will crash.
Note
Depending on the source of the cuQuantum package, you may need to replace lib
above by lib64
.
Code example (serial)¶
The following code example illustrates the common steps necessary to use cuTensorNet and also introduces typical tensor network operations. The full sample code can be found in the NVIDIA/cuQuantum repository (here).
Headers and data types¶
8#include <stdlib.h>
9#include <stdio.h>
10
11#include <unordered_map>
12#include <vector>
13#include <cassert>
14
15#include <cuda_runtime.h>
16#include <cutensornet.h>
17
18
19#define HANDLE_ERROR(x) \
20{ const auto err = x; \
21 if( err != CUTENSORNET_STATUS_SUCCESS ) \
22 { printf("Error: %s in line %d\n", cutensornetGetErrorString(err), __LINE__); \
23 fflush(stdout); \
24 } \
25};
26
27#define HANDLE_CUDA_ERROR(x) \
28{ const auto err = x; \
29 if( err != cudaSuccess ) \
30 { printf("CUDA Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); \
31 fflush(stdout); \
32 } \
33};
34
35
36struct GPUTimer
37{
38 GPUTimer(cudaStream_t stream): stream_(stream)
39 {
40 HANDLE_CUDA_ERROR(cudaEventCreate(&start_));
41 HANDLE_CUDA_ERROR(cudaEventCreate(&stop_));
42 }
43
44 ~GPUTimer()
45 {
46 HANDLE_CUDA_ERROR(cudaEventDestroy(start_));
47 HANDLE_CUDA_ERROR(cudaEventDestroy(stop_));
48 }
49
50 void start()
51 {
52 HANDLE_CUDA_ERROR(cudaEventRecord(start_, stream_));
53 }
54
55 float seconds()
56 {
57 HANDLE_CUDA_ERROR(cudaEventRecord(stop_, stream_));
58 HANDLE_CUDA_ERROR(cudaEventSynchronize(stop_));
59 float time;
60 HANDLE_CUDA_ERROR(cudaEventElapsedTime(&time, start_, stop_));
61 return time * 1e-3;
62 }
63
64 private:
65 cudaEvent_t start_, stop_;
66 cudaStream_t stream_;
67};
68
69
70int main()
71{
72 static_assert(sizeof(size_t) == sizeof(int64_t), "Please build this sample on a 64-bit architecture!");
73
74 bool verbose = true;
75
76 // Check cuTensorNet version
77 const size_t cuTensornetVersion = cutensornetGetVersion();
78 if(verbose)
79 printf("cuTensorNet version: %ld\n", cuTensornetVersion);
80
81 // Set GPU device
82 int numDevices {0};
83 HANDLE_CUDA_ERROR( cudaGetDeviceCount(&numDevices) );
84 const int deviceId = 0;
85 HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) );
86 cudaDeviceProp prop;
87 HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );
88
89 if(verbose) {
90 printf("===== device info ======\n");
91 printf("GPU-local-id:%d\n", deviceId);
92 printf("GPU-name:%s\n", prop.name);
93 printf("GPU-clock:%d\n", prop.clockRate);
94 printf("GPU-memoryClock:%d\n", prop.memoryClockRate);
95 printf("GPU-nSM:%d\n", prop.multiProcessorCount);
96 printf("GPU-major:%d\n", prop.major);
97 printf("GPU-minor:%d\n", prop.minor);
98 printf("========================\n");
99 }
100
101 typedef float floatType;
102 cudaDataType_t typeData = CUDA_R_32F;
103 cutensornetComputeType_t typeCompute = CUTENSORNET_COMPUTE_32F;
104
105 if(verbose)
106 printf("Included headers and defined data types\n");
Define tensor network and tensor sizes¶
Next, we define the topology of the tensor network (i.e., the modes of the tensors, their extents, and their connectivity).
110 /**********************
111 * Computing: R_{k,l} = A_{a,b,c,d,e,f} B_{b,g,h,e,i,j} C_{m,a,g,f,i,k} D_{l,c,h,d,j,m}
112 **********************/
113
114 constexpr int32_t numInputs = 4;
115
116 // Create vectors of tensor modes
117 std::vector<int32_t> modesA{'a','b','c','d','e','f'};
118 std::vector<int32_t> modesB{'b','g','h','e','i','j'};
119 std::vector<int32_t> modesC{'m','a','g','f','i','k'};
120 std::vector<int32_t> modesD{'l','c','h','d','j','m'};
121 std::vector<int32_t> modesR{'k','l'};
122
123 // Set mode extents
124 std::unordered_map<int32_t, int64_t> extent;
125 extent['a'] = 16;
126 extent['b'] = 16;
127 extent['c'] = 16;
128 extent['d'] = 16;
129 extent['e'] = 16;
130 extent['f'] = 16;
131 extent['g'] = 16;
132 extent['h'] = 16;
133 extent['i'] = 16;
134 extent['j'] = 16;
135 extent['k'] = 16;
136 extent['l'] = 16;
137 extent['m'] = 16;
138
139 // Create a vector of extents for each tensor
140 std::vector<int64_t> extentA;
141 for (auto mode : modesA)
142 extentA.push_back(extent[mode]);
143 std::vector<int64_t> extentB;
144 for (auto mode : modesB)
145 extentB.push_back(extent[mode]);
146 std::vector<int64_t> extentC;
147 for (auto mode : modesC)
148 extentC.push_back(extent[mode]);
149 std::vector<int64_t> extentD;
150 for (auto mode : modesD)
151 extentD.push_back(extent[mode]);
152 std::vector<int64_t> extentR;
153 for (auto mode : modesR)
154 extentR.push_back(extent[mode]);
155
156 if(verbose)
157 printf("Defined tensor network, modes, and extents\n");
Allocate memory and initialize data¶
Next, we allocate memory for the tensor network operands and initialize them to random values.
160 /**********************
161 * Allocating data
162 **********************/
163
164 size_t elementsA = 1;
165 for (auto mode : modesA)
166 elementsA *= extent[mode];
167 size_t elementsB = 1;
168 for (auto mode : modesB)
169 elementsB *= extent[mode];
170 size_t elementsC = 1;
171 for (auto mode : modesC)
172 elementsC *= extent[mode];
173 size_t elementsD = 1;
174 for (auto mode : modesD)
175 elementsD *= extent[mode];
176 size_t elementsR = 1;
177 for (auto mode : modesR)
178 elementsR *= extent[mode];
179
180 size_t sizeA = sizeof(floatType) * elementsA;
181 size_t sizeB = sizeof(floatType) * elementsB;
182 size_t sizeC = sizeof(floatType) * elementsC;
183 size_t sizeD = sizeof(floatType) * elementsD;
184 size_t sizeR = sizeof(floatType) * elementsR;
185 if(verbose)
186 printf("Total GPU memory used for tensor storage: %.2f GiB\n",
187 (sizeA + sizeB + sizeC + sizeD + sizeR) / 1024. /1024. / 1024);
188
189 void* rawDataIn_d[numInputs];
190 void* R_d;
191 HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[0], sizeA) );
192 HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[1], sizeB) );
193 HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[2], sizeC) );
194 HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[3], sizeD) );
195 HANDLE_CUDA_ERROR( cudaMalloc((void**) &R_d, sizeR));
196
197 floatType *A = (floatType*) malloc(sizeof(floatType) * elementsA);
198 floatType *B = (floatType*) malloc(sizeof(floatType) * elementsB);
199 floatType *C = (floatType*) malloc(sizeof(floatType) * elementsC);
200 floatType *D = (floatType*) malloc(sizeof(floatType) * elementsD);
201 floatType *R = (floatType*) malloc(sizeof(floatType) * elementsR);
202
203 if (A == NULL || B == NULL || C == NULL || D == NULL || R == NULL)
204 {
205 printf("Error: Host memory allocation failed!\n");
206 return -1;
207 }
208
209 /*******************
210 * Initialize data
211 *******************/
212
213 memset(R, 0, sizeof(floatType) * elementsR);
214 for (uint64_t i = 0; i < elementsA; i++)
215 A[i] = ((floatType) rand()) / RAND_MAX;
216 for (uint64_t i = 0; i < elementsB; i++)
217 B[i] = ((floatType) rand()) / RAND_MAX;
218 for (uint64_t i = 0; i < elementsC; i++)
219 C[i] = ((floatType) rand()) / RAND_MAX;
220 for (uint64_t i = 0; i < elementsD; i++)
221 D[i] = ((floatType) rand()) / RAND_MAX;
222
223 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
224 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
225 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );
226 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[3], D, sizeD, cudaMemcpyHostToDevice) );
227
228 if(verbose)
229 printf("Allocated GPU memory for data, and initialize data\n");
cuTensorNet handle and network descriptor¶
Next, we initialize the cuTensorNet library via cutensornetCreate()
and
create the network descriptor with the desired tensor modes, extents, and
strides, as well as the data and compute types. Note that the created library
context will be associated with the currently active GPU.
232 /*************************
233 * cuTensorNet
234 *************************/
235
236 cudaStream_t stream;
237 HANDLE_CUDA_ERROR( cudaStreamCreate(&stream) );
238
239 cutensornetHandle_t handle;
240 HANDLE_ERROR( cutensornetCreate(&handle) );
241
242 const int32_t nmodeA = modesA.size();
243 const int32_t nmodeB = modesB.size();
244 const int32_t nmodeC = modesC.size();
245 const int32_t nmodeD = modesD.size();
246 const int32_t nmodeR = modesR.size();
247
248 /*******************************
249 * Create Network Descriptor
250 *******************************/
251
252 const int32_t* modesIn[] = {modesA.data(), modesB.data(), modesC.data(), modesD.data()};
253 int32_t const numModesIn[] = {nmodeA, nmodeB, nmodeC, nmodeD};
254 const int64_t* extentsIn[] = {extentA.data(), extentB.data(), extentC.data(), extentD.data()};
255 const int64_t* stridesIn[] = {NULL, NULL, NULL, NULL}; // strides are optional; if no stride is provided, cuTensorNet assumes a generalized column-major data layout
256
257 // Set up tensor network
258 cutensornetNetworkDescriptor_t descNet;
259 HANDLE_ERROR( cutensornetCreateNetworkDescriptor(handle,
260 numInputs, numModesIn, extentsIn, stridesIn, modesIn, NULL,
261 nmodeR, extentR.data(), /*stridesOut = */NULL, modesR.data(),
262 typeData, typeCompute,
263 &descNet) );
264
265 if(verbose)
266 printf("Initialized the cuTensorNet library and created a tensor network descriptor\n");
Optimal contraction order and slicing¶
At this stage, we can deploy the cuTensorNet optimizer to find an optimized contraction path and slicing combination.
We choose a limit for the workspace needed to perform the contraction based on the available memory resources, and provide it
to the optimizer as a constraint. We then create an optimizer configuration object of type cutensornetContractionOptimizerConfig_t
to specify various optimizer options and provide it to the optimizer, which is invoked via cutensornetContractionOptimize()
.
The results from the optimizer will be returned in an optimizer info object of type cutensornetContractionOptimizerInfo_t
.
269 /*******************************
270 * Choose workspace limit based on available resources.
271 *******************************/
272
273 size_t freeMem, totalMem;
274 HANDLE_CUDA_ERROR( cudaMemGetInfo(&freeMem, &totalMem) );
275 uint64_t workspaceLimit = (uint64_t)((double)freeMem * 0.9);
276 if(verbose)
277 printf("Workspace limit = %lu\n", workspaceLimit);
278
279 /*******************************
280 * Find "optimal" contraction order and slicing
281 *******************************/
282
283 cutensornetContractionOptimizerConfig_t optimizerConfig;
284 HANDLE_ERROR( cutensornetCreateContractionOptimizerConfig(handle, &optimizerConfig) );
285
286 // Set the desired number of hyper-samples (defaults to 0)
287 int32_t num_hypersamples = 8;
288 HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(handle,
289 optimizerConfig,
290 CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES,
291 &num_hypersamples,
292 sizeof(num_hypersamples)) );
293
294 // Create contraction optimizer info and find an optimized contraction path
295 cutensornetContractionOptimizerInfo_t optimizerInfo;
296 HANDLE_ERROR( cutensornetCreateContractionOptimizerInfo(handle, descNet, &optimizerInfo) );
297
298 HANDLE_ERROR( cutensornetContractionOptimize(handle,
299 descNet,
300 optimizerConfig,
301 workspaceLimit,
302 optimizerInfo) );
303
304 // Query the number of slices the tensor network execution will be split into
305 int64_t numSlices = 0;
306 HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
307 handle,
308 optimizerInfo,
309 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
310 &numSlices,
311 sizeof(numSlices)) );
312 assert(numSlices > 0);
313
314 if(verbose)
315 printf("Found an optimized contraction path using cuTensorNet optimizer\n");
It is also possible to bypass the cuTensorNet optimizer and import a pre-determined contraction path, as well as slicing information,
directly to the optimizer info object via cutensornetContractionOptimizerInfoSetAttribute()
.
Create workspace descriptor and allocate workspace memory¶
Next, we create a workspace descriptor, compute the workspace sizes, and query the minimum workspace size needed to contract the network. We then allocate device memory for the workspace and set this in the workspace descriptor. The workspace descriptor will be provided to the contraction plan.
318 /*******************************
319 * Create workspace descriptor, allocate workspace, and set it.
320 *******************************/
321
322 cutensornetWorkspaceDescriptor_t workDesc;
323 HANDLE_ERROR( cutensornetCreateWorkspaceDescriptor(handle, &workDesc) );
324
325 int64_t requiredWorkspaceSize = 0;
326 HANDLE_ERROR( cutensornetWorkspaceComputeContractionSizes(handle,
327 descNet,
328 optimizerInfo,
329 workDesc) );
330
331 HANDLE_ERROR( cutensornetWorkspaceGetMemorySize(handle,
332 workDesc,
333 CUTENSORNET_WORKSIZE_PREF_MIN,
334 CUTENSORNET_MEMSPACE_DEVICE,
335 CUTENSORNET_WORKSPACE_SCRATCH,
336 &requiredWorkspaceSize) );
337
338 void* work = nullptr;
339 HANDLE_CUDA_ERROR( cudaMalloc(&work, requiredWorkspaceSize) );
340
341 HANDLE_ERROR( cutensornetWorkspaceSetMemory(handle,
342 workDesc,
343 CUTENSORNET_MEMSPACE_DEVICE,
344 CUTENSORNET_WORKSPACE_SCRATCH,
345 work,
346 requiredWorkspaceSize) );
347
348 if(verbose)
349 printf("Allocated and set up the GPU workspace\n");
Contraction plan and auto-tune¶
We create a tensor network contraction plan holding all pairwise contraction plans for cuTENSOR. Optionally, we can auto-tune the plan such that cuTENSOR selects the best kernel for each pairwise contraction. This contraction plan can be reused for many (possibly different) data inputs, avoiding the cost of initializing this plan redundantly.
352 /*******************************
353 * Initialize the pairwise contraction plan (for cuTENSOR).
354 *******************************/
355
356 cutensornetContractionPlan_t plan;
357 HANDLE_ERROR( cutensornetCreateContractionPlan(handle,
358 descNet,
359 optimizerInfo,
360 workDesc,
361 &plan) );
362
363 /*******************************
364 * Optional: Auto-tune cuTENSOR's cutensorContractionPlan to pick the fastest kernel
365 * for each pairwise tensor contraction.
366 *******************************/
367 cutensornetContractionAutotunePreference_t autotunePref;
368 HANDLE_ERROR( cutensornetCreateContractionAutotunePreference(handle,
369 &autotunePref) );
370
371 const int numAutotuningIterations = 5; // may be 0
372 HANDLE_ERROR( cutensornetContractionAutotunePreferenceSetAttribute(
373 handle,
374 autotunePref,
375 CUTENSORNET_CONTRACTION_AUTOTUNE_MAX_ITERATIONS,
376 &numAutotuningIterations,
377 sizeof(numAutotuningIterations)) );
378
379 // Modify the plan again to find the best pair-wise contractions
380 HANDLE_ERROR( cutensornetContractionAutotune(handle,
381 plan,
382 rawDataIn_d,
383 R_d,
384 workDesc,
385 autotunePref,
386 stream) );
387
388 HANDLE_ERROR( cutensornetDestroyContractionAutotunePreference(autotunePref) );
389
390 if(verbose)
391 printf("Created a contraction plan for cuTensorNet and optionally auto-tuned it\n");
Tensor network contraction execution¶
Finally, we contract the tensor network as many times as needed, possibly with different input each time.
Tensor network slices, captured as a cutensornetSliceGroup_t
object, are computed using the same contraction plan.
For convenience, NULL
can be provided to the cutensornetContractSlices()
function instead of a slice group when
the goal is to contract all the slices in the network. We also clean up and free allocated resources.
394 /**********************
395 * Execute the tensor network contraction
396 **********************/
397
398 // Create a cutensornetSliceGroup_t object from a range of slice IDs
399 cutensornetSliceGroup_t sliceGroup{};
400 HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, 0, numSlices, 1, &sliceGroup) );
401
402 GPUTimer timer {stream};
403 double minTimeCUTENSORNET = 1e100;
404 const int numRuns = 3; // number of repeats to get stable performance results
405 for (int i = 0; i < numRuns; ++i)
406 {
407 HANDLE_CUDA_ERROR( cudaMemcpy(R_d, R, sizeR, cudaMemcpyHostToDevice) ); // restore the output tensor on GPU
408 HANDLE_CUDA_ERROR( cudaDeviceSynchronize() );
409
410 /*
411 * Contract all slices of the tensor network
412 */
413 timer.start();
414
415 int32_t accumulateOutput = 0; // output tensor data will be overwritten
416 HANDLE_ERROR( cutensornetContractSlices(handle,
417 plan,
418 rawDataIn_d,
419 R_d,
420 accumulateOutput,
421 workDesc,
422 sliceGroup, // alternatively, NULL can also be used to contract over all slices instead of specifying a sliceGroup object
423 stream) );
424
425 // Synchronize and measure best timing
426 auto time = timer.seconds();
427 minTimeCUTENSORNET = (time > minTimeCUTENSORNET) ? minTimeCUTENSORNET : time;
428 }
429
430 if(verbose)
431 printf("Contracted the tensor network, each slice used the same contraction plan\n");
432
433 // Print the 1-norm of the output tensor (verification)
434 HANDLE_CUDA_ERROR( cudaStreamSynchronize(stream) );
435 HANDLE_CUDA_ERROR( cudaMemcpy(R, R_d, sizeR, cudaMemcpyDeviceToHost) ); // restore the output tensor on Host
436 double norm1 = 0.0;
437 for (int64_t i = 0; i < elementsR; ++i) {
438 norm1 += std::abs(R[i]);
439 }
440 if(verbose)
441 printf("Computed the 1-norm of the output tensor: %e\n", norm1);
442
443 /*************************/
444
445 // Query the total Flop count for the tensor network contraction
446 double flops {0.0};
447 HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
448 handle,
449 optimizerInfo,
450 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
451 &flops,
452 sizeof(flops)) );
453
454 if(verbose) {
455 printf("Number of tensor network slices = %ld\n", numSlices);
456 printf("Tensor network contraction time (ms) = %.3f\n", minTimeCUTENSORNET * 1000.f);
457 }
458
459 // Free cuTensorNet resources
460 HANDLE_ERROR( cutensornetDestroySliceGroup(sliceGroup) );
461 HANDLE_ERROR( cutensornetDestroyContractionPlan(plan) );
462 HANDLE_ERROR( cutensornetDestroyWorkspaceDescriptor(workDesc) );
463 HANDLE_ERROR( cutensornetDestroyContractionOptimizerInfo(optimizerInfo) );
464 HANDLE_ERROR( cutensornetDestroyContractionOptimizerConfig(optimizerConfig) );
465 HANDLE_ERROR( cutensornetDestroyNetworkDescriptor(descNet) );
466 HANDLE_ERROR( cutensornetDestroy(handle) );
467
468 // Free Host memory resources
469 if (R) free(R);
470 if (D) free(D);
471 if (C) free(C);
472 if (B) free(B);
473 if (A) free(A);
474
475 // Free GPU memory resources
476 if (work) cudaFree(work);
477 if (R_d) cudaFree(R_d);
478 if (rawDataIn_d[0]) cudaFree(rawDataIn_d[0]);
479 if (rawDataIn_d[1]) cudaFree(rawDataIn_d[1]);
480 if (rawDataIn_d[2]) cudaFree(rawDataIn_d[2]);
481 if (rawDataIn_d[3]) cudaFree(rawDataIn_d[3]);
482
483 if(verbose)
484 printf("Freed resources and exited\n");
485
486 return 0;
487}
Recall that the full sample code can be found in the NVIDIA/cuQuantum repository (here).
Code example (automatic slice-based distributed parallelization)¶
It is straightforward to adapt Code example (Serial) and enable automatic parallel execution across multiple/many GPU devices (across multiple/many nodes). We will illustrate this with an example using the Message Passing Interface (MPI) as the communication layer. Below we show the minor additions that need to be made in order to enable distributed parallel execution without making any changes to the original serial source code. The full MPI-automatic sample code can be found in the NVIDIA/cuQuantum repository. To enable automatic parallelism, cuTensorNet requires that
the environment variable
$CUTENSORNET_COMM_LIB
is set to the path to the wrapper shared librarylibcutensornet_distributed_interface_mpi.so
, andthe executable is linked to a CUDA-aware MPI library
The detailed instruction for setting these up is given in the installation guide above.
First, in addition to the headers and definitions mentioned in Headers and data types, we include the MPI header and define a macro to handle MPI errors. We also need to initialize the MPI service and assign a unique GPU device to each MPI process that will later be associated with the cuTensorNet library handle created inside the MPI process.
20#include <mpi.h>
44#define HANDLE_MPI_ERROR(x) \
45{ const auto err = x; \
46 if( err != MPI_SUCCESS ) \
47 { char error[MPI_MAX_ERROR_STRING]; int len; \
48 MPI_Error_string(err, error, &len); \
49 printf("MPI Error: %s in line %d\n", error, __LINE__); \
50 fflush(stdout); \
51 MPI_Abort(MPI_COMM_WORLD, err); \
52 } \
53};
The MPI service initialization must precede the first cutensornetCreate()
call
which creates a cuTensorNet library handle. An attempt to call cutensornetCreate()
before initializing the MPI service will result in an error.
97 // Initialize MPI
98 HANDLE_MPI_ERROR( MPI_Init(&argc, &argv) );
99 int rank {-1};
100 HANDLE_MPI_ERROR( MPI_Comm_rank(MPI_COMM_WORLD, &rank) );
101 int numProcs {0};
102 HANDLE_MPI_ERROR( MPI_Comm_size(MPI_COMM_WORLD, &numProcs) );
If multiple GPU devices located on the same node are visible to an MPI process,
we need to pick an exclusive GPU device for each MPI process. If the mpirun
(or mpiexec
)
command provided by your MPI library implementation sets up an environment variable
that shows the rank of the respective MPI process during its invocation, you can use
that environment variable to set CUDA_VISIBLE_DEVICES
to point to a specific single
GPU device assigned to the MPI process exclusively (for example, Open MPI provides
${OMPI_COMM_WORLD_LOCAL_RANK}
for this purpose). Otherwise, the GPU device can be
set manually, as shown below.
122 // Set GPU device based on ranks and nodes
123 int numDevices {0};
124 HANDLE_CUDA_ERROR( cudaGetDeviceCount(&numDevices) );
125 const int deviceId = rank % numDevices; // we assume that the processes are mapped to nodes in contiguous chunks
126 HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) );
127 cudaDeviceProp prop;
128 HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );
Next we define the tensor network as described in Define tensor network and tensor sizes. In a one GPU device per process model, the tensor network, including operands and result data, is replicated on each process. The root process initializes the input data and broadcasts it to the other processes.
258 /*******************
259 * Initialize data
260 *******************/
261
262 memset(R, 0, sizeof(floatType) * elementsR);
263 if(rank == 0)
264 {
265 for (uint64_t i = 0; i < elementsA; i++)
266 A[i] = ((floatType) rand()) / RAND_MAX;
267 for (uint64_t i = 0; i < elementsB; i++)
268 B[i] = ((floatType) rand()) / RAND_MAX;
269 for (uint64_t i = 0; i < elementsC; i++)
270 C[i] = ((floatType) rand()) / RAND_MAX;
271 for (uint64_t i = 0; i < elementsD; i++)
272 D[i] = ((floatType) rand()) / RAND_MAX;
273 }
274
275 // Broadcast input data to all ranks
276 HANDLE_MPI_ERROR( MPI_Bcast(A, elementsA, floatTypeMPI, 0, MPI_COMM_WORLD) );
277 HANDLE_MPI_ERROR( MPI_Bcast(B, elementsB, floatTypeMPI, 0, MPI_COMM_WORLD) );
278 HANDLE_MPI_ERROR( MPI_Bcast(C, elementsC, floatTypeMPI, 0, MPI_COMM_WORLD) );
279 HANDLE_MPI_ERROR( MPI_Bcast(D, elementsD, floatTypeMPI, 0, MPI_COMM_WORLD) );
280
281 // Copy data to GPU
282 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
283 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
284 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );
285 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[3], D, sizeD, cudaMemcpyHostToDevice) );
286
287 if(verbose)
288 printf("Allocated GPU memory for data, and initialize data\n");
Once the MPI service has been initialized and the cuTensorNet library handle has been created
afterwards, one can activate the distributed parallel execution by calling cutensornetDistributedResetConfiguration()
.
Per standard practice, the user’s code needs to create a duplicate MPI communicator via MPI_Comm_dup
.
Then, the duplicate MPI communicator is associated with the cuTensorNet library handle
by passing the pointer to the duplicate MPI communicator together with its size (in bytes)
to the cutensornetDistributedResetConfiguration()
call. The MPI communicator will be stored
inside the cuTensorNet library handle such that all subsequent calls to the
tensor network contraction path finder and tensor network contraction executor will be
parallelized across all participating MPI processes (each MPI process is associated with its own GPU).
342 /*******************************
343 * Activate distributed (parallel) execution prior to
344 * calling contraction path finder and contraction executor
345 *******************************/
346 // HANDLE_ERROR( cutensornetDistributedResetConfiguration(handle, NULL, 0) ); // resets back to serial execution
347 MPI_Comm cutnComm;
348 HANDLE_MPI_ERROR( MPI_Comm_dup(MPI_COMM_WORLD, &cutnComm) ); // duplicate MPI communicator
349 HANDLE_ERROR( cutensornetDistributedResetConfiguration(handle, &cutnComm, sizeof(cutnComm)) );
350 if(verbose)
351 printf("Reset distributed MPI configuration\n");
Note
cutensornetDistributedResetConfiguration()
is a collective call that must be executed
by all participating MPI processes.
The API of this distributed parallelization model makes it straightforward to run source codes
written for serial execution on multiple GPUs/nodes. Essentially, all MPI processes will
execute exactly the same (serial) source code while automatically performing distributed parallelization
inside the tensor network contraction path finder and tensor network contraction executor calls.
The parallelization of the tensor network contraction path finder will only occur when the number
of requested hyper-samples is larger than zero. However, regardless of that, activation of the
distributed parallelization must precede the invocation of the tensor network contraction path finder.
That is, the tensor network contraction path finder and tensor network contraction execution invocations
must be done strictly after activating the distributed parallelization via cutensornetDistributedResetConfiguration()
.
When the distributed configuration is set to a parallel mode, the user is normally expected to invoke
tensor network contraction execution by calling the cutensornetContractSlices()
function which is provided
with the full range of tensor network slices that will be automatically distributed across all MPI processes.
Since the size of the tensor network must be sufficiently large to get a benefit of acceleration from
distributed execution, smaller tensor networks (those which consist of only a single slice)
can still be processed without distributed parallelization, which can be achieved by calling
cutensornetDistributedResetConfiguration()
with a NULL
argument in place of the MPI communicator
pointer (as before, this should be done prior to calling the tensor network contraction path finder).
That is, the switch between distributed parallelization and redundant serial execution can be done
on a per-tensor-network basis. Users can decide which (larger) tensor networks to process
in a parallel manner and which (smaller ones) to process in a serial manner redundantly,
by resetting the distributed configuration appropriately. In both cases, all MPI processes
will produce the same output tensor (result) at the end of the tensor network execution.
Note
In the current version of the cuTensorNet library, the parallel tensor network contraction
execution triggered by the cutensornetContractSlices()
call will block the provided CUDA stream as well
as the calling CPU thread until the execution has completed on all MPI processes. This is a temporary
limitation that will be lifted in future versions of the cuTensorNet library, where the call to
cutensornetContractSlices()
will be fully asynchronous, similar to the serial execution case.
Additionally, for an explicit synchronization of all MPI processes (barrier) one can make
a collective call to cutensornetDistributedSynchronize()
.
Before termination, the MPI service needs to be finalized.
561 // Shut down MPI service
562 HANDLE_MPI_ERROR( MPI_Finalize() );
The complete MPI-automatic sample can be found in the NVIDIA/cuQuantum repository.
Code example (manual slice-based distributed parallelization)¶
For advanced users, it is also possible (but more involved) to adapt Code example (Serial) to explicitly parallelize execution of the tensor network contraction operation on multiple GPU devices. Here we will also use MPI as the communication layer. For brevity, we will show only the changes that need to be made on top of the serial example. The full MPI-manual sample code can be found in the NVIDIA/cuQuantum repository. Note that this sample does NOT require CUDA-aware MPI.
First, in addition to the headers and definitions mentioned in Headers and data types, we need to include the MPI header and define a macro to handle MPI errors. We also need to initialize the MPI service and associate each MPI process with its own GPU device, as explained previously.
20#include <mpi.h>
44#define HANDLE_MPI_ERROR(x) \
45{ const auto err = x; \
46 if( err != MPI_SUCCESS ) \
47 { char error[MPI_MAX_ERROR_STRING]; int len; \
48 MPI_Error_string(err, error, &len); \
49 printf("MPI Error: %s in line %d\n", error, __LINE__); \
50 fflush(stdout); \
51 MPI_Abort(MPI_COMM_WORLD, err); \
52 } \
53};
97 // Initialize MPI
98 HANDLE_MPI_ERROR( MPI_Init(&argc, &argv) );
99 int rank {-1};
100 HANDLE_MPI_ERROR( MPI_Comm_rank(MPI_COMM_WORLD, &rank) );
101 int numProcs {0};
102 HANDLE_MPI_ERROR( MPI_Comm_size(MPI_COMM_WORLD, &numProcs) );
122 // Set GPU device based on ranks and nodes
123 int numDevices {0};
124 HANDLE_CUDA_ERROR( cudaGetDeviceCount(&numDevices) );
125 const int deviceId = rank % numDevices; // we assume that the processes are mapped to nodes in contiguous chunks
126 HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) );
127 cudaDeviceProp prop;
128 HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );
Next, we define the tensor network as described in Define tensor network and tensor sizes. In a one GPU device per process model, the tensor network, including operands and result data, is replicated on each process. The root process initializes the input data and broadcasts it to the other processes.
258 /*******************
259 * Initialize data
260 *******************/
261
262 memset(R, 0, sizeof(floatType) * elementsR);
263 if(rank == 0)
264 {
265 for (uint64_t i = 0; i < elementsA; i++)
266 A[i] = ((floatType) rand()) / RAND_MAX;
267 for (uint64_t i = 0; i < elementsB; i++)
268 B[i] = ((floatType) rand()) / RAND_MAX;
269 for (uint64_t i = 0; i < elementsC; i++)
270 C[i] = ((floatType) rand()) / RAND_MAX;
271 for (uint64_t i = 0; i < elementsD; i++)
272 D[i] = ((floatType) rand()) / RAND_MAX;
273 }
274
275 // Broadcast input data to all ranks
276 HANDLE_MPI_ERROR( MPI_Bcast(A, elementsA, floatTypeMPI, 0, MPI_COMM_WORLD) );
277 HANDLE_MPI_ERROR( MPI_Bcast(B, elementsB, floatTypeMPI, 0, MPI_COMM_WORLD) );
278 HANDLE_MPI_ERROR( MPI_Bcast(C, elementsC, floatTypeMPI, 0, MPI_COMM_WORLD) );
279 HANDLE_MPI_ERROR( MPI_Bcast(D, elementsD, floatTypeMPI, 0, MPI_COMM_WORLD) );
280
281 // Copy data to GPU
282 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
283 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
284 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );
285 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[3], D, sizeD, cudaMemcpyHostToDevice) );
286
287 if(verbose)
288 printf("Allocated GPU memory for data, and initialize data\n");
Then we create the library handle and tensor network descriptor on each process, as described in cuTensorNet handle and network descriptor.
Next, we find the optimal contraction path and slicing combination for our tensor network. We will run the cuTensorNet optimizer on all processes and determine which process has the best path in terms of the FLOP count. We will then pack the optimizer info object on this process, broadcast the packed buffer, and unpack it on all other processes. Each process now has the same optimizer info object, which we use to calculate the share of slices for each process.
363 // Compute the path on all ranks so that we can choose the path with the lowest cost. Note that since this is a tiny
364 // example with 4 operands, all processes will compute the same globally optimal path. This is not the case for large
365 // tensor networks. For large networks, hyper-optimization does become beneficial.
366
367 // Enforce tensor network slicing (for parallelization)
368 const int32_t min_slices = numProcs;
369 HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(handle,
370 optimizerConfig,
371 CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MIN_SLICES,
372 &min_slices,
373 sizeof(min_slices)) );
374
375 // Find an optimized tensor network contraction path on each MPI process
376 HANDLE_ERROR( cutensornetContractionOptimize(handle,
377 descNet,
378 optimizerConfig,
379 workspaceLimit,
380 optimizerInfo) );
381
382 // Query the obtained Flop count
383 double flops{-1.};
384 HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(handle,
385 optimizerInfo,
386 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
387 &flops,
388 sizeof(flops)) );
389
390 // Choose the contraction path with the lowest Flop cost
391 struct {
392 double value;
393 int rank;
394 } in{flops, rank}, out;
395 HANDLE_MPI_ERROR( MPI_Allreduce(&in, &out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD) );
396 const int sender = out.rank;
397 flops = out.value;
398
399 if (verbose)
400 printf("Process %d has the path with the lowest FLOP count %lf\n", sender, flops);
401
402 // Get the buffer size for optimizerInfo and broadcast it
403 size_t bufSize {0};
404 if (rank == sender)
405 {
406 HANDLE_ERROR( cutensornetContractionOptimizerInfoGetPackedSize(handle, optimizerInfo, &bufSize) );
407 }
408 HANDLE_MPI_ERROR( MPI_Bcast(&bufSize, 1, MPI_INT64_T, sender, MPI_COMM_WORLD) );
409
410 // Allocate a buffer
411 std::vector<char> buffer(bufSize);
412
413 // Pack optimizerInfo on sender and broadcast it
414 if (rank == sender)
415 {
416 HANDLE_ERROR( cutensornetContractionOptimizerInfoPackData(handle, optimizerInfo, buffer.data(), bufSize) );
417 }
418 HANDLE_MPI_ERROR( MPI_Bcast(buffer.data(), bufSize, MPI_CHAR, sender, MPI_COMM_WORLD) );
419
420 // Unpack optimizerInfo from the buffer
421 if (rank != sender)
422 {
423 HANDLE_ERROR( cutensornetUpdateContractionOptimizerInfoFromPackedData(handle, buffer.data(), bufSize, optimizerInfo) );
424 }
425
426 // Query the number of slices the tensor network execution will be split into
427 int64_t numSlices = 0;
428 HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
429 handle,
430 optimizerInfo,
431 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
432 &numSlices,
433 sizeof(numSlices)) );
434 assert(numSlices > 0);
435
436 // Calculate each process's share of the slices
437 int64_t procChunk = numSlices / numProcs;
438 int extra = numSlices % numProcs;
439 int procSliceBegin = rank * procChunk + std::min(rank, extra);
440 int procSliceEnd = (rank == numProcs - 1) ? numSlices : (rank + 1) * procChunk + std::min(rank + 1, extra);
We now create the workspace descriptor and allocate memory as described in Create workspace descriptor and allocate workspace memory, and create the Contraction plan and auto-tune the tensor network.
Next, on each process, we create a slice group (see cutensornetSliceGroup_t
) that corresponds to its share of the tensor network slices.
We then provide this slice group object to the cutensornetContractSlices()
function to get a partial contraction result on each process.
530 // Create a cutensornetSliceGroup_t object from a range of slice IDs
531 cutensornetSliceGroup_t sliceGroup{};
532 HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, procSliceBegin, procSliceEnd, 1, &sliceGroup) );
553 HANDLE_ERROR( cutensornetContractSlices(handle,
554 plan,
555 rawDataIn_d,
556 R_d,
557 accumulateOutput,
558 workDesc,
559 sliceGroup,
560 stream) );
Finally, we sum up the partial contributions to obtain the result of the tensor network contraction.
566 // Perform Allreduce operation on the output tensor
567 HANDLE_CUDA_ERROR( cudaStreamSynchronize(stream) );
568 HANDLE_CUDA_ERROR( cudaMemcpy(R, R_d, sizeR, cudaMemcpyDeviceToHost) ); // restore the output tensor on Host
569 HANDLE_MPI_ERROR( MPI_Allreduce(MPI_IN_PLACE, R, elementsR, floatTypeMPI, MPI_SUM, MPI_COMM_WORLD) );
Before termination, the MPI service needs to be finalized.
631 // Shut down MPI service
632 HANDLE_MPI_ERROR( MPI_Finalize() );
The complete MPI-manual sample can be found in the NVIDIA/cuQuantum repository.
Code example (tensorQR)¶
Code example (tensorSVD)¶
Code example (GateSplit)¶
Code example (MPS factorization)¶
Code example (intermediate tensor reuse)¶
- Caching/Reusing constant intermediate tensors
- Headers and data types
- Define tensor network and tensor sizes
- Allocate memory, initialize data, initialize cuTensorNet handle
- Mark constant tensors and create the network descriptor
- Contraction order and slicing
- Create workspace descriptor and allocate workspace memory
- Contraction plan and auto-tune
- Tensor network contraction execution
- Free resources
Code example (gradients computation)¶
- Computing gradients via backward propagation
- Headers and data types
- Define tensor network and tensor sizes
- Allocate memory, initialize data, initialize cuTensorNet handle
- Create the network descriptor and set gradient tensor IDs
- Contraction order
- Create workspace descriptor and allocate workspace memory
- Contraction plan and auto-tune
- Tensor network contraction execution and gradient computation
- Free resources
Code example (amplitudes slice)¶
- Computing tensor network state amplitudes
- Headers and error handling
- Define the tensor network state and the desired slice of state amplitudes
- Initialize the cuTensorNet library handle
- Define quantum gates on GPU
- Allocate the amplitudes slice tensor on GPU
- Allocate the scratch buffer on GPU
- Create a pure tensor network state
- Apply quantum gates
- Create the state amplitudes accessor
- Configure the state amplitudes accessor
- Prepare the computation of the amplitudes slice tensor
- Set up the workspace
- Compute the specified slice of state amplitudes
- Free resources
Code example (expectation value)¶
- Computing tensor network state expectation value
- Headers and error handling
- Define the tensor network state
- Initialize the cuTensorNet library handle
- Define quantum gates on GPU
- Allocate the scratch buffer on GPU
- Create a pure tensor network state
- Apply quantum gates
- Construct a tensor network operator
- Create the expectation value object
- Configure the expectation value calculation
- Prepare the expectation value calculation
- Set up the workspace
- Compute the requested expectation value
- Free resources
Code example (marginal distribution)¶
- Computing tensor network state marginal distribution tensor
- Headers and error handling
- Define the tensor network state and the desired marginal distribution tensor
- Initialize the cuTensorNet library handle
- Define quantum gates on GPU
- Allocate the marginal distribution tensor on GPU
- Allocate the scratch buffer on GPU
- Create a pure tensor network state
- Apply quantum gates
- Create the marginal distribution object
- Configure the marginal distribution object
- Prepare the computation of the marginal distribution tensor
- Set up the workspace
- Compute the marginal distribution tensor
- Free resources
Code example (tensor network sampling)¶
- Sampling the tensor network state
- Headers and error handling
- Define the tensor network state and the desired number of output samples to generate
- Initialize the cuTensorNet library handle
- Define quantum gates on GPU
- Create a pure tensor network state
- Apply quantum gates
- Create the tensor network state sampler
- Configure the tensor network state sampler
- Prepare the tensor network state sampler
- Allocate the workspace buffer on GPU and setup the workspace
- Perform sampling of the final quantum circuit state
- Free resources
Code example (MPS amplitudes slice using simple update)¶
- Computing Matrix Product State (MPS) Amplitudes
- Headers and error handling
- Define the tensor network state and the desired slice of state amplitudes
- Initialize the cuTensorNet library handle
- Define quantum gates on GPU
- Allocate MPS tensors
- Allocate the amplitudes slice tensor on GPU
- Allocate the scratch buffer on GPU
- Create a pure tensor network state
- Apply quantum gates
- Request MPS factorization for the final quantum circuit state
- Configure MPS factorization procedure
- Prepare the computation of MPS factorization
- Compute MPS factorization
- Create the state amplitudes accessor
- Configure the state amplitudes accessor
- Prepare the computation of the amplitudes slice tensor
- Set up the workspace
- Compute the specified slice of state amplitudes
- Free resources
Code example (MPS expectation value)¶
- Computing Matrix Product State expectation value
- Headers and error handling
- Define the tensor network state
- Initialize the cuTensorNet library handle
- Define quantum gates on GPU
- Allocate MPS tensors
- Allocate the scratch buffer on GPU
- Create a pure tensor network state
- Apply quantum gates
- Request MPS factorization for the final quantum circuit state
- Configure MPS factorization procedure
- Prepare the computation of MPS factorization
- Compute MPS factorization
- Construct a tensor network operator
- Create the expectation value object
- Configure the expectation value calculation
- Prepare the expectation value calculation
- Set up the workspace
- Compute the requested expectation value
- Free resources
Code example (MPS marginal distribution)¶
- Computing Matrix Product State marginal distribution tensor
- Headers and error handling
- Define the tensor network state and the desired marginal distribution tensor
- Initialize the cuTensorNet library handle
- Define quantum gates on GPU
- Allocate MPS tensors
- Allocate the marginal distribution tensor on GPU
- Allocate the scratch buffer on GPU
- Create a pure tensor network state
- Apply quantum gates
- Request MPS factorization for the final quantum circuit state
- Configure MPS factorization procedure
- Prepare the computation of MPS factorization
- Compute MPS factorization
- Create the marginal distribution object
- Configure the marginal distribution object
- Prepare the computation of the marginal distribution tensor
- Set up the workspace
- Compute the marginal distribution tensor
- Free resources
Code example (MPS sampling)¶
- Sampling the Matrix Product State
- Headers and error handling
- Define the tensor network state and the desired number of output samples to generate
- Initialize the cuTensorNet library handle
- Define quantum gates on GPU
- Allocate MPS tensors
- Allocate the scratch buffer on GPU
- Create a pure tensor network state
- Apply quantum gates
- Request MPS factorization for the final quantum circuit state
- Configure MPS factorization procedure
- Prepare the computation of MPS factorization
- Compute MPS factorization
- Create the tensor network state sampler
- Configure the tensor network state sampler
- Prepare the tensor network state sampler
- Set up the workspace
- Perform sampling of the final quantum circuit state
- Free resources
Code example (MPS sampling QFT)¶
- Sampling the Matrix Product State (QFT Circuit)
- Headers and error handling
- Define the tensor network state and the desired number of output samples to generate
- Initialize the cuTensorNet library handle
- Define quantum gates in GPU memory
- Allocate MPS tensors in GPU memory
- Allocate the scratch buffer on GPU
- Create a pure tensor network state
- Apply quantum gates
- Request MPS factorization for the final quantum circuit state
- Configure MPS factorization procedure
- Prepare the computation of MPS factorization
- Compute MPS factorization
- Create the tensor network state sampler
- Configure the tensor network state sampler
- Prepare the tensor network state sampler
- Set up the workspace
- Perform sampling of the final quantum circuit state
- Free resources
Code example (MPS sampling MPO)¶
- Sampling the Matrix Product State (circuit with Matrix Product Operators)
- Headers and error handling
- Define the tensor network state and the desired number of output samples to generate
- Initialize the cuTensorNet library handle
- Define and allocate MPO tensors
- Define and allocate MPS tensors
- Allocate the scratch buffer on GPU
- Create a pure tensor network state
- Construct necessary MPO tensor network operators
- Apply MPO tensor network operators to the quantum circuit state
- Request MPS factorization for the final quantum circuit state
- Configure MPS factorization procedure
- Prepare the computation of MPS factorization
- Compute MPS factorization
- Create the tensor network state sampler
- Configure the tensor network state sampler
- Prepare the tensor network state sampler
- Set up the workspace
- Perform sampling of the final quantum circuit state
- Free resources
Useful tips¶
For debugging, one can set the environment variable
CUTENSORNET_LOG_LEVEL=n
. The leveln
= 0, 1, …, 5 corresponds to the logger level as described and used incutensornetLoggerSetLevel()
. The environment variableCUTENSORNET_LOG_FILE=<filepath>
can be used to redirect the log output to a custom file at<filepath>
instead ofstdout
.