Examples

In this section, we show how to contract a tensor network using cuTensorNet. First, we describe how to compile sample code. Then, we present an example code used to perform common steps in cuTensorNet. In the example, we perform the following tensor contraction:

\[R_{k,l} = A_{a,b,c,d,e,f} B_{b,g,h,e,i,j} C_{m,a,g,f,i,k} D_{l,c,h,d,j,m}\]

We construct the code step by step, each step adding code at the end. The steps are separated by succinct multi-line comment blocks.

It is recommended that the reader refers to Overview and cuTENSOR documentation for familiarity with the nomenclature and cuTENSOR operations.

Compiling Code

Assuming cuQuantum has been extracted in CUQUANTUM_ROOT and cuTENSOR in CUTENSOR_ROOT, we update the library path as follows:

export LD_LIBRARY_PATH=${CUQUANTUM_ROOT}/lib:${CUTENSOR_ROOT}/lib/11:${LD_LIBRARY_PATH}

Depending on your CUDA Toolkit, you might have to choose a different library version (e.g., ${CUTENSOR_ROOT}/lib/11.0).

The serial sample code discussed below (tensornet_example.cu) can be compiled via the following command:

nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -o tensornet_example

For static linking against the cuTensorNet library, use the following command (note that libmetis_static.a needs to be explicitly linked against, assuming it is installed through the NVIDIA CUDA Toolkit and accessible through $LIBRARY_PATH):

nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include ${CUQUANTUM_ROOT}/lib/libcutensornet_static.a -L${CUTENSOR_DIR}/lib/11 -lcutensor libmetis_static.a -o tensornet_example

In order to build parallel (MPI) versions of the examples (tensornet_example_mpi_auto.cu and tensornet_example_mpi.cu), one will need to have an MPI library installed (e.g., recent Open MPI, MVAPICH, or MPICH). In particular, the automatic parallel example requires CUDA-aware MPI, see Code Example (Automatic Slice-Based Distributed Parallelization) below. In this case, one will need to add -I${MPI_PATH}/include and -L${MPI_PATH}/lib -lmpi to the build command:

nvcc tensornet_example_mpi_auto.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -I${MPI_PATH}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -L${MPI_PATH}/lib -lmpi -o tensornet_example_mpi_auto
nvcc tensornet_example_mpi.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -I${MPI_PATH}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -L${MPI_PATH}/lib -lmpi -o tensornet_example_mpi

Warning

When running tensornet_example_mpi_auto.cu without CUDA-aware MPI, the program will crash.

Note

Depending on the source of the cuQuantum package, you may need to replace lib above by lib64.

Code Example (Serial)

The following code example illustrates the common steps necessary to use cuTensorNet and also introduces typical tensor network operations. The full sample code can be found in the NVIDIA/cuQuantum repository (here).

Headers and data types

  8#include <stdlib.h>
  9#include <stdio.h>
 10
 11#include <unordered_map>
 12#include <vector>
 13#include <cassert>
 14
 15#include <cuda_runtime.h>
 16#include <cutensornet.h>
 17
 18
 19#define HANDLE_ERROR(x)                                           \
 20{ const auto err = x;                                             \
 21  if( err != CUTENSORNET_STATUS_SUCCESS )                         \
 22  { printf("Error: %s in line %d\n", cutensornetGetErrorString(err), __LINE__); \
 23    fflush(stdout);                                               \
 24  }                                                               \
 25};
 26
 27#define HANDLE_CUDA_ERROR(x)                                      \
 28{ const auto err = x;                                             \
 29  if( err != cudaSuccess )                                        \
 30  { printf("CUDA Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); \
 31    fflush(stdout);                                               \
 32  }                                                               \
 33};
 34
 35
 36struct GPUTimer
 37{
 38    GPUTimer(cudaStream_t stream): stream_(stream)
 39    {
 40        HANDLE_CUDA_ERROR(cudaEventCreate(&start_));
 41        HANDLE_CUDA_ERROR(cudaEventCreate(&stop_));
 42    }
 43
 44    ~GPUTimer()
 45    {
 46        HANDLE_CUDA_ERROR(cudaEventDestroy(start_));
 47        HANDLE_CUDA_ERROR(cudaEventDestroy(stop_));
 48    }
 49
 50    void start()
 51    {
 52        HANDLE_CUDA_ERROR(cudaEventRecord(start_, stream_));
 53    }
 54
 55    float seconds()
 56    {
 57        HANDLE_CUDA_ERROR(cudaEventRecord(stop_, stream_));
 58        HANDLE_CUDA_ERROR(cudaEventSynchronize(stop_));
 59        float time;
 60        HANDLE_CUDA_ERROR(cudaEventElapsedTime(&time, start_, stop_));
 61        return time * 1e-3;
 62    }
 63
 64    private:
 65    cudaEvent_t start_, stop_;
 66    cudaStream_t stream_;
 67};
 68
 69
 70int main()
 71{
 72   static_assert(sizeof(size_t) == sizeof(int64_t), "Please build this sample on a 64-bit architecture!");
 73
 74   bool verbose = true;
 75
 76   // Check cuTensorNet version
 77   const size_t cuTensornetVersion = cutensornetGetVersion();
 78   if(verbose)
 79      printf("cuTensorNet version: %ld\n", cuTensornetVersion);
 80
 81   // Set GPU device
 82   int numDevices {0};
 83   HANDLE_CUDA_ERROR( cudaGetDeviceCount(&numDevices) );
 84   const int deviceId = 0;
 85   HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) );
 86   cudaDeviceProp prop;
 87   HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );
 88
 89   if(verbose) {
 90      printf("===== device info ======\n");
 91      printf("GPU-name:%s\n", prop.name);
 92      printf("GPU-clock:%d\n", prop.clockRate);
 93      printf("GPU-memoryClock:%d\n", prop.memoryClockRate);
 94      printf("GPU-nSM:%d\n", prop.multiProcessorCount);
 95      printf("GPU-major:%d\n", prop.major);
 96      printf("GPU-minor:%d\n", prop.minor);
 97      printf("========================\n");
 98   }
 99
100   typedef float floatType;
101   cudaDataType_t typeData = CUDA_R_32F;
102   cutensornetComputeType_t typeCompute = CUTENSORNET_COMPUTE_32F;
103
104   if(verbose)
105      printf("Included headers and defined data types\n");

Define tensor network and tensor sizes

Next, we define the topology of the tensor network (i.e., the modes of the tensors, their extents, and their connectivity).

109   /**********************
110   * Computing: R_{k,l} = A_{a,b,c,d,e,f} B_{b,g,h,e,i,j} C_{m,a,g,f,i,k} D_{l,c,h,d,j,m}
111   **********************/
112
113   constexpr int32_t numInputs = 4;
114
115   // Create vectors of tensor modes
116   std::vector<int32_t> modesA{'a','b','c','d','e','f'};
117   std::vector<int32_t> modesB{'b','g','h','e','i','j'};
118   std::vector<int32_t> modesC{'m','a','g','f','i','k'};
119   std::vector<int32_t> modesD{'l','c','h','d','j','m'};
120   std::vector<int32_t> modesR{'k','l'};
121
122   // Set mode extents
123   std::unordered_map<int32_t, int64_t> extent;
124   extent['a'] = 16;
125   extent['b'] = 16;
126   extent['c'] = 16;
127   extent['d'] = 16;
128   extent['e'] = 16;
129   extent['f'] = 16;
130   extent['g'] = 16;
131   extent['h'] = 16;
132   extent['i'] = 16;
133   extent['j'] = 16;
134   extent['k'] = 16;
135   extent['l'] = 16;
136   extent['m'] = 16;
137
138   // Create a vector of extents for each tensor
139   std::vector<int64_t> extentA;
140   for (auto mode : modesA)
141      extentA.push_back(extent[mode]);
142   std::vector<int64_t> extentB;
143   for (auto mode : modesB)
144      extentB.push_back(extent[mode]);
145   std::vector<int64_t> extentC;
146   for (auto mode : modesC)
147      extentC.push_back(extent[mode]);
148   std::vector<int64_t> extentD;
149   for (auto mode : modesD)
150      extentD.push_back(extent[mode]);
151   std::vector<int64_t> extentR;
152   for (auto mode : modesR)
153      extentR.push_back(extent[mode]);
154
155   if(verbose)
156      printf("Defined tensor network, modes, and extents\n");

Allocate memory and initialize data

Next, we allocate memory for the tensor network operands and initialize them to random values.

159   /**********************
160   * Allocating data
161   **********************/
162
163   size_t elementsA = 1;
164   for (auto mode : modesA)
165      elementsA *= extent[mode];
166   size_t elementsB = 1;
167   for (auto mode : modesB)
168      elementsB *= extent[mode];
169   size_t elementsC = 1;
170   for (auto mode : modesC)
171      elementsC *= extent[mode];
172   size_t elementsD = 1;
173   for (auto mode : modesD)
174      elementsD *= extent[mode];
175   size_t elementsR = 1;
176   for (auto mode : modesR)
177      elementsR *= extent[mode];
178
179   size_t sizeA = sizeof(floatType) * elementsA;
180   size_t sizeB = sizeof(floatType) * elementsB;
181   size_t sizeC = sizeof(floatType) * elementsC;
182   size_t sizeD = sizeof(floatType) * elementsD;
183   size_t sizeR = sizeof(floatType) * elementsR;
184   if(verbose)
185      printf("Total GPU memory used for tensor storage: %.2f GiB\n",
186             (sizeA + sizeB + sizeC + sizeD + sizeR) / 1024. /1024. / 1024);
187
188   void* rawDataIn_d[numInputs];
189   void* R_d;
190   HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[0], sizeA) );
191   HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[1], sizeB) );
192   HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[2], sizeC) );
193   HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[3], sizeD) );
194   HANDLE_CUDA_ERROR( cudaMalloc((void**) &R_d, sizeR));
195
196   floatType *A = (floatType*) malloc(sizeof(floatType) * elementsA);
197   floatType *B = (floatType*) malloc(sizeof(floatType) * elementsB);
198   floatType *C = (floatType*) malloc(sizeof(floatType) * elementsC);
199   floatType *D = (floatType*) malloc(sizeof(floatType) * elementsD);
200   floatType *R = (floatType*) malloc(sizeof(floatType) * elementsR);
201
202   if (A == NULL || B == NULL || C == NULL || D == NULL || R == NULL)
203   {
204      printf("Error: Host memory allocation failed!\n");
205      return -1;
206   }
207
208   /*******************
209   * Initialize data
210   *******************/
211
212   memset(R, 0, sizeof(floatType) * elementsR);
213   for (uint64_t i = 0; i < elementsA; i++)
214      A[i] = ((floatType) rand()) / RAND_MAX;
215   for (uint64_t i = 0; i < elementsB; i++)
216      B[i] = ((floatType) rand()) / RAND_MAX;
217   for (uint64_t i = 0; i < elementsC; i++)
218      C[i] = ((floatType) rand()) / RAND_MAX;
219   for (uint64_t i = 0; i < elementsD; i++)
220      D[i] = ((floatType) rand()) / RAND_MAX;
221
222   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
223   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
224   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );
225   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[3], D, sizeD, cudaMemcpyHostToDevice) );
226
227   if(verbose)
228      printf("Allocated GPU memory for data, and initialize data\n");

cuTensorNet handle and network descriptor

Next, we initialize the cuTensorNet library via cutensornetCreate() and create the network descriptor with the desired tensor modes, extents, and strides, as well as the data and compute types. Note that the created library context will be associated with the currently active GPU.

231   /*************************
232   * cuTensorNet
233   *************************/
234
235   cudaStream_t stream;
236   HANDLE_CUDA_ERROR( cudaStreamCreate(&stream) );
237
238   cutensornetHandle_t handle;
239   HANDLE_ERROR( cutensornetCreate(&handle) );
240
241   const int32_t nmodeA = modesA.size();
242   const int32_t nmodeB = modesB.size();
243   const int32_t nmodeC = modesC.size();
244   const int32_t nmodeD = modesD.size();
245   const int32_t nmodeR = modesR.size();
246
247   /*******************************
248   * Create Network Descriptor
249   *******************************/
250
251   const int32_t* modesIn[] = {modesA.data(), modesB.data(), modesC.data(), modesD.data()};
252   int32_t const numModesIn[] = {nmodeA, nmodeB, nmodeC, nmodeD};
253   const int64_t* extentsIn[] = {extentA.data(), extentB.data(), extentC.data(), extentD.data()};
254   const int64_t* stridesIn[] = {NULL, NULL, NULL, NULL}; // strides are optional; if no stride is provided, cuTensorNet assumes a generalized column-major data layout
255
256   // Set up tensor network
257   cutensornetNetworkDescriptor_t descNet;
258   HANDLE_ERROR( cutensornetCreateNetworkDescriptor(handle,
259                     numInputs, numModesIn, extentsIn, stridesIn, modesIn, NULL,
260                     nmodeR, extentR.data(), /*stridesOut = */NULL, modesR.data(),
261                     typeData, typeCompute,
262                     &descNet) );
263
264   if(verbose)
265      printf("Initialized the cuTensorNet library and created a tensor network descriptor\n");

Optimal contraction order and slicing

At this stage, we can deploy the cuTensorNet optimizer to find an optimized contraction path and slicing combination. We choose a limit for the workspace needed to perform the contraction based on the available memory resources, and provide it to the optimizer as a constraint. We then create an optimizer configuration object of type cutensornetContractionOptimizerConfig_t to specify various optimizer options and provide it to the optimizer, which is invoked via cutensornetContractionOptimize(). The results from the optimizer will be returned in an optimizer info object of type cutensornetContractionOptimizerInfo_t.

268   /*******************************
269   * Choose workspace limit based on available resources.
270   *******************************/
271
272   size_t freeMem, totalMem;
273   HANDLE_CUDA_ERROR( cudaMemGetInfo(&freeMem, &totalMem) );
274   uint64_t workspaceLimit = (uint64_t)((double)freeMem * 0.9);
275   if(verbose)
276      printf("Workspace limit = %lu\n", workspaceLimit);
277
278   /*******************************
279   * Find "optimal" contraction order and slicing
280   *******************************/
281
282   cutensornetContractionOptimizerConfig_t optimizerConfig;
283   HANDLE_ERROR( cutensornetCreateContractionOptimizerConfig(handle, &optimizerConfig) );
284
285   // Set the desired number of hyper-samples (defaults to 0)
286   int32_t num_hypersamples = 8;
287   HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(handle,
288                     optimizerConfig,
289                     CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES,
290                     &num_hypersamples,
291                     sizeof(num_hypersamples)) );
292
293   // Create contraction optimizer info and find an optimized contraction path
294   cutensornetContractionOptimizerInfo_t optimizerInfo;
295   HANDLE_ERROR( cutensornetCreateContractionOptimizerInfo(handle, descNet, &optimizerInfo) );
296
297   HANDLE_ERROR( cutensornetContractionOptimize(handle,
298                                             descNet,
299                                             optimizerConfig,
300                                             workspaceLimit,
301                                             optimizerInfo) );
302
303   // Query the number of slices the tensor network execution will be split into
304   int64_t numSlices = 0;
305   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
306                  handle,
307                  optimizerInfo,
308                  CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
309                  &numSlices,
310                  sizeof(numSlices)) );
311   assert(numSlices > 0);
312
313   if(verbose)
314      printf("Found an optimized contraction path using cuTensorNet optimizer\n");

It is also possible to bypass the cuTensorNet optimizer and import a pre-determined contraction path, as well as slicing information, directly to the optimizer info object via cutensornetContractionOptimizerInfoSetAttribute().

Create workspace descriptor and allocate workspace memory

Next, we create a workspace descriptor, compute the workspace sizes, and query the minimum workspace size needed to contract the network. We then allocate device memory for the workspace and set this in the workspace descriptor. The workspace descriptor will be provided to the contraction plan.

317   /*******************************
318   * Create workspace descriptor, allocate workspace, and set it.
319   *******************************/
320
321   cutensornetWorkspaceDescriptor_t workDesc;
322   HANDLE_ERROR( cutensornetCreateWorkspaceDescriptor(handle, &workDesc) );
323
324   int64_t requiredWorkspaceSize = 0;
325   HANDLE_ERROR( cutensornetWorkspaceComputeContractionSizes(handle,
326                                                         descNet,
327                                                         optimizerInfo,
328                                                         workDesc) );
329
330   HANDLE_ERROR( cutensornetWorkspaceGetMemorySize(handle,
331                                                   workDesc,
332                                                   CUTENSORNET_WORKSIZE_PREF_MIN,
333                                                   CUTENSORNET_MEMSPACE_DEVICE,
334                                                   CUTENSORNET_WORKSPACE_SCRATCH,
335                                                   &requiredWorkspaceSize) );
336
337   void* work = nullptr;
338   HANDLE_CUDA_ERROR( cudaMalloc(&work, requiredWorkspaceSize) );
339
340   HANDLE_ERROR( cutensornetWorkspaceSetMemory(handle,
341                                               workDesc,
342                                               CUTENSORNET_MEMSPACE_DEVICE,
343                                               CUTENSORNET_WORKSPACE_SCRATCH,
344                                               work,
345                                               requiredWorkspaceSize) );
346
347   if(verbose)
348      printf("Allocated and set up the GPU workspace\n");

Contraction plan and auto-tune

We create a tensor network contraction plan holding all pairwise contraction plans for cuTENSOR. Optionally, we can auto-tune the plan such that cuTENSOR selects the best kernel for each pairwise contraction. This contraction plan can be reused for many (possibly different) data inputs, avoiding the cost of initializing this plan redundantly.

351   /*******************************
352   * Initialize the pairwise contraction plan (for cuTENSOR).
353   *******************************/
354
355   cutensornetContractionPlan_t plan;
356   HANDLE_ERROR( cutensornetCreateContractionPlan(handle,
357                                                descNet,
358                                                optimizerInfo,
359                                                workDesc,
360                                                &plan) );
361
362   /*******************************
363   * Optional: Auto-tune cuTENSOR's cutensorContractionPlan to pick the fastest kernel
364   *           for each pairwise tensor contraction.
365   *******************************/
366   cutensornetContractionAutotunePreference_t autotunePref;
367   HANDLE_ERROR( cutensornetCreateContractionAutotunePreference(handle,
368                                                      &autotunePref) );
369
370   const int numAutotuningIterations = 5; // may be 0
371   HANDLE_ERROR( cutensornetContractionAutotunePreferenceSetAttribute(
372                           handle,
373                           autotunePref,
374                           CUTENSORNET_CONTRACTION_AUTOTUNE_MAX_ITERATIONS,
375                           &numAutotuningIterations,
376                           sizeof(numAutotuningIterations)) );
377
378   // Modify the plan again to find the best pair-wise contractions
379   HANDLE_ERROR( cutensornetContractionAutotune(handle,
380                                                plan,
381                                                rawDataIn_d,
382                                                R_d,
383                                                workDesc,
384                                                autotunePref,
385                                                stream) );
386
387   HANDLE_ERROR( cutensornetDestroyContractionAutotunePreference(autotunePref) );
388
389   if(verbose)
390      printf("Created a contraction plan for cuTensorNet and optionally auto-tuned it\n");

Tensor network contraction execution

Finally, we contract the tensor network as many times as needed, possibly with different input each time. Tensor network slices, captured as a cutensornetSliceGroup_t object, are computed using the same contraction plan. For convenience, NULL can be provided to the cutensornetContractSlices() function instead of a slice group when the goal is to contract all the slices in the network. We also clean up and free allocated resources.

393   /**********************
394   * Execute the tensor network contraction
395   **********************/
396
397   // Create a cutensornetSliceGroup_t object from a range of slice IDs
398   cutensornetSliceGroup_t sliceGroup{};
399   HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, 0, numSlices, 1, &sliceGroup) );
400
401   GPUTimer timer {stream};
402   double minTimeCUTENSORNET = 1e100;
403   const int numRuns = 3; // number of repeats to get stable performance results
404   for (int i = 0; i < numRuns; ++i)
405   {
406      HANDLE_CUDA_ERROR( cudaMemcpy(R_d, R, sizeR, cudaMemcpyHostToDevice) ); // restore the output tensor on GPU
407      HANDLE_CUDA_ERROR( cudaDeviceSynchronize() );
408
409      /*
410      * Contract all slices of the tensor network
411      */
412      timer.start();
413
414      int32_t accumulateOutput = 0; // output tensor data will be overwritten
415      HANDLE_ERROR( cutensornetContractSlices(handle,
416                     plan,
417                     rawDataIn_d,
418                     R_d,
419                     accumulateOutput,
420                     workDesc,
421                     sliceGroup, // alternatively, NULL can also be used to contract over all slices instead of specifying a sliceGroup object
422                     stream) );
423
424      // Synchronize and measure best timing
425      auto time = timer.seconds();
426      minTimeCUTENSORNET = (time > minTimeCUTENSORNET) ? minTimeCUTENSORNET : time;
427   }
428
429   if(verbose)
430      printf("Contracted the tensor network, each slice used the same contraction plan\n");
431
432   // Print the 1-norm of the output tensor (verification)
433   HANDLE_CUDA_ERROR( cudaStreamSynchronize(stream) );
434   HANDLE_CUDA_ERROR( cudaMemcpy(R, R_d, sizeR, cudaMemcpyDeviceToHost) ); // restore the output tensor on Host
435   double norm1 = 0.0;
436   for (int64_t i = 0; i < elementsR; ++i) {
437      norm1 += std::abs(R[i]);
438   }
439   if(verbose)
440      printf("Computed the 1-norm of the output tensor: %e\n", norm1);
441
442   /*************************/
443
444   // Query the total Flop count for the tensor network contraction
445   double flops {0.0};
446   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
447                     handle,
448                     optimizerInfo,
449                     CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
450                     &flops,
451                     sizeof(flops)) );
452
453   if(verbose) {
454      printf("Number of tensor network slices = %ld\n", numSlices);
455      printf("Tensor network contraction time (ms) = %.3f\n", minTimeCUTENSORNET * 1000.f);
456   }
457
458   // Free cuTensorNet resources
459   HANDLE_ERROR( cutensornetDestroySliceGroup(sliceGroup) );
460   HANDLE_ERROR( cutensornetDestroyContractionPlan(plan) );
461   HANDLE_ERROR( cutensornetDestroyWorkspaceDescriptor(workDesc) );
462   HANDLE_ERROR( cutensornetDestroyContractionOptimizerInfo(optimizerInfo) );
463   HANDLE_ERROR( cutensornetDestroyContractionOptimizerConfig(optimizerConfig) );
464   HANDLE_ERROR( cutensornetDestroyNetworkDescriptor(descNet) );
465   HANDLE_ERROR( cutensornetDestroy(handle) );
466
467   // Free Host memory resources
468   if (R) free(R);
469   if (D) free(D);
470   if (C) free(C);
471   if (B) free(B);
472   if (A) free(A);
473
474   // Free GPU memory resources
475   if (work) cudaFree(work);
476   if (R_d) cudaFree(R_d);
477   if (rawDataIn_d[0]) cudaFree(rawDataIn_d[0]);
478   if (rawDataIn_d[1]) cudaFree(rawDataIn_d[1]);
479   if (rawDataIn_d[2]) cudaFree(rawDataIn_d[2]);
480   if (rawDataIn_d[3]) cudaFree(rawDataIn_d[3]);
481
482   if(verbose)
483      printf("Freed resources and exited\n");
484
485   return 0;
486}

Recall that the full sample code can be found in the NVIDIA/cuQuantum repository (here).

Code Example (Automatic Slice-Based Distributed Parallelization)

It is straightforward to adapt Code Example (Serial) and enable automatic parallel execution across multiple/many GPU devices (across multiple/many nodes). We will illustrate this with an example using the Message Passing Interface (MPI) as the communication layer. Below we show the minor additions that need to be made in order to enable distributed parallel execution without making any changes to the original serial source code. The full MPI-automatic sample code can be found in the NVIDIA/cuQuantum repository. To enable automatic parallelism, cuTensorNet requires that

  • the environment variable $CUTENSORNET_COMM_LIB is set to the path to the wrapper shared library libcutensornet_distributed_interface_mpi.so, and

  • the executable is linked to a CUDA-aware MPI library

The detailed instruction for setting these up is given in the installation guide above.

First, in addition to the headers and definitions mentioned in Headers and data types, we include the MPI header and define a macro to handle MPI errors. We also need to initialize the MPI service and assign a unique GPU device to each MPI process that will later be associated with the cuTensorNet library handle created inside the MPI process.

20#include <mpi.h>
44#define HANDLE_MPI_ERROR(x)                                       \
45{ const auto err = x;                                             \
46  if( err != MPI_SUCCESS )                                        \
47  { char error[MPI_MAX_ERROR_STRING]; int len;                    \
48    MPI_Error_string(err, error, &len);                           \
49    printf("MPI Error: %s in line %d\n", error, __LINE__);        \
50    fflush(stdout);                                               \
51    MPI_Abort(MPI_COMM_WORLD, err);                               \
52  }                                                               \
53};

The MPI service initialization must precede the first cutensornetCreate() call which creates a cuTensorNet library handle. An attempt to call cutensornetCreate() before initializing the MPI service will result in an error.

 97   // Initialize MPI
 98   HANDLE_MPI_ERROR( MPI_Init(&argc, &argv) );
 99   int rank {-1};
100   HANDLE_MPI_ERROR( MPI_Comm_rank(MPI_COMM_WORLD, &rank) );
101   int numProcs {0};
102   HANDLE_MPI_ERROR( MPI_Comm_size(MPI_COMM_WORLD, &numProcs) );

If multiple GPU devices located on the same node are visible to an MPI process, we need to pick an exclusive GPU device for each MPI process. If the mpirun (or mpiexec) command provided by your MPI library implementation sets up an environment variable that shows the rank of the respective MPI process during its invocation, you can use that environment variable to set CUDA_VISIBLE_DEVICES to point to a specific single GPU device assigned to the MPI process exclusively (for example, Open MPI provides ${OMPI_COMM_WORLD_LOCAL_RANK} for this purpose). Otherwise, the GPU device can be set manually, as shown below.

122   // Set GPU device based on ranks and nodes
123   int numDevices {0};
124   HANDLE_CUDA_ERROR( cudaGetDeviceCount(&numDevices) );
125   const int deviceId = rank % numDevices; // we assume that the processes are mapped to nodes in contiguous chunks
126   HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) );
127   cudaDeviceProp prop;
128   HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );

Next we define the tensor network as described in Define tensor network and tensor sizes. In a one GPU device per process model, the tensor network, including operands and result data, is replicated on each process. The root process initializes the input data and broadcasts it to the other processes.

253   /*******************
254   * Initialize data
255   *******************/
256
257   memset(R, 0, sizeof(floatType) * elementsR);
258   if(rank == 0)
259   {
260      for (uint64_t i = 0; i < elementsA; i++)
261         A[i] = ((floatType) rand()) / RAND_MAX;
262      for (uint64_t i = 0; i < elementsB; i++)
263         B[i] = ((floatType) rand()) / RAND_MAX;
264      for (uint64_t i = 0; i < elementsC; i++)
265         C[i] = ((floatType) rand()) / RAND_MAX;
266      for (uint64_t i = 0; i < elementsD; i++)
267         D[i] = ((floatType) rand()) / RAND_MAX;
268   }
269
270   // Broadcast input data to all ranks
271   HANDLE_MPI_ERROR( MPI_Bcast(A, elementsA, floatTypeMPI, 0, MPI_COMM_WORLD) );
272   HANDLE_MPI_ERROR( MPI_Bcast(B, elementsB, floatTypeMPI, 0, MPI_COMM_WORLD) );
273   HANDLE_MPI_ERROR( MPI_Bcast(C, elementsC, floatTypeMPI, 0, MPI_COMM_WORLD) );
274   HANDLE_MPI_ERROR( MPI_Bcast(D, elementsD, floatTypeMPI, 0, MPI_COMM_WORLD) );
275
276   // Copy data to GPU
277   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
278   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
279   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );
280   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[3], D, sizeD, cudaMemcpyHostToDevice) );
281
282   if(verbose)
283      printf("Allocated GPU memory for data, and initialize data\n");

Once the MPI service has been initialized and the cuTensorNet library handle has been created afterwards, one can activate the distributed parallel execution by calling cutensornetDistributedResetConfiguration(). Per standard practice, the user’s code needs to create a duplicate MPI communicator via MPI_Comm_dup. Then, the duplicate MPI communicator is associated with the cuTensorNet library handle by passing the pointer to the duplicate MPI communicator together with its size (in bytes) to the cutensornetDistributedResetConfiguration() call. The MPI communicator will be stored inside the cuTensorNet library handle such that all subsequent calls to the tensor network contraction path finder and tensor network contraction executor will be parallelized across all participating MPI processes (each MPI process is associated with its own GPU).

337   /*******************************
338   * Activate distributed (parallel) execution prior to
339   * calling contraction path finder and contraction executor
340   *******************************/
341   // HANDLE_ERROR( cutensornetDistributedResetConfiguration(handle, NULL, 0) ); // resets back to serial execution
342   MPI_Comm cutnComm;
343   HANDLE_MPI_ERROR( MPI_Comm_dup(MPI_COMM_WORLD, &cutnComm) ); // duplicate MPI communicator
344   HANDLE_ERROR( cutensornetDistributedResetConfiguration(handle, &cutnComm, sizeof(cutnComm)) );
345   if(verbose)
346      printf("Reset distributed MPI configuration\n");

Note

cutensornetDistributedResetConfiguration() is a collective call that must be executed by all participating MPI processes.

The API of this distributed parallelization model makes it straightforward to run source codes written for serial execution on multiple GPUs/nodes. Essentially, all MPI processes will execute exactly the same (serial) source code while automatically performing distributed parallelization inside the tensor network contraction path finder and tensor network contraction executor calls. The parallelization of the tensor network contraction path finder will only occur when the number of requested hyper-samples is larger than zero. However, regardless of that, activation of the distributed parallelization must precede the invocation of the tensor network contraction path finder. That is, the tensor network contraction path finder and tensor network contraction execution invocations must be done strictly after activating the distributed parallelization via cutensornetDistributedResetConfiguration(). When the distributed configuration is set to a parallel mode, the user is normally expected to invoke tensor network contraction execution by calling the cutensornetContractSlices() function which is provided with the full range of tensor network slices that will be automatically distributed across all MPI processes.

Since the size of the tensor network must be sufficiently large to get a benefit of acceleration from distributed execution, smaller tensor networks (those which consist of only a single slice) can still be processed without distributed parallelization, which can be achieved by calling cutensornetDistributedResetConfiguration() with a NULL argument in place of the MPI communicator pointer (as before, this should be done prior to calling the tensor network contraction path finder). That is, the switch between distributed parallelization and redundant serial execution can be done on a per-tensor-network basis. Users can decide which (larger) tensor networks to process in a parallel manner and which (smaller ones) to process in a serial manner redundantly, by resetting the distributed configuration appropriately. In both cases, all MPI processes will produce the same output tensor (result) at the end of the tensor network execution.

Note

In the current version of the cuTensorNet library, the parallel tensor network contraction execution triggered by the cutensornetContractSlices() call will block the provided CUDA stream as well as the calling CPU thread until the execution has completed on all MPI processes. This is a temporary limitation that will be lifted in future versions of the cuTensorNet library, where the call to cutensornetContractSlices() will be fully asynchronous, similar to the serial execution case. Additionally, for an explicit synchronization of all MPI processes (barrier) one can make a collective call to cutensornetDistributedSynchronize().

Before termination, the MPI service needs to be finalized.

556   // Shut down MPI service
557   HANDLE_MPI_ERROR( MPI_Finalize() );

The complete MPI-automatic sample can be found in the NVIDIA/cuQuantum repository.

Code Example (Manual Slice-Based Distributed Parallelization)

For advanced users, it is also possible (but more involved) to adapt Code Example (Serial) to explicitly parallelize execution of the tensor network contraction operation on multiple GPU devices. Here we will also use MPI as the communication layer. For brevity, we will show only the changes that need to be made on top of the serial example. The full MPI-manual sample code can be found in the NVIDIA/cuQuantum repository. Note that this sample does NOT require CUDA-aware MPI.

First, in addition to the headers and definitions mentioned in Headers and data types, we need to include the MPI header and define a macro to handle MPI errors. We also need to initialize the MPI service and associate each MPI process with its own GPU device, as explained previously.

20#include <mpi.h>
44#define HANDLE_MPI_ERROR(x)                                       \
45{ const auto err = x;                                             \
46  if( err != MPI_SUCCESS )                                        \
47  { char error[MPI_MAX_ERROR_STRING]; int len;                    \
48    MPI_Error_string(err, error, &len);                           \
49    printf("MPI Error: %s in line %d\n", error, __LINE__);        \
50    fflush(stdout);                                               \
51    MPI_Abort(MPI_COMM_WORLD, err);                               \
52  }                                                               \
53};
 97   // Initialize MPI
 98   HANDLE_MPI_ERROR( MPI_Init(&argc, &argv) );
 99   int rank {-1};
100   HANDLE_MPI_ERROR( MPI_Comm_rank(MPI_COMM_WORLD, &rank) );
101   int numProcs {0};
102   HANDLE_MPI_ERROR( MPI_Comm_size(MPI_COMM_WORLD, &numProcs) );
122   // Set GPU device based on ranks and nodes
123   int numDevices {0};
124   HANDLE_CUDA_ERROR( cudaGetDeviceCount(&numDevices) );
125   const int deviceId = rank % numDevices; // we assume that the processes are mapped to nodes in contiguous chunks
126   HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) );
127   cudaDeviceProp prop;
128   HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );

Next, we define the tensor network as described in Define tensor network and tensor sizes. In a one GPU device per process model, the tensor network, including operands and result data, is replicated on each process. The root process initializes the input data and broadcasts it to the other processes.

253   /*******************
254   * Initialize data
255   *******************/
256
257   memset(R, 0, sizeof(floatType) * elementsR);
258   if(rank == 0)
259   {
260      for (uint64_t i = 0; i < elementsA; i++)
261         A[i] = ((floatType) rand()) / RAND_MAX;
262      for (uint64_t i = 0; i < elementsB; i++)
263         B[i] = ((floatType) rand()) / RAND_MAX;
264      for (uint64_t i = 0; i < elementsC; i++)
265         C[i] = ((floatType) rand()) / RAND_MAX;
266      for (uint64_t i = 0; i < elementsD; i++)
267         D[i] = ((floatType) rand()) / RAND_MAX;
268   }
269
270   // Broadcast input data to all ranks
271   HANDLE_MPI_ERROR( MPI_Bcast(A, elementsA, floatTypeMPI, 0, MPI_COMM_WORLD) );
272   HANDLE_MPI_ERROR( MPI_Bcast(B, elementsB, floatTypeMPI, 0, MPI_COMM_WORLD) );
273   HANDLE_MPI_ERROR( MPI_Bcast(C, elementsC, floatTypeMPI, 0, MPI_COMM_WORLD) );
274   HANDLE_MPI_ERROR( MPI_Bcast(D, elementsD, floatTypeMPI, 0, MPI_COMM_WORLD) );
275
276   // Copy data to GPU
277   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
278   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
279   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );
280   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[3], D, sizeD, cudaMemcpyHostToDevice) );
281
282   if(verbose)
283      printf("Allocated GPU memory for data, and initialize data\n");

Then we create the library handle and tensor network descriptor on each process, as described in cuTensorNet handle and network descriptor.

Next, we find the optimal contraction path and slicing combination for our tensor network. We will run the cuTensorNet optimizer on all processes and determine which process has the best path in terms of the FLOP count. We will then pack the optimizer info object on this process, broadcast the packed buffer, and unpack it on all other processes. Each process now has the same optimizer info object, which we use to calculate the share of slices for each process.

358   // Compute the path on all ranks so that we can choose the path with the lowest cost. Note that since this is a tiny
359   // example with 4 operands, all processes will compute the same globally optimal path. This is not the case for large
360   // tensor networks. For large networks, hyper-optimization does become beneficial.
361
362   // Enforce tensor network slicing (for parallelization)
363   const int32_t min_slices = numProcs;
364   HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(handle,
365                  optimizerConfig,
366                  CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MIN_SLICES,
367                  &min_slices,
368                  sizeof(min_slices)) );
369
370   // Find an optimized tensor network contraction path on each MPI process
371   HANDLE_ERROR( cutensornetContractionOptimize(handle,
372                                       descNet,
373                                       optimizerConfig,
374                                       workspaceLimit,
375                                       optimizerInfo) );
376
377   // Query the obtained Flop count
378   double flops{-1.};
379   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(handle,
380                     optimizerInfo,
381                     CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
382                     &flops,
383                     sizeof(flops)) );
384
385   // Choose the contraction path with the lowest Flop cost
386   struct {
387      double value;
388      int rank;
389   } in{flops, rank}, out;
390   HANDLE_MPI_ERROR( MPI_Allreduce(&in, &out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD) );
391   const int sender = out.rank;
392   flops = out.value;
393
394   if (verbose)
395      printf("Process %d has the path with the lowest FLOP count %lf\n", sender, flops);
396
397   // Get the buffer size for optimizerInfo and broadcast it
398   size_t bufSize {0};
399   if (rank == sender)
400   {
401       HANDLE_ERROR( cutensornetContractionOptimizerInfoGetPackedSize(handle, optimizerInfo, &bufSize) );
402   }
403   HANDLE_MPI_ERROR( MPI_Bcast(&bufSize, 1, MPI_INT64_T, sender, MPI_COMM_WORLD) );
404
405   // Allocate a buffer
406   std::vector<char> buffer(bufSize);
407
408   // Pack optimizerInfo on sender and broadcast it
409   if (rank == sender)
410   {
411       HANDLE_ERROR( cutensornetContractionOptimizerInfoPackData(handle, optimizerInfo, buffer.data(), bufSize) );
412   }
413   HANDLE_MPI_ERROR( MPI_Bcast(buffer.data(), bufSize, MPI_CHAR, sender, MPI_COMM_WORLD) );
414
415   // Unpack optimizerInfo from the buffer
416   if (rank != sender)
417   {
418       HANDLE_ERROR( cutensornetUpdateContractionOptimizerInfoFromPackedData(handle, buffer.data(), bufSize, optimizerInfo) );
419   }
420
421   // Query the number of slices the tensor network execution will be split into
422   int64_t numSlices = 0;
423   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
424                  handle,
425                  optimizerInfo,
426                  CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
427                  &numSlices,
428                  sizeof(numSlices)) );
429   assert(numSlices > 0);
430
431   // Calculate each process's share of the slices
432   int64_t procChunk = numSlices / numProcs;
433   int extra = numSlices % numProcs;
434   int procSliceBegin = rank * procChunk + std::min(rank, extra);
435   int procSliceEnd = (rank == numProcs - 1) ? numSlices : (rank + 1) * procChunk + std::min(rank + 1, extra);

We now create the workspace descriptor and allocate memory as described in Create workspace descriptor and allocate workspace memory, and create the Contraction plan and auto-tune the tensor network.

Next, on each process, we create a slice group (see cutensornetSliceGroup_t) that corresponds to its share of the tensor network slices. We then provide this slice group object to the cutensornetContractSlices() function to get a partial contraction result on each process.

525   // Create a cutensornetSliceGroup_t object from a range of slice IDs
526   cutensornetSliceGroup_t sliceGroup{};
527   HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, procSliceBegin, procSliceEnd, 1, &sliceGroup) );
548      HANDLE_ERROR( cutensornetContractSlices(handle,
549                                 plan,
550                                 rawDataIn_d,
551                                 R_d,
552                                 accumulateOutput,
553                                 workDesc,
554                                 sliceGroup,
555                                 stream) );

Finally, we sum up the partial contributions to obtain the result of the tensor network contraction.

561      // Perform Allreduce operation on the output tensor
562      HANDLE_CUDA_ERROR( cudaStreamSynchronize(stream) );
563      HANDLE_CUDA_ERROR( cudaMemcpy(R, R_d, sizeR, cudaMemcpyDeviceToHost) ); // restore the output tensor on Host
564      HANDLE_MPI_ERROR( MPI_Allreduce(MPI_IN_PLACE, R, elementsR, floatTypeMPI, MPI_SUM, MPI_COMM_WORLD) );

Before termination, the MPI service needs to be finalized.

626   // Shut down MPI service
627   HANDLE_MPI_ERROR( MPI_Finalize() );

The complete MPI-manual sample can be found in the NVIDIA/cuQuantum repository.

Code Example (GateSplit)

Useful Tips

  • For debugging, one can set the environment variable CUTENSORNET_LOG_LEVEL=n. The level n = 0, 1, …, 5 corresponds to the logger level as described and used in cutensornetLoggerSetLevel(). The environment variable CUTENSORNET_LOG_FILE=<filepath> can be used to redirect the log output to a custom file at <filepath> instead of stdout.