Examples

In this section, we show how to contract a tensor network using cuTensorNet. First, we describe how to compile sample code. Then, we present an example code used to perform common steps in cuTensorNet. In the example, we perform the following tensor contraction:

\[R_{k,l} = A_{a,b,c,d,e,f} B_{b,g,h,e,i,j} C_{m,a,g,f,i,k} D_{l,c,h,d,j,m}\]

We construct the code step by step, each step adding code at the end. The steps are separated by succinct multi-line comment blocks.

It is recommended that the reader refers to Overview and cuTENSOR documentation for familiarity with the nomenclature and cuTENSOR operations.

Compiling Code

Assuming cuQuantum has been extracted in CUQUANTUM_ROOT and cuTENSOR in CUTENSOR_ROOT, we update the library path as follows:

export LD_LIBRARY_PATH=${CUQUANTUM_ROOT}/lib:${CUTENSOR_ROOT}/lib/11:${LD_LIBRARY_PATH}

Depending on your CUDA Toolkit, you might have to choose a different library version (e.g., ${CUTENSOR_ROOT}/lib/11.0).

The serial sample code discussed below (tensornet_example.cu) can be compiled via the following command:

nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -o tensornet_example

For static linking against the cuTensorNet library, use the following command (note that libmetis_static.a needs to be explicitly linked against, assuming it is installed through the NVIDIA CUDA Toolkit and accessible through $LIBRARY_PATH):

nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include ${CUQUANTUM_ROOT}/lib/libcutensornet_static.a -L${CUTENSOR_DIR}/lib/11 -lcutensor libmetis_static.a -o tensornet_example

In order to build parallel (MPI) versions of the examples (tensornet_example_mpi_auto.cu and tensornet_example_mpi.cu), one will need to have an MPI library installed (e.g., recent Open MPI, MVAPICH, or MPICH). In particular, the automatic parallel example requires CUDA-aware MPI, see Code Example (Automatic Slice-Based Distributed Parallelization) below. In this case, one will need to add -I${MPI_PATH}/include and -L${MPI_PATH}/lib -lmpi to the build command:

nvcc tensornet_example_mpi_auto.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -I${MPI_PATH}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -L${MPI_PATH}/lib -lmpi -o tensornet_example_mpi_auto
nvcc tensornet_example_mpi.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -I${MPI_PATH}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -L${MPI_PATH}/lib -lmpi -o tensornet_example_mpi

Warning

When running tensornet_example_mpi_auto.cu without CUDA-aware MPI, the program will crash.

Note

Depending on the source of the cuQuantum package, you may need to replace lib above by lib64.

Code Example (Serial)

The following code example illustrates the common steps necessary to use cuTensorNet and also introduces typical tensor network operations. The full sample code can be found in the NVIDIA/cuQuantum repository (here).

Headers and data types

  8#include <stdlib.h>
  9#include <stdio.h>
 10
 11#include <unordered_map>
 12#include <vector>
 13#include <cassert>
 14
 15#include <cuda_runtime.h>
 16#include <cutensornet.h>
 17
 18
 19#define HANDLE_ERROR(x)                                           \
 20{ const auto err = x;                                             \
 21  if( err != CUTENSORNET_STATUS_SUCCESS )                         \
 22  { printf("Error: %s in line %d\n", cutensornetGetErrorString(err), __LINE__); \
 23    fflush(stdout);                                               \
 24  }                                                               \
 25};
 26
 27#define HANDLE_CUDA_ERROR(x)                                      \
 28{ const auto err = x;                                             \
 29  if( err != cudaSuccess )                                        \
 30  { printf("CUDA Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); \
 31    fflush(stdout);                                               \
 32  }                                                               \
 33};
 34
 35
 36struct GPUTimer
 37{
 38    GPUTimer(cudaStream_t stream): stream_(stream)
 39    {
 40        HANDLE_CUDA_ERROR(cudaEventCreate(&start_));
 41        HANDLE_CUDA_ERROR(cudaEventCreate(&stop_));
 42    }
 43
 44    ~GPUTimer()
 45    {
 46        HANDLE_CUDA_ERROR(cudaEventDestroy(start_));
 47        HANDLE_CUDA_ERROR(cudaEventDestroy(stop_));
 48    }
 49
 50    void start()
 51    {
 52        HANDLE_CUDA_ERROR(cudaEventRecord(start_, stream_));
 53    }
 54
 55    float seconds()
 56    {
 57        HANDLE_CUDA_ERROR(cudaEventRecord(stop_, stream_));
 58        HANDLE_CUDA_ERROR(cudaEventSynchronize(stop_));
 59        float time;
 60        HANDLE_CUDA_ERROR(cudaEventElapsedTime(&time, start_, stop_));
 61        return time * 1e-3;
 62    }
 63
 64    private:
 65    cudaEvent_t start_, stop_;
 66    cudaStream_t stream_;
 67};
 68
 69
 70int main()
 71{
 72   static_assert(sizeof(size_t) == sizeof(int64_t), "Please build this sample on a 64-bit architecture!");
 73
 74   bool verbose = true;
 75
 76   // Check cuTensorNet version
 77   const size_t cuTensornetVersion = cutensornetGetVersion();
 78   if(verbose)
 79      printf("cuTensorNet version: %ld\n", cuTensornetVersion);
 80
 81   // Set GPU device
 82   int numDevices {0};
 83   HANDLE_CUDA_ERROR( cudaGetDeviceCount(&numDevices) );
 84   const int deviceId = 0;
 85   HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) );
 86   cudaDeviceProp prop;
 87   HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );
 88
 89   if(verbose) {
 90      printf("===== device info ======\n");
 91      printf("GPU-local-id:%d\n", deviceId);
 92      printf("GPU-name:%s\n", prop.name);
 93      printf("GPU-clock:%d\n", prop.clockRate);
 94      printf("GPU-memoryClock:%d\n", prop.memoryClockRate);
 95      printf("GPU-nSM:%d\n", prop.multiProcessorCount);
 96      printf("GPU-major:%d\n", prop.major);
 97      printf("GPU-minor:%d\n", prop.minor);
 98      printf("========================\n");
 99   }
100
101   typedef float floatType;
102   cudaDataType_t typeData = CUDA_R_32F;
103   cutensornetComputeType_t typeCompute = CUTENSORNET_COMPUTE_32F;
104
105   if(verbose)
106      printf("Included headers and defined data types\n");

Define tensor network and tensor sizes

Next, we define the topology of the tensor network (i.e., the modes of the tensors, their extents, and their connectivity).

110   /**********************
111   * Computing: R_{k,l} = A_{a,b,c,d,e,f} B_{b,g,h,e,i,j} C_{m,a,g,f,i,k} D_{l,c,h,d,j,m}
112   **********************/
113
114   constexpr int32_t numInputs = 4;
115
116   // Create vectors of tensor modes
117   std::vector<int32_t> modesA{'a','b','c','d','e','f'};
118   std::vector<int32_t> modesB{'b','g','h','e','i','j'};
119   std::vector<int32_t> modesC{'m','a','g','f','i','k'};
120   std::vector<int32_t> modesD{'l','c','h','d','j','m'};
121   std::vector<int32_t> modesR{'k','l'};
122
123   // Set mode extents
124   std::unordered_map<int32_t, int64_t> extent;
125   extent['a'] = 16;
126   extent['b'] = 16;
127   extent['c'] = 16;
128   extent['d'] = 16;
129   extent['e'] = 16;
130   extent['f'] = 16;
131   extent['g'] = 16;
132   extent['h'] = 16;
133   extent['i'] = 16;
134   extent['j'] = 16;
135   extent['k'] = 16;
136   extent['l'] = 16;
137   extent['m'] = 16;
138
139   // Create a vector of extents for each tensor
140   std::vector<int64_t> extentA;
141   for (auto mode : modesA)
142      extentA.push_back(extent[mode]);
143   std::vector<int64_t> extentB;
144   for (auto mode : modesB)
145      extentB.push_back(extent[mode]);
146   std::vector<int64_t> extentC;
147   for (auto mode : modesC)
148      extentC.push_back(extent[mode]);
149   std::vector<int64_t> extentD;
150   for (auto mode : modesD)
151      extentD.push_back(extent[mode]);
152   std::vector<int64_t> extentR;
153   for (auto mode : modesR)
154      extentR.push_back(extent[mode]);
155
156   if(verbose)
157      printf("Defined tensor network, modes, and extents\n");

Allocate memory and initialize data

Next, we allocate memory for the tensor network operands and initialize them to random values.

160   /**********************
161   * Allocating data
162   **********************/
163
164   size_t elementsA = 1;
165   for (auto mode : modesA)
166      elementsA *= extent[mode];
167   size_t elementsB = 1;
168   for (auto mode : modesB)
169      elementsB *= extent[mode];
170   size_t elementsC = 1;
171   for (auto mode : modesC)
172      elementsC *= extent[mode];
173   size_t elementsD = 1;
174   for (auto mode : modesD)
175      elementsD *= extent[mode];
176   size_t elementsR = 1;
177   for (auto mode : modesR)
178      elementsR *= extent[mode];
179
180   size_t sizeA = sizeof(floatType) * elementsA;
181   size_t sizeB = sizeof(floatType) * elementsB;
182   size_t sizeC = sizeof(floatType) * elementsC;
183   size_t sizeD = sizeof(floatType) * elementsD;
184   size_t sizeR = sizeof(floatType) * elementsR;
185   if(verbose)
186      printf("Total GPU memory used for tensor storage: %.2f GiB\n",
187             (sizeA + sizeB + sizeC + sizeD + sizeR) / 1024. /1024. / 1024);
188
189   void* rawDataIn_d[numInputs];
190   void* R_d;
191   HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[0], sizeA) );
192   HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[1], sizeB) );
193   HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[2], sizeC) );
194   HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[3], sizeD) );
195   HANDLE_CUDA_ERROR( cudaMalloc((void**) &R_d, sizeR));
196
197   floatType *A = (floatType*) malloc(sizeof(floatType) * elementsA);
198   floatType *B = (floatType*) malloc(sizeof(floatType) * elementsB);
199   floatType *C = (floatType*) malloc(sizeof(floatType) * elementsC);
200   floatType *D = (floatType*) malloc(sizeof(floatType) * elementsD);
201   floatType *R = (floatType*) malloc(sizeof(floatType) * elementsR);
202
203   if (A == NULL || B == NULL || C == NULL || D == NULL || R == NULL)
204   {
205      printf("Error: Host memory allocation failed!\n");
206      return -1;
207   }
208
209   /*******************
210   * Initialize data
211   *******************/
212
213   memset(R, 0, sizeof(floatType) * elementsR);
214   for (uint64_t i = 0; i < elementsA; i++)
215      A[i] = ((floatType) rand()) / RAND_MAX;
216   for (uint64_t i = 0; i < elementsB; i++)
217      B[i] = ((floatType) rand()) / RAND_MAX;
218   for (uint64_t i = 0; i < elementsC; i++)
219      C[i] = ((floatType) rand()) / RAND_MAX;
220   for (uint64_t i = 0; i < elementsD; i++)
221      D[i] = ((floatType) rand()) / RAND_MAX;
222
223   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
224   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
225   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );
226   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[3], D, sizeD, cudaMemcpyHostToDevice) );
227
228   if(verbose)
229      printf("Allocated GPU memory for data, and initialize data\n");

cuTensorNet handle and network descriptor

Next, we initialize the cuTensorNet library via cutensornetCreate() and create the network descriptor with the desired tensor modes, extents, and strides, as well as the data and compute types. Note that the created library context will be associated with the currently active GPU.

232   /*************************
233   * cuTensorNet
234   *************************/
235
236   cudaStream_t stream;
237   HANDLE_CUDA_ERROR( cudaStreamCreate(&stream) );
238
239   cutensornetHandle_t handle;
240   HANDLE_ERROR( cutensornetCreate(&handle) );
241
242   const int32_t nmodeA = modesA.size();
243   const int32_t nmodeB = modesB.size();
244   const int32_t nmodeC = modesC.size();
245   const int32_t nmodeD = modesD.size();
246   const int32_t nmodeR = modesR.size();
247
248   /*******************************
249   * Create Network Descriptor
250   *******************************/
251
252   const int32_t* modesIn[] = {modesA.data(), modesB.data(), modesC.data(), modesD.data()};
253   int32_t const numModesIn[] = {nmodeA, nmodeB, nmodeC, nmodeD};
254   const int64_t* extentsIn[] = {extentA.data(), extentB.data(), extentC.data(), extentD.data()};
255   const int64_t* stridesIn[] = {NULL, NULL, NULL, NULL}; // strides are optional; if no stride is provided, cuTensorNet assumes a generalized column-major data layout
256
257   // Set up tensor network
258   cutensornetNetworkDescriptor_t descNet;
259   HANDLE_ERROR( cutensornetCreateNetworkDescriptor(handle,
260                     numInputs, numModesIn, extentsIn, stridesIn, modesIn, NULL,
261                     nmodeR, extentR.data(), /*stridesOut = */NULL, modesR.data(),
262                     typeData, typeCompute,
263                     &descNet) );
264
265   if(verbose)
266      printf("Initialized the cuTensorNet library and created a tensor network descriptor\n");

Optimal contraction order and slicing

At this stage, we can deploy the cuTensorNet optimizer to find an optimized contraction path and slicing combination. We choose a limit for the workspace needed to perform the contraction based on the available memory resources, and provide it to the optimizer as a constraint. We then create an optimizer configuration object of type cutensornetContractionOptimizerConfig_t to specify various optimizer options and provide it to the optimizer, which is invoked via cutensornetContractionOptimize(). The results from the optimizer will be returned in an optimizer info object of type cutensornetContractionOptimizerInfo_t.

269   /*******************************
270   * Choose workspace limit based on available resources.
271   *******************************/
272
273   size_t freeMem, totalMem;
274   HANDLE_CUDA_ERROR( cudaMemGetInfo(&freeMem, &totalMem) );
275   uint64_t workspaceLimit = (uint64_t)((double)freeMem * 0.9);
276   if(verbose)
277      printf("Workspace limit = %lu\n", workspaceLimit);
278
279   /*******************************
280   * Find "optimal" contraction order and slicing
281   *******************************/
282
283   cutensornetContractionOptimizerConfig_t optimizerConfig;
284   HANDLE_ERROR( cutensornetCreateContractionOptimizerConfig(handle, &optimizerConfig) );
285
286   // Set the desired number of hyper-samples (defaults to 0)
287   int32_t num_hypersamples = 8;
288   HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(handle,
289                     optimizerConfig,
290                     CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES,
291                     &num_hypersamples,
292                     sizeof(num_hypersamples)) );
293
294   // Create contraction optimizer info and find an optimized contraction path
295   cutensornetContractionOptimizerInfo_t optimizerInfo;
296   HANDLE_ERROR( cutensornetCreateContractionOptimizerInfo(handle, descNet, &optimizerInfo) );
297
298   HANDLE_ERROR( cutensornetContractionOptimize(handle,
299                                             descNet,
300                                             optimizerConfig,
301                                             workspaceLimit,
302                                             optimizerInfo) );
303
304   // Query the number of slices the tensor network execution will be split into
305   int64_t numSlices = 0;
306   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
307                  handle,
308                  optimizerInfo,
309                  CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
310                  &numSlices,
311                  sizeof(numSlices)) );
312   assert(numSlices > 0);
313
314   if(verbose)
315      printf("Found an optimized contraction path using cuTensorNet optimizer\n");

It is also possible to bypass the cuTensorNet optimizer and import a pre-determined contraction path, as well as slicing information, directly to the optimizer info object via cutensornetContractionOptimizerInfoSetAttribute().

Create workspace descriptor and allocate workspace memory

Next, we create a workspace descriptor, compute the workspace sizes, and query the minimum workspace size needed to contract the network. We then allocate device memory for the workspace and set this in the workspace descriptor. The workspace descriptor will be provided to the contraction plan.

318   /*******************************
319   * Create workspace descriptor, allocate workspace, and set it.
320   *******************************/
321
322   cutensornetWorkspaceDescriptor_t workDesc;
323   HANDLE_ERROR( cutensornetCreateWorkspaceDescriptor(handle, &workDesc) );
324
325   int64_t requiredWorkspaceSize = 0;
326   HANDLE_ERROR( cutensornetWorkspaceComputeContractionSizes(handle,
327                                                         descNet,
328                                                         optimizerInfo,
329                                                         workDesc) );
330
331   HANDLE_ERROR( cutensornetWorkspaceGetMemorySize(handle,
332                                                   workDesc,
333                                                   CUTENSORNET_WORKSIZE_PREF_MIN,
334                                                   CUTENSORNET_MEMSPACE_DEVICE,
335                                                   CUTENSORNET_WORKSPACE_SCRATCH,
336                                                   &requiredWorkspaceSize) );
337
338   void* work = nullptr;
339   HANDLE_CUDA_ERROR( cudaMalloc(&work, requiredWorkspaceSize) );
340
341   HANDLE_ERROR( cutensornetWorkspaceSetMemory(handle,
342                                               workDesc,
343                                               CUTENSORNET_MEMSPACE_DEVICE,
344                                               CUTENSORNET_WORKSPACE_SCRATCH,
345                                               work,
346                                               requiredWorkspaceSize) );
347
348   if(verbose)
349      printf("Allocated and set up the GPU workspace\n");

Contraction plan and auto-tune

We create a tensor network contraction plan holding all pairwise contraction plans for cuTENSOR. Optionally, we can auto-tune the plan such that cuTENSOR selects the best kernel for each pairwise contraction. This contraction plan can be reused for many (possibly different) data inputs, avoiding the cost of initializing this plan redundantly.

352   /*******************************
353   * Initialize the pairwise contraction plan (for cuTENSOR).
354   *******************************/
355
356   cutensornetContractionPlan_t plan;
357   HANDLE_ERROR( cutensornetCreateContractionPlan(handle,
358                                                descNet,
359                                                optimizerInfo,
360                                                workDesc,
361                                                &plan) );
362
363   /*******************************
364   * Optional: Auto-tune cuTENSOR's cutensorContractionPlan to pick the fastest kernel
365   *           for each pairwise tensor contraction.
366   *******************************/
367   cutensornetContractionAutotunePreference_t autotunePref;
368   HANDLE_ERROR( cutensornetCreateContractionAutotunePreference(handle,
369                                                      &autotunePref) );
370
371   const int numAutotuningIterations = 5; // may be 0
372   HANDLE_ERROR( cutensornetContractionAutotunePreferenceSetAttribute(
373                           handle,
374                           autotunePref,
375                           CUTENSORNET_CONTRACTION_AUTOTUNE_MAX_ITERATIONS,
376                           &numAutotuningIterations,
377                           sizeof(numAutotuningIterations)) );
378
379   // Modify the plan again to find the best pair-wise contractions
380   HANDLE_ERROR( cutensornetContractionAutotune(handle,
381                                                plan,
382                                                rawDataIn_d,
383                                                R_d,
384                                                workDesc,
385                                                autotunePref,
386                                                stream) );
387
388   HANDLE_ERROR( cutensornetDestroyContractionAutotunePreference(autotunePref) );
389
390   if(verbose)
391      printf("Created a contraction plan for cuTensorNet and optionally auto-tuned it\n");

Tensor network contraction execution

Finally, we contract the tensor network as many times as needed, possibly with different input each time. Tensor network slices, captured as a cutensornetSliceGroup_t object, are computed using the same contraction plan. For convenience, NULL can be provided to the cutensornetContractSlices() function instead of a slice group when the goal is to contract all the slices in the network. We also clean up and free allocated resources.

394   /**********************
395   * Execute the tensor network contraction
396   **********************/
397
398   // Create a cutensornetSliceGroup_t object from a range of slice IDs
399   cutensornetSliceGroup_t sliceGroup{};
400   HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, 0, numSlices, 1, &sliceGroup) );
401
402   GPUTimer timer {stream};
403   double minTimeCUTENSORNET = 1e100;
404   const int numRuns = 3; // number of repeats to get stable performance results
405   for (int i = 0; i < numRuns; ++i)
406   {
407      HANDLE_CUDA_ERROR( cudaMemcpy(R_d, R, sizeR, cudaMemcpyHostToDevice) ); // restore the output tensor on GPU
408      HANDLE_CUDA_ERROR( cudaDeviceSynchronize() );
409
410      /*
411      * Contract all slices of the tensor network
412      */
413      timer.start();
414
415      int32_t accumulateOutput = 0; // output tensor data will be overwritten
416      HANDLE_ERROR( cutensornetContractSlices(handle,
417                     plan,
418                     rawDataIn_d,
419                     R_d,
420                     accumulateOutput,
421                     workDesc,
422                     sliceGroup, // alternatively, NULL can also be used to contract over all slices instead of specifying a sliceGroup object
423                     stream) );
424
425      // Synchronize and measure best timing
426      auto time = timer.seconds();
427      minTimeCUTENSORNET = (time > minTimeCUTENSORNET) ? minTimeCUTENSORNET : time;
428   }
429
430   if(verbose)
431      printf("Contracted the tensor network, each slice used the same contraction plan\n");
432
433   // Print the 1-norm of the output tensor (verification)
434   HANDLE_CUDA_ERROR( cudaStreamSynchronize(stream) );
435   HANDLE_CUDA_ERROR( cudaMemcpy(R, R_d, sizeR, cudaMemcpyDeviceToHost) ); // restore the output tensor on Host
436   double norm1 = 0.0;
437   for (int64_t i = 0; i < elementsR; ++i) {
438      norm1 += std::abs(R[i]);
439   }
440   if(verbose)
441      printf("Computed the 1-norm of the output tensor: %e\n", norm1);
442
443   /*************************/
444
445   // Query the total Flop count for the tensor network contraction
446   double flops {0.0};
447   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
448                     handle,
449                     optimizerInfo,
450                     CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
451                     &flops,
452                     sizeof(flops)) );
453
454   if(verbose) {
455      printf("Number of tensor network slices = %ld\n", numSlices);
456      printf("Tensor network contraction time (ms) = %.3f\n", minTimeCUTENSORNET * 1000.f);
457   }
458
459   // Free cuTensorNet resources
460   HANDLE_ERROR( cutensornetDestroySliceGroup(sliceGroup) );
461   HANDLE_ERROR( cutensornetDestroyContractionPlan(plan) );
462   HANDLE_ERROR( cutensornetDestroyWorkspaceDescriptor(workDesc) );
463   HANDLE_ERROR( cutensornetDestroyContractionOptimizerInfo(optimizerInfo) );
464   HANDLE_ERROR( cutensornetDestroyContractionOptimizerConfig(optimizerConfig) );
465   HANDLE_ERROR( cutensornetDestroyNetworkDescriptor(descNet) );
466   HANDLE_ERROR( cutensornetDestroy(handle) );
467
468   // Free Host memory resources
469   if (R) free(R);
470   if (D) free(D);
471   if (C) free(C);
472   if (B) free(B);
473   if (A) free(A);
474
475   // Free GPU memory resources
476   if (work) cudaFree(work);
477   if (R_d) cudaFree(R_d);
478   if (rawDataIn_d[0]) cudaFree(rawDataIn_d[0]);
479   if (rawDataIn_d[1]) cudaFree(rawDataIn_d[1]);
480   if (rawDataIn_d[2]) cudaFree(rawDataIn_d[2]);
481   if (rawDataIn_d[3]) cudaFree(rawDataIn_d[3]);
482
483   if(verbose)
484      printf("Freed resources and exited\n");
485
486   return 0;
487}

Recall that the full sample code can be found in the NVIDIA/cuQuantum repository (here).

Code Example (Automatic Slice-Based Distributed Parallelization)

It is straightforward to adapt Code Example (Serial) and enable automatic parallel execution across multiple/many GPU devices (across multiple/many nodes). We will illustrate this with an example using the Message Passing Interface (MPI) as the communication layer. Below we show the minor additions that need to be made in order to enable distributed parallel execution without making any changes to the original serial source code. The full MPI-automatic sample code can be found in the NVIDIA/cuQuantum repository. To enable automatic parallelism, cuTensorNet requires that

  • the environment variable $CUTENSORNET_COMM_LIB is set to the path to the wrapper shared library libcutensornet_distributed_interface_mpi.so, and

  • the executable is linked to a CUDA-aware MPI library

The detailed instruction for setting these up is given in the installation guide above.

First, in addition to the headers and definitions mentioned in Headers and data types, we include the MPI header and define a macro to handle MPI errors. We also need to initialize the MPI service and assign a unique GPU device to each MPI process that will later be associated with the cuTensorNet library handle created inside the MPI process.

20#include <mpi.h>
44#define HANDLE_MPI_ERROR(x)                                       \
45{ const auto err = x;                                             \
46  if( err != MPI_SUCCESS )                                        \
47  { char error[MPI_MAX_ERROR_STRING]; int len;                    \
48    MPI_Error_string(err, error, &len);                           \
49    printf("MPI Error: %s in line %d\n", error, __LINE__);        \
50    fflush(stdout);                                               \
51    MPI_Abort(MPI_COMM_WORLD, err);                               \
52  }                                                               \
53};

The MPI service initialization must precede the first cutensornetCreate() call which creates a cuTensorNet library handle. An attempt to call cutensornetCreate() before initializing the MPI service will result in an error.

 97   // Initialize MPI
 98   HANDLE_MPI_ERROR( MPI_Init(&argc, &argv) );
 99   int rank {-1};
100   HANDLE_MPI_ERROR( MPI_Comm_rank(MPI_COMM_WORLD, &rank) );
101   int numProcs {0};
102   HANDLE_MPI_ERROR( MPI_Comm_size(MPI_COMM_WORLD, &numProcs) );

If multiple GPU devices located on the same node are visible to an MPI process, we need to pick an exclusive GPU device for each MPI process. If the mpirun (or mpiexec) command provided by your MPI library implementation sets up an environment variable that shows the rank of the respective MPI process during its invocation, you can use that environment variable to set CUDA_VISIBLE_DEVICES to point to a specific single GPU device assigned to the MPI process exclusively (for example, Open MPI provides ${OMPI_COMM_WORLD_LOCAL_RANK} for this purpose). Otherwise, the GPU device can be set manually, as shown below.

122   // Set GPU device based on ranks and nodes
123   int numDevices {0};
124   HANDLE_CUDA_ERROR( cudaGetDeviceCount(&numDevices) );
125   const int deviceId = rank % numDevices; // we assume that the processes are mapped to nodes in contiguous chunks
126   HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) );
127   cudaDeviceProp prop;
128   HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );

Next we define the tensor network as described in Define tensor network and tensor sizes. In a one GPU device per process model, the tensor network, including operands and result data, is replicated on each process. The root process initializes the input data and broadcasts it to the other processes.

258   /*******************
259   * Initialize data
260   *******************/
261
262   memset(R, 0, sizeof(floatType) * elementsR);
263   if(rank == 0)
264   {
265      for (uint64_t i = 0; i < elementsA; i++)
266         A[i] = ((floatType) rand()) / RAND_MAX;
267      for (uint64_t i = 0; i < elementsB; i++)
268         B[i] = ((floatType) rand()) / RAND_MAX;
269      for (uint64_t i = 0; i < elementsC; i++)
270         C[i] = ((floatType) rand()) / RAND_MAX;
271      for (uint64_t i = 0; i < elementsD; i++)
272         D[i] = ((floatType) rand()) / RAND_MAX;
273   }
274
275   // Broadcast input data to all ranks
276   HANDLE_MPI_ERROR( MPI_Bcast(A, elementsA, floatTypeMPI, 0, MPI_COMM_WORLD) );
277   HANDLE_MPI_ERROR( MPI_Bcast(B, elementsB, floatTypeMPI, 0, MPI_COMM_WORLD) );
278   HANDLE_MPI_ERROR( MPI_Bcast(C, elementsC, floatTypeMPI, 0, MPI_COMM_WORLD) );
279   HANDLE_MPI_ERROR( MPI_Bcast(D, elementsD, floatTypeMPI, 0, MPI_COMM_WORLD) );
280
281   // Copy data to GPU
282   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
283   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
284   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );
285   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[3], D, sizeD, cudaMemcpyHostToDevice) );
286
287   if(verbose)
288      printf("Allocated GPU memory for data, and initialize data\n");

Once the MPI service has been initialized and the cuTensorNet library handle has been created afterwards, one can activate the distributed parallel execution by calling cutensornetDistributedResetConfiguration(). Per standard practice, the user’s code needs to create a duplicate MPI communicator via MPI_Comm_dup. Then, the duplicate MPI communicator is associated with the cuTensorNet library handle by passing the pointer to the duplicate MPI communicator together with its size (in bytes) to the cutensornetDistributedResetConfiguration() call. The MPI communicator will be stored inside the cuTensorNet library handle such that all subsequent calls to the tensor network contraction path finder and tensor network contraction executor will be parallelized across all participating MPI processes (each MPI process is associated with its own GPU).

342   /*******************************
343   * Activate distributed (parallel) execution prior to
344   * calling contraction path finder and contraction executor
345   *******************************/
346   // HANDLE_ERROR( cutensornetDistributedResetConfiguration(handle, NULL, 0) ); // resets back to serial execution
347   MPI_Comm cutnComm;
348   HANDLE_MPI_ERROR( MPI_Comm_dup(MPI_COMM_WORLD, &cutnComm) ); // duplicate MPI communicator
349   HANDLE_ERROR( cutensornetDistributedResetConfiguration(handle, &cutnComm, sizeof(cutnComm)) );
350   if(verbose)
351      printf("Reset distributed MPI configuration\n");

Note

cutensornetDistributedResetConfiguration() is a collective call that must be executed by all participating MPI processes.

The API of this distributed parallelization model makes it straightforward to run source codes written for serial execution on multiple GPUs/nodes. Essentially, all MPI processes will execute exactly the same (serial) source code while automatically performing distributed parallelization inside the tensor network contraction path finder and tensor network contraction executor calls. The parallelization of the tensor network contraction path finder will only occur when the number of requested hyper-samples is larger than zero. However, regardless of that, activation of the distributed parallelization must precede the invocation of the tensor network contraction path finder. That is, the tensor network contraction path finder and tensor network contraction execution invocations must be done strictly after activating the distributed parallelization via cutensornetDistributedResetConfiguration(). When the distributed configuration is set to a parallel mode, the user is normally expected to invoke tensor network contraction execution by calling the cutensornetContractSlices() function which is provided with the full range of tensor network slices that will be automatically distributed across all MPI processes.

Since the size of the tensor network must be sufficiently large to get a benefit of acceleration from distributed execution, smaller tensor networks (those which consist of only a single slice) can still be processed without distributed parallelization, which can be achieved by calling cutensornetDistributedResetConfiguration() with a NULL argument in place of the MPI communicator pointer (as before, this should be done prior to calling the tensor network contraction path finder). That is, the switch between distributed parallelization and redundant serial execution can be done on a per-tensor-network basis. Users can decide which (larger) tensor networks to process in a parallel manner and which (smaller ones) to process in a serial manner redundantly, by resetting the distributed configuration appropriately. In both cases, all MPI processes will produce the same output tensor (result) at the end of the tensor network execution.

Note

In the current version of the cuTensorNet library, the parallel tensor network contraction execution triggered by the cutensornetContractSlices() call will block the provided CUDA stream as well as the calling CPU thread until the execution has completed on all MPI processes. This is a temporary limitation that will be lifted in future versions of the cuTensorNet library, where the call to cutensornetContractSlices() will be fully asynchronous, similar to the serial execution case. Additionally, for an explicit synchronization of all MPI processes (barrier) one can make a collective call to cutensornetDistributedSynchronize().

Before termination, the MPI service needs to be finalized.

561   // Shut down MPI service
562   HANDLE_MPI_ERROR( MPI_Finalize() );

The complete MPI-automatic sample can be found in the NVIDIA/cuQuantum repository.

Code Example (Manual Slice-Based Distributed Parallelization)

For advanced users, it is also possible (but more involved) to adapt Code Example (Serial) to explicitly parallelize execution of the tensor network contraction operation on multiple GPU devices. Here we will also use MPI as the communication layer. For brevity, we will show only the changes that need to be made on top of the serial example. The full MPI-manual sample code can be found in the NVIDIA/cuQuantum repository. Note that this sample does NOT require CUDA-aware MPI.

First, in addition to the headers and definitions mentioned in Headers and data types, we need to include the MPI header and define a macro to handle MPI errors. We also need to initialize the MPI service and associate each MPI process with its own GPU device, as explained previously.

20#include <mpi.h>
44#define HANDLE_MPI_ERROR(x)                                       \
45{ const auto err = x;                                             \
46  if( err != MPI_SUCCESS )                                        \
47  { char error[MPI_MAX_ERROR_STRING]; int len;                    \
48    MPI_Error_string(err, error, &len);                           \
49    printf("MPI Error: %s in line %d\n", error, __LINE__);        \
50    fflush(stdout);                                               \
51    MPI_Abort(MPI_COMM_WORLD, err);                               \
52  }                                                               \
53};
 97   // Initialize MPI
 98   HANDLE_MPI_ERROR( MPI_Init(&argc, &argv) );
 99   int rank {-1};
100   HANDLE_MPI_ERROR( MPI_Comm_rank(MPI_COMM_WORLD, &rank) );
101   int numProcs {0};
102   HANDLE_MPI_ERROR( MPI_Comm_size(MPI_COMM_WORLD, &numProcs) );
122   // Set GPU device based on ranks and nodes
123   int numDevices {0};
124   HANDLE_CUDA_ERROR( cudaGetDeviceCount(&numDevices) );
125   const int deviceId = rank % numDevices; // we assume that the processes are mapped to nodes in contiguous chunks
126   HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) );
127   cudaDeviceProp prop;
128   HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );

Next, we define the tensor network as described in Define tensor network and tensor sizes. In a one GPU device per process model, the tensor network, including operands and result data, is replicated on each process. The root process initializes the input data and broadcasts it to the other processes.

258   /*******************
259   * Initialize data
260   *******************/
261
262   memset(R, 0, sizeof(floatType) * elementsR);
263   if(rank == 0)
264   {
265      for (uint64_t i = 0; i < elementsA; i++)
266         A[i] = ((floatType) rand()) / RAND_MAX;
267      for (uint64_t i = 0; i < elementsB; i++)
268         B[i] = ((floatType) rand()) / RAND_MAX;
269      for (uint64_t i = 0; i < elementsC; i++)
270         C[i] = ((floatType) rand()) / RAND_MAX;
271      for (uint64_t i = 0; i < elementsD; i++)
272         D[i] = ((floatType) rand()) / RAND_MAX;
273   }
274
275   // Broadcast input data to all ranks
276   HANDLE_MPI_ERROR( MPI_Bcast(A, elementsA, floatTypeMPI, 0, MPI_COMM_WORLD) );
277   HANDLE_MPI_ERROR( MPI_Bcast(B, elementsB, floatTypeMPI, 0, MPI_COMM_WORLD) );
278   HANDLE_MPI_ERROR( MPI_Bcast(C, elementsC, floatTypeMPI, 0, MPI_COMM_WORLD) );
279   HANDLE_MPI_ERROR( MPI_Bcast(D, elementsD, floatTypeMPI, 0, MPI_COMM_WORLD) );
280
281   // Copy data to GPU
282   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
283   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
284   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );
285   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[3], D, sizeD, cudaMemcpyHostToDevice) );
286
287   if(verbose)
288      printf("Allocated GPU memory for data, and initialize data\n");

Then we create the library handle and tensor network descriptor on each process, as described in cuTensorNet handle and network descriptor.

Next, we find the optimal contraction path and slicing combination for our tensor network. We will run the cuTensorNet optimizer on all processes and determine which process has the best path in terms of the FLOP count. We will then pack the optimizer info object on this process, broadcast the packed buffer, and unpack it on all other processes. Each process now has the same optimizer info object, which we use to calculate the share of slices for each process.

363   // Compute the path on all ranks so that we can choose the path with the lowest cost. Note that since this is a tiny
364   // example with 4 operands, all processes will compute the same globally optimal path. This is not the case for large
365   // tensor networks. For large networks, hyper-optimization does become beneficial.
366
367   // Enforce tensor network slicing (for parallelization)
368   const int32_t min_slices = numProcs;
369   HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(handle,
370                  optimizerConfig,
371                  CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MIN_SLICES,
372                  &min_slices,
373                  sizeof(min_slices)) );
374
375   // Find an optimized tensor network contraction path on each MPI process
376   HANDLE_ERROR( cutensornetContractionOptimize(handle,
377                                       descNet,
378                                       optimizerConfig,
379                                       workspaceLimit,
380                                       optimizerInfo) );
381
382   // Query the obtained Flop count
383   double flops{-1.};
384   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(handle,
385                     optimizerInfo,
386                     CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
387                     &flops,
388                     sizeof(flops)) );
389
390   // Choose the contraction path with the lowest Flop cost
391   struct {
392      double value;
393      int rank;
394   } in{flops, rank}, out;
395   HANDLE_MPI_ERROR( MPI_Allreduce(&in, &out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD) );
396   const int sender = out.rank;
397   flops = out.value;
398
399   if (verbose)
400      printf("Process %d has the path with the lowest FLOP count %lf\n", sender, flops);
401
402   // Get the buffer size for optimizerInfo and broadcast it
403   size_t bufSize {0};
404   if (rank == sender)
405   {
406       HANDLE_ERROR( cutensornetContractionOptimizerInfoGetPackedSize(handle, optimizerInfo, &bufSize) );
407   }
408   HANDLE_MPI_ERROR( MPI_Bcast(&bufSize, 1, MPI_INT64_T, sender, MPI_COMM_WORLD) );
409
410   // Allocate a buffer
411   std::vector<char> buffer(bufSize);
412
413   // Pack optimizerInfo on sender and broadcast it
414   if (rank == sender)
415   {
416       HANDLE_ERROR( cutensornetContractionOptimizerInfoPackData(handle, optimizerInfo, buffer.data(), bufSize) );
417   }
418   HANDLE_MPI_ERROR( MPI_Bcast(buffer.data(), bufSize, MPI_CHAR, sender, MPI_COMM_WORLD) );
419
420   // Unpack optimizerInfo from the buffer
421   if (rank != sender)
422   {
423       HANDLE_ERROR( cutensornetUpdateContractionOptimizerInfoFromPackedData(handle, buffer.data(), bufSize, optimizerInfo) );
424   }
425
426   // Query the number of slices the tensor network execution will be split into
427   int64_t numSlices = 0;
428   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
429                  handle,
430                  optimizerInfo,
431                  CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
432                  &numSlices,
433                  sizeof(numSlices)) );
434   assert(numSlices > 0);
435
436   // Calculate each process's share of the slices
437   int64_t procChunk = numSlices / numProcs;
438   int extra = numSlices % numProcs;
439   int procSliceBegin = rank * procChunk + std::min(rank, extra);
440   int procSliceEnd = (rank == numProcs - 1) ? numSlices : (rank + 1) * procChunk + std::min(rank + 1, extra);

We now create the workspace descriptor and allocate memory as described in Create workspace descriptor and allocate workspace memory, and create the Contraction plan and auto-tune the tensor network.

Next, on each process, we create a slice group (see cutensornetSliceGroup_t) that corresponds to its share of the tensor network slices. We then provide this slice group object to the cutensornetContractSlices() function to get a partial contraction result on each process.

530   // Create a cutensornetSliceGroup_t object from a range of slice IDs
531   cutensornetSliceGroup_t sliceGroup{};
532   HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, procSliceBegin, procSliceEnd, 1, &sliceGroup) );
553      HANDLE_ERROR( cutensornetContractSlices(handle,
554                                 plan,
555                                 rawDataIn_d,
556                                 R_d,
557                                 accumulateOutput,
558                                 workDesc,
559                                 sliceGroup,
560                                 stream) );

Finally, we sum up the partial contributions to obtain the result of the tensor network contraction.

566      // Perform Allreduce operation on the output tensor
567      HANDLE_CUDA_ERROR( cudaStreamSynchronize(stream) );
568      HANDLE_CUDA_ERROR( cudaMemcpy(R, R_d, sizeR, cudaMemcpyDeviceToHost) ); // restore the output tensor on Host
569      HANDLE_MPI_ERROR( MPI_Allreduce(MPI_IN_PLACE, R, elementsR, floatTypeMPI, MPI_SUM, MPI_COMM_WORLD) );

Before termination, the MPI service needs to be finalized.

631   // Shut down MPI service
632   HANDLE_MPI_ERROR( MPI_Finalize() );

The complete MPI-manual sample can be found in the NVIDIA/cuQuantum repository.

Code Example (GateSplit)

Useful Tips

  • For debugging, one can set the environment variable CUTENSORNET_LOG_LEVEL=n. The level n = 0, 1, …, 5 corresponds to the logger level as described and used in cutensornetLoggerSetLevel(). The environment variable CUTENSORNET_LOG_FILE=<filepath> can be used to redirect the log output to a custom file at <filepath> instead of stdout.