Getting Started

In this section, we show how to contract a tensor network using cuTensorNet. First, we describe how to install the library and how to compile a sample code. Then, we present the example code used to perform common steps in cuTensorNet. In this example, we perform the following tensor contraction:

\[D_{m,x,n,y} = A_{m,h,k,n} B_{u,k,h} C_{x,u,y}\]

We build the code up step by step, each step adding code at the end. The steps are separated by succinct multi-line comment blocks.

It is recommended that the reader refers to Overview and cuTENSOR documentation for familiarity with the nomenclature and cuTENSOR operations.

Installation and Compilation

Download the cuQuantum package (which cuTensorNet is part of) from https://developer.nvidia.com/cuQuantum-downloads, and the cuTENSOR package from https://developer.nvidia.com/cutensor.

Linux

Assuming cuQuantum has been extracted in CUQUANTUM_ROOT and cuTENSOR in CUTENSOR_ROOT we update the library path as follows:

export LD_LIBRARY_PATH=${CUQUANTUM_ROOT}/lib:${CUTENSOR_ROOT}/lib/11:${LD_LIBRARY_PATH}

Depending on your CUDA Toolkit, you might have to choose a different library version (e.g., ${CUTENSOR_ROOT}/lib/11.0).

The sample code discussed below (tensornet_example.cu) can be compiled via the following command:

nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -o tensornet_example

For statically linking against the cuTensorNet library, use the following command (note that libmetis_static.a needs to be explicitly linked against, assuming it is installed through the NVIDIA CUDA Toolkit and accessible through $LIBRARY_PATH):

nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include ${CUQUANTUM_ROOT}/lib/libcutensornet_static.a -L${CUTENSOR_DIR}/lib/11 -lcutensor libmetis_static.a -o tensornet_example

Note

Depending on the source of the cuQuantum package, you may need to replace lib above by lib64.

Code Example (Serial)

The following code example illustrates the common steps necessary to use cuTensorNet and also introduces typical tensor network operations. The full sample code can be found in the NVIDIA/cuQuantum repository (here).

Headers and data types

 8#include <stdlib.h>
 9#include <stdio.h>
10
11#include <unordered_map>
12#include <vector>
13#include <cassert>
14
15#include <cuda_runtime.h>
16#include <cutensornet.h>
17#include <cutensor.h>
18
19#define HANDLE_ERROR(x)                                           \
20{ const auto err = x;                                             \
21if( err != CUTENSORNET_STATUS_SUCCESS )                           \
22{ printf("Error: %s in line %d\n", cutensornetGetErrorString(err), __LINE__); return err; } \
23};
24
25#define HANDLE_CUDA_ERROR(x)                                      \
26{  const auto err = x;                                            \
27   if( err != cudaSuccess )                                       \
28   { printf("Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); return err; } \
29};
30
31struct GPUTimer
32{
33   GPUTimer(cudaStream_t stream): stream_(stream)
34   {
35      cudaEventCreate(&start_);
36      cudaEventCreate(&stop_);
37   }
38
39   ~GPUTimer()
40   {
41      cudaEventDestroy(start_);
42      cudaEventDestroy(stop_);
43   }
44
45   void start()
46   {
47      cudaEventRecord(start_, stream_);
48   }
49
50   float seconds()
51   {
52      cudaEventRecord(stop_, stream_);
53      cudaEventSynchronize(stop_);
54      float time;
55      cudaEventElapsedTime(&time, start_, stop_);
56      return time * 1e-3;
57   }
58
59   private:
60   cudaEvent_t start_, stop_;
61   cudaStream_t stream_;
62};
63
64
65int main()
66{
67   const size_t cuTensornetVersion = cutensornetGetVersion();
68   printf("cuTensorNet-vers:%ld\n",cuTensornetVersion);
69
70   cudaDeviceProp prop;
71   int deviceId{-1};
72   HANDLE_CUDA_ERROR( cudaGetDevice(&deviceId) );
73   HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );
74
75   printf("===== device info ======\n");
76   printf("GPU-name:%s\n", prop.name);
77   printf("GPU-clock:%d\n", prop.clockRate);
78   printf("GPU-memoryClock:%d\n", prop.memoryClockRate);
79   printf("GPU-nSM:%d\n", prop.multiProcessorCount);
80   printf("GPU-major:%d\n", prop.major);
81   printf("GPU-minor:%d\n", prop.minor);
82   printf("========================\n");
83
84   typedef float floatType;
85   cudaDataType_t typeData = CUDA_R_32F;
86   cutensornetComputeType_t typeCompute = CUTENSORNET_COMPUTE_32F;
87
88   printf("Include headers and define data types\n");

Define tensor network and tensor sizes

Next, we define the topology of the tensor network (i.e., the modes of the tensors, their extents, and their connectivity).

 91   /**********************
 92   * Computing: D_{m,x,n,y} = A_{m,h,k,n} B_{u,k,h} C_{x,u,y}
 93   **********************/
 94
 95   constexpr int32_t numInputs = 3;
 96
 97   // Create vector of modes
 98   std::vector<int32_t> modesA{'m','h','k','n'};
 99   std::vector<int32_t> modesB{'u','k','h'};
100   std::vector<int32_t> modesC{'x','u','y'};
101   std::vector<int32_t> modesD{'m','x','n','y'};
102
103   // Extents
104   std::unordered_map<int32_t, int64_t> extent;
105   extent['m'] = 96;
106   extent['n'] = 96;
107   extent['u'] = 96;
108   extent['h'] = 64;
109   extent['k'] = 64;
110   extent['x'] = 64;
111   extent['y'] = 64;
112
113   // Create a vector of extents for each tensor
114   std::vector<int64_t> extentA;
115   for (auto mode : modesA)
116      extentA.push_back(extent[mode]);
117   std::vector<int64_t> extentB;
118   for (auto mode : modesB)
119      extentB.push_back(extent[mode]);
120   std::vector<int64_t> extentC;
121   for (auto mode : modesC)
122      extentC.push_back(extent[mode]);
123   std::vector<int64_t> extentD;
124   for (auto mode : modesD)
125      extentD.push_back(extent[mode]);
126
127   printf("Define network, modes, and extents\n");

Allocate memory and initialize data

Next, we allocate memory for the tensor network operands and initialize it to random values.

130   /**********************
131   * Allocating data
132   **********************/
133
134   size_t elementsA = 1;
135   for (auto mode : modesA)
136      elementsA *= extent[mode];
137   size_t elementsB = 1;
138   for (auto mode : modesB)
139      elementsB *= extent[mode];
140   size_t elementsC = 1;
141   for (auto mode : modesC)
142      elementsC *= extent[mode];
143   size_t elementsD = 1;
144   for (auto mode : modesD)
145      elementsD *= extent[mode];
146
147   size_t sizeA = sizeof(floatType) * elementsA;
148   size_t sizeB = sizeof(floatType) * elementsB;
149   size_t sizeC = sizeof(floatType) * elementsC;
150   size_t sizeD = sizeof(floatType) * elementsD;
151   printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC + sizeD)/1024./1024./1024);
152
153   void* rawDataIn_d[numInputs];
154   void* D_d;
155   HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[0], sizeA) );
156   HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[1], sizeB) );
157   HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[2], sizeC) );
158   HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_d, sizeD));
159
160   floatType *A = (floatType*) malloc(sizeof(floatType) * elementsA);
161   floatType *B = (floatType*) malloc(sizeof(floatType) * elementsB);
162   floatType *C = (floatType*) malloc(sizeof(floatType) * elementsC);
163   floatType *D = (floatType*) malloc(sizeof(floatType) * elementsD);
164
165   if (A == NULL || B == NULL || C == NULL || D == NULL)
166   {
167      printf("Error: Host allocation of A or C.\n");
168      return -1;
169   }
170
171   /*******************
172   * Initialize data
173   *******************/
174
175   for (uint64_t i = 0; i < elementsA; i++)
176      A[i] = ((floatType) rand())/RAND_MAX;
177   for (uint64_t i = 0; i < elementsB; i++)
178      B[i] = ((floatType) rand())/RAND_MAX;
179   for (uint64_t i = 0; i < elementsC; i++)
180      C[i] = ((floatType) rand())/RAND_MAX;
181   memset(D, 0, sizeof(floatType) * elementsD);
182
183   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
184   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
185   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );
186
187   printf("Allocate memory for data, and initialize data.\n");

cuTensorNet handle and network descriptor

Next, we initialize the cuTensorNet library via cutensornetCreate() and create the network descriptor with the desired tensor modes, extents, and strides, as well as the data and compute types.

190   /*************************
191   * cuTensorNet
192   *************************/
193
194   cudaStream_t stream;
195   cudaStreamCreate(&stream);
196
197   cutensornetHandle_t handle;
198   HANDLE_ERROR( cutensornetCreate(&handle) );
199
200   const int32_t nmodeA = modesA.size();
201   const int32_t nmodeB = modesB.size();
202   const int32_t nmodeC = modesC.size();
203   const int32_t nmodeD = modesD.size();
204
205   /*******************************
206   * Create Network Descriptor
207   *******************************/
208
209   const int32_t* modesIn[] = {modesA.data(), modesB.data(), modesC.data()};
210   int32_t const numModesIn[] = {nmodeA, nmodeB, nmodeC};
211   const int64_t* extentsIn[] = {extentA.data(), extentB.data(), extentC.data()};
212   const int64_t* stridesIn[] = {NULL, NULL, NULL}; // strides are optional; if no stride is provided, then cuTensorNet assumes a generalized column-major data layout
213
214   // Notice that pointers are allocated via cudaMalloc are aligned to 256 byte
215   // boundaries by default; however here we're checking the pointer alignment explicitly
216   // to demonstrate how one would check the alginment for arbitrary pointers.
217
218   auto getMaximalPointerAlignment = [](const void* ptr) {
219      const uint64_t ptrAddr  = reinterpret_cast<uint64_t>(ptr);
220      uint32_t alignment = 1;
221      while(ptrAddr % alignment == 0 &&
222            alignment < 256) // at the latest we terminate once the alignment reached 256 bytes (we could be going, but any alignment larger or equal to 256 is equally fine)
223      {
224         alignment *= 2;
225      }
226      return alignment;
227   };
228   const uint32_t alignmentsIn[] = {getMaximalPointerAlignment(rawDataIn_d[0]),
229                                    getMaximalPointerAlignment(rawDataIn_d[1]),
230                                    getMaximalPointerAlignment(rawDataIn_d[2])};
231   const uint32_t alignmentOut = getMaximalPointerAlignment(D_d);
232
233   // setup tensor network
234   cutensornetNetworkDescriptor_t descNet;
235   HANDLE_ERROR( cutensornetCreateNetworkDescriptor(handle,
236                                                numInputs, numModesIn, extentsIn, stridesIn, modesIn, alignmentsIn,
237                                                nmodeD, extentD.data(), /*stridesOut = */NULL, modesD.data(), alignmentOut,
238                                                typeData, typeCompute,
239                                                &descNet) );
240
241   printf("Initialize the cuTensorNet library and create a network descriptor.\n");

Optimal contraction order and slicing

At this stage, we can deploy the cuTensorNet optimizer to find an optimized contraction path and slicing combination. We choose a limit for the workspace needed to perform the contraction based on the available resources, and provide it to the optimizer as a constraint. We then create an optimizer config object of type cutensornetContractionOptimizerConfig_t to specify various optimizer options and provide it to the optimizer, which is called using cutensornetContractionOptimize(). The results from the optimizer will be returned in an optimizer info object of type cutensornetContractionOptimizerInfo_t.

244   /*******************************
245   * Choose workspace limit based on available resources.
246   *******************************/
247
248   size_t freeMem, totalMem;
249   HANDLE_CUDA_ERROR( cudaMemGetInfo(&freeMem, &totalMem) );
250   uint64_t workspaceLimit = totalMem * 0.9;
251
252   /*******************************
253   * Find "optimal" contraction order and slicing
254   *******************************/
255
256   cutensornetContractionOptimizerConfig_t optimizerConfig;
257   HANDLE_ERROR( cutensornetCreateContractionOptimizerConfig(handle, &optimizerConfig) );
258
259   // Set the value of the partitioner imbalance factor, if desired
260   int32_t imbalance_factor = 30;
261   HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(
262                                                               handle,
263                                                               optimizerConfig,
264                                                               CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_IMBALANCE_FACTOR,
265                                                               &imbalance_factor,
266                                                               sizeof(imbalance_factor)) );
267
268
269   cutensornetContractionOptimizerInfo_t optimizerInfo;
270   HANDLE_ERROR( cutensornetCreateContractionOptimizerInfo(handle, descNet, &optimizerInfo) );
271
272   HANDLE_ERROR( cutensornetContractionOptimize(handle,
273                                             descNet,
274                                             optimizerConfig,
275                                             workspaceLimit,
276                                             optimizerInfo) );
277
278   int64_t numSlices = 0;
279   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
280               handle,
281               optimizerInfo,
282               CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
283               &numSlices,
284               sizeof(numSlices)) );
285
286   assert(numSlices > 0);
287
288   printf("Find an optimized contraction path with cuTensorNet optimizer.\n");

It is also possible to bypass the cuTensorNet optimizer and set a pre-determined path, as well as slicing information, directly into the optimizer info object via cutensornetContractionOptimizerInfoSetAttribute().

Create workspace descriptor and allocate workspace memory

Next, we create a workspace descriptor, compute the workspace sizes, and query the minimum workspace size needed to contract the network. We then allocate device memory for the workspace and set this in the workspace descriptor. The workspace descriptor will be provided to the contraction plan.

291   /*******************************
292   * Create workspace descriptor, allocate workspace, and set it.
293   *******************************/
294
295   cutensornetWorkspaceDescriptor_t workDesc;
296   HANDLE_ERROR( cutensornetCreateWorkspaceDescriptor(handle, &workDesc) );
297
298   uint64_t requiredWorkspaceSize = 0;
299   HANDLE_ERROR( cutensornetWorkspaceComputeSizes(handle,
300                                          descNet,
301                                          optimizerInfo,
302                                          workDesc) );
303
304   HANDLE_ERROR( cutensornetWorkspaceGetSize(handle,
305                                         workDesc,
306                                         CUTENSORNET_WORKSIZE_PREF_MIN,
307                                         CUTENSORNET_MEMSPACE_DEVICE,
308                                         &requiredWorkspaceSize) );
309
310   void *work = nullptr;
311   HANDLE_CUDA_ERROR( cudaMalloc(&work, requiredWorkspaceSize) );
312
313   HANDLE_ERROR( cutensornetWorkspaceSet(handle,
314                                         workDesc,
315                                         CUTENSORNET_MEMSPACE_DEVICE,
316                                         work,
317                                         requiredWorkspaceSize) );
318
319   printf("Allocate workspace.\n");

Contraction plan and auto-tune

We create a contraction plan holding pair-wise contraction plans for cuTENSOR. Optionally, we can auto-tune the plan such that cuTENSOR selects the best kernel for each contraction. This contraction plan can be reused for many (possibly different) data inputs, avoiding the cost of initializing this plan redundantly.

322   /*******************************
323   * Initialize all pair-wise contraction plans (for cuTENSOR).
324   *******************************/
325
326   cutensornetContractionPlan_t plan;
327
328   HANDLE_ERROR( cutensornetCreateContractionPlan(handle,
329                                                  descNet,
330                                                  optimizerInfo,
331                                                  workDesc,
332                                                  &plan) );
333
334
335   /*******************************
336   * Optional: Auto-tune cuTENSOR's cutensorContractionPlan to pick the fastest kernel
337   *           for each pairwise contraction.
338   *******************************/
339   cutensornetContractionAutotunePreference_t autotunePref;
340   HANDLE_ERROR( cutensornetCreateContractionAutotunePreference(handle,
341                           &autotunePref) );
342
343   const int numAutotuningIterations = 5; // may be 0
344   HANDLE_ERROR( cutensornetContractionAutotunePreferenceSetAttribute(
345                           handle,
346                           autotunePref,
347                           CUTENSORNET_CONTRACTION_AUTOTUNE_MAX_ITERATIONS,
348                           &numAutotuningIterations,
349                           sizeof(numAutotuningIterations)) );
350
351   // modify the plan again to find the best pair-wise contractions
352   HANDLE_ERROR( cutensornetContractionAutotune(handle,
353                           plan,
354                           rawDataIn_d,
355                           D_d,
356                           workDesc,
357                           autotunePref,
358                           stream) );
359
360   HANDLE_ERROR( cutensornetDestroyContractionAutotunePreference(autotunePref) );
361
362   printf("Create a contraction plan for cuTensorNet and optionally auto-tune it.\n");

Network contraction execution

Finally, we contract the network as many times as needed, possibly with different data. Network slices, captured as a cutensornetSliceGroup_t object, are executed using the same contraction plan. For convenience, NULL can be provided to the cutensornetContractSlices() function instead of a slice group when the goal is to contract all the slices in the network. We also clean up and free allocated resources.

365   /**********************
366   * Run
367   **********************/
368   cutensornetSliceGroup_t sliceGroup{};
369
370   // Create a cutensornetSliceGroup_t object from a range of slice IDs.
371   HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, 0, numSlices, 1, &sliceGroup) );
372
373   GPUTimer timer{stream};
374   double minTimeCUTENSOR = 1e100;
375   const int numRuns = 3; // to get stable perf results
376   for (int i=0; i < numRuns; ++i)
377   {
378      cudaMemcpy(D_d, D, sizeD, cudaMemcpyHostToDevice); // restore output
379      cudaDeviceSynchronize();
380
381      /*
382      * Contract over all slices.
383      *
384      * A user may choose to parallelize over the slices across multiple devices.
385      */
386      timer.start();
387
388      int32_t accumulateOutput = 0;
389      HANDLE_ERROR( cutensornetContractSlices(handle,
390                                 plan,
391                                 rawDataIn_d,
392                                 D_d,
393                                 accumulateOutput,
394                                 workDesc,
395                                 sliceGroup,    // Alternatively, NULL can also be used to contract over all the slices instead of specifying a sliceGroup object.
396                                 stream) );
397
398      // Synchronize and measure timing
399      auto time = timer.seconds();
400      minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
401   }
402
403   printf("Contract the network, each slice uses the same contraction plan.\n");
404
405
406   /*************************/
407
408   double flops{0.};
409   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
410               handle,
411               optimizerInfo,
412               CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
413               &flops,
414               sizeof(flops)) );
415
416   printf("numSlices: %ld\n", numSlices);
417   printf("%.2f ms / slice\n", minTimeCUTENSOR * 1000.f / numSlices);
418   printf("%.2f GFLOPS/s\n", flops/1e9/minTimeCUTENSOR );
419
420   HANDLE_ERROR( cutensornetDestroySliceGroup(sliceGroup) );
421   HANDLE_ERROR( cutensornetDestroy(handle) );
422   HANDLE_ERROR( cutensornetDestroyNetworkDescriptor(descNet) );
423   HANDLE_ERROR( cutensornetDestroyContractionPlan(plan) );
424   HANDLE_ERROR( cutensornetDestroyContractionOptimizerConfig(optimizerConfig) );
425   HANDLE_ERROR( cutensornetDestroyContractionOptimizerInfo(optimizerInfo) );
426   HANDLE_ERROR( cutensornetDestroyWorkspaceDescriptor(workDesc) );
427
428   if (A) free(A);
429   if (B) free(B);
430   if (C) free(C);
431   if (D) free(D);
432   if (rawDataIn_d[0]) cudaFree(rawDataIn_d[0]);
433   if (rawDataIn_d[1]) cudaFree(rawDataIn_d[1]);
434   if (rawDataIn_d[2]) cudaFree(rawDataIn_d[2]);
435   if (D_d) cudaFree(D_d);
436   if (work) cudaFree(work);
437
438   printf("Free resource and exit.\n");
439
440   return 0;
441}

Recall that the full sample code can be found in the NVIDIA/cuQuantum repository (here).

Code Example (Slice-based Parallelism)

It is straightforward to adapt Code Example (Serial) to enable parallel execution of the contraction operation on multiple devices. We will illustrate this with an example using MPI as the communication layer. In the interests of brevity, we will show only the changes that need to be made; the full sample code can be found in the NVIDIA/cuQuantum repository.

First, in addition to the headers and definitions mentioned in Headers and data types, we include the MPI header and define a macro to handle MPI errors. We also initialize MPI and map each process to a device in this section.

10#include <mpi.h>
42#define HANDLE_MPI_ERROR(x)                                       \
43{ const auto err = x;                                             \
44  if( err != MPI_SUCCESS )                                        \
45  { char error[MPI_MAX_ERROR_STRING]; int len;                    \
46    MPI_Error_string(err, error, &len);                           \
47    printf("[Process %d] MPI Error: %s in line %d\n", rank, error, __LINE__); \
48    fflush(stdout);                                               \
49    MPI_Abort(MPI_COMM_WORLD, err);                               \
50  }                                                               \
51};
 96   // Initialize MPI.
 97   int errorCode = MPI_Init(&argc, &argv);
 98   if (errorCode != MPI_SUCCESS)
 99   {
100      printf("Error initializing MPI.\n");
101      MPI_Abort(MPI_COMM_WORLD, errorCode);
102   }
103
104   const int root{0};
105   int rank{};
106   HANDLE_MPI_ERROR( MPI_Comm_rank(MPI_COMM_WORLD, &rank) );
107
108   int numProcs{};
109   HANDLE_MPI_ERROR( MPI_Comm_size(MPI_COMM_WORLD, &numProcs) );
131   // Set deviceId based on ranks and nodes.
132   int deviceId = rank % numDevices;    // We assume that the processes are mapped to nodes in contiguous chunks.
133   HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) );
134   HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );

Next we define the tensor network as described in Define tensor network and tensor sizes. In a one process per device model, the tensor network, including operand and result data, is replicated on each process. The root process initalizes the operand data and broadcasts it to the other processes.

245   /*******************
246   * Initialize data
247   *******************/
248
249   // Rank root creates the tensor data.
250   if (rank == root)
251   {
252      for (uint64_t i = 0; i < elementsA; i++)
253         A[i] = ((floatType) rand())/RAND_MAX;
254      for (uint64_t i = 0; i < elementsB; i++)
255         B[i] = ((floatType) rand())/RAND_MAX;
256      for (uint64_t i = 0; i < elementsC; i++)
257         C[i] = ((floatType) rand())/RAND_MAX;
258   }
259
260   // Broadcast data to all ranks.
261   HANDLE_MPI_ERROR( MPI_Bcast(A, elementsA, floatTypeMPI, root, MPI_COMM_WORLD) );
262   HANDLE_MPI_ERROR( MPI_Bcast(B, elementsB, floatTypeMPI, root, MPI_COMM_WORLD) );
263   HANDLE_MPI_ERROR( MPI_Bcast(C, elementsC, floatTypeMPI, root, MPI_COMM_WORLD) );
264
265   // Copy data onto the device on all ranks.
266   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
267   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
268   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );

Then we create the library handle and tensor network descriptor on each process, as decribed in cuTensorNet handle and network descriptor.

Next, we find the optimal contraction path and slicing combination for our network. We will run the cuTensorNet optimizer on all processes and determine which process has the best path in terms of FLOP count. We will then pack the optimizer info object on this process, broadcast the packed buffer, and unpack it on all processes. Each process now has the same optimizer info object, which we use to calculate the share of slices for each process.

351   // Compute the path on all ranks so that we can choose the path with the lowest cost. Note that since this is a tiny
352   //   example with 3 operands, all processes will compute the same globally optimal path. This is not the case for large
353   //   tensor networks. For large networks, hyperoptimization is also beneficial and can be enabled by setting the
354   //   optimizer config attribute CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES.
355
356   // Force slicing.
357   int32_t min_slices = numProcs;
358   HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(
359                                                               handle,
360                                                               optimizerConfig,
361                                                               CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MIN_SLICES,
362                                                               &min_slices,
363                                                               sizeof(min_slices)) );
364
365   HANDLE_ERROR( cutensornetContractionOptimize(handle,
366                                             descNet,
367                                             optimizerConfig,
368                                             workspaceLimit,
369                                             optimizerInfo) );
370
371   double flops{-1.};
372   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
373                                                               handle,
374                                                               optimizerInfo,
375                                                               CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
376                                                               &flops,
377                                                               sizeof(flops)) );
378
379   // Choose the path with the lowest cost.
380   struct {
381       double value;
382       int rank;
383   } in{flops, rank}, out;
384
385   HANDLE_MPI_ERROR( MPI_Allreduce(&in, &out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD) );
386
387   int sender = out.rank;
388   flops = out.value;
389   if (rank == root)
390   {
391       printf("Process %d has the path with the lowest FLOP count %lf.\n", sender, flops);
392   }
393
394   size_t bufSize;
395
396   // Get buffer size for optimizerInfo and broadcast it.
397   if (rank == sender)
398   {
399       HANDLE_ERROR( cutensornetContractionOptimizerInfoGetPackedSize(handle, optimizerInfo, &bufSize) );
400   }
401
402   HANDLE_MPI_ERROR( MPI_Bcast(&bufSize, 1, MPI_INT64_T, sender, MPI_COMM_WORLD) );
403
404   // Allocate buffer.
405   std::vector<char> buffer(bufSize);
406
407   // Pack optimizerInfo on sender and broadcast it.
408   if (rank == sender)
409   {
410       HANDLE_ERROR( cutensornetContractionOptimizerInfoPackData(handle, optimizerInfo, buffer.data(), bufSize) );
411   }
412
413   HANDLE_MPI_ERROR( MPI_Bcast(buffer.data(), bufSize, MPI_CHAR, sender, MPI_COMM_WORLD) );
414
415   // Unpack optimizerInfo from buffer.
416   if (rank != sender)
417   {
418       HANDLE_ERROR( cutensornetUpdateContractionOptimizerInfoFromPackedData(handle, buffer.data(), bufSize, optimizerInfo) );
419   }
420
421   int64_t numSlices = 0;
422   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
423               handle,
424               optimizerInfo,
425               CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
426               &numSlices,
427               sizeof(numSlices)) );
428
429   assert(numSlices > 0);
430
431   // Calculate each process's share of the slices.
432
433   int64_t procChunk = numSlices / numProcs;
434   int extra = numSlices % numProcs;
435   int procSliceBegin = rank * procChunk + std::min(rank, extra);
436   int procSliceEnd =  rank == numProcs - 1 ? numSlices : (rank + 1) * procChunk + std::min(rank + 1, extra);

We now create the workspace descriptor and allocate memory as described in Create workspace descriptor and allocate workspace memory, and create the Contraction plan and auto-tune the network.

Next, on each process, we create a slice group (see cutensornetSliceGroup_t) that corresponds to its share of the network slices. We then provide this slice group object to the cutensornetContractSlices() function to get a partial contraction result on each process.

525   cutensornetSliceGroup_t sliceGroup{};
526   // Create a cutensornetSliceGroup_t object from a range of slice IDs.
527   HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, procSliceBegin, procSliceEnd, 1, &sliceGroup) );
548      HANDLE_ERROR( cutensornetContractSlices(handle,
549                                 plan,
550                                 rawDataIn_d,
551                                 D_d,
552                                 accumulateOutput,
553                                 workDesc,
554                                 sliceGroup,
555                                 stream) );

Finally, we sum up the partial contributions to obtain the result of the contraction.

584   // Reduce on root process.
585   if (rank == root)
586   {
587      HANDLE_MPI_ERROR( MPI_Reduce(MPI_IN_PLACE, D, elementsD, floatTypeMPI, MPI_SUM, root, MPI_COMM_WORLD) );
588   }
589   else
590   {
591      HANDLE_MPI_ERROR( MPI_Reduce(D, D, elementsD, floatTypeMPI, MPI_SUM, root, MPI_COMM_WORLD) );
592   }
659   HANDLE_MPI_ERROR( MPI_Finalize() );

The complete MPI sample can be found in the NVIDIA/cuQuantum repository.

Useful Tips

  • For debugging, the environment variable CUTENSORNET_LOG_LEVEL=n can be set. The level n = 0, 1, …, 5 corresponds to the logger level as described and used in cutensornetLoggerSetLevel(). The environment variable CUTENSORNET_LOG_FILE=<filepath> can be used to direct the log output to a custom file at <filepath> instead of stdout.