Getting Started¶

In this section, we show how to contract a tensor network using cuTensorNet. First, we describe how to install the library and how to compile a sample code. Then, we present the example code used to perform common steps in cuTensorNet. In this example, we perform the following tensor contraction:

\[D_{m,x,n,y} = A_{m,h,k,n} B_{u,k,h} C_{x,u,y}\]

We build the code up step by step, each step adding code at the end. The steps are separated by succinct multi-line comment blocks.

It is recommended that the reader refers to Overview and cuTENSOR documentation for familiarity with the nomenclature and cuTENSOR operations.

Installation and Compilation¶

Download the cuQuantum package (which cuTensorNet is part of) from https://developer.nvidia.com/cuQuantum-downloads, and the cuTENSOR package from https://developer.nvidia.com/cutensor.

Linux¶

Assuming cuQuantum has been extracted in CUQUANTUM_ROOT and cuTENSOR in CUTENSOR_ROOT we update the library path as follows:

export LD_LIBRARY_PATH=${CUQUANTUM_ROOT}/lib:${CUTENSOR_ROOT}/lib/11:${LD_LIBRARY_PATH}

Depending on your CUDA Toolkit, you might have to choose a different library version (e.g., ${CUTENSOR_ROOT}/lib/11.0).

The sample code discussed below (tensornet_example.cu) can be compiled via the following command:

nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -o tensornet_example

For statically linking against the cuTensorNet library, use the following command (note that libmetis_static.a needs to be explicitly linked against, assuming it is installed through the NVIDIA CUDA Toolkit and accessible through $LIBRARY_PATH):

nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include ${CUQUANTUM_ROOT}/lib/libcutensornet_static.a -L${CUTENSOR_DIR}/lib/11 -lcutensor libmetis_static.a -o tensornet_example

Note

Depending on the source of the cuQuantum package, you may need to replace lib above by lib64.

Code Example (Serial)¶

The following code example illustrates the common steps necessary to use cuTensorNet and also introduces typical tensor network operations. The full sample code can be found in the NVIDIA/cuQuantum repository (here).

Headers and data types¶

#include <stdlib.h>
#include <stdio.h>

#include <unordered_map>
#include <vector>
#include <cassert>

#include <cuda_runtime.h>
#include <cutensornet.h>
#include <cutensor.h>

#define HANDLE_ERROR(x)                                           \
{ const auto err = x;                                             \
if( err != CUTENSORNET_STATUS_SUCCESS )                           \
{ printf("Error: %s in line %d\n", cutensornetGetErrorString(err), __LINE__); return err; } \
};

#define HANDLE_CUDA_ERROR(x)                                      \
{  const auto err = x;                                            \
   if( err != cudaSuccess )                                       \
   { printf("Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); return err; } \
};

struct GPUTimer
{
   GPUTimer(cudaStream_t stream): stream_(stream)
   {
      cudaEventCreate(&start_);
      cudaEventCreate(&stop_);
   }

   ~GPUTimer()
   {
      cudaEventDestroy(start_);
      cudaEventDestroy(stop_);
   }

   void start()
   {
      cudaEventRecord(start_, stream_);
   }

   float seconds()
   {
      cudaEventRecord(stop_, stream_);
      cudaEventSynchronize(stop_);
      float time;
      cudaEventElapsedTime(&time, start_, stop_);
      return time * 1e-3;
   }

   private:
   cudaEvent_t start_, stop_;
   cudaStream_t stream_;
};


int main()
{
   const size_t cuTensornetVersion = cutensornetGetVersion();
   printf("cuTensorNet-vers:%ld\n",cuTensornetVersion);

   cudaDeviceProp prop;
   int deviceId{-1};
   HANDLE_CUDA_ERROR( cudaGetDevice(&deviceId) );
   HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );

   printf("===== device info ======\n");
   printf("GPU-name:%s\n", prop.name);
   printf("GPU-clock:%d\n", prop.clockRate);
   printf("GPU-memoryClock:%d\n", prop.memoryClockRate);
   printf("GPU-nSM:%d\n", prop.multiProcessorCount);
   printf("GPU-major:%d\n", prop.major);
   printf("GPU-minor:%d\n", prop.minor);
   printf("========================\n");

   typedef float floatType;
   cudaDataType_t typeData = CUDA_R_32F;
   cutensornetComputeType_t typeCompute = CUTENSORNET_COMPUTE_32F;

   printf("Include headers and define data types\n");

Define tensor network and tensor sizes¶

Next, we define the topology of the tensor network (i.e., the modes of the tensors, their extents, and their connectivity).

   /**********************
   * Computing: D_{m,x,n,y} = A_{m,h,k,n} B_{u,k,h} C_{x,u,y}
   **********************/

   constexpr int32_t numInputs = 3;

   // Create vector of modes
   std::vector<int32_t> modesA{'m','h','k','n'};
   std::vector<int32_t> modesB{'u','k','h'};
   std::vector<int32_t> modesC{'x','u','y'};
   std::vector<int32_t> modesD{'m','x','n','y'};

   // Extents
   std::unordered_map<int32_t, int64_t> extent;
   extent['m'] = 96;
   extent['n'] = 96;
   extent['u'] = 96;
   extent['h'] = 64;
   extent['k'] = 64;
   extent['x'] = 64;
   extent['y'] = 64;

   // Create a vector of extents for each tensor
   std::vector<int64_t> extentA;
   for (auto mode : modesA)
      extentA.push_back(extent[mode]);
   std::vector<int64_t> extentB;
   for (auto mode : modesB)
      extentB.push_back(extent[mode]);
   std::vector<int64_t> extentC;
   for (auto mode : modesC)
      extentC.push_back(extent[mode]);
   std::vector<int64_t> extentD;
   for (auto mode : modesD)
      extentD.push_back(extent[mode]);

   printf("Define network, modes, and extents\n");

Allocate memory and initialize data¶

Next, we allocate memory for the tensor network operands and initialize it to random values.

   /**********************
   * Allocating data
   **********************/

   size_t elementsA = 1;
   for (auto mode : modesA)
      elementsA *= extent[mode];
   size_t elementsB = 1;
   for (auto mode : modesB)
      elementsB *= extent[mode];
   size_t elementsC = 1;
   for (auto mode : modesC)
      elementsC *= extent[mode];
   size_t elementsD = 1;
   for (auto mode : modesD)
      elementsD *= extent[mode];

   size_t sizeA = sizeof(floatType) * elementsA;
   size_t sizeB = sizeof(floatType) * elementsB;
   size_t sizeC = sizeof(floatType) * elementsC;
   size_t sizeD = sizeof(floatType) * elementsD;
   printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC + sizeD)/1024./1024./1024);

   void* rawDataIn_d[numInputs];
   void* D_d;
   HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[0], sizeA) );
   HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[1], sizeB) );
   HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[2], sizeC) );
   HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_d, sizeD));

   floatType *A = (floatType*) malloc(sizeof(floatType) * elementsA);
   floatType *B = (floatType*) malloc(sizeof(floatType) * elementsB);
   floatType *C = (floatType*) malloc(sizeof(floatType) * elementsC);
   floatType *D = (floatType*) malloc(sizeof(floatType) * elementsD);

   if (A == NULL || B == NULL || C == NULL || D == NULL)
   {
      printf("Error: Host allocation of A or C.\n");
      return -1;
   }

   /*******************
   * Initialize data
   *******************/

   for (uint64_t i = 0; i < elementsA; i++)
      A[i] = ((floatType) rand())/RAND_MAX;
   for (uint64_t i = 0; i < elementsB; i++)
      B[i] = ((floatType) rand())/RAND_MAX;
   for (uint64_t i = 0; i < elementsC; i++)
      C[i] = ((floatType) rand())/RAND_MAX;
   memset(D, 0, sizeof(floatType) * elementsD);

   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );

   printf("Allocate memory for data, and initialize data.\n");

cuTensorNet handle and network descriptor¶

Next, we initialize the cuTensorNet library via cutensornetCreate() and create the network descriptor with the desired tensor modes, extents, and strides, as well as the data and compute types.

   /*************************
   * cuTensorNet
   *************************/

   cudaStream_t stream;
   cudaStreamCreate(&stream);

   cutensornetHandle_t handle;
   HANDLE_ERROR( cutensornetCreate(&handle) );

   const int32_t nmodeA = modesA.size();
   const int32_t nmodeB = modesB.size();
   const int32_t nmodeC = modesC.size();
   const int32_t nmodeD = modesD.size();

   /*******************************
   * Create Network Descriptor
   *******************************/

   const int32_t* modesIn[] = {modesA.data(), modesB.data(), modesC.data()};
   int32_t const numModesIn[] = {nmodeA, nmodeB, nmodeC};
   const int64_t* extentsIn[] = {extentA.data(), extentB.data(), extentC.data()};
   const int64_t* stridesIn[] = {NULL, NULL, NULL}; // strides are optional; if no stride is provided, then cuTensorNet assumes a generalized column-major data layout

   // Notice that pointers are allocated via cudaMalloc are aligned to 256 byte
   // boundaries by default; however here we're checking the pointer alignment explicitly
   // to demonstrate how one would check the alginment for arbitrary pointers.

   auto getMaximalPointerAlignment = [](const void* ptr) {
      const uint64_t ptrAddr  = reinterpret_cast<uint64_t>(ptr);
      uint32_t alignment = 1;
      while(ptrAddr % alignment == 0 &&
            alignment < 256) // at the latest we terminate once the alignment reached 256 bytes (we could be going, but any alignment larger or equal to 256 is equally fine)
      {
         alignment *= 2;
      }
      return alignment;
   };
   const uint32_t alignmentsIn[] = {getMaximalPointerAlignment(rawDataIn_d[0]),
                                    getMaximalPointerAlignment(rawDataIn_d[1]),
                                    getMaximalPointerAlignment(rawDataIn_d[2])};
   const uint32_t alignmentOut = getMaximalPointerAlignment(D_d);

   // setup tensor network
   cutensornetNetworkDescriptor_t descNet;
   HANDLE_ERROR( cutensornetCreateNetworkDescriptor(handle,
                                                numInputs, numModesIn, extentsIn, stridesIn, modesIn, alignmentsIn,
                                                nmodeD, extentD.data(), /*stridesOut = */NULL, modesD.data(), alignmentOut,
                                                typeData, typeCompute,
                                                &descNet) );

   printf("Initialize the cuTensorNet library and create a network descriptor.\n");

Optimal contraction order and slicing¶

At this stage, we can deploy the cuTensorNet optimizer to find an optimized contraction path and slicing combination. We choose a limit for the workspace needed to perform the contraction based on the available resources, and provide it to the optimizer as a constraint. We then create an optimizer config object of type cutensornetContractionOptimizerConfig_t to specify various optimizer options and provide it to the optimizer, which is called using cutensornetContractionOptimize(). The results from the optimizer will be returned in an optimizer info object of type cutensornetContractionOptimizerInfo_t.

   /*******************************
   * Choose workspace limit based on available resources.
   *******************************/

   size_t freeMem, totalMem;
   HANDLE_CUDA_ERROR( cudaMemGetInfo(&freeMem, &totalMem) );
   uint64_t workspaceLimit = totalMem * 0.9;

   /*******************************
   * Find "optimal" contraction order and slicing
   *******************************/

   cutensornetContractionOptimizerConfig_t optimizerConfig;
   HANDLE_ERROR( cutensornetCreateContractionOptimizerConfig(handle, &optimizerConfig) );

   // Set the value of the partitioner imbalance factor, if desired
   int32_t imbalance_factor = 30;
   HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(
                                                               handle,
                                                               optimizerConfig,
                                                               CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_IMBALANCE_FACTOR,
                                                               &imbalance_factor,
                                                               sizeof(imbalance_factor)) );


   cutensornetContractionOptimizerInfo_t optimizerInfo;
   HANDLE_ERROR( cutensornetCreateContractionOptimizerInfo(handle, descNet, &optimizerInfo) );

   HANDLE_ERROR( cutensornetContractionOptimize(handle,
                                             descNet,
                                             optimizerConfig,
                                             workspaceLimit,
                                             optimizerInfo) );

   int64_t numSlices = 0;
   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
               handle,
               optimizerInfo,
               CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
               &numSlices,
               sizeof(numSlices)) );

   assert(numSlices > 0);

   printf("Find an optimized contraction path with cuTensorNet optimizer.\n");

It is also possible to bypass the cuTensorNet optimizer and set a pre-determined path, as well as slicing information, directly into the optimizer info object via cutensornetContractionOptimizerInfoSetAttribute().

Create workspace descriptor and allocate workspace memory¶

Next, we create a workspace descriptor, compute the workspace sizes, and query the minimum workspace size needed to contract the network. We then allocate device memory for the workspace and set this in the workspace descriptor. The workspace descriptor will be provided to the contraction plan.

   /*******************************
   * Create workspace descriptor, allocate workspace, and set it.
   *******************************/

   cutensornetWorkspaceDescriptor_t workDesc;
   HANDLE_ERROR( cutensornetCreateWorkspaceDescriptor(handle, &workDesc) );

   uint64_t requiredWorkspaceSize = 0;
   HANDLE_ERROR( cutensornetWorkspaceComputeSizes(handle,
                                          descNet,
                                          optimizerInfo,
                                          workDesc) );

   HANDLE_ERROR( cutensornetWorkspaceGetSize(handle,
                                         workDesc,
                                         CUTENSORNET_WORKSIZE_PREF_MIN,
                                         CUTENSORNET_MEMSPACE_DEVICE,
                                         &requiredWorkspaceSize) );

   void *work = nullptr;
   HANDLE_CUDA_ERROR( cudaMalloc(&work, requiredWorkspaceSize) );

   HANDLE_ERROR( cutensornetWorkspaceSet(handle,
                                         workDesc,
                                         CUTENSORNET_MEMSPACE_DEVICE,
                                         work,
                                         requiredWorkspaceSize) );

   printf("Allocate workspace.\n");

Contraction plan and auto-tune¶

We create a contraction plan holding pair-wise contraction plans for cuTENSOR. Optionally, we can auto-tune the plan such that cuTENSOR selects the best kernel for each contraction. This contraction plan can be reused for many (possibly different) data inputs, avoiding the cost of initializing this plan redundantly.

   /*******************************
   * Initialize all pair-wise contraction plans (for cuTENSOR).
   *******************************/

   cutensornetContractionPlan_t plan;

   HANDLE_ERROR( cutensornetCreateContractionPlan(handle,
                                                  descNet,
                                                  optimizerInfo,
                                                  workDesc,
                                                  &plan) );


   /*******************************
   * Optional: Auto-tune cuTENSOR's cutensorContractionPlan to pick the fastest kernel
   *           for each pairwise contraction.
   *******************************/
   cutensornetContractionAutotunePreference_t autotunePref;
   HANDLE_ERROR( cutensornetCreateContractionAutotunePreference(handle,
                           &autotunePref) );

   const int numAutotuningIterations = 5; // may be 0
   HANDLE_ERROR( cutensornetContractionAutotunePreferenceSetAttribute(
                           handle,
                           autotunePref,
                           CUTENSORNET_CONTRACTION_AUTOTUNE_MAX_ITERATIONS,
                           &numAutotuningIterations,
                           sizeof(numAutotuningIterations)) );

   // modify the plan again to find the best pair-wise contractions
   HANDLE_ERROR( cutensornetContractionAutotune(handle,
                           plan,
                           rawDataIn_d,
                           D_d,
                           workDesc,
                           autotunePref,
                           stream) );

   HANDLE_ERROR( cutensornetDestroyContractionAutotunePreference(autotunePref) );

   printf("Create a contraction plan for cuTensorNet and optionally auto-tune it.\n");

Network contraction execution¶

Finally, we contract the network as many times as needed, possibly with different data. Network slices, captured as a cutensornetSliceGroup_t object, are executed using the same contraction plan. For convenience, NULL can be provided to the cutensornetContractSlices() function instead of a slice group when the goal is to contract all the slices in the network. We also clean up and free allocated resources.

   /**********************
   * Run
   **********************/
   cutensornetSliceGroup_t sliceGroup{};

   // Create a cutensornetSliceGroup_t object from a range of slice IDs.
   HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, 0, numSlices, 1, &sliceGroup) );

   GPUTimer timer{stream};
   double minTimeCUTENSOR = 1e100;
   const int numRuns = 3; // to get stable perf results
   for (int i=0; i < numRuns; ++i)
   {
      cudaMemcpy(D_d, D, sizeD, cudaMemcpyHostToDevice); // restore output
      cudaDeviceSynchronize();

      /*
      * Contract over all slices.
      *
      * A user may choose to parallelize over the slices across multiple devices.
      */
      timer.start();

      int32_t accumulateOutput = 0;
      HANDLE_ERROR( cutensornetContractSlices(handle,
                                 plan,
                                 rawDataIn_d,
                                 D_d,
                                 accumulateOutput,
                                 workDesc,
                                 sliceGroup,    // Alternatively, NULL can also be used to contract over all the slices instead of specifying a sliceGroup object.
                                 stream) );

      // Synchronize and measure timing
      auto time = timer.seconds();
      minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
   }

   printf("Contract the network, each slice uses the same contraction plan.\n");


   /*************************/

   double flops{0.};
   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
               handle,
               optimizerInfo,
               CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
               &flops,
               sizeof(flops)) );

   printf("numSlices: %ld\n", numSlices);
   printf("%.2f ms / slice\n", minTimeCUTENSOR * 1000.f / numSlices);
   printf("%.2f GFLOPS/s\n", flops/1e9/minTimeCUTENSOR );

   HANDLE_ERROR( cutensornetDestroySliceGroup(sliceGroup) );
   HANDLE_ERROR( cutensornetDestroy(handle) );
   HANDLE_ERROR( cutensornetDestroyNetworkDescriptor(descNet) );
   HANDLE_ERROR( cutensornetDestroyContractionPlan(plan) );
   HANDLE_ERROR( cutensornetDestroyContractionOptimizerConfig(optimizerConfig) );
   HANDLE_ERROR( cutensornetDestroyContractionOptimizerInfo(optimizerInfo) );
   HANDLE_ERROR( cutensornetDestroyWorkspaceDescriptor(workDesc) );

   if (A) free(A);
   if (B) free(B);
   if (C) free(C);
   if (D) free(D);
   if (rawDataIn_d[0]) cudaFree(rawDataIn_d[0]);
   if (rawDataIn_d[1]) cudaFree(rawDataIn_d[1]);
   if (rawDataIn_d[2]) cudaFree(rawDataIn_d[2]);
   if (D_d) cudaFree(D_d);
   if (work) cudaFree(work);

   printf("Free resource and exit.\n");

   return 0;
}

Recall that the full sample code can be found in the NVIDIA/cuQuantum repository (here).

Code Example (Slice-based Parallelism)¶

It is straightforward to adapt Code Example (Serial) to enable parallel execution of the contraction operation on multiple devices. We will illustrate this with an example using MPI as the communication layer. In the interests of brevity, we will show only the changes that need to be made; the full sample code can be found in the NVIDIA/cuQuantum repository.

First, in addition to the headers and definitions mentioned in Headers and data types, we include the MPI header and define a macro to handle MPI errors. We also initialize MPI and map each process to a device in this section.

#include <mpi.h>

#define HANDLE_MPI_ERROR(x)                                       \
{ const auto err = x;                                             \
  if( err != MPI_SUCCESS )                                        \
  { char error[MPI_MAX_ERROR_STRING]; int len;                    \
    MPI_Error_string(err, error, &len);                           \
    printf("[Process %d] MPI Error: %s in line %d\n", rank, error, __LINE__); \
    fflush(stdout);                                               \
    MPI_Abort(MPI_COMM_WORLD, err);                               \
  }                                                               \
};

   // Initialize MPI.
   int errorCode = MPI_Init(&argc, &argv);
   if (errorCode != MPI_SUCCESS)
   {
      printf("Error initializing MPI.\n");
      MPI_Abort(MPI_COMM_WORLD, errorCode);
   }

   const int root{0};
   int rank{};
   HANDLE_MPI_ERROR( MPI_Comm_rank(MPI_COMM_WORLD, &rank) );

   int numProcs{};
   HANDLE_MPI_ERROR( MPI_Comm_size(MPI_COMM_WORLD, &numProcs) );

   // Set deviceId based on ranks and nodes.
   int deviceId = rank % numDevices;    // We assume that the processes are mapped to nodes in contiguous chunks.
   HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) );
   HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );

Next we define the tensor network as described in Define tensor network and tensor sizes. In a one process per device model, the tensor network, including operand and result data, is replicated on each process. The root process initalizes the operand data and broadcasts it to the other processes.

   /*******************
   * Initialize data
   *******************/

   // Rank root creates the tensor data.
   if (rank == root)
   {
      for (uint64_t i = 0; i < elementsA; i++)
         A[i] = ((floatType) rand())/RAND_MAX;
      for (uint64_t i = 0; i < elementsB; i++)
         B[i] = ((floatType) rand())/RAND_MAX;
      for (uint64_t i = 0; i < elementsC; i++)
         C[i] = ((floatType) rand())/RAND_MAX;
   }

   // Broadcast data to all ranks.
   HANDLE_MPI_ERROR( MPI_Bcast(A, elementsA, floatTypeMPI, root, MPI_COMM_WORLD) );
   HANDLE_MPI_ERROR( MPI_Bcast(B, elementsB, floatTypeMPI, root, MPI_COMM_WORLD) );
   HANDLE_MPI_ERROR( MPI_Bcast(C, elementsC, floatTypeMPI, root, MPI_COMM_WORLD) );

   // Copy data onto the device on all ranks.
   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
   HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );

Then we create the library handle and tensor network descriptor on each process, as decribed in cuTensorNet handle and network descriptor.

Next, we find the optimal contraction path and slicing combination for our network. We will run the cuTensorNet optimizer on all processes and determine which process has the best path in terms of FLOP count. We will then pack the optimizer info object on this process, broadcast the packed buffer, and unpack it on all processes. Each process now has the same optimizer info object, which we use to calculate the share of slices for each process.

   // Compute the path on all ranks so that we can choose the path with the lowest cost. Note that since this is a tiny
   //   example with 3 operands, all processes will compute the same globally optimal path. This is not the case for large
   //   tensor networks. For large networks, hyperoptimization is also beneficial and can be enabled by setting the
   //   optimizer config attribute CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES.

   // Force slicing.
   int32_t min_slices = numProcs;
   HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(
                                                               handle,
                                                               optimizerConfig,
                                                               CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MIN_SLICES,
                                                               &min_slices,
                                                               sizeof(min_slices)) );

   HANDLE_ERROR( cutensornetContractionOptimize(handle,
                                             descNet,
                                             optimizerConfig,
                                             workspaceLimit,
                                             optimizerInfo) );

   double flops{-1.};
   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
                                                               handle,
                                                               optimizerInfo,
                                                               CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
                                                               &flops,
                                                               sizeof(flops)) );

   // Choose the path with the lowest cost.
   struct {
       double value;
       int rank;
   } in{flops, rank}, out;

   HANDLE_MPI_ERROR( MPI_Allreduce(&in, &out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD) );

   int sender = out.rank;
   flops = out.value;
   if (rank == root)
   {
       printf("Process %d has the path with the lowest FLOP count %lf.\n", sender, flops);
   }

   size_t bufSize;

   // Get buffer size for optimizerInfo and broadcast it.
   if (rank == sender)
   {
       HANDLE_ERROR( cutensornetContractionOptimizerInfoGetPackedSize(handle, optimizerInfo, &bufSize) );
   }

   HANDLE_MPI_ERROR( MPI_Bcast(&bufSize, 1, MPI_INT64_T, sender, MPI_COMM_WORLD) );

   // Allocate buffer.
   std::vector<char> buffer(bufSize);

   // Pack optimizerInfo on sender and broadcast it.
   if (rank == sender)
   {
       HANDLE_ERROR( cutensornetContractionOptimizerInfoPackData(handle, optimizerInfo, buffer.data(), bufSize) );
   }

   HANDLE_MPI_ERROR( MPI_Bcast(buffer.data(), bufSize, MPI_CHAR, sender, MPI_COMM_WORLD) );

   // Unpack optimizerInfo from buffer.
   if (rank != sender)
   {
       HANDLE_ERROR( cutensornetUpdateContractionOptimizerInfoFromPackedData(handle, buffer.data(), bufSize, optimizerInfo) );
   }

   int64_t numSlices = 0;
   HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
               handle,
               optimizerInfo,
               CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
               &numSlices,
               sizeof(numSlices)) );

   assert(numSlices > 0);

   // Calculate each process's share of the slices.

   int64_t procChunk = numSlices / numProcs;
   int extra = numSlices % numProcs;
   int procSliceBegin = rank * procChunk + std::min(rank, extra);
   int procSliceEnd =  rank == numProcs - 1 ? numSlices : (rank + 1) * procChunk + std::min(rank + 1, extra);

We now create the workspace descriptor and allocate memory as described in Create workspace descriptor and allocate workspace memory, and create the Contraction plan and auto-tune the network.

Next, on each process, we create a slice group (see cutensornetSliceGroup_t) that corresponds to its share of the network slices. We then provide this slice group object to the cutensornetContractSlices() function to get a partial contraction result on each process.

   cutensornetSliceGroup_t sliceGroup{};
   // Create a cutensornetSliceGroup_t object from a range of slice IDs.
   HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, procSliceBegin, procSliceEnd, 1, &sliceGroup) );

      HANDLE_ERROR( cutensornetContractSlices(handle,
                                 plan,
                                 rawDataIn_d,
                                 D_d,
                                 accumulateOutput,
                                 workDesc,
                                 sliceGroup,
                                 stream) );

Finally, we sum up the partial contributions to obtain the result of the contraction.

   // Reduce on root process.
   if (rank == root)
   {
      HANDLE_MPI_ERROR( MPI_Reduce(MPI_IN_PLACE, D, elementsD, floatTypeMPI, MPI_SUM, root, MPI_COMM_WORLD) );
   }
   else
   {
      HANDLE_MPI_ERROR( MPI_Reduce(D, D, elementsD, floatTypeMPI, MPI_SUM, root, MPI_COMM_WORLD) );
   }

   HANDLE_MPI_ERROR( MPI_Finalize() );

The complete MPI sample can be found in the NVIDIA/cuQuantum repository.

Useful Tips¶

For debugging, the environment variable CUTENSORNET_LOG_LEVEL=n can be set. The level n = 0, 1, …, 5 corresponds to the logger level as described and used in cutensornetLoggerSetLevel(). The environment variable CUTENSORNET_LOG_FILE=<filepath> can be used to direct the log output to a custom file at <filepath> instead of stdout.