Code example (Contraction with manual distributed slicing)#

For advanced users, it is also possible (but more involved) to adapt Code example (serial contraction) to explicitly parallelize execution of the tensor network contraction operation on multiple GPU devices. Here we will also use MPI as the communication layer. For brevity, we will show only the changes that need to be made on top of the serial example. The full MPI-manual sample code can be found in the NVIDIA/cuQuantum repository. Note that this sample does NOT require CUDA-aware MPI.

First, in addition to the headers and definitions mentioned in Headers and data types, we need to include the MPI header and define a macro to handle MPI errors. We also need to initialize the MPI service and associate each MPI process with its own GPU device, as explained previously.

20#include <mpi.h>
48#define HANDLE_MPI_ERROR(x)                                        \
49    do {                                                           \
50        const auto err = x;                                        \
51        if (err != MPI_SUCCESS)                                    \
52        {                                                          \
53            char error[MPI_MAX_ERROR_STRING];                      \
54            int len;                                               \
55            MPI_Error_string(err, error, &len);                    \
56            printf("MPI Error: %s in line %d\n", error, __LINE__); \
57            fflush(stdout);                                        \
58            MPI_Abort(MPI_COMM_WORLD, err);                        \
59        }                                                          \
60    } while (0)
106    // Initialize MPI
107    HANDLE_MPI_ERROR(MPI_Init(&argc, &argv));
108    int rank{-1};
109    HANDLE_MPI_ERROR(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
110    int numProcs{0};
111    HANDLE_MPI_ERROR(MPI_Comm_size(MPI_COMM_WORLD, &numProcs));
129    // Set GPU device based on ranks and nodes
130    int numDevices{0};
131    HANDLE_CUDA_ERROR(cudaGetDeviceCount(&numDevices));
132    const int deviceId = rank % numDevices; // we assume that the processes are mapped to nodes in contiguous chunks
133    HANDLE_CUDA_ERROR(cudaSetDevice(deviceId));
134    cudaDeviceProp prop;
135    HANDLE_CUDA_ERROR(cudaGetDeviceProperties(&prop, deviceId));

Next, we define the tensor network as described in Define tensor network and tensor sizes. In a one GPU device per process model, the tensor network, including operands and result data, is replicated on each process. The root process initializes the input data and broadcasts it to the other processes.

243    /*******************
244     * Initialize data
245     *******************/
246
247    // init output tensor to all 0s
248    memset(tensorData_h[numInputs], 0, tensorSizes[numInputs]);
249    if (rank == 0)
250    {
251        // init input tensors to random values
252        for (int32_t t = 0; t < numInputs; ++t)
253        {
254            for (uint64_t i = 0; i < tensorElements[t]; ++i) tensorData_h[t][i] = ((floatType)rand()) / RAND_MAX;
255        }
256    }
257
258    // Broadcast input data to all ranks
259    for (int32_t t = 0; t < numInputs; ++t)
260    {
261        HANDLE_MPI_ERROR(MPI_Bcast(tensorData_h[t], tensorElements[t], floatTypeMPI, 0, MPI_COMM_WORLD));
262    }
263
264    // copy input data to device buffers
265    for (int32_t t = 0; t < numInputs; ++t)
266    {
267        HANDLE_CUDA_ERROR(cudaMemcpy(tensorData_d[t], tensorData_h[t], tensorSizes[t], cudaMemcpyHostToDevice));
268    }

Then we create the library handle and tensor network descriptor on each process, as described in cuTensorNet handle and network descriptor.

Next, we find the optimal contraction path and slicing combination for our tensor network. We will run the cuTensorNet optimizer on all processes and determine which process has the best path in terms of the FLOP count. We will then pack the optimizer info object on this process, broadcast the packed buffer, and unpack it on all other processes. Each process now has the same optimizer info object, which we use to calculate the share of slices for each process. Because the optimizer info object is modified as a result of unifying the optimizer across processes, we need to update the network with the modified optimizer info object via cutensornetNetworkSetOptimizerInfo().

353    // Compute the path on all ranks so that we can choose the path with the lowest cost. Note that since this is a tiny
354    // example with 4 operands, all processes will compute the same globally optimal path. This is not the case for large
355    // tensor networks. For large networks, hyper-optimization does become beneficial.
356
357    // Enforce tensor network slicing (for parallelization)
358    const int32_t min_slices = numProcs;
359    HANDLE_ERROR(cutensornetContractionOptimizerConfigSetAttribute(handle,
360                                                                   optimizerConfig,
361                                                                   CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MIN_SLICES,
362                                                                   &min_slices,
363                                                                   sizeof(min_slices)));
364
365    // Find an optimized tensor network contraction path on each MPI process
366    HANDLE_ERROR(cutensornetContractionOptimize(handle,
367                                                networkDesc,
368                                                optimizerConfig,
369                                                workspaceLimit,
370                                                optimizerInfo));
371
372    // Query the obtained Flop count
373    double flops{-1.};
374    HANDLE_ERROR(cutensornetContractionOptimizerInfoGetAttribute(handle,
375                                                                 optimizerInfo,
376                                                                 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
377                                                                 &flops,
378                                                                 sizeof(flops)));
379
380    // Choose the contraction path with the lowest Flop cost
381    struct
382    {
383        double value;
384        int rank;
385    } in{flops, rank}, out;
386
387    HANDLE_MPI_ERROR(MPI_Allreduce(&in, &out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD));
388    const int sender = out.rank;
389    flops            = out.value;
390
391    if (verbose) printf("Process %d has the path with the lowest FLOP count %lf\n", sender, flops);
392
393    // Get the buffer size for optimizerInfo and broadcast it
394    size_t bufSize{0};
395    if (rank == sender)
396    {
397        HANDLE_ERROR(cutensornetContractionOptimizerInfoGetPackedSize(handle, optimizerInfo, &bufSize));
398    }
399    HANDLE_MPI_ERROR(MPI_Bcast(&bufSize, 1, MPI_INT64_T, sender, MPI_COMM_WORLD));
400
401    // Allocate a buffer
402    std::vector<char> buffer(bufSize);
403
404    // Pack optimizerInfo on sender and broadcast it
405    if (rank == sender)
406    {
407        HANDLE_ERROR(cutensornetContractionOptimizerInfoPackData(handle, optimizerInfo, buffer.data(), bufSize));
408    }
409    HANDLE_MPI_ERROR(MPI_Bcast(buffer.data(), bufSize, MPI_CHAR, sender, MPI_COMM_WORLD));
410
411    // Unpack optimizerInfo from the buffer
412    if (rank != sender)
413    {
414        HANDLE_ERROR(
415            cutensornetUpdateContractionOptimizerInfoFromPackedData(handle, buffer.data(), bufSize, optimizerInfo));
416    }
417
418    // Update the network with the modified optimizer info
419    HANDLE_ERROR(cutensornetNetworkSetOptimizerInfo(handle, networkDesc, optimizerInfo));
420
421    // Query the number of slices the tensor network execution will be split into
422    int64_t numSlices = 0;
423    HANDLE_ERROR(cutensornetContractionOptimizerInfoGetAttribute(handle,
424                                                                 optimizerInfo,
425                                                                 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
426                                                                 &numSlices,
427                                                                 sizeof(numSlices)));
428    assert(numSlices > 0);
429
430    // Calculate each process's share of the slices
431    int64_t procChunk  = numSlices / numProcs;
432    int extra          = numSlices % numProcs;
433    int procSliceBegin = rank * procChunk + std::min(rank, extra);
434    int procSliceEnd   = (rank == numProcs - 1) ? numSlices : (rank + 1) * procChunk + std::min(rank + 1, extra);

We now create the workspace descriptor and allocate memory as described in Create workspace descriptor and allocate workspace memory, and create the Contraction preparation and auto-tuning of the tensor network.

Next, on each process, we create a slice group (see cutensornetSliceGroup_t) that corresponds to its share of the tensor network slices. We then provide this slice group object to the cutensornetNetworkContract() function to get a partial contraction result on each process.

525    // Create a cutensornetSliceGroup_t object from a range of slice IDs
526    cutensornetSliceGroup_t sliceGroup{};
527    HANDLE_ERROR(cutensornetCreateSliceGroupFromIDRange(handle, procSliceBegin, procSliceEnd, 1, &sliceGroup));
548        HANDLE_ERROR(cutensornetNetworkContract(handle,
549                                                networkDesc,
550                                                accumulateOutput,
551                                                workDesc,
552                                                sliceGroup,
553                                                stream));

Finally, we sum up the partial contributions to obtain the result of the tensor network contraction.

559        // Perform Allreduce operation on the output tensor
560        HANDLE_CUDA_ERROR(cudaStreamSynchronize(stream));
561        // restore the output tensor on Host
562        HANDLE_CUDA_ERROR(cudaMemcpy(tensorData_h[numInputs], tensorData_d[numInputs], tensorSizes[numInputs], cudaMemcpyDeviceToHost));
563        HANDLE_MPI_ERROR(MPI_Allreduce(MPI_IN_PLACE, tensorData_h[numInputs], tensorElements[numInputs], floatTypeMPI, MPI_SUM, MPI_COMM_WORLD));

Before termination, the MPI service needs to be finalized.

624    // Shut down MPI service
625    HANDLE_MPI_ERROR(MPI_Finalize());

The complete MPI-manual sample can be found in the NVIDIA/cuQuantum repository.