Code example (Contraction with manual distributed slicing)#

For advanced users, it is also possible (but more involved) to adapt Code example (serial contraction) to explicitly parallelize execution of the tensor network contraction operation on multiple GPU devices. Here we will also use MPI as the communication layer. For brevity, we will show only the changes that need to be made on top of the serial example. The full MPI-manual sample code can be found in the NVIDIA/cuQuantum repository. Note that this sample does NOT require CUDA-aware MPI.

First, in addition to the headers and definitions mentioned in Headers and data types, we need to include the MPI header and define a macro to handle MPI errors. We also need to initialize the MPI service and associate each MPI process with its own GPU device, as explained previously.

#include <mpi.h>

#define HANDLE_MPI_ERROR(x)                                        \
    do {                                                           \
        const auto err = x;                                        \
        if (err != MPI_SUCCESS)                                    \
        {                                                          \
            char error[MPI_MAX_ERROR_STRING];                      \
            int len;                                               \
            MPI_Error_string(err, error, &len);                    \
            printf("MPI Error: %s in line %d\n", error, __LINE__); \
            fflush(stdout);                                        \
            MPI_Abort(MPI_COMM_WORLD, err);                        \
        }                                                          \
    } while (0)

    // Initialize MPI
    HANDLE_MPI_ERROR(MPI_Init(&argc, &argv));
    int rank{-1};
    HANDLE_MPI_ERROR(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
    int numProcs{0};
    HANDLE_MPI_ERROR(MPI_Comm_size(MPI_COMM_WORLD, &numProcs));

    // Set GPU device based on ranks and nodes
    int numDevices{0};
    HANDLE_CUDA_ERROR(cudaGetDeviceCount(&numDevices));
    const int deviceId = rank % numDevices; // we assume that the processes are mapped to nodes in contiguous chunks
    HANDLE_CUDA_ERROR(cudaSetDevice(deviceId));
    cudaDeviceProp prop;
    HANDLE_CUDA_ERROR(cudaGetDeviceProperties(&prop, deviceId));

Next, we define the tensor network as described in Define tensor network and tensor sizes. In a one GPU device per process model, the tensor network, including operands and result data, is replicated on each process. The root process initializes the input data and broadcasts it to the other processes.

    /*******************
     * Initialize data
     *******************/

    // init output tensor to all 0s
    memset(tensorData_h[numInputs], 0, tensorSizes[numInputs]);
    if (rank == 0)
    {
        // init input tensors to random values
        for (int32_t t = 0; t < numInputs; ++t)
        {
            for (uint64_t i = 0; i < tensorElements[t]; ++i) tensorData_h[t][i] = ((floatType)rand()) / RAND_MAX;
        }
    }

    // Broadcast input data to all ranks
    for (int32_t t = 0; t < numInputs; ++t)
    {
        HANDLE_MPI_ERROR(MPI_Bcast(tensorData_h[t], tensorElements[t], floatTypeMPI, 0, MPI_COMM_WORLD));
    }

    // copy input data to device buffers
    for (int32_t t = 0; t < numInputs; ++t)
    {
        HANDLE_CUDA_ERROR(cudaMemcpy(tensorData_d[t], tensorData_h[t], tensorSizes[t], cudaMemcpyHostToDevice));
    }

Then we create the library handle and tensor network descriptor on each process, as described in cuTensorNet handle and network descriptor.

Next, we find the optimal contraction path and slicing combination for our tensor network. We will run the cuTensorNet optimizer on all processes and determine which process has the best path in terms of the FLOP count. We will then pack the optimizer info object on this process, broadcast the packed buffer, and unpack it on all other processes. Each process now has the same optimizer info object, which we use to calculate the share of slices for each process. Because the optimizer info object is modified as a result of unifying the optimizer across processes, we need to update the network with the modified optimizer info object via cutensornetNetworkSetOptimizerInfo().

    // Compute the path on all ranks so that we can choose the path with the lowest cost. Note that since this is a tiny
    // example with 4 operands, all processes will compute the same globally optimal path. This is not the case for large
    // tensor networks. For large networks, hyper-optimization does become beneficial.

    // Enforce tensor network slicing (for parallelization)
    const int32_t min_slices = numProcs;
    HANDLE_ERROR(cutensornetContractionOptimizerConfigSetAttribute(handle,
                                                                   optimizerConfig,
                                                                   CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MIN_SLICES,
                                                                   &min_slices,
                                                                   sizeof(min_slices)));

    // Find an optimized tensor network contraction path on each MPI process
    HANDLE_ERROR(cutensornetContractionOptimize(handle,
                                                networkDesc,
                                                optimizerConfig,
                                                workspaceLimit,
                                                optimizerInfo));

    // Query the obtained Flop count
    double flops{-1.};
    HANDLE_ERROR(cutensornetContractionOptimizerInfoGetAttribute(handle,
                                                                 optimizerInfo,
                                                                 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
                                                                 &flops,
                                                                 sizeof(flops)));

    // Choose the contraction path with the lowest Flop cost
    struct
    {
        double value;
        int rank;
    } in{flops, rank}, out;

    HANDLE_MPI_ERROR(MPI_Allreduce(&in, &out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD));
    const int sender = out.rank;
    flops            = out.value;

    if (verbose) printf("Process %d has the path with the lowest FLOP count %lf\n", sender, flops);

    // Get the buffer size for optimizerInfo and broadcast it
    size_t bufSize{0};
    if (rank == sender)
    {
        HANDLE_ERROR(cutensornetContractionOptimizerInfoGetPackedSize(handle, optimizerInfo, &bufSize));
    }
    HANDLE_MPI_ERROR(MPI_Bcast(&bufSize, 1, MPI_INT64_T, sender, MPI_COMM_WORLD));

    // Allocate a buffer
    std::vector<char> buffer(bufSize);

    // Pack optimizerInfo on sender and broadcast it
    if (rank == sender)
    {
        HANDLE_ERROR(cutensornetContractionOptimizerInfoPackData(handle, optimizerInfo, buffer.data(), bufSize));
    }
    HANDLE_MPI_ERROR(MPI_Bcast(buffer.data(), bufSize, MPI_CHAR, sender, MPI_COMM_WORLD));

    // Unpack optimizerInfo from the buffer
    if (rank != sender)
    {
        HANDLE_ERROR(
            cutensornetUpdateContractionOptimizerInfoFromPackedData(handle, buffer.data(), bufSize, optimizerInfo));
    }

    // Update the network with the modified optimizer info
    HANDLE_ERROR(cutensornetNetworkSetOptimizerInfo(handle, networkDesc, optimizerInfo));

    // Query the number of slices the tensor network execution will be split into
    int64_t numSlices = 0;
    HANDLE_ERROR(cutensornetContractionOptimizerInfoGetAttribute(handle,
                                                                 optimizerInfo,
                                                                 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
                                                                 &numSlices,
                                                                 sizeof(numSlices)));
    assert(numSlices > 0);

    // Calculate each process's share of the slices
    int64_t procChunk  = numSlices / numProcs;
    int extra          = numSlices % numProcs;
    int procSliceBegin = rank * procChunk + std::min(rank, extra);
    int procSliceEnd   = (rank == numProcs - 1) ? numSlices : (rank + 1) * procChunk + std::min(rank + 1, extra);

We now create the workspace descriptor and allocate memory as described in Create workspace descriptor and allocate workspace memory, and create the Contraction preparation and auto-tuning of the tensor network.

Next, on each process, we create a slice group (see cutensornetSliceGroup_t) that corresponds to its share of the tensor network slices. We then provide this slice group object to the cutensornetNetworkContract() function to get a partial contraction result on each process.

    // Create a cutensornetSliceGroup_t object from a range of slice IDs
    cutensornetSliceGroup_t sliceGroup{};
    HANDLE_ERROR(cutensornetCreateSliceGroupFromIDRange(handle, procSliceBegin, procSliceEnd, 1, &sliceGroup));

        HANDLE_ERROR(cutensornetNetworkContract(handle,
                                                networkDesc,
                                                accumulateOutput,
                                                workDesc,
                                                sliceGroup,
                                                stream));

Finally, we sum up the partial contributions to obtain the result of the tensor network contraction.

        // Perform Allreduce operation on the output tensor
        HANDLE_CUDA_ERROR(cudaStreamSynchronize(stream));
        // restore the output tensor on Host
        HANDLE_CUDA_ERROR(cudaMemcpy(tensorData_h[numInputs], tensorData_d[numInputs], tensorSizes[numInputs], cudaMemcpyDeviceToHost));
        HANDLE_MPI_ERROR(MPI_Allreduce(MPI_IN_PLACE, tensorData_h[numInputs], tensorElements[numInputs], floatTypeMPI, MPI_SUM, MPI_COMM_WORLD));

Before termination, the MPI service needs to be finalized.

    // Shut down MPI service
    HANDLE_MPI_ERROR(MPI_Finalize());

The complete MPI-manual sample can be found in the NVIDIA/cuQuantum repository.