Examples¶

In this section, we show how to contract a tensor network using cuTensorNet. First, we describe how to compile sample code. Then, we present an example code used to perform common steps in cuTensorNet. In the example, we perform the following tensor contraction:

\[R_{k,l} = A_{a,b,c,d,e,f} B_{b,g,h,e,i,j} C_{m,a,g,f,i,k} D_{l,c,h,d,j,m}\]

We construct the code step by step, each step adding code at the end. The steps are separated by succinct multi-line comment blocks.

It is recommended that the reader refers to Overview and cuTENSOR documentation for familiarity with the nomenclature and cuTENSOR operations.

Compiling code¶

Assuming cuQuantum has been extracted in CUQUANTUM_ROOT and cuTENSOR in CUTENSOR_ROOT, we update the library path as follows:

export LD_LIBRARY_PATH=${CUQUANTUM_ROOT}/lib:${CUTENSOR_ROOT}/lib:${LD_LIBRARY_PATH}

Depending on your CUDA Toolkit, you might have to choose a different library version (e.g., ${CUTENSOR_ROOT}/lib/13.0).

The serial sample code discussed below (tensornet_example.cu) can be compiled via the following command:

nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib -lcutensornet -lcutensor -o tensornet_example

For static linking against the cuTensorNet library, use the following command (note that libmetis_static.a needs to be explicitly linked against, assuming it is installed through the NVIDIA CUDA Toolkit and accessible through $LIBRARY_PATH):

nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include ${CUQUANTUM_ROOT}/lib/libcutensornet_static.a -L${CUTENSOR_DIR}/lib -lcutensor libmetis_static.a -o tensornet_example

In order to build parallel (MPI) versions of the examples (tensornet_example_mpi_auto.cu and tensornet_example_mpi.cu), one will need to have an MPI library installed (e.g., recent Open MPI, MVAPICH, or MPICH). In particular, the automatic parallel example requires CUDA-aware MPI, see Code example (automatic slice-based distributed parallelization) below. In this case, one will need to add -I${MPI_PATH}/include and -L${MPI_PATH}/lib -lmpi to the build command:

nvcc tensornet_example_mpi_auto.cu -I${CUQUANTUM_ROOT}/include -I${MPI_PATH}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib -lcutensornet -lcutensor -L${MPI_PATH}/lib -lmpi -o tensornet_example_mpi_auto
nvcc tensornet_example_mpi.cu -I${CUQUANTUM_ROOT}/include -I${MPI_PATH}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib -lcutensornet -lcutensor -L${MPI_PATH}/lib -lmpi -o tensornet_example_mpi

Warning

When running tensornet_example_mpi_auto.cu without CUDA-aware MPI, the program will crash.

Note

Depending on the source of the cuQuantum package, you may need to replace lib above by lib64.

Code example (serial)¶

The following code example illustrates the common steps necessary to use cuTensorNet and also introduces typical tensor network operations. The full sample code can be found in the NVIDIA/cuQuantum repository (here).

Headers and data types¶

#include <stdlib.h>
#include <stdio.h>

#include <unordered_map>
#include <vector>
#include <cassert>

#include <cuda_runtime.h>
#include <cutensornet.h>

#define HANDLE_ERROR(x)                                                                 \
    do {                                                                                \
        const auto err = x;                                                             \
        if (err != CUTENSORNET_STATUS_SUCCESS)                                          \
        {                                                                               \
            printf("Error: %s in line %d\n", cutensornetGetErrorString(err), __LINE__); \
            fflush(stdout);                                                             \
            exit(EXIT_FAILURE);                                                         \
        }                                                                               \
    } while (0)

#define HANDLE_CUDA_ERROR(x)                                                          \
    do {                                                                              \
        const auto err = x;                                                           \
        if (err != cudaSuccess)                                                       \
        {                                                                             \
            printf("CUDA Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); \
            fflush(stdout);                                                           \
            exit(EXIT_FAILURE);                                                       \
        }                                                                             \
    } while (0)

// Usage: DEV_ATTR(cudaDevAttrClockRate, deviceId)
#define DEV_ATTR(ENUMCONST, DID)                                                   \
    ({ int v;                                                                       \
       HANDLE_CUDA_ERROR(cudaDeviceGetAttribute(&v, ENUMCONST, DID));               \
       v; })


struct GPUTimer
{
    GPUTimer(cudaStream_t stream) : stream_(stream)
    {
        HANDLE_CUDA_ERROR(cudaEventCreate(&start_));
        HANDLE_CUDA_ERROR(cudaEventCreate(&stop_));
    }

    ~GPUTimer()
    {
        HANDLE_CUDA_ERROR(cudaEventDestroy(start_));
        HANDLE_CUDA_ERROR(cudaEventDestroy(stop_));
    }

    void start() { HANDLE_CUDA_ERROR(cudaEventRecord(start_, stream_)); }

    float seconds()
    {
        HANDLE_CUDA_ERROR(cudaEventRecord(stop_, stream_));
        HANDLE_CUDA_ERROR(cudaEventSynchronize(stop_));
        float time;
        HANDLE_CUDA_ERROR(cudaEventElapsedTime(&time, start_, stop_));
        return time * 1e-3;
    }

private:
    cudaEvent_t start_, stop_;
    cudaStream_t stream_;
};

int main()
{
    static_assert(sizeof(size_t) == sizeof(int64_t), "Please build this sample on a 64-bit architecture!");

    bool verbose = true;

    // Check cuTensorNet version
    const size_t cuTensornetVersion = cutensornetGetVersion();
    if (verbose) printf("cuTensorNet version: %ld\n", cuTensornetVersion);

    // Set GPU device
    int numDevices{0};
    HANDLE_CUDA_ERROR(cudaGetDeviceCount(&numDevices));
    const int deviceId = 0;
    HANDLE_CUDA_ERROR(cudaSetDevice(deviceId));
    cudaDeviceProp prop;
    HANDLE_CUDA_ERROR(cudaGetDeviceProperties(&prop, deviceId));

    if (verbose)
    {
        printf("===== device info ======\n");
        printf("GPU-local-id:%d\n", deviceId);
        printf("GPU-name:%s\n", prop.name);
        printf("GPU-clock:%d\n", DEV_ATTR(cudaDevAttrClockRate, deviceId));
        printf("GPU-memoryClock:%d\n", DEV_ATTR(cudaDevAttrMemoryClockRate, deviceId));
        printf("GPU-nSM:%d\n", prop.multiProcessorCount);
        printf("GPU-major:%d\n", prop.major);
        printf("GPU-minor:%d\n", prop.minor);
        printf("========================\n");
    }

    typedef float floatType;
    cudaDataType_t typeData              = CUDA_R_32F;
    cutensornetComputeType_t typeCompute = CUTENSORNET_COMPUTE_32F;

    if (verbose) printf("Included headers and defined data types\n");

Define tensor network and tensor sizes¶

Next, we define the topology of the tensor network (i.e., the modes of the tensors, their extents, and their connectivity).

    /**************************************************************************************
     * Computing: R_{k,l} = A_{a,b,c,d,e,f} B_{b,g,h,e,i,j} C_{m,a,g,f,i,k} D_{l,c,h,d,j,m}
     **************************************************************************************/

    constexpr int32_t numInputs = 4;

    // Create vectors of tensor modes
    std::vector<std::vector<int32_t>> tensorModes{ // for input tensors & output tensor
        // input tensors
        {'a', 'b', 'c', 'd', 'e', 'f'}, // tensor A
        {'b', 'g', 'h', 'e', 'i', 'j'}, // tensor B
        {'m', 'a', 'g', 'f', 'i', 'k'}, // tensor C
        {'l', 'c', 'h', 'd', 'j', 'm'}, // tensor D
        // output tensor
        {'k', 'l'}, // tensor R
    };

    // Set mode extents
    int64_t sameExtent = 16; // setting same extent for simplicity. In principle extents can differ.
    std::unordered_map<int32_t, int64_t> extent;
    for (auto& vec : tensorModes)
    {
        for (auto& mode : vec)
        {
            extent[mode] = sameExtent;
        }
    }

    // Create a vector of extents for each tensor
    std::vector<std::vector<int64_t>> tensorExtents; // for input tensors & output tensor
    tensorExtents.resize(numInputs + 1);             // hold inputs + output tensors
    for (int32_t t = 0; t < numInputs + 1; ++t)
    {
        for (auto& mode : tensorModes[t]) tensorExtents[t].push_back(extent[mode]);
    }

    if (verbose) printf("Defined tensor network, modes, and extents\n");

Allocate memory and initialize data¶

Next, we allocate memory for the tensor network operands and initialize them to random values.

    /*****************
     * Allocating data
     *****************/

    std::vector<size_t> tensorElements(numInputs + 1); // for input tensors & output tensor
    std::vector<size_t> tensorSizes(numInputs + 1);    // for input tensors & output tensor
    size_t totalSize = 0;
    for (int32_t t = 0; t < numInputs + 1; ++t)
    {
        size_t numElements = 1;
        for (auto& mode : tensorModes[t]) numElements *= extent[mode];
        tensorElements[t] = numElements;

        tensorSizes[t] = sizeof(floatType) * numElements;
        totalSize += tensorSizes[t];
    }

    if (verbose) printf("Total GPU memory used for tensor storage: %.2f GiB\n", (totalSize) / 1024. / 1024. / 1024);

    void* tensorData_d[numInputs + 1]; // for input tensors & output tensor
    for (int32_t t = 0; t < numInputs + 1; ++t)
    {
        HANDLE_CUDA_ERROR(cudaMalloc((void**)&tensorData_d[t], tensorSizes[t]));
    }

    floatType* tensorData_h[numInputs + 1]; // for input tensors & output tensor
    for (int32_t t = 0; t < numInputs + 1; ++t)
    {
        tensorData_h[t] = (floatType*)malloc(tensorSizes[t]);
        if (tensorData_h[t] == NULL)
        {
            printf("Error: Host memory allocation failed!\n");
            return -1;
        }
    }

    /*****************
     * Initialize data
     *****************/

    // init output tensor to all 0s
    memset(tensorData_h[numInputs], 0, tensorSizes[numInputs]);
    // init input tensors to random values
    for (int32_t t = 0; t < numInputs; ++t)
    {
        for (size_t e = 0; e < tensorElements[t]; ++e) tensorData_h[t][e] = ((floatType)rand()) / RAND_MAX;
    }
    // copy input data to device buffers
    for (int32_t t = 0; t < numInputs; ++t)
    {
        HANDLE_CUDA_ERROR(cudaMemcpy(tensorData_d[t], tensorData_h[t], tensorSizes[t], cudaMemcpyHostToDevice));
    }

cuTensorNet handle and network descriptor¶

Next, we initialize the cuTensorNet library via cutensornetCreate(). Note that the created library context will be associated with the currently active GPU. We create the network descriptor, and append the input tensors with the desired tensor modes and extents, as well as the data type. We can, optionaly, set the output tensor modes (if skipped, the output modes will be inferred). Also, we can, optionaly, set the compute mode on the network (if skipped, a default compute type, corresponding to the data type, will be used. Refer to cutensornetCreateNetwork()).

    /*************
     * cuTensorNet
     *************/

    cudaStream_t stream;
    HANDLE_CUDA_ERROR(cudaStreamCreate(&stream));

    cutensornetHandle_t handle;
    HANDLE_ERROR(cutensornetCreate(&handle));

    if (verbose) printf("Allocated GPU memory for data, initialized data, and created library handle\n");

    /****************
     * Create Network
     ****************/

    // Set up tensor network
    cutensornetNetworkDescriptor_t networkDesc;
    HANDLE_ERROR(cutensornetCreateNetwork(handle, &networkDesc));

    int64_t tensorIDs[numInputs]; // for input tensors
    // attach the input tensors to the network
    for (int32_t t = 0; t < numInputs; ++t)
    {
        HANDLE_ERROR(cutensornetNetworkAppendTensor(handle,
                                                    networkDesc,
                                                    tensorModes[t].size(),
                                                    tensorExtents[t].data(),
                                                    tensorModes[t].data(),
                                                    NULL,
                                                    typeData,
                                                    &tensorIDs[t]));
    }

    // set the output tensor
    HANDLE_ERROR(cutensornetNetworkSetOutputTensor(handle,
                                                   networkDesc,
                                                   tensorModes[numInputs].size(),
                                                   tensorModes[numInputs].data(),
                                                   typeData));

    // set the network compute type
    HANDLE_ERROR(cutensornetNetworkSetAttribute(handle,
                                                networkDesc,
                                                CUTENSORNET_NETWORK_COMPUTE_TYPE,
                                                &typeCompute,
                                                sizeof(typeCompute)));
    if (verbose) printf("Initialized the cuTensorNet library and created a tensor network descriptor\n");

Optimal contraction order and slicing¶

At this stage, we can deploy the cuTensorNet optimizer to find an optimized contraction path and slicing combination. We choose a limit for the workspace needed to perform the contraction based on the available memory resources, and provide it to the optimizer as a constraint. We then create an optimizer configuration object of type cutensornetContractionOptimizerConfig_t to specify various optimizer options and provide it to the optimizer, which is invoked via cutensornetContractionOptimize(). The results from the optimizer will be returned in an optimizer info object of type cutensornetContractionOptimizerInfo_t.

    /******************************************************
     * Choose workspace limit based on available resources.
     ******************************************************/

    size_t freeMem, totalMem;
    HANDLE_CUDA_ERROR(cudaMemGetInfo(&freeMem, &totalMem));
    uint64_t workspaceLimit = (uint64_t)((double)freeMem * 0.9);
    if (verbose) printf("Workspace limit = %lu\n", workspaceLimit);

    /*******************************
     * Find "optimal" contraction order and slicing
     *******************************/

    cutensornetContractionOptimizerConfig_t optimizerConfig;
    HANDLE_ERROR(cutensornetCreateContractionOptimizerConfig(handle, &optimizerConfig));

    // Set the desired number of hyper-samples (defaults to 0)
    int32_t num_hypersamples = 8;
    HANDLE_ERROR(cutensornetContractionOptimizerConfigSetAttribute(handle,
                                                                   optimizerConfig,
                                                                   CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES,
                                                                   &num_hypersamples,
                                                                   sizeof(num_hypersamples)));

    // Create contraction optimizer info and find an optimized contraction path
    cutensornetContractionOptimizerInfo_t optimizerInfo;
    HANDLE_ERROR(cutensornetCreateContractionOptimizerInfo(handle, networkDesc, &optimizerInfo));

    HANDLE_ERROR(cutensornetContractionOptimize(handle,
                                                networkDesc,
                                                optimizerConfig,
                                                workspaceLimit,
                                                optimizerInfo));

    // Query the number of slices the tensor network execution will be split into
    int64_t numSlices = 0;
    HANDLE_ERROR(cutensornetContractionOptimizerInfoGetAttribute(handle,
                                                                 optimizerInfo,
                                                                 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
                                                                 &numSlices,
                                                                 sizeof(numSlices)));
    assert(numSlices > 0);

    if (verbose) printf("Found an optimized contraction path using cuTensorNet optimizer\n");

It is also possible to bypass the cuTensorNet optimizer and import a pre-determined contraction path, as well as slicing information, directly to the optimizer info object via cutensornetContractionOptimizerInfoSetAttribute(), then attach it to the network via cutensornetNetworkSetOptimizerInfo().

Create workspace descriptor and allocate workspace memory¶

Next, we create a workspace descriptor, compute the workspace sizes, and query the minimum workspace size needed to contract the network. We then allocate device memory for the workspace and set this in the workspace descriptor. The workspace descriptor will be provided to the contraction preparation and computation calls.

    /*******************************
     * Create workspace descriptor, allocate workspace, and set it.
     *******************************/

    cutensornetWorkspaceDescriptor_t workDesc;
    HANDLE_ERROR(cutensornetCreateWorkspaceDescriptor(handle, &workDesc));

    int64_t requiredWorkspaceSize = 0;
    HANDLE_ERROR(cutensornetWorkspaceComputeContractionSizes(handle,
                                                             networkDesc,
                                                             optimizerInfo,
                                                             workDesc));

    HANDLE_ERROR(cutensornetWorkspaceGetMemorySize(handle,
                                                   workDesc,
                                                   CUTENSORNET_WORKSIZE_PREF_MIN,
                                                   CUTENSORNET_MEMSPACE_DEVICE,
                                                   CUTENSORNET_WORKSPACE_SCRATCH,
                                                   &requiredWorkspaceSize));

    void* work = nullptr;
    HANDLE_CUDA_ERROR(cudaMalloc(&work, requiredWorkspaceSize));

    HANDLE_ERROR(cutensornetWorkspaceSetMemory(handle,
                                               workDesc,
                                               CUTENSORNET_MEMSPACE_DEVICE,
                                               CUTENSORNET_WORKSPACE_SCRATCH,
                                               work,
                                               requiredWorkspaceSize));

    if (verbose) printf("Allocated and set up the GPU workspace\n");

Contraction preparation and auto-tuning¶

We prepare the tensor network contraction, via cutensornetNetworkPrepareContraction(), and set tensor’s data buffers and strides via cutensornetNetworkSetInputTensorMemory() and cutensornetNetworkSetOutputTensorMemory(). Optionally, we can auto-tune the contraction, via cutensornetNetworkAutotuneContraction(), such that cuTENSOR selects the best kernel for each pairwise contraction. This prepared network contraction can be reused for many (possibly different) data inputs, avoiding the cost of initializing it redundantly.

    /**************************
     * Prepare the contraction.
     **************************/

    HANDLE_ERROR(cutensornetNetworkPrepareContraction(handle,
                                                      networkDesc,
                                                      workDesc));

    // set tensor's data buffers and strides
    for (int32_t t = 0; t < numInputs; ++t)
    {
        HANDLE_ERROR(cutensornetNetworkSetInputTensorMemory(handle,
                                                            networkDesc,
                                                            tensorIDs[t],
                                                            tensorData_d[t],
                                                            NULL));
    }
    HANDLE_ERROR(cutensornetNetworkSetOutputTensorMemory(handle,
                                                         networkDesc,
                                                         tensorData_d[numInputs],
                                                         NULL));
    /****************************************************************
     * Optional: Auto-tune the contraction to pick the fastest kernel
     *           for each pairwise tensor contraction.
     ****************************************************************/
    cutensornetNetworkAutotunePreference_t autotunePref;
    HANDLE_ERROR(cutensornetCreateNetworkAutotunePreference(handle,
                                                            &autotunePref));

    const int numAutotuningIterations = 5; // may be 0
    HANDLE_ERROR(cutensornetNetworkAutotunePreferenceSetAttribute(handle,
                                                                  autotunePref,
                                                                  CUTENSORNET_NETWORK_AUTOTUNE_MAX_ITERATIONS,
                                                                  &numAutotuningIterations,
                                                                  sizeof(numAutotuningIterations)));

    // Modify the network again to find the best pair-wise contractions
    HANDLE_ERROR(cutensornetNetworkAutotuneContraction(handle,
                                                       networkDesc,
                                                       workDesc,
                                                       autotunePref,
                                                       stream));

    HANDLE_ERROR(cutensornetDestroyNetworkAutotunePreference(autotunePref));

    if (verbose) printf("Prepared the network contraction for cuTensorNet and optionally auto-tuned it\n");

Tensor network contraction execution¶

Finally, we contract the tensor network as many times as needed, via cutensornetNetworkContract(), possibly with different input each time (reset the data buffers and strides via cutensornetNetworkSetInputTensorMemory()). Tensor network slices, captured as a cutensornetSliceGroup_t object, are computed using the same prepared network contraction. For convenience, NULL can be provided to the cutensornetNetworkContract() function instead of a slice group when the goal is to contract all the slices in the network. We also clean up and free allocated resources.

    /****************************************
     * Execute the tensor network contraction
     ****************************************/

    // Create a cutensornetSliceGroup_t object from a range of slice IDs
    cutensornetSliceGroup_t sliceGroup{};
    HANDLE_ERROR(cutensornetCreateSliceGroupFromIDRange(handle, 0, numSlices, 1, &sliceGroup));

    GPUTimer timer{stream};
    double minTimeCUTENSORNET = 1e100;
    const int numRuns         = 3; // number of repeats to get stable performance results
    for (int i = 0; i < numRuns; ++i)
    {
        // reset the output tensor data
        HANDLE_CUDA_ERROR(cudaMemcpy(tensorData_d[numInputs], tensorData_h[numInputs], tensorSizes[numInputs], cudaMemcpyHostToDevice));
        HANDLE_CUDA_ERROR(cudaDeviceSynchronize());

        /*
         * Contract all slices of the tensor network
         */
        timer.start();

        int32_t accumulateOutput = 0; // output tensor data will be overwritten
        HANDLE_ERROR(cutensornetNetworkContract(handle,
                                                networkDesc,
                                                accumulateOutput,
                                                workDesc,
                                                sliceGroup, // alternatively, NULL can also be used to contract over all slices instead of specifying a sliceGroup object
                                                stream));

        // Synchronize and measure best timing
        auto time          = timer.seconds();
        minTimeCUTENSORNET = (time > minTimeCUTENSORNET) ? minTimeCUTENSORNET : time;
    }

    if (verbose) printf("Contracted the tensor network, each slice used the same prepared contraction\n");

    // Print the 1-norm of the output tensor (verification)
    HANDLE_CUDA_ERROR(cudaStreamSynchronize(stream));
    // restore the output tensor on Host
    HANDLE_CUDA_ERROR(cudaMemcpy(tensorData_h[numInputs], tensorData_d[numInputs], tensorSizes[numInputs], cudaMemcpyDeviceToHost));
    double norm1 = 0.0;
    for (int64_t i = 0; i < tensorElements[numInputs]; ++i)
    {
        norm1 += std::abs(tensorData_h[numInputs][i]);
    }
    if (verbose) printf("Computed the 1-norm of the output tensor: %e\n", norm1);

    /*************************/

    // Query the total Flop count for the tensor network contraction
    double flops{0.0};
    HANDLE_ERROR(cutensornetContractionOptimizerInfoGetAttribute(handle,
                                                                 optimizerInfo,
                                                                 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
                                                                 &flops,
                                                                 sizeof(flops)));

    if (verbose)
    {
        printf("Number of tensor network slices = %ld\n", numSlices);
        printf("Tensor network contraction time (ms) = %.3f\n", minTimeCUTENSORNET * 1000.f);
    }

    // Sphinx: #9
    /****************
     * Free resources
     ****************/

    // Free cuTensorNet resources
    HANDLE_ERROR(cutensornetDestroySliceGroup(sliceGroup));
    HANDLE_ERROR(cutensornetDestroyWorkspaceDescriptor(workDesc));
    HANDLE_ERROR(cutensornetDestroyContractionOptimizerInfo(optimizerInfo));
    HANDLE_ERROR(cutensornetDestroyContractionOptimizerConfig(optimizerConfig));
    HANDLE_ERROR(cutensornetDestroyNetwork(networkDesc));
    HANDLE_ERROR(cutensornetDestroy(handle));

    HANDLE_CUDA_ERROR(cudaStreamDestroy(stream));

    // Free Host and GPU memory resources
    for (int32_t t = 0; t < numInputs + 1; ++t)
    {
        if (tensorData_h[t]) free(tensorData_h[t]);
        if (tensorData_d[t]) cudaFree(tensorData_d[t]);
    }
    if (work) cudaFree(work);

    if (verbose) printf("Freed resources and exited\n");

    return 0;
}

Recall that the full sample code can be found in the NVIDIA/cuQuantum repository (here).

Code example (automatic slice-based distributed parallelization)¶

It is straightforward to adapt Code example (Serial) and enable automatic parallel execution across multiple/many GPU devices (across multiple/many nodes). We will illustrate this with an example using the Message Passing Interface (MPI) as the communication layer. Below we show the minor additions that need to be made in order to enable distributed parallel execution without making any changes to the original serial source code. The full MPI-automatic sample code can be found in the NVIDIA/cuQuantum repository. To enable automatic parallelism, cuTensorNet requires that

the environment variable $CUTENSORNET_COMM_LIB is set to the path to the wrapper shared library libcutensornet_distributed_interface_mpi.so, and

the executable is linked to a CUDA-aware MPI library

The detailed instruction for setting these up is given in the installation guide above.

First, in addition to the headers and definitions mentioned in Headers and data types, we include the MPI header and define a macro to handle MPI errors. We also need to initialize the MPI service and assign a unique GPU device to each MPI process that will later be associated with the cuTensorNet library handle created inside the MPI process.

#include <mpi.h>

#define HANDLE_MPI_ERROR(x)                                        \
    do {                                                           \
        const auto err = x;                                        \
        if (err != MPI_SUCCESS)                                    \
        {                                                          \
            char error[MPI_MAX_ERROR_STRING];                      \
            int len;                                               \
            MPI_Error_string(err, error, &len);                    \
            printf("MPI Error: %s in line %d\n", error, __LINE__); \
            fflush(stdout);                                        \
            MPI_Abort(MPI_COMM_WORLD, err);                        \
        }                                                          \
    } while (0)

The MPI service initialization must precede the first cutensornetCreate() call which creates a cuTensorNet library handle. An attempt to call cutensornetCreate() before initializing the MPI service will result in an error.

    // Initialize MPI
    HANDLE_MPI_ERROR(MPI_Init(&argc, &argv));
    int rank{-1};
    HANDLE_MPI_ERROR(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
    int numProcs{0};
    HANDLE_MPI_ERROR(MPI_Comm_size(MPI_COMM_WORLD, &numProcs));

If multiple GPU devices located on the same node are visible to an MPI process, we need to pick an exclusive GPU device for each MPI process. If the mpirun (or mpiexec) command provided by your MPI library implementation sets up an environment variable that shows the rank of the respective MPI process during its invocation, you can use that environment variable to set CUDA_VISIBLE_DEVICES to point to a specific single GPU device assigned to the MPI process exclusively (for example, Open MPI provides ${OMPI_COMM_WORLD_LOCAL_RANK} for this purpose). Otherwise, the GPU device can be set manually, as shown below.

    // Set GPU device based on ranks and nodes
    int numDevices{0};
    HANDLE_CUDA_ERROR(cudaGetDeviceCount(&numDevices));
    const int deviceId = rank % numDevices; // we assume that the processes are mapped to nodes in contiguous chunks
    HANDLE_CUDA_ERROR(cudaSetDevice(deviceId));
    cudaDeviceProp prop;
    HANDLE_CUDA_ERROR(cudaGetDeviceProperties(&prop, deviceId));

Next we define the tensor network as described in Define tensor network and tensor sizes. In a one GPU device per process model, the tensor network, including operands and result data, is replicated on each process. The root process initializes the input data and broadcasts it to the other processes.

    /*******************
     * Initialize data
     *******************/

    // init output tensor to all 0s
    memset(tensorData_h[numInputs], 0, tensorSizes[numInputs]);
    if (rank == 0)
    {
        // init input tensors to random values
        for (int32_t t = 0; t < numInputs; ++t)
        {
            for (uint64_t i = 0; i < tensorElements[t]; ++i) tensorData_h[t][i] = ((floatType)rand()) / RAND_MAX;
        }
    }

    // Broadcast input data to all ranks
    for (int32_t t = 0; t < numInputs; ++t)
    {
        HANDLE_MPI_ERROR(MPI_Bcast(tensorData_h[t], tensorElements[t], floatTypeMPI, 0, MPI_COMM_WORLD));
    }

    // copy input data to device buffers
    for (int32_t t = 0; t < numInputs; ++t)
    {
        HANDLE_CUDA_ERROR(cudaMemcpy(tensorData_d[t], tensorData_h[t], tensorSizes[t], cudaMemcpyHostToDevice));
    }

Once the MPI service has been initialized and the cuTensorNet library handle has been created afterwards, one can activate the distributed parallel execution by calling cutensornetDistributedResetConfiguration(). Per standard practice, the user’s code needs to create a duplicate MPI communicator via MPI_Comm_dup. Then, the duplicate MPI communicator is associated with the cuTensorNet library handle by passing the pointer to the duplicate MPI communicator together with its size (in bytes) to the cutensornetDistributedResetConfiguration() call. The MPI communicator will be stored inside the cuTensorNet library handle such that all subsequent calls to the tensor network contraction path finder and tensor network contraction executor will be parallelized across all participating MPI processes (each MPI process is associated with its own GPU).

    /*******************************
     * Activate distributed (parallel) execution prior to
     * calling contraction path finder and contraction executor
     *******************************/
    // HANDLE_ERROR( cutensornetDistributedResetConfiguration(handle, NULL, 0) ); // resets back to serial execution
    MPI_Comm cutnComm;
    HANDLE_MPI_ERROR(MPI_Comm_dup(MPI_COMM_WORLD, &cutnComm)); // duplicate MPI communicator
    HANDLE_ERROR(cutensornetDistributedResetConfiguration(handle, &cutnComm, sizeof(cutnComm)));
    if (verbose) printf("Reset distributed MPI configuration\n");

Note

cutensornetDistributedResetConfiguration() is a collective call that must be executed by all participating MPI processes.

The API of this distributed parallelization model makes it straightforward to run source codes written for serial execution on multiple GPUs/nodes. Essentially, all MPI processes will execute exactly the same (serial) source code while automatically performing distributed parallelization inside the tensor network contraction path finder and tensor network contraction executor calls. The parallelization of the tensor network contraction path finder will only occur when the number of requested hyper-samples is larger than zero. However, regardless of that, activation of the distributed parallelization must precede the invocation of the tensor network contraction path finder. That is, the tensor network contraction path finder and tensor network contraction execution invocations must be done strictly after activating the distributed parallelization via cutensornetDistributedResetConfiguration(). When the distributed configuration is set to a parallel mode, the user is normally expected to invoke tensor network contraction execution by calling the cutensornetNetworkContract() function which is provided with the full range of tensor network slices that will be automatically distributed across all MPI processes.

Since the size of the tensor network must be sufficiently large to get a benefit of acceleration from distributed execution, smaller tensor networks (those which consist of only a single slice) can still be processed without distributed parallelization, which can be achieved by calling cutensornetDistributedResetConfiguration() with a NULL argument in place of the MPI communicator pointer (as before, this should be done prior to calling the tensor network contraction path finder). That is, the switch between distributed parallelization and redundant serial execution can be done on a per-tensor-network basis. Users can decide which (larger) tensor networks to process in a parallel manner and which (smaller ones) to process in a serial manner redundantly, by resetting the distributed configuration appropriately. In both cases, all MPI processes will produce the same output tensor (result) at the end of the tensor network execution.

Note

In the current version of the cuTensorNet library, the parallel tensor network contraction execution triggered by the cutensornetNetworkContract() call will block the provided CUDA stream as well as the calling CPU thread until the execution has completed on all MPI processes. This is a temporary limitation that will be lifted in future versions of the cuTensorNet library, where the call to cutensornetNetworkContract() will be fully asynchronous, similar to the serial execution case. Additionally, for an explicit synchronization of all MPI processes (barrier) one can make a collective call to cutensornetDistributedSynchronize().

Before termination, the MPI service needs to be finalized.

    // Shut down MPI service
    HANDLE_MPI_ERROR(MPI_Finalize());

The complete MPI-automatic sample can be found in the NVIDIA/cuQuantum repository.

Code example (manual slice-based distributed parallelization)¶

For advanced users, it is also possible (but more involved) to adapt Code example (Serial) to explicitly parallelize execution of the tensor network contraction operation on multiple GPU devices. Here we will also use MPI as the communication layer. For brevity, we will show only the changes that need to be made on top of the serial example. The full MPI-manual sample code can be found in the NVIDIA/cuQuantum repository. Note that this sample does NOT require CUDA-aware MPI.

First, in addition to the headers and definitions mentioned in Headers and data types, we need to include the MPI header and define a macro to handle MPI errors. We also need to initialize the MPI service and associate each MPI process with its own GPU device, as explained previously.

#include <mpi.h>

#define HANDLE_MPI_ERROR(x)                                        \
    do {                                                           \
        const auto err = x;                                        \
        if (err != MPI_SUCCESS)                                    \
        {                                                          \
            char error[MPI_MAX_ERROR_STRING];                      \
            int len;                                               \
            MPI_Error_string(err, error, &len);                    \
            printf("MPI Error: %s in line %d\n", error, __LINE__); \
            fflush(stdout);                                        \
            MPI_Abort(MPI_COMM_WORLD, err);                        \
        }                                                          \
    } while (0)

    // Initialize MPI
    HANDLE_MPI_ERROR(MPI_Init(&argc, &argv));
    int rank{-1};
    HANDLE_MPI_ERROR(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
    int numProcs{0};
    HANDLE_MPI_ERROR(MPI_Comm_size(MPI_COMM_WORLD, &numProcs));

    // Set GPU device based on ranks and nodes
    int numDevices{0};
    HANDLE_CUDA_ERROR(cudaGetDeviceCount(&numDevices));
    const int deviceId = rank % numDevices; // we assume that the processes are mapped to nodes in contiguous chunks
    HANDLE_CUDA_ERROR(cudaSetDevice(deviceId));
    cudaDeviceProp prop;
    HANDLE_CUDA_ERROR(cudaGetDeviceProperties(&prop, deviceId));

Next, we define the tensor network as described in Define tensor network and tensor sizes. In a one GPU device per process model, the tensor network, including operands and result data, is replicated on each process. The root process initializes the input data and broadcasts it to the other processes.

    /*******************
     * Initialize data
     *******************/

    // init output tensor to all 0s
    memset(tensorData_h[numInputs], 0, tensorSizes[numInputs]);
    if (rank == 0)
    {
        // init input tensors to random values
        for (int32_t t = 0; t < numInputs; ++t)
        {
            for (uint64_t i = 0; i < tensorElements[t]; ++i) tensorData_h[t][i] = ((floatType)rand()) / RAND_MAX;
        }
    }

    // Broadcast input data to all ranks
    for (int32_t t = 0; t < numInputs; ++t)
    {
        HANDLE_MPI_ERROR(MPI_Bcast(tensorData_h[t], tensorElements[t], floatTypeMPI, 0, MPI_COMM_WORLD));
    }

    // copy input data to device buffers
    for (int32_t t = 0; t < numInputs; ++t)
    {
        HANDLE_CUDA_ERROR(cudaMemcpy(tensorData_d[t], tensorData_h[t], tensorSizes[t], cudaMemcpyHostToDevice));
    }

Then we create the library handle and tensor network descriptor on each process, as described in cuTensorNet handle and network descriptor.

Next, we find the optimal contraction path and slicing combination for our tensor network. We will run the cuTensorNet optimizer on all processes and determine which process has the best path in terms of the FLOP count. We will then pack the optimizer info object on this process, broadcast the packed buffer, and unpack it on all other processes. Each process now has the same optimizer info object, which we use to calculate the share of slices for each process. Because the optimizer info object is modified as a result of unifiying the optimizer across processes, we need to update the network with the modified optimizer info object via cutensornetNetworkSetOptimizerInfo().

    // Compute the path on all ranks so that we can choose the path with the lowest cost. Note that since this is a tiny
    // example with 4 operands, all processes will compute the same globally optimal path. This is not the case for large
    // tensor networks. For large networks, hyper-optimization does become beneficial.

    // Enforce tensor network slicing (for parallelization)
    const int32_t min_slices = numProcs;
    HANDLE_ERROR(cutensornetContractionOptimizerConfigSetAttribute(handle,
                                                                   optimizerConfig,
                                                                   CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MIN_SLICES,
                                                                   &min_slices,
                                                                   sizeof(min_slices)));

    // Find an optimized tensor network contraction path on each MPI process
    HANDLE_ERROR(cutensornetContractionOptimize(handle,
                                                networkDesc,
                                                optimizerConfig,
                                                workspaceLimit,
                                                optimizerInfo));

    // Query the obtained Flop count
    double flops{-1.};
    HANDLE_ERROR(cutensornetContractionOptimizerInfoGetAttribute(handle,
                                                                 optimizerInfo,
                                                                 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
                                                                 &flops,
                                                                 sizeof(flops)));

    // Choose the contraction path with the lowest Flop cost
    struct
    {
        double value;
        int rank;
    } in{flops, rank}, out;

    HANDLE_MPI_ERROR(MPI_Allreduce(&in, &out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD));
    const int sender = out.rank;
    flops            = out.value;

    if (verbose) printf("Process %d has the path with the lowest FLOP count %lf\n", sender, flops);

    // Get the buffer size for optimizerInfo and broadcast it
    size_t bufSize{0};
    if (rank == sender)
    {
        HANDLE_ERROR(cutensornetContractionOptimizerInfoGetPackedSize(handle, optimizerInfo, &bufSize));
    }
    HANDLE_MPI_ERROR(MPI_Bcast(&bufSize, 1, MPI_INT64_T, sender, MPI_COMM_WORLD));

    // Allocate a buffer
    std::vector<char> buffer(bufSize);

    // Pack optimizerInfo on sender and broadcast it
    if (rank == sender)
    {
        HANDLE_ERROR(cutensornetContractionOptimizerInfoPackData(handle, optimizerInfo, buffer.data(), bufSize));
    }
    HANDLE_MPI_ERROR(MPI_Bcast(buffer.data(), bufSize, MPI_CHAR, sender, MPI_COMM_WORLD));

    // Unpack optimizerInfo from the buffer
    if (rank != sender)
    {
        HANDLE_ERROR(
            cutensornetUpdateContractionOptimizerInfoFromPackedData(handle, buffer.data(), bufSize, optimizerInfo));
    }

    // Update the network with the modified optimizer info
    HANDLE_ERROR(cutensornetNetworkSetOptimizerInfo(handle, networkDesc, optimizerInfo));

    // Query the number of slices the tensor network execution will be split into
    int64_t numSlices = 0;
    HANDLE_ERROR(cutensornetContractionOptimizerInfoGetAttribute(handle,
                                                                 optimizerInfo,
                                                                 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
                                                                 &numSlices,
                                                                 sizeof(numSlices)));
    assert(numSlices > 0);

    // Calculate each process's share of the slices
    int64_t procChunk  = numSlices / numProcs;
    int extra          = numSlices % numProcs;
    int procSliceBegin = rank * procChunk + std::min(rank, extra);
    int procSliceEnd   = (rank == numProcs - 1) ? numSlices : (rank + 1) * procChunk + std::min(rank + 1, extra);

We now create the workspace descriptor and allocate memory as described in Create workspace descriptor and allocate workspace memory, and create the Contraction preparation and auto-tuning of the tensor network.

Next, on each process, we create a slice group (see cutensornetSliceGroup_t) that corresponds to its share of the tensor network slices. We then provide this slice group object to the cutensornetNetworkContract() function to get a partial contraction result on each process.

    // Create a cutensornetSliceGroup_t object from a range of slice IDs
    cutensornetSliceGroup_t sliceGroup{};
    HANDLE_ERROR(cutensornetCreateSliceGroupFromIDRange(handle, procSliceBegin, procSliceEnd, 1, &sliceGroup));

        HANDLE_ERROR(cutensornetNetworkContract(handle,
                                                networkDesc,
                                                accumulateOutput,
                                                workDesc,
                                                sliceGroup,
                                                stream));

Finally, we sum up the partial contributions to obtain the result of the tensor network contraction.

        // Perform Allreduce operation on the output tensor
        HANDLE_CUDA_ERROR(cudaStreamSynchronize(stream));
        // restore the output tensor on Host
        HANDLE_CUDA_ERROR(cudaMemcpy(tensorData_h[numInputs], tensorData_d[numInputs], tensorSizes[numInputs], cudaMemcpyDeviceToHost));
        HANDLE_MPI_ERROR(MPI_Allreduce(MPI_IN_PLACE, tensorData_h[numInputs], tensorElements[numInputs], floatTypeMPI, MPI_SUM, MPI_COMM_WORLD));

Before termination, the MPI service needs to be finalized.

    // Shut down MPI service
    HANDLE_MPI_ERROR(MPI_Finalize());

The complete MPI-manual sample can be found in the NVIDIA/cuQuantum repository.

Code example (tensorQR)¶

Code example (tensorSVD)¶

Code example (GateSplit)¶

Code example (MPS factorization)¶

Code example (intermediate tensor reuse)¶

Code example (gradients computation)¶

Code example (amplitudes slice)¶

Code example (expectation value)¶

Code example (marginal distribution)¶

Code example (tensor network sampling)¶

Code example (Projection MPS Circuit DMRG)¶

Code example (MPS sampling MPO)¶

Useful tips¶

For debugging, one can set the environment variable CUTENSORNET_LOG_LEVEL=n. The level n = 0, 1, …, 5 corresponds to the logger level as described and used in cutensornetLoggerSetLevel(). The environment variable CUTENSORNET_LOG_FILE=<filepath> can be used to redirect the log output to a custom file at <filepath> instead of stdout.

Examples¶

Compiling code¶

Code example (serial)¶

Headers and data types¶

Define tensor network and tensor sizes¶

Allocate memory and initialize data¶

cuTensorNet handle and network descriptor¶

Optimal contraction order and slicing¶

Create workspace descriptor and allocate workspace memory¶

Contraction preparation and auto-tuning¶

Tensor network contraction execution¶

Code example (automatic slice-based distributed parallelization)¶

Code example (manual slice-based distributed parallelization)¶

Code example (tensorQR)¶

Code example (tensorSVD)¶

Code example (GateSplit)¶

Code example (MPS factorization)¶

Code example (intermediate tensor reuse)¶

Code example (gradients computation)¶

Code example (amplitudes slice)¶

Code example (expectation value)¶

Code example (marginal distribution)¶

Code example (tensor network sampling)¶

Code example (MPS amplitudes slice using simple update)¶

Code example (MPS expectation value)¶

Code example (MPS marginal distribution)¶

Code example (MPS sampling)¶

Code example (MPS sampling QFT)¶

Code example (Projection MPS Circuit DMRG)¶

Code example (MPS sampling MPO)¶

Useful tips¶