Code example (Contraction with manual distributed slicing)#
For advanced users, it is also possible (but more involved) to adapt Code example (serial contraction) to explicitly parallelize execution of the tensor network contraction operation on multiple GPU devices. Here we will also use MPI as the communication layer. For brevity, we will show only the changes that need to be made on top of the serial example. The full MPI-manual sample code can be found in the NVIDIA/cuQuantum repository. Note that this sample does NOT require CUDA-aware MPI.
First, in addition to the headers and definitions mentioned in Headers and data types, we need to include the MPI header and define a macro to handle MPI errors. We also need to initialize the MPI service and associate each MPI process with its own GPU device, as explained previously.
20#include <mpi.h>
48#define HANDLE_MPI_ERROR(x) \
49 do { \
50 const auto err = x; \
51 if (err != MPI_SUCCESS) \
52 { \
53 char error[MPI_MAX_ERROR_STRING]; \
54 int len; \
55 MPI_Error_string(err, error, &len); \
56 printf("MPI Error: %s in line %d\n", error, __LINE__); \
57 fflush(stdout); \
58 MPI_Abort(MPI_COMM_WORLD, err); \
59 } \
60 } while (0)
106 // Initialize MPI
107 HANDLE_MPI_ERROR(MPI_Init(&argc, &argv));
108 int rank{-1};
109 HANDLE_MPI_ERROR(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
110 int numProcs{0};
111 HANDLE_MPI_ERROR(MPI_Comm_size(MPI_COMM_WORLD, &numProcs));
129 // Set GPU device based on ranks and nodes
130 int numDevices{0};
131 HANDLE_CUDA_ERROR(cudaGetDeviceCount(&numDevices));
132 const int deviceId = rank % numDevices; // we assume that the processes are mapped to nodes in contiguous chunks
133 HANDLE_CUDA_ERROR(cudaSetDevice(deviceId));
134 cudaDeviceProp prop;
135 HANDLE_CUDA_ERROR(cudaGetDeviceProperties(&prop, deviceId));
Next, we define the tensor network as described in Define tensor network and tensor sizes. In a one GPU device per process model, the tensor network, including operands and result data, is replicated on each process. The root process initializes the input data and broadcasts it to the other processes.
243 /*******************
244 * Initialize data
245 *******************/
246
247 // init output tensor to all 0s
248 memset(tensorData_h[numInputs], 0, tensorSizes[numInputs]);
249 if (rank == 0)
250 {
251 // init input tensors to random values
252 for (int32_t t = 0; t < numInputs; ++t)
253 {
254 for (uint64_t i = 0; i < tensorElements[t]; ++i) tensorData_h[t][i] = ((floatType)rand()) / RAND_MAX;
255 }
256 }
257
258 // Broadcast input data to all ranks
259 for (int32_t t = 0; t < numInputs; ++t)
260 {
261 HANDLE_MPI_ERROR(MPI_Bcast(tensorData_h[t], tensorElements[t], floatTypeMPI, 0, MPI_COMM_WORLD));
262 }
263
264 // copy input data to device buffers
265 for (int32_t t = 0; t < numInputs; ++t)
266 {
267 HANDLE_CUDA_ERROR(cudaMemcpy(tensorData_d[t], tensorData_h[t], tensorSizes[t], cudaMemcpyHostToDevice));
268 }
Then we create the library handle and tensor network descriptor on each process, as described in cuTensorNet handle and network descriptor.
Next, we find the optimal contraction path and slicing combination for our tensor network.
We will run the cuTensorNet optimizer on all processes and determine which process has the best path
in terms of the FLOP count. We will then pack the optimizer info object on this process, broadcast the packed buffer,
and unpack it on all other processes. Each process now has the same optimizer info object, which we use to calculate
the share of slices for each process. Because the optimizer info object is modified as a result of unifying the optimizer across processes,
we need to update the network with the modified optimizer info object via cutensornetNetworkSetOptimizerInfo().
353 // Compute the path on all ranks so that we can choose the path with the lowest cost. Note that since this is a tiny
354 // example with 4 operands, all processes will compute the same globally optimal path. This is not the case for large
355 // tensor networks. For large networks, hyper-optimization does become beneficial.
356
357 // Enforce tensor network slicing (for parallelization)
358 const int32_t min_slices = numProcs;
359 HANDLE_ERROR(cutensornetContractionOptimizerConfigSetAttribute(handle,
360 optimizerConfig,
361 CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MIN_SLICES,
362 &min_slices,
363 sizeof(min_slices)));
364
365 // Find an optimized tensor network contraction path on each MPI process
366 HANDLE_ERROR(cutensornetContractionOptimize(handle,
367 networkDesc,
368 optimizerConfig,
369 workspaceLimit,
370 optimizerInfo));
371
372 // Query the obtained Flop count
373 double flops{-1.};
374 HANDLE_ERROR(cutensornetContractionOptimizerInfoGetAttribute(handle,
375 optimizerInfo,
376 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
377 &flops,
378 sizeof(flops)));
379
380 // Choose the contraction path with the lowest Flop cost
381 struct
382 {
383 double value;
384 int rank;
385 } in{flops, rank}, out;
386
387 HANDLE_MPI_ERROR(MPI_Allreduce(&in, &out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD));
388 const int sender = out.rank;
389 flops = out.value;
390
391 if (verbose) printf("Process %d has the path with the lowest FLOP count %lf\n", sender, flops);
392
393 // Get the buffer size for optimizerInfo and broadcast it
394 size_t bufSize{0};
395 if (rank == sender)
396 {
397 HANDLE_ERROR(cutensornetContractionOptimizerInfoGetPackedSize(handle, optimizerInfo, &bufSize));
398 }
399 HANDLE_MPI_ERROR(MPI_Bcast(&bufSize, 1, MPI_INT64_T, sender, MPI_COMM_WORLD));
400
401 // Allocate a buffer
402 std::vector<char> buffer(bufSize);
403
404 // Pack optimizerInfo on sender and broadcast it
405 if (rank == sender)
406 {
407 HANDLE_ERROR(cutensornetContractionOptimizerInfoPackData(handle, optimizerInfo, buffer.data(), bufSize));
408 }
409 HANDLE_MPI_ERROR(MPI_Bcast(buffer.data(), bufSize, MPI_CHAR, sender, MPI_COMM_WORLD));
410
411 // Unpack optimizerInfo from the buffer
412 if (rank != sender)
413 {
414 HANDLE_ERROR(
415 cutensornetUpdateContractionOptimizerInfoFromPackedData(handle, buffer.data(), bufSize, optimizerInfo));
416 }
417
418 // Update the network with the modified optimizer info
419 HANDLE_ERROR(cutensornetNetworkSetOptimizerInfo(handle, networkDesc, optimizerInfo));
420
421 // Query the number of slices the tensor network execution will be split into
422 int64_t numSlices = 0;
423 HANDLE_ERROR(cutensornetContractionOptimizerInfoGetAttribute(handle,
424 optimizerInfo,
425 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
426 &numSlices,
427 sizeof(numSlices)));
428 assert(numSlices > 0);
429
430 // Calculate each process's share of the slices
431 int64_t procChunk = numSlices / numProcs;
432 int extra = numSlices % numProcs;
433 int procSliceBegin = rank * procChunk + std::min(rank, extra);
434 int procSliceEnd = (rank == numProcs - 1) ? numSlices : (rank + 1) * procChunk + std::min(rank + 1, extra);
We now create the workspace descriptor and allocate memory as described in Create workspace descriptor and allocate workspace memory, and create the Contraction preparation and auto-tuning of the tensor network.
Next, on each process, we create a slice group (see cutensornetSliceGroup_t) that corresponds to its share of the tensor network slices.
We then provide this slice group object to the cutensornetNetworkContract() function to get a partial contraction result on each process.
525 // Create a cutensornetSliceGroup_t object from a range of slice IDs
526 cutensornetSliceGroup_t sliceGroup{};
527 HANDLE_ERROR(cutensornetCreateSliceGroupFromIDRange(handle, procSliceBegin, procSliceEnd, 1, &sliceGroup));
548 HANDLE_ERROR(cutensornetNetworkContract(handle,
549 networkDesc,
550 accumulateOutput,
551 workDesc,
552 sliceGroup,
553 stream));
Finally, we sum up the partial contributions to obtain the result of the tensor network contraction.
559 // Perform Allreduce operation on the output tensor
560 HANDLE_CUDA_ERROR(cudaStreamSynchronize(stream));
561 // restore the output tensor on Host
562 HANDLE_CUDA_ERROR(cudaMemcpy(tensorData_h[numInputs], tensorData_d[numInputs], tensorSizes[numInputs], cudaMemcpyDeviceToHost));
563 HANDLE_MPI_ERROR(MPI_Allreduce(MPI_IN_PLACE, tensorData_h[numInputs], tensorElements[numInputs], floatTypeMPI, MPI_SUM, MPI_COMM_WORLD));
Before termination, the MPI service needs to be finalized.
624 // Shut down MPI service
625 HANDLE_MPI_ERROR(MPI_Finalize());
The complete MPI-manual sample can be found in the NVIDIA/cuQuantum repository.