Code example (Contraction with automatic distributed slicing)#
It is straightforward to adapt Code example (serial contraction) and enable automatic parallel execution across multiple/many GPU devices (across multiple/many nodes). We will illustrate this with an example using the Message Passing Interface (MPI) as the communication layer. Below we show the minor additions that need to be made in order to enable distributed parallel execution without making any changes to the original serial source code. The full MPI-automatic sample code can be found in the NVIDIA/cuQuantum repository. To enable automatic parallelism, cuTensorNet requires that
the environment variable
$CUTENSORNET_COMM_LIBis set to the path to the wrapper shared librarylibcutensornet_distributed_interface_mpi.so, andthe executable is linked to a CUDA-aware MPI library
The detailed instruction for setting these up is given in the installation guide above.
First, in addition to the headers and definitions mentioned in Headers and data types, we include the MPI header and define a macro to handle MPI errors. We also need to initialize the MPI service and assign a unique GPU device to each MPI process that will later be associated with the cuTensorNet library handle created inside the MPI process.
20#include <mpi.h>
48#define HANDLE_MPI_ERROR(x) \
49 do { \
50 const auto err = x; \
51 if (err != MPI_SUCCESS) \
52 { \
53 char error[MPI_MAX_ERROR_STRING]; \
54 int len; \
55 MPI_Error_string(err, error, &len); \
56 printf("MPI Error: %s in line %d\n", error, __LINE__); \
57 fflush(stdout); \
58 MPI_Abort(MPI_COMM_WORLD, err); \
59 } \
60 } while (0)
The MPI service initialization must precede the first cutensornetCreate() call
which creates a cuTensorNet library handle. An attempt to call cutensornetCreate()
before initializing the MPI service will result in an error.
106 // Initialize MPI
107 HANDLE_MPI_ERROR(MPI_Init(&argc, &argv));
108 int rank{-1};
109 HANDLE_MPI_ERROR(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
110 int numProcs{0};
111 HANDLE_MPI_ERROR(MPI_Comm_size(MPI_COMM_WORLD, &numProcs));
If multiple GPU devices located on the same node are visible to an MPI process,
we need to pick an exclusive GPU device for each MPI process. If the mpirun (or mpiexec)
command provided by your MPI library implementation sets up an environment variable
that shows the rank of the respective MPI process during its invocation, you can use
that environment variable to set CUDA_VISIBLE_DEVICES to point to a specific single
GPU device assigned to the MPI process exclusively (for example, Open MPI provides
${OMPI_COMM_WORLD_LOCAL_RANK} for this purpose). Otherwise, the GPU device can be
set manually, as shown below.
129 // Set GPU device based on ranks and nodes
130 int numDevices{0};
131 HANDLE_CUDA_ERROR(cudaGetDeviceCount(&numDevices));
132 const int deviceId = rank % numDevices; // we assume that the processes are mapped to nodes in contiguous chunks
133 HANDLE_CUDA_ERROR(cudaSetDevice(deviceId));
134 cudaDeviceProp prop;
135 HANDLE_CUDA_ERROR(cudaGetDeviceProperties(&prop, deviceId));
Next we define the tensor network as described in Define tensor network and tensor sizes. In a one GPU device per process model, the tensor network, including operands and result data, is replicated on each process. The root process initializes the input data and broadcasts it to the other processes.
243 /*******************
244 * Initialize data
245 *******************/
246
247 // init output tensor to all 0s
248 memset(tensorData_h[numInputs], 0, tensorSizes[numInputs]);
249 if (rank == 0)
250 {
251 // init input tensors to random values
252 for (int32_t t = 0; t < numInputs; ++t)
253 {
254 for (uint64_t i = 0; i < tensorElements[t]; ++i) tensorData_h[t][i] = ((floatType)rand()) / RAND_MAX;
255 }
256 }
257
258 // Broadcast input data to all ranks
259 for (int32_t t = 0; t < numInputs; ++t)
260 {
261 HANDLE_MPI_ERROR(MPI_Bcast(tensorData_h[t], tensorElements[t], floatTypeMPI, 0, MPI_COMM_WORLD));
262 }
263
264 // copy input data to device buffers
265 for (int32_t t = 0; t < numInputs; ++t)
266 {
267 HANDLE_CUDA_ERROR(cudaMemcpy(tensorData_d[t], tensorData_h[t], tensorSizes[t], cudaMemcpyHostToDevice));
268 }
Once the MPI service has been initialized and the cuTensorNet library handle has been created
afterwards, one can activate the distributed parallel execution by calling cutensornetDistributedResetConfiguration().
Per standard practice, the user’s code needs to create a duplicate MPI communicator via MPI_Comm_dup.
Then, the duplicate MPI communicator is associated with the cuTensorNet library handle
by passing the pointer to the duplicate MPI communicator together with its size (in bytes)
to the cutensornetDistributedResetConfiguration() call. The MPI communicator will be stored
inside the cuTensorNet library handle such that all subsequent calls to the
tensor network contraction path finder and tensor network contraction executor will be
parallelized across all participating MPI processes (each MPI process is associated with its own GPU).
333 /*******************************
334 * Activate distributed (parallel) execution prior to
335 * calling contraction path finder and contraction executor
336 *******************************/
337 // HANDLE_ERROR( cutensornetDistributedResetConfiguration(handle, NULL, 0) ); // resets back to serial execution
338 MPI_Comm cutnComm;
339 HANDLE_MPI_ERROR(MPI_Comm_dup(MPI_COMM_WORLD, &cutnComm)); // duplicate MPI communicator
340 HANDLE_ERROR(cutensornetDistributedResetConfiguration(handle, &cutnComm, sizeof(cutnComm)));
341 if (verbose) printf("Reset distributed MPI configuration\n");
Note
cutensornetDistributedResetConfiguration() is a collective call that must be executed
by all participating MPI processes.
The API of this distributed parallelization model makes it straightforward to run source codes
written for serial execution on multiple GPUs/nodes. Essentially, all MPI processes will
execute exactly the same (serial) source code while automatically performing distributed parallelization
inside the tensor network contraction path finder and tensor network contraction executor calls.
The parallelization of the tensor network contraction path finder will only occur when the number
of requested hyper-samples is larger than zero. However, regardless of that, activation of the
distributed parallelization must precede the invocation of the tensor network contraction path finder.
That is, the tensor network contraction path finder and tensor network contraction execution invocations
must be done strictly after activating the distributed parallelization via cutensornetDistributedResetConfiguration().
When the distributed configuration is set to a parallel mode, the user is normally expected to invoke
tensor network contraction execution by calling the cutensornetNetworkContract() function which is provided
with the full range of tensor network slices that will be automatically distributed across all MPI processes.
Since the size of the tensor network must be sufficiently large to get a benefit of acceleration from
distributed execution, smaller tensor networks (those which consist of only a single slice)
can still be processed without distributed parallelization, which can be achieved by calling
cutensornetDistributedResetConfiguration() with a NULL argument in place of the MPI communicator
pointer (as before, this should be done prior to calling the tensor network contraction path finder).
That is, the switch between distributed parallelization and redundant serial execution can be done
on a per-tensor-network basis. Users can decide which (larger) tensor networks to process
in a parallel manner and which (smaller ones) to process in a serial manner redundantly,
by resetting the distributed configuration appropriately. In both cases, all MPI processes
will produce the same output tensor (result) at the end of the tensor network execution.
Note
In the current version of the cuTensorNet library, the parallel tensor network contraction
execution triggered by the cutensornetNetworkContract() call will block the provided CUDA stream as well
as the calling CPU thread until the execution has completed on all MPI processes. This is a temporary
limitation that will be lifted in future versions of the cuTensorNet library, where the call to
cutensornetNetworkContract() will be fully asynchronous, similar to the serial execution case.
Additionally, for an explicit synchronization of all MPI processes (barrier) one can make
a collective call to cutensornetDistributedSynchronize().
Before termination, the MPI service needs to be finalized.
552 // Shut down MPI service
553 HANDLE_MPI_ERROR(MPI_Finalize());
The complete MPI-automatic sample can be found in the NVIDIA/cuQuantum repository.