Getting Started¶
In this section, we show how to contract a tensor network using cuTensorNet. First, we describe how to install the library and how to compile a sample code. Then, we present the example code used to perform common steps in cuTensorNet. In this example, we perform the following tensor contraction:
We build the code up step by step, each step adding code at the end. The steps are separated by succinct multi-line comment blocks.
It is recommended that the reader refers to Overview and cuTENSOR documentation for familiarity with the nomenclature and cuTENSOR operations.
Installation and Compilation¶
Download the cuQuantum package (which cuTensorNet is part of) from https://developer.nvidia.com/cuQuantum-downloads, and the cuTENSOR package from https://developer.nvidia.com/cutensor.
Linux¶
Assuming cuQuantum has been extracted in CUQUANTUM_ROOT
and cuTENSOR in CUTENSOR_ROOT
we update the library path as follows:
export LD_LIBRARY_PATH=${CUQUANTUM_ROOT}/lib:${CUTENSOR_ROOT}/lib/11:${LD_LIBRARY_PATH}
Depending on your CUDA Toolkit, you might have to choose a different library version (e.g., ${CUTENSOR_ROOT}/lib/11.0
).
The sample code discussed below (tensornet_example.cu
) can be compiled via the following command:
nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib/11 -lcutensornet -lcutensor -o tensornet_example
For statically linking against the cuTensorNet library, use the following command (note that libmetis_static.a
needs to be explicitly linked against, assuming it is installed through the NVIDIA CUDA Toolkit and accessible through $LIBRARY_PATH
):
nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -I${CUTENSOR_ROOT}/include ${CUQUANTUM_ROOT}/lib/libcutensornet_static.a -L${CUTENSOR_DIR}/lib/11 -lcutensor libmetis_static.a -o tensornet_example
Note
Depending on the source of the cuQuantum package, you may need to replace lib
above by lib64
.
Code Example (Serial)¶
The following code example illustrates the common steps necessary to use cuTensorNet and also introduces typical tensor network operations. The full sample code can be found in the NVIDIA/cuQuantum repository (here).
Headers and data types¶
8#include <stdlib.h>
9#include <stdio.h>
10
11#include <unordered_map>
12#include <vector>
13#include <cassert>
14
15#include <cuda_runtime.h>
16#include <cutensornet.h>
17#include <cutensor.h>
18
19#define HANDLE_ERROR(x) \
20{ const auto err = x; \
21if( err != CUTENSORNET_STATUS_SUCCESS ) \
22{ printf("Error: %s in line %d\n", cutensornetGetErrorString(err), __LINE__); return err; } \
23};
24
25#define HANDLE_CUDA_ERROR(x) \
26{ const auto err = x; \
27 if( err != cudaSuccess ) \
28 { printf("Error: %s in line %d\n", cudaGetErrorString(err), __LINE__); return err; } \
29};
30
31struct GPUTimer
32{
33 GPUTimer(cudaStream_t stream): stream_(stream)
34 {
35 cudaEventCreate(&start_);
36 cudaEventCreate(&stop_);
37 }
38
39 ~GPUTimer()
40 {
41 cudaEventDestroy(start_);
42 cudaEventDestroy(stop_);
43 }
44
45 void start()
46 {
47 cudaEventRecord(start_, stream_);
48 }
49
50 float seconds()
51 {
52 cudaEventRecord(stop_, stream_);
53 cudaEventSynchronize(stop_);
54 float time;
55 cudaEventElapsedTime(&time, start_, stop_);
56 return time * 1e-3;
57 }
58
59 private:
60 cudaEvent_t start_, stop_;
61 cudaStream_t stream_;
62};
63
64
65int main()
66{
67 const size_t cuTensornetVersion = cutensornetGetVersion();
68 printf("cuTensorNet-vers:%ld\n",cuTensornetVersion);
69
70 cudaDeviceProp prop;
71 int deviceId{-1};
72 HANDLE_CUDA_ERROR( cudaGetDevice(&deviceId) );
73 HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );
74
75 printf("===== device info ======\n");
76 printf("GPU-name:%s\n", prop.name);
77 printf("GPU-clock:%d\n", prop.clockRate);
78 printf("GPU-memoryClock:%d\n", prop.memoryClockRate);
79 printf("GPU-nSM:%d\n", prop.multiProcessorCount);
80 printf("GPU-major:%d\n", prop.major);
81 printf("GPU-minor:%d\n", prop.minor);
82 printf("========================\n");
83
84 typedef float floatType;
85 cudaDataType_t typeData = CUDA_R_32F;
86 cutensornetComputeType_t typeCompute = CUTENSORNET_COMPUTE_32F;
87
88 printf("Include headers and define data types\n");
Define tensor network and tensor sizes¶
Next, we define the topology of the tensor network (i.e., the modes of the tensors, their extents, and their connectivity).
91 /**********************
92 * Computing: D_{m,x,n,y} = A_{m,h,k,n} B_{u,k,h} C_{x,u,y}
93 **********************/
94
95 constexpr int32_t numInputs = 3;
96
97 // Create vector of modes
98 std::vector<int32_t> modesA{'m','h','k','n'};
99 std::vector<int32_t> modesB{'u','k','h'};
100 std::vector<int32_t> modesC{'x','u','y'};
101 std::vector<int32_t> modesD{'m','x','n','y'};
102
103 // Extents
104 std::unordered_map<int32_t, int64_t> extent;
105 extent['m'] = 96;
106 extent['n'] = 96;
107 extent['u'] = 96;
108 extent['h'] = 64;
109 extent['k'] = 64;
110 extent['x'] = 64;
111 extent['y'] = 64;
112
113 // Create a vector of extents for each tensor
114 std::vector<int64_t> extentA;
115 for (auto mode : modesA)
116 extentA.push_back(extent[mode]);
117 std::vector<int64_t> extentB;
118 for (auto mode : modesB)
119 extentB.push_back(extent[mode]);
120 std::vector<int64_t> extentC;
121 for (auto mode : modesC)
122 extentC.push_back(extent[mode]);
123 std::vector<int64_t> extentD;
124 for (auto mode : modesD)
125 extentD.push_back(extent[mode]);
126
127 printf("Define network, modes, and extents\n");
Allocate memory and initialize data¶
Next, we allocate memory for the tensor network operands and initialize it to random values.
130 /**********************
131 * Allocating data
132 **********************/
133
134 size_t elementsA = 1;
135 for (auto mode : modesA)
136 elementsA *= extent[mode];
137 size_t elementsB = 1;
138 for (auto mode : modesB)
139 elementsB *= extent[mode];
140 size_t elementsC = 1;
141 for (auto mode : modesC)
142 elementsC *= extent[mode];
143 size_t elementsD = 1;
144 for (auto mode : modesD)
145 elementsD *= extent[mode];
146
147 size_t sizeA = sizeof(floatType) * elementsA;
148 size_t sizeB = sizeof(floatType) * elementsB;
149 size_t sizeC = sizeof(floatType) * elementsC;
150 size_t sizeD = sizeof(floatType) * elementsD;
151 printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC + sizeD)/1024./1024./1024);
152
153 void* rawDataIn_d[numInputs];
154 void* D_d;
155 HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[0], sizeA) );
156 HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[1], sizeB) );
157 HANDLE_CUDA_ERROR( cudaMalloc((void**) &rawDataIn_d[2], sizeC) );
158 HANDLE_CUDA_ERROR( cudaMalloc((void**) &D_d, sizeD));
159
160 floatType *A = (floatType*) malloc(sizeof(floatType) * elementsA);
161 floatType *B = (floatType*) malloc(sizeof(floatType) * elementsB);
162 floatType *C = (floatType*) malloc(sizeof(floatType) * elementsC);
163 floatType *D = (floatType*) malloc(sizeof(floatType) * elementsD);
164
165 if (A == NULL || B == NULL || C == NULL || D == NULL)
166 {
167 printf("Error: Host allocation of A or C.\n");
168 return -1;
169 }
170
171 /*******************
172 * Initialize data
173 *******************/
174
175 for (uint64_t i = 0; i < elementsA; i++)
176 A[i] = ((floatType) rand())/RAND_MAX;
177 for (uint64_t i = 0; i < elementsB; i++)
178 B[i] = ((floatType) rand())/RAND_MAX;
179 for (uint64_t i = 0; i < elementsC; i++)
180 C[i] = ((floatType) rand())/RAND_MAX;
181 memset(D, 0, sizeof(floatType) * elementsD);
182
183 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
184 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
185 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );
186
187 printf("Allocate memory for data, and initialize data.\n");
cuTensorNet handle and network descriptor¶
Next, we initialize the cuTensorNet library via cutensornetCreate()
and
create the network descriptor with the desired tensor modes, extents, and
strides, as well as the data and compute types.
190 /*************************
191 * cuTensorNet
192 *************************/
193
194 cudaStream_t stream;
195 cudaStreamCreate(&stream);
196
197 cutensornetHandle_t handle;
198 HANDLE_ERROR( cutensornetCreate(&handle) );
199
200 const int32_t nmodeA = modesA.size();
201 const int32_t nmodeB = modesB.size();
202 const int32_t nmodeC = modesC.size();
203 const int32_t nmodeD = modesD.size();
204
205 /*******************************
206 * Create Network Descriptor
207 *******************************/
208
209 const int32_t* modesIn[] = {modesA.data(), modesB.data(), modesC.data()};
210 int32_t const numModesIn[] = {nmodeA, nmodeB, nmodeC};
211 const int64_t* extentsIn[] = {extentA.data(), extentB.data(), extentC.data()};
212 const int64_t* stridesIn[] = {NULL, NULL, NULL}; // strides are optional; if no stride is provided, then cuTensorNet assumes a generalized column-major data layout
213
214 // Notice that pointers are allocated via cudaMalloc are aligned to 256 byte
215 // boundaries by default; however here we're checking the pointer alignment explicitly
216 // to demonstrate how one would check the alginment for arbitrary pointers.
217
218 auto getMaximalPointerAlignment = [](const void* ptr) {
219 const uint64_t ptrAddr = reinterpret_cast<uint64_t>(ptr);
220 uint32_t alignment = 1;
221 while(ptrAddr % alignment == 0 &&
222 alignment < 256) // at the latest we terminate once the alignment reached 256 bytes (we could be going, but any alignment larger or equal to 256 is equally fine)
223 {
224 alignment *= 2;
225 }
226 return alignment;
227 };
228 const uint32_t alignmentsIn[] = {getMaximalPointerAlignment(rawDataIn_d[0]),
229 getMaximalPointerAlignment(rawDataIn_d[1]),
230 getMaximalPointerAlignment(rawDataIn_d[2])};
231 const uint32_t alignmentOut = getMaximalPointerAlignment(D_d);
232
233 // setup tensor network
234 cutensornetNetworkDescriptor_t descNet;
235 HANDLE_ERROR( cutensornetCreateNetworkDescriptor(handle,
236 numInputs, numModesIn, extentsIn, stridesIn, modesIn, alignmentsIn,
237 nmodeD, extentD.data(), /*stridesOut = */NULL, modesD.data(), alignmentOut,
238 typeData, typeCompute,
239 &descNet) );
240
241 printf("Initialize the cuTensorNet library and create a network descriptor.\n");
Optimal contraction order and slicing¶
At this stage, we can deploy the cuTensorNet optimizer to find an optimized contraction path and slicing combination.
We choose a limit for the workspace needed to perform the contraction based on the available resources, and provide it
to the optimizer as a constraint. We then create an optimizer config object of type cutensornetContractionOptimizerConfig_t
to specify various optimizer options and provide it to the optimizer, which is called using cutensornetContractionOptimize()
. The
results from the optimizer will be returned in an optimizer info object of type cutensornetContractionOptimizerInfo_t
.
244 /*******************************
245 * Choose workspace limit based on available resources.
246 *******************************/
247
248 size_t freeMem, totalMem;
249 HANDLE_CUDA_ERROR( cudaMemGetInfo(&freeMem, &totalMem) );
250 uint64_t workspaceLimit = totalMem * 0.9;
251
252 /*******************************
253 * Find "optimal" contraction order and slicing
254 *******************************/
255
256 cutensornetContractionOptimizerConfig_t optimizerConfig;
257 HANDLE_ERROR( cutensornetCreateContractionOptimizerConfig(handle, &optimizerConfig) );
258
259 // Set the value of the partitioner imbalance factor, if desired
260 int32_t imbalance_factor = 30;
261 HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(
262 handle,
263 optimizerConfig,
264 CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_GRAPH_IMBALANCE_FACTOR,
265 &imbalance_factor,
266 sizeof(imbalance_factor)) );
267
268
269 cutensornetContractionOptimizerInfo_t optimizerInfo;
270 HANDLE_ERROR( cutensornetCreateContractionOptimizerInfo(handle, descNet, &optimizerInfo) );
271
272 HANDLE_ERROR( cutensornetContractionOptimize(handle,
273 descNet,
274 optimizerConfig,
275 workspaceLimit,
276 optimizerInfo) );
277
278 int64_t numSlices = 0;
279 HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
280 handle,
281 optimizerInfo,
282 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
283 &numSlices,
284 sizeof(numSlices)) );
285
286 assert(numSlices > 0);
287
288 printf("Find an optimized contraction path with cuTensorNet optimizer.\n");
It is also possible to bypass the cuTensorNet optimizer and set a pre-determined path, as well as slicing information, directly into the optimizer
info object via cutensornetContractionOptimizerInfoSetAttribute()
.
Create workspace descriptor and allocate workspace memory¶
Next, we create a workspace descriptor, compute the workspace sizes, and query the minimum workspace size needed to contract the network. We then allocate device memory for the workspace and set this in the workspace descriptor. The workspace descriptor will be provided to the contraction plan.
291 /*******************************
292 * Create workspace descriptor, allocate workspace, and set it.
293 *******************************/
294
295 cutensornetWorkspaceDescriptor_t workDesc;
296 HANDLE_ERROR( cutensornetCreateWorkspaceDescriptor(handle, &workDesc) );
297
298 uint64_t requiredWorkspaceSize = 0;
299 HANDLE_ERROR( cutensornetWorkspaceComputeSizes(handle,
300 descNet,
301 optimizerInfo,
302 workDesc) );
303
304 HANDLE_ERROR( cutensornetWorkspaceGetSize(handle,
305 workDesc,
306 CUTENSORNET_WORKSIZE_PREF_MIN,
307 CUTENSORNET_MEMSPACE_DEVICE,
308 &requiredWorkspaceSize) );
309
310 void *work = nullptr;
311 HANDLE_CUDA_ERROR( cudaMalloc(&work, requiredWorkspaceSize) );
312
313 HANDLE_ERROR( cutensornetWorkspaceSet(handle,
314 workDesc,
315 CUTENSORNET_MEMSPACE_DEVICE,
316 work,
317 requiredWorkspaceSize) );
318
319 printf("Allocate workspace.\n");
Contraction plan and auto-tune¶
We create a contraction plan holding pair-wise contraction plans for cuTENSOR. Optionally, we can auto-tune the plan such that cuTENSOR selects the best kernel for each contraction. This contraction plan can be reused for many (possibly different) data inputs, avoiding the cost of initializing this plan redundantly.
322 /*******************************
323 * Initialize all pair-wise contraction plans (for cuTENSOR).
324 *******************************/
325
326 cutensornetContractionPlan_t plan;
327
328 HANDLE_ERROR( cutensornetCreateContractionPlan(handle,
329 descNet,
330 optimizerInfo,
331 workDesc,
332 &plan) );
333
334
335 /*******************************
336 * Optional: Auto-tune cuTENSOR's cutensorContractionPlan to pick the fastest kernel
337 * for each pairwise contraction.
338 *******************************/
339 cutensornetContractionAutotunePreference_t autotunePref;
340 HANDLE_ERROR( cutensornetCreateContractionAutotunePreference(handle,
341 &autotunePref) );
342
343 const int numAutotuningIterations = 5; // may be 0
344 HANDLE_ERROR( cutensornetContractionAutotunePreferenceSetAttribute(
345 handle,
346 autotunePref,
347 CUTENSORNET_CONTRACTION_AUTOTUNE_MAX_ITERATIONS,
348 &numAutotuningIterations,
349 sizeof(numAutotuningIterations)) );
350
351 // modify the plan again to find the best pair-wise contractions
352 HANDLE_ERROR( cutensornetContractionAutotune(handle,
353 plan,
354 rawDataIn_d,
355 D_d,
356 workDesc,
357 autotunePref,
358 stream) );
359
360 HANDLE_ERROR( cutensornetDestroyContractionAutotunePreference(autotunePref) );
361
362 printf("Create a contraction plan for cuTensorNet and optionally auto-tune it.\n");
Network contraction execution¶
Finally, we contract the network as many times as needed, possibly with different data.
Network slices, captured as a cutensornetSliceGroup_t
object, are executed using the same contraction plan. For convenience, NULL
can be provided to
the cutensornetContractSlices()
function instead of a slice group when the goal is to contract all the slices in the network.
We also clean up and free allocated resources.
365 /**********************
366 * Run
367 **********************/
368 cutensornetSliceGroup_t sliceGroup{};
369
370 // Create a cutensornetSliceGroup_t object from a range of slice IDs.
371 HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, 0, numSlices, 1, &sliceGroup) );
372
373 GPUTimer timer{stream};
374 double minTimeCUTENSOR = 1e100;
375 const int numRuns = 3; // to get stable perf results
376 for (int i=0; i < numRuns; ++i)
377 {
378 cudaMemcpy(D_d, D, sizeD, cudaMemcpyHostToDevice); // restore output
379 cudaDeviceSynchronize();
380
381 /*
382 * Contract over all slices.
383 *
384 * A user may choose to parallelize over the slices across multiple devices.
385 */
386 timer.start();
387
388 int32_t accumulateOutput = 0;
389 HANDLE_ERROR( cutensornetContractSlices(handle,
390 plan,
391 rawDataIn_d,
392 D_d,
393 accumulateOutput,
394 workDesc,
395 sliceGroup, // Alternatively, NULL can also be used to contract over all the slices instead of specifying a sliceGroup object.
396 stream) );
397
398 // Synchronize and measure timing
399 auto time = timer.seconds();
400 minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
401 }
402
403 printf("Contract the network, each slice uses the same contraction plan.\n");
404
405
406 /*************************/
407
408 double flops{0.};
409 HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
410 handle,
411 optimizerInfo,
412 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
413 &flops,
414 sizeof(flops)) );
415
416 printf("numSlices: %ld\n", numSlices);
417 printf("%.2f ms / slice\n", minTimeCUTENSOR * 1000.f / numSlices);
418 printf("%.2f GFLOPS/s\n", flops/1e9/minTimeCUTENSOR );
419
420 HANDLE_ERROR( cutensornetDestroySliceGroup(sliceGroup) );
421 HANDLE_ERROR( cutensornetDestroy(handle) );
422 HANDLE_ERROR( cutensornetDestroyNetworkDescriptor(descNet) );
423 HANDLE_ERROR( cutensornetDestroyContractionPlan(plan) );
424 HANDLE_ERROR( cutensornetDestroyContractionOptimizerConfig(optimizerConfig) );
425 HANDLE_ERROR( cutensornetDestroyContractionOptimizerInfo(optimizerInfo) );
426 HANDLE_ERROR( cutensornetDestroyWorkspaceDescriptor(workDesc) );
427
428 if (A) free(A);
429 if (B) free(B);
430 if (C) free(C);
431 if (D) free(D);
432 if (rawDataIn_d[0]) cudaFree(rawDataIn_d[0]);
433 if (rawDataIn_d[1]) cudaFree(rawDataIn_d[1]);
434 if (rawDataIn_d[2]) cudaFree(rawDataIn_d[2]);
435 if (D_d) cudaFree(D_d);
436 if (work) cudaFree(work);
437
438 printf("Free resource and exit.\n");
439
440 return 0;
441}
Recall that the full sample code can be found in the NVIDIA/cuQuantum repository (here).
Code Example (Slice-based Parallelism)¶
It is straightforward to adapt Code Example (Serial) to enable parallel execution of the contraction operation on multiple devices. We will illustrate this with an example using MPI as the communication layer. In the interests of brevity, we will show only the changes that need to be made; the full sample code can be found in the NVIDIA/cuQuantum repository.
First, in addition to the headers and definitions mentioned in Headers and data types, we include the MPI header and define a macro to handle MPI errors. We also initialize MPI and map each process to a device in this section.
10#include <mpi.h>
42#define HANDLE_MPI_ERROR(x) \
43{ const auto err = x; \
44 if( err != MPI_SUCCESS ) \
45 { char error[MPI_MAX_ERROR_STRING]; int len; \
46 MPI_Error_string(err, error, &len); \
47 printf("[Process %d] MPI Error: %s in line %d\n", rank, error, __LINE__); \
48 fflush(stdout); \
49 MPI_Abort(MPI_COMM_WORLD, err); \
50 } \
51};
96 // Initialize MPI.
97 int errorCode = MPI_Init(&argc, &argv);
98 if (errorCode != MPI_SUCCESS)
99 {
100 printf("Error initializing MPI.\n");
101 MPI_Abort(MPI_COMM_WORLD, errorCode);
102 }
103
104 const int root{0};
105 int rank{};
106 HANDLE_MPI_ERROR( MPI_Comm_rank(MPI_COMM_WORLD, &rank) );
107
108 int numProcs{};
109 HANDLE_MPI_ERROR( MPI_Comm_size(MPI_COMM_WORLD, &numProcs) );
131 // Set deviceId based on ranks and nodes.
132 int deviceId = rank % numDevices; // We assume that the processes are mapped to nodes in contiguous chunks.
133 HANDLE_CUDA_ERROR( cudaSetDevice(deviceId) );
134 HANDLE_CUDA_ERROR( cudaGetDeviceProperties(&prop, deviceId) );
Next we define the tensor network as described in Define tensor network and tensor sizes. In a one process per device model, the tensor network, including operand and result data, is replicated on each process. The root process initalizes the operand data and broadcasts it to the other processes.
245 /*******************
246 * Initialize data
247 *******************/
248
249 // Rank root creates the tensor data.
250 if (rank == root)
251 {
252 for (uint64_t i = 0; i < elementsA; i++)
253 A[i] = ((floatType) rand())/RAND_MAX;
254 for (uint64_t i = 0; i < elementsB; i++)
255 B[i] = ((floatType) rand())/RAND_MAX;
256 for (uint64_t i = 0; i < elementsC; i++)
257 C[i] = ((floatType) rand())/RAND_MAX;
258 }
259
260 // Broadcast data to all ranks.
261 HANDLE_MPI_ERROR( MPI_Bcast(A, elementsA, floatTypeMPI, root, MPI_COMM_WORLD) );
262 HANDLE_MPI_ERROR( MPI_Bcast(B, elementsB, floatTypeMPI, root, MPI_COMM_WORLD) );
263 HANDLE_MPI_ERROR( MPI_Bcast(C, elementsC, floatTypeMPI, root, MPI_COMM_WORLD) );
264
265 // Copy data onto the device on all ranks.
266 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[0], A, sizeA, cudaMemcpyHostToDevice) );
267 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[1], B, sizeB, cudaMemcpyHostToDevice) );
268 HANDLE_CUDA_ERROR( cudaMemcpy(rawDataIn_d[2], C, sizeC, cudaMemcpyHostToDevice) );
Then we create the library handle and tensor network descriptor on each process, as decribed in cuTensorNet handle and network descriptor.
Next, we find the optimal contraction path and slicing combination for our network. We will run the cuTensorNet optimizer on all processes and determine which process has the best path in terms of FLOP count. We will then pack the optimizer info object on this process, broadcast the packed buffer, and unpack it on all processes. Each process now has the same optimizer info object, which we use to calculate the share of slices for each process.
351 // Compute the path on all ranks so that we can choose the path with the lowest cost. Note that since this is a tiny
352 // example with 3 operands, all processes will compute the same globally optimal path. This is not the case for large
353 // tensor networks. For large networks, hyperoptimization is also beneficial and can be enabled by setting the
354 // optimizer config attribute CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES.
355
356 // Force slicing.
357 int32_t min_slices = numProcs;
358 HANDLE_ERROR( cutensornetContractionOptimizerConfigSetAttribute(
359 handle,
360 optimizerConfig,
361 CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SLICER_MIN_SLICES,
362 &min_slices,
363 sizeof(min_slices)) );
364
365 HANDLE_ERROR( cutensornetContractionOptimize(handle,
366 descNet,
367 optimizerConfig,
368 workspaceLimit,
369 optimizerInfo) );
370
371 double flops{-1.};
372 HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
373 handle,
374 optimizerInfo,
375 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_FLOP_COUNT,
376 &flops,
377 sizeof(flops)) );
378
379 // Choose the path with the lowest cost.
380 struct {
381 double value;
382 int rank;
383 } in{flops, rank}, out;
384
385 HANDLE_MPI_ERROR( MPI_Allreduce(&in, &out, 1, MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD) );
386
387 int sender = out.rank;
388 flops = out.value;
389 if (rank == root)
390 {
391 printf("Process %d has the path with the lowest FLOP count %lf.\n", sender, flops);
392 }
393
394 size_t bufSize;
395
396 // Get buffer size for optimizerInfo and broadcast it.
397 if (rank == sender)
398 {
399 HANDLE_ERROR( cutensornetContractionOptimizerInfoGetPackedSize(handle, optimizerInfo, &bufSize) );
400 }
401
402 HANDLE_MPI_ERROR( MPI_Bcast(&bufSize, 1, MPI_INT64_T, sender, MPI_COMM_WORLD) );
403
404 // Allocate buffer.
405 std::vector<char> buffer(bufSize);
406
407 // Pack optimizerInfo on sender and broadcast it.
408 if (rank == sender)
409 {
410 HANDLE_ERROR( cutensornetContractionOptimizerInfoPackData(handle, optimizerInfo, buffer.data(), bufSize) );
411 }
412
413 HANDLE_MPI_ERROR( MPI_Bcast(buffer.data(), bufSize, MPI_CHAR, sender, MPI_COMM_WORLD) );
414
415 // Unpack optimizerInfo from buffer.
416 if (rank != sender)
417 {
418 HANDLE_ERROR( cutensornetUpdateContractionOptimizerInfoFromPackedData(handle, buffer.data(), bufSize, optimizerInfo) );
419 }
420
421 int64_t numSlices = 0;
422 HANDLE_ERROR( cutensornetContractionOptimizerInfoGetAttribute(
423 handle,
424 optimizerInfo,
425 CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_NUM_SLICES,
426 &numSlices,
427 sizeof(numSlices)) );
428
429 assert(numSlices > 0);
430
431 // Calculate each process's share of the slices.
432
433 int64_t procChunk = numSlices / numProcs;
434 int extra = numSlices % numProcs;
435 int procSliceBegin = rank * procChunk + std::min(rank, extra);
436 int procSliceEnd = rank == numProcs - 1 ? numSlices : (rank + 1) * procChunk + std::min(rank + 1, extra);
We now create the workspace descriptor and allocate memory as described in Create workspace descriptor and allocate workspace memory, and create the Contraction plan and auto-tune the network.
Next, on each process, we create a slice group (see cutensornetSliceGroup_t
) that corresponds to its share of the network slices. We then provide this slice group object to the cutensornetContractSlices()
function
to get a partial contraction result on each process.
525 cutensornetSliceGroup_t sliceGroup{};
526 // Create a cutensornetSliceGroup_t object from a range of slice IDs.
527 HANDLE_ERROR( cutensornetCreateSliceGroupFromIDRange(handle, procSliceBegin, procSliceEnd, 1, &sliceGroup) );
548 HANDLE_ERROR( cutensornetContractSlices(handle,
549 plan,
550 rawDataIn_d,
551 D_d,
552 accumulateOutput,
553 workDesc,
554 sliceGroup,
555 stream) );
Finally, we sum up the partial contributions to obtain the result of the contraction.
584 // Reduce on root process.
585 if (rank == root)
586 {
587 HANDLE_MPI_ERROR( MPI_Reduce(MPI_IN_PLACE, D, elementsD, floatTypeMPI, MPI_SUM, root, MPI_COMM_WORLD) );
588 }
589 else
590 {
591 HANDLE_MPI_ERROR( MPI_Reduce(D, D, elementsD, floatTypeMPI, MPI_SUM, root, MPI_COMM_WORLD) );
592 }
659 HANDLE_MPI_ERROR( MPI_Finalize() );
The complete MPI sample can be found in the NVIDIA/cuQuantum repository.
Useful Tips¶
For debugging, the environment variable
CUTENSORNET_LOG_LEVEL=n
can be set. The leveln
= 0, 1, …, 5 corresponds to the logger level as described and used incutensornetLoggerSetLevel()
. The environment variableCUTENSORNET_LOG_FILE=<filepath>
can be used to direct the log output to a custom file at<filepath>
instead ofstdout
.