Plan Cache (beta)¶
This section introduces the software-managed plan cache that has the following key features:
Minimize launch-related overhead (e.g., due to kernel selection).
Overhead-free autotuning (a.k.a. Incremental Autotuning).
This feature enables users to automatically find the best implementation for the given problem and thereby increases performance.
The cache is implemented in a thread-safe manner and is shared across all threads that use the same
cutensorHandle_t
.Serialize and deserialize the cache:
Allows users to store the state of the cache to disc and reuse it later
In essence, the plan cache can be seen as a lookup table from a specific problem
instance (e.g., cutensorContractionDescriptor_t
) to an actual implementation (encoded by
cutensorContractionPlan_t
).
The remainder of this section assumes familiarity with Getting Started.
Note
The cache can always be deactivated via the CUTENSOR_DISABLE_PLAN_CACHE
environment variable (see Environment Variables).
Incremental Autotuning¶
The incremental autotuning feature enables users to automatically explore different implementations, referred to as candidates, for a given problem.
When using the cache with the incremental auto-tuning feature (CUTENSOR_AUTOTUNE_INCREMENTAL
),
successive invocations of the same contraction problem (albeit with potentially
different data pointers) will be performed by different candidates; the timing for those
candidates is automatically measured and the fastest candidate is stored in the plan
cache. The number of different candidates to explore is configurable by the user (via CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT
);
all subsequent calls to the same problem will then be mapped to the fastest candidate
(stored in the cache), thus taking advantage of the fastest (real-world) candidate.
This autotuning approach has some key advantages:
Candidates are evaluated at a point in time where hardware caches are in a production-environment state (i.e., the hardware cache state reflects the real-world situation).
Overhead is minimized (i.e., no timing loop, no synchronization).
Moreover, the candidates are evaluated in the order given by our performance model (from fastest to slowest).
Incremental autotuning is especially powerful if combined with
cuTENSOR’s cache serialization feature (via cutensorHandleWriteCacheToFile
and
cutensorHandleReadCacheFromFile
) by writing the tuned cache to disc.
Note
We recommend warming up the GPU (i.e., reaching steady-state performance) before auto-tuning is used to minimize fluctuations in measured performance.
Introductory Example¶
This subsection provides a basic overview of the cache-related API calls and features. In addition to the steps outlined in Getting Started, in this example we also:
Attach cachelines to the handle (via
cutensorHandleAttachPlanCachelines
).Configure the cache behavior on a contraction-by-contraction basis (via
cutensorContractionFindSetAttribute
).
Let’s start from the same example outlined in Getting Started and enhance it with the steps required to use the plan cache. First we have to add calls to attach and detach cachelines:
// Set number of cache lines
constexpr int32_t numCachelines = 1024;
// Set cache size and allocate
const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
// Attach cache
HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );
// ...
// Detach cache and free-up resources
HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );
Note that the number of cachelines is configurable by the user; ideally we would like to supply as many cachelines
as the applications has distinct contraction calls. Since this might not always be possible (due to memory constraints),
cuTENSOR’s plan cache will evict cache entries using a least-recently-used (LRU) policy. Users
can choose to disable caching on a contraction-by-contraction basis (via cutensorCacheMode_t::CUTENSOR_CACHE_NONE
).
Note
Until the cachelines are detached using cutensorHandleDetachPlanCachelines
, the user-allocated cachelines must not be freed.
Moreover, cutensorHandleDetachPlanCachelines
also deallocates other plan cache related resources.
It should be called once the plan cache is no longer needed in order to avoid resource leaks.
With the above mentioned changes (and a loop around the contraction call) our example now looks as follows:
1#include <stdlib.h>
2#include <stdio.h>
3
4#include <cuda_runtime.h>
5#include <cutensor.h>
6
7#include <unordered_map>
8#include <vector>
9
10// Handle cuTENSOR errors
11#define HANDLE_ERROR(x) { \
12 const auto err = x; \
13 if( err != CUTENSOR_STATUS_SUCCESS ) \
14 { printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
15}
16
17int main(int argc, char** argv)
18{
19 // Host element type definition
20 typedef float floatTypeA;
21 typedef float floatTypeB;
22 typedef float floatTypeC;
23 typedef float floatTypeCompute;
24
25 // CUDA types
26 cudaDataType_t typeA = CUDA_R_32F;
27 cudaDataType_t typeB = CUDA_R_32F;
28 cudaDataType_t typeC = CUDA_R_32F;
29 cutensorComputeType_t typeCompute = CUTENSOR_COMPUTE_32F;
30
31 floatTypeCompute alpha = (floatTypeCompute)1.1f;
32 floatTypeCompute beta = (floatTypeCompute)0.9f;
33
34 printf("Include headers and define data types\n");
35
36 /* ***************************** */
37
38 // Create vector of modes
39 std::vector<int> modeC{'m','u','n','v'};
40 std::vector<int> modeA{'m','h','k','n'};
41 std::vector<int> modeB{'u','k','v','h'};
42 int nmodeA = modeA.size();
43 int nmodeB = modeB.size();
44 int nmodeC = modeC.size();
45
46 // Extents
47 std::unordered_map<int, int64_t> extent;
48 extent['m'] = 96;
49 extent['n'] = 96;
50 extent['u'] = 96;
51 extent['v'] = 64;
52 extent['h'] = 64;
53 extent['k'] = 64;
54
55 // Create a vector of extents for each tensor
56 std::vector<int64_t> extentC;
57 for(auto mode : modeC)
58 extentC.push_back(extent[mode]);
59 std::vector<int64_t> extentA;
60 for(auto mode : modeA)
61 extentA.push_back(extent[mode]);
62 std::vector<int64_t> extentB;
63 for(auto mode : modeB)
64 extentB.push_back(extent[mode]);
65
66 printf("Define modes and extents\n");
67
68 /* ***************************** */
69
70 // Number of elements of each tensor
71 size_t elementsA = 1;
72 for(auto mode : modeA)
73 elementsA *= extent[mode];
74 size_t elementsB = 1;
75 for(auto mode : modeB)
76 elementsB *= extent[mode];
77 size_t elementsC = 1;
78 for(auto mode : modeC)
79 elementsC *= extent[mode];
80
81 // Size in bytes
82 size_t sizeA = sizeof(floatTypeA) * elementsA;
83 size_t sizeB = sizeof(floatTypeB) * elementsB;
84 size_t sizeC = sizeof(floatTypeC) * elementsC;
85
86 // Allocate on device
87 void *A_d, *B_d, *C_d;
88 cudaMalloc((void**)&A_d, sizeA);
89 cudaMalloc((void**)&B_d, sizeB);
90 cudaMalloc((void**)&C_d, sizeC);
91
92 // Allocate on host
93 floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
94 floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
95 floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
96
97 // Initialize data on host
98 for(int64_t i = 0; i < elementsA; i++)
99 A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
100 for(int64_t i = 0; i < elementsB; i++)
101 B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
102 for(int64_t i = 0; i < elementsC; i++)
103 C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
104
105 // Copy to device
106 cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
107 cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice);
108 cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice);
109
110 printf("Allocate, initialize and transfer tensors\n");
111
112 /* ***************************** */
113
114 // Initialize cuTENSOR library
115 cutensorHandle_t handle;
116 cutensorInit(&handle);
117
118 /**********************
119 * Setup plan cache
120 **********************/
121 printf("Attach cachelines\n");
122
123 constexpr int32_t numCachelines = 1024;
124 const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
125 printf("Allocating: %.2f kB for the cache\n", sizeCache / 1000.);
126 cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
127 HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );
128
129 // Create Tensor Descriptors
130 cutensorTensorDescriptor_t descA;
131 HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
132 &descA,
133 nmodeA,
134 extentA.data(),
135 NULL,/*stride*/
136 typeA, CUTENSOR_OP_IDENTITY ) );
137
138 cutensorTensorDescriptor_t descB;
139 HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
140 &descB,
141 nmodeB,
142 extentB.data(),
143 NULL,/*stride*/
144 typeB, CUTENSOR_OP_IDENTITY ) );
145
146 cutensorTensorDescriptor_t descC;
147 HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
148 &descC,
149 nmodeC,
150 extentC.data(),
151 NULL,/*stride*/
152 typeC, CUTENSOR_OP_IDENTITY ) );
153
154 printf("Initialize cuTENSOR and tensor descriptors\n");
155
156 /* ***************************** */
157
158 //Retrieve the memory alignment for each tensor
159 uint32_t alignmentRequirementA;
160 HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
161 A_d,
162 &descA,
163 &alignmentRequirementA) );
164
165 uint32_t alignmentRequirementB;
166 HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
167 B_d,
168 &descB,
169 &alignmentRequirementB) );
170
171 uint32_t alignmentRequirementC;
172 HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
173 C_d,
174 &descC,
175 &alignmentRequirementC) );
176
177 printf("Query best alignment requirement for our pointers\n");
178
179 /* ***************************** */
180
181 // Create the Contraction Descriptor
182 cutensorContractionDescriptor_t desc;
183 HANDLE_ERROR( cutensorInitContractionDescriptor( &handle,
184 &desc,
185 &descA, modeA.data(), alignmentRequirementA,
186 &descB, modeB.data(), alignmentRequirementB,
187 &descC, modeC.data(), alignmentRequirementC,
188 &descC, modeC.data(), alignmentRequirementC,
189 typeCompute) );
190
191 printf("Initialize contraction descriptor\n");
192
193 /* ***************************** */
194
195 // Set the algorithm to use
196 cutensorContractionFind_t find;
197 HANDLE_ERROR( cutensorInitContractionFind(
198 &handle, &find,
199 CUTENSOR_ALGO_DEFAULT) );
200
201 printf("Initialize settings to find algorithm\n");
202
203 /* ***************************** */
204
205 // Query workspace
206 size_t worksize = 0;
207 HANDLE_ERROR( cutensorContractionGetWorkspace(&handle,
208 &desc,
209 &find,
210 CUTENSOR_WORKSPACE_RECOMMENDED, &worksize ) );
211
212 // Allocate workspace
213 void *work = nullptr;
214 if(worksize > 0)
215 {
216 if( cudaSuccess != cudaMalloc(&work, worksize) ) // This is optional!
217 {
218 work = nullptr;
219 worksize = 0;
220 }
221 }
222
223 printf("Query recommended workspace size and allocate it\n");
224
225 /* ***************************** */
226 printf("Execute contraction from plan\n");
227
228 int numRuns = 5;
229 for(int i=0; i < numRuns; ++i)
230 {
231 // Create Contraction Plan && look-up cache (if attached)
232 cutensorContractionPlan_t plan;
233 HANDLE_ERROR( cutensorInitContractionPlan(&handle,
234 &plan,
235 &desc,
236 &find,
237 worksize) );
238
239 printf("Create plan for contraction\n");
240
241 /* ***************************** */
242
243 cutensorStatus_t err;
244
245 // Execute the tensor contraction
246 err = cutensorContraction(&handle,
247 &plan,
248 (void*)&alpha, A_d,
249 B_d,
250 (void*)&beta, C_d,
251 C_d,
252 work, worksize, 0 /* stream */);
253 cudaDeviceSynchronize();
254
255 // Check for errors
256 if(err != CUTENSOR_STATUS_SUCCESS)
257 {
258 printf("ERROR: %s\n", cutensorGetErrorString(err));
259 }
260 }
261
262 /* ***************************** */
263
264 // Detach cache and free-up resources
265 HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );
266
267 if ( A ) free( A );
268 if ( B ) free( B );
269 if ( C ) free( C );
270 if ( cachelines ) free(cachelines);
271 if ( A_d ) cudaFree( A_d );
272 if ( B_d ) cudaFree( B_d );
273 if ( C_d ) cudaFree( C_d );
274 if ( work ) cudaFree( work );
275
276 printf("Successful completion\n");
277
278 return 0;
279}
This minimal change suffices to cache the plan for the tensor contraction call in line 233 and 246.
It’s important to note that the call to cutensorInitContractionPlan
is inside of the
loop; the lookup from the cache happens here.
To disable caching for a certain contraction, the
corresponding cutensorContractionFind_t
needs to be modified accordingly:
const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_NONE;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
&handle,
&find,
CUTENSOR_CONTRACTION_FIND_CACHE_MODE,
&cacheMode,
sizeof(cutensorCacheMode_t)));
This concludes the introductory example.
Advanced Example¶
This example will augment the previous example and explains how to:
Take advantage of incremental auto-tuning
It is recommended to warm up the GPU (i.e., reach steady-state performance) before using auto-tuning (to avoid big fluctuations in measured performance)
Use tags to distinguish two otherwise identical tensor contractions from each other
This is usful if the hardware cache of the GPU is (likely) substantially different between the two calls (e.g., if one of the operands was just read/written by a preceeding call) and it is expected that the state of the cache has significant impact on the performance (e.g., for bandwidth-bound contractions)
Write the plan cache state to a file and read it back in
Let us start by enabling incremental autotuning.
To do so, we modify cutensorContractionFind_t
as follows:
const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_INCREMENTAL;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
&handle,
&find,
CUTENSOR_CONTRACTION_FIND_AUTOTUNE_MODE,
&autotuneMode ,
sizeof(cutensorAutotuneMode_t)));
const uint32_t incCount = 4;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
&handle,
&find,
CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT,
&incCount,
sizeof(uint32_t)));
The first call to cutensorContractionFindSetAttribute
enables incremental auto-tuning,
while the second call sets the CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT
; this
value corresponds to the number of different candidates that should be explored via
incremental autotuning before subsequent calls look-up from the plan cache.
Higher values of incCount
explore more candidates, and as such cause a larger overhead
initially, but they can also result in better performance – if the initial overhead
can be amortized (e.g., when writing the cache to disc).
We feel that a CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT
of four is a good default value.
The following code incorporates those changes:
1#include <stdlib.h>
2#include <stdio.h>
3
4#include <cuda_runtime.h>
5#include <cutensor.h>
6
7#include <unordered_map>
8#include <vector>
9
10// Handle cuTENSOR errors
11#define HANDLE_ERROR(x) { \
12 const auto err = x; \
13 if( err != CUTENSOR_STATUS_SUCCESS ) \
14 { printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
15}
16
17int main(int argc, char** argv)
18{
19 // Host element type definition
20 typedef float floatTypeA;
21 typedef float floatTypeB;
22 typedef float floatTypeC;
23 typedef float floatTypeCompute;
24
25 // CUDA types
26 cudaDataType_t typeA = CUDA_R_32F;
27 cudaDataType_t typeB = CUDA_R_32F;
28 cudaDataType_t typeC = CUDA_R_32F;
29 cutensorComputeType_t typeCompute = CUTENSOR_COMPUTE_32F;
30
31 floatTypeCompute alpha = (floatTypeCompute)1.1f;
32 floatTypeCompute beta = (floatTypeCompute)0.9f;
33
34 printf("Include headers and define data types\n");
35
36 /* ***************************** */
37
38 // Create vector of modes
39 std::vector<int> modeC{'m','u','n','v'};
40 std::vector<int> modeA{'m','h','k','n'};
41 std::vector<int> modeB{'u','k','v','h'};
42 int nmodeA = modeA.size();
43 int nmodeB = modeB.size();
44 int nmodeC = modeC.size();
45
46 // Extents
47 std::unordered_map<int, int64_t> extent;
48 extent['m'] = 96;
49 extent['n'] = 96;
50 extent['u'] = 96;
51 extent['v'] = 64;
52 extent['h'] = 64;
53 extent['k'] = 64;
54
55 // Create a vector of extents for each tensor
56 std::vector<int64_t> extentC;
57 for(auto mode : modeC)
58 extentC.push_back(extent[mode]);
59 std::vector<int64_t> extentA;
60 for(auto mode : modeA)
61 extentA.push_back(extent[mode]);
62 std::vector<int64_t> extentB;
63 for(auto mode : modeB)
64 extentB.push_back(extent[mode]);
65
66 printf("Define modes and extents\n");
67
68 /* ***************************** */
69
70 // Number of elements of each tensor
71 size_t elementsA = 1;
72 for(auto mode : modeA)
73 elementsA *= extent[mode];
74 size_t elementsB = 1;
75 for(auto mode : modeB)
76 elementsB *= extent[mode];
77 size_t elementsC = 1;
78 for(auto mode : modeC)
79 elementsC *= extent[mode];
80
81 // Size in bytes
82 size_t sizeA = sizeof(floatTypeA) * elementsA;
83 size_t sizeB = sizeof(floatTypeB) * elementsB;
84 size_t sizeC = sizeof(floatTypeC) * elementsC;
85
86 // Allocate on device
87 void *A_d, *B_d, *C_d;
88 cudaMalloc((void**)&A_d, sizeA);
89 cudaMalloc((void**)&B_d, sizeB);
90 cudaMalloc((void**)&C_d, sizeC);
91
92 // Allocate on host
93 floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
94 floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
95 floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
96
97 // Initialize data on host
98 for(int64_t i = 0; i < elementsA; i++)
99 A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
100 for(int64_t i = 0; i < elementsB; i++)
101 B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
102 for(int64_t i = 0; i < elementsC; i++)
103 C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
104
105 // Copy to device
106 cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
107 cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice);
108 cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice);
109
110 printf("Allocate, initialize and transfer tensors\n");
111
112 /* ***************************** */
113
114 // Initialize cuTENSOR library
115 cutensorHandle_t handle;
116 cutensorInit(&handle);
117
118 /**********************
119 * Setup plan cache
120 **********************/
121 printf("Attach cachelines\n");
122
123 constexpr int32_t numCachelines = 1024;
124 const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
125 printf("Allocating: %.2f kB for the cache\n", sizeCache / 1000.);
126 cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
127 HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );
128
129 // Create Tensor Descriptors
130 cutensorTensorDescriptor_t descA;
131 HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
132 &descA,
133 nmodeA,
134 extentA.data(),
135 NULL,/*stride*/
136 typeA, CUTENSOR_OP_IDENTITY ) );
137
138 cutensorTensorDescriptor_t descB;
139 HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
140 &descB,
141 nmodeB,
142 extentB.data(),
143 NULL,/*stride*/
144 typeB, CUTENSOR_OP_IDENTITY ) );
145
146 cutensorTensorDescriptor_t descC;
147 HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
148 &descC,
149 nmodeC,
150 extentC.data(),
151 NULL,/*stride*/
152 typeC, CUTENSOR_OP_IDENTITY ) );
153
154 printf("Initialize cuTENSOR and tensor descriptors\n");
155
156 /* ***************************** */
157
158 //Retrieve the memory alignment for each tensor
159 uint32_t alignmentRequirementA;
160 HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
161 A_d,
162 &descA,
163 &alignmentRequirementA) );
164
165 uint32_t alignmentRequirementB;
166 HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
167 B_d,
168 &descB,
169 &alignmentRequirementB) );
170
171 uint32_t alignmentRequirementC;
172 HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
173 C_d,
174 &descC,
175 &alignmentRequirementC) );
176
177 printf("Query best alignment requirement for our pointers\n");
178
179 /* ***************************** */
180
181 // Create the Contraction Descriptor
182 cutensorContractionDescriptor_t desc;
183 HANDLE_ERROR( cutensorInitContractionDescriptor( &handle,
184 &desc,
185 &descA, modeA.data(), alignmentRequirementA,
186 &descB, modeB.data(), alignmentRequirementB,
187 &descC, modeC.data(), alignmentRequirementC,
188 &descC, modeC.data(), alignmentRequirementC,
189 typeCompute) );
190
191 printf("Initialize contraction descriptor\n");
192
193 /* ***************************** */
194
195 // Set the algorithm to use
196 cutensorContractionFind_t find;
197 HANDLE_ERROR( cutensorInitContractionFind(
198 &handle, &find,
199 CUTENSOR_ALGO_DEFAULT) );
200
201 const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_INCREMENTAL;
202 HANDLE_ERROR(cutensorContractionFindSetAttribute(
203 &handle,
204 &find,
205 CUTENSOR_CONTRACTION_FIND_AUTOTUNE_MODE,
206 &autotuneMode ,
207 sizeof(cutensorAutotuneMode_t)));
208
209 const uint32_t incCount = 4;
210 HANDLE_ERROR(cutensorContractionFindSetAttribute(
211 &handle,
212 &find,
213 CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT,
214 &incCount,
215 sizeof(uint32_t)));
216
217 printf("Initialize settings to find algorithm\n");
218
219 /* ***************************** */
220
221 // Query workspace
222 size_t worksize = 0;
223 HANDLE_ERROR( cutensorContractionGetWorkspace(&handle,
224 &desc,
225 &find,
226 CUTENSOR_WORKSPACE_RECOMMENDED, &worksize ) );
227
228 // Allocate workspace
229 void *work = nullptr;
230 if(worksize > 0)
231 {
232 if( cudaSuccess != cudaMalloc(&work, worksize) ) // This is optional!
233 {
234 work = nullptr;
235 worksize = 0;
236 }
237 }
238
239 printf("Query recommended workspace size and allocate it\n");
240
241 /* ***************************** */
242 printf("Execute contraction from plan\n");
243
244 int numRuns = 5;
245 for(int i=0; i < numRuns; ++i)
246 {
247 // Create Contraction Plan && look-up cache (if attached)
248 cutensorContractionPlan_t plan;
249 HANDLE_ERROR( cutensorInitContractionPlan(&handle,
250 &plan,
251 &desc,
252 &find,
253 worksize) );
254
255 printf("Create plan for contraction\n");
256
257 /* ***************************** */
258
259 cutensorStatus_t err;
260
261 // Execute the tensor contraction
262 err = cutensorContraction(&handle,
263 &plan,
264 (void*)&alpha, A_d,
265 B_d,
266 (void*)&beta, C_d,
267 C_d,
268 work, worksize, 0 /* stream */);
269 cudaDeviceSynchronize();
270
271 // Check for errors
272 if(err != CUTENSOR_STATUS_SUCCESS)
273 {
274 printf("ERROR: %s\n", cutensorGetErrorString(err));
275 }
276 }
277
278 /* ***************************** */
279
280 // Detach cache and free-up resources
281 HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );
282
283 if ( A ) free( A );
284 if ( B ) free( B );
285 if ( C ) free( C );
286 if ( cachelines ) free(cachelines);
287 if ( A_d ) cudaFree( A_d );
288 if ( B_d ) cudaFree( B_d );
289 if ( C_d ) cudaFree( C_d );
290 if ( work ) cudaFree( work );
291
292 printf("Successful completion\n");
293
294 return 0;
295}
Let us further augment the above example by writing the plan cache to a file and reading it in (provided it was previously written):
const char cacheFilename[] = "./cache.bin";
uint32_t numCachelinesRead = 0;
cutensorStatus_t status = cutensorHandleReadCacheFromFile(&handle, cacheFilename, &numCachelinesRead);
if (status == CUTENSOR_STATUS_SUCCESS)
{
printf("%d cachelines have been successfully read from file (%s).\n", numCachelinesRead, cacheFilename);
}
else if (status == CUTENSOR_STATUS_IO_ERROR)
{
printf("File (%s) doesn't seem to exist.\n", cacheFilename);
}
else if (status == CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE)
{
printf("Cannot read cache: Please attach at least %d cachelines to the handle.\n", numCachelines);
}
// ...
const char filename[] = "./cache.bin";
HANDLE_ERROR( cutensorHandleWriteCacheToFile(&handle, filename) );
printf("Cache has been successfully written to file (%s).\n", filename);
Warning
cutensorHandleReadCacheFromFile
only succeeds if the number of attached
cachelines is sufficient to read all cachelines stored in the file; otherwise
CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE
is returned and the sufficient number of cachelines
is stored in numCachelinesRead
.
With these changes the example now looks as follows:
1#include <stdlib.h>
2#include <stdio.h>
3
4#include <cuda_runtime.h>
5#include <cutensor.h>
6
7#include <unordered_map>
8#include <vector>
9
10// Handle cuTENSOR errors
11#define HANDLE_ERROR(x) { \
12 const auto err = x; \
13 if( err != CUTENSOR_STATUS_SUCCESS ) \
14 { printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
15}
16
17int main(int argc, char** argv)
18{
19 // Host element type definition
20 typedef float floatTypeA;
21 typedef float floatTypeB;
22 typedef float floatTypeC;
23 typedef float floatTypeCompute;
24
25 // CUDA types
26 cudaDataType_t typeA = CUDA_R_32F;
27 cudaDataType_t typeB = CUDA_R_32F;
28 cudaDataType_t typeC = CUDA_R_32F;
29 cutensorComputeType_t typeCompute = CUTENSOR_COMPUTE_32F;
30
31 floatTypeCompute alpha = (floatTypeCompute)1.1f;
32 floatTypeCompute beta = (floatTypeCompute)0.9f;
33
34 printf("Include headers and define data types\n");
35
36 /* ***************************** */
37
38 // Create vector of modes
39 std::vector<int> modeC{'m','u','n','v'};
40 std::vector<int> modeA{'m','h','k','n'};
41 std::vector<int> modeB{'u','k','v','h'};
42 int nmodeA = modeA.size();
43 int nmodeB = modeB.size();
44 int nmodeC = modeC.size();
45
46 // Extents
47 std::unordered_map<int, int64_t> extent;
48 extent['m'] = 96;
49 extent['n'] = 96;
50 extent['u'] = 96;
51 extent['v'] = 64;
52 extent['h'] = 64;
53 extent['k'] = 64;
54
55 // Create a vector of extents for each tensor
56 std::vector<int64_t> extentC;
57 for(auto mode : modeC)
58 extentC.push_back(extent[mode]);
59 std::vector<int64_t> extentA;
60 for(auto mode : modeA)
61 extentA.push_back(extent[mode]);
62 std::vector<int64_t> extentB;
63 for(auto mode : modeB)
64 extentB.push_back(extent[mode]);
65
66 printf("Define modes and extents\n");
67
68 /* ***************************** */
69
70 // Number of elements of each tensor
71 size_t elementsA = 1;
72 for(auto mode : modeA)
73 elementsA *= extent[mode];
74 size_t elementsB = 1;
75 for(auto mode : modeB)
76 elementsB *= extent[mode];
77 size_t elementsC = 1;
78 for(auto mode : modeC)
79 elementsC *= extent[mode];
80
81 // Size in bytes
82 size_t sizeA = sizeof(floatTypeA) * elementsA;
83 size_t sizeB = sizeof(floatTypeB) * elementsB;
84 size_t sizeC = sizeof(floatTypeC) * elementsC;
85
86 // Allocate on device
87 void *A_d, *B_d, *C_d;
88 cudaMalloc((void**)&A_d, sizeA);
89 cudaMalloc((void**)&B_d, sizeB);
90 cudaMalloc((void**)&C_d, sizeC);
91
92 // Allocate on host
93 floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
94 floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
95 floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
96
97 // Initialize data on host
98 for(int64_t i = 0; i < elementsA; i++)
99 A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
100 for(int64_t i = 0; i < elementsB; i++)
101 B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
102 for(int64_t i = 0; i < elementsC; i++)
103 C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
104
105 // Copy to device
106 cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
107 cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice);
108 cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice);
109
110 printf("Allocate, initialize and transfer tensors\n");
111
112 /* ***************************** */
113
114 // Initialize cuTENSOR library
115 cutensorHandle_t handle;
116 cutensorInit(&handle);
117
118 /**********************
119 * Setup plan cache
120 **********************/
121 printf("Attach cachelines\n");
122
123 constexpr int32_t numCachelines = 1024;
124 const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
125 printf("Allocating: %.2f kB for the cache\n", sizeCache / 1000.);
126 cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
127 HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );
128
129
130 const char cacheFilename[] = "./cache.bin";
131 uint32_t numCachelinesRead = 0;
132 cutensorStatus_t status = cutensorHandleReadCacheFromFile(&handle, cacheFilename, &numCachelinesRead);
133 if (status == CUTENSOR_STATUS_SUCCESS)
134 {
135 printf("%d cachelines have been successfully read from file (%s).\n", numCachelinesRead, cacheFilename);
136 }
137 else if (status == CUTENSOR_STATUS_IO_ERROR)
138 {
139 printf("File (%s) doesn't seem to exist.\n", cacheFilename);
140 }
141 else if (status == CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE)
142 {
143 printf("Cannot read cache: Please attach at least %d cachelines to the handle.\n", numCachelines);
144 }
145
146 // Create Tensor Descriptors
147 cutensorTensorDescriptor_t descA;
148 HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
149 &descA,
150 nmodeA,
151 extentA.data(),
152 NULL,/*stride*/
153 typeA, CUTENSOR_OP_IDENTITY ) );
154
155 cutensorTensorDescriptor_t descB;
156 HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
157 &descB,
158 nmodeB,
159 extentB.data(),
160 NULL,/*stride*/
161 typeB, CUTENSOR_OP_IDENTITY ) );
162
163 cutensorTensorDescriptor_t descC;
164 HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
165 &descC,
166 nmodeC,
167 extentC.data(),
168 NULL,/*stride*/
169 typeC, CUTENSOR_OP_IDENTITY ) );
170
171 printf("Initialize cuTENSOR and tensor descriptors\n");
172
173 /* ***************************** */
174
175 //Retrieve the memory alignment for each tensor
176 uint32_t alignmentRequirementA;
177 HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
178 A_d,
179 &descA,
180 &alignmentRequirementA) );
181
182 uint32_t alignmentRequirementB;
183 HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
184 B_d,
185 &descB,
186 &alignmentRequirementB) );
187
188 uint32_t alignmentRequirementC;
189 HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
190 C_d,
191 &descC,
192 &alignmentRequirementC) );
193
194 printf("Query best alignment requirement for our pointers\n");
195
196 /* ***************************** */
197
198 // Create the Contraction Descriptor
199 cutensorContractionDescriptor_t desc;
200 HANDLE_ERROR( cutensorInitContractionDescriptor( &handle,
201 &desc,
202 &descA, modeA.data(), alignmentRequirementA,
203 &descB, modeB.data(), alignmentRequirementB,
204 &descC, modeC.data(), alignmentRequirementC,
205 &descC, modeC.data(), alignmentRequirementC,
206 typeCompute) );
207
208 printf("Initialize contraction descriptor\n");
209
210 /* ***************************** */
211
212 // Set the algorithm to use
213 cutensorContractionFind_t find;
214 HANDLE_ERROR( cutensorInitContractionFind(
215 &handle, &find,
216 CUTENSOR_ALGO_DEFAULT) );
217
218 const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_INCREMENTAL;
219 HANDLE_ERROR(cutensorContractionFindSetAttribute(
220 &handle,
221 &find,
222 CUTENSOR_CONTRACTION_FIND_AUTOTUNE_MODE,
223 &autotuneMode ,
224 sizeof(cutensorAutotuneMode_t)));
225
226 const uint32_t incCount = 4;
227 HANDLE_ERROR(cutensorContractionFindSetAttribute(
228 &handle,
229 &find,
230 CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT,
231 &incCount,
232 sizeof(uint32_t)));
233
234 printf("Initialize settings to find algorithm\n");
235
236 /* ***************************** */
237
238 // Query workspace
239 size_t worksize = 0;
240 HANDLE_ERROR( cutensorContractionGetWorkspace(&handle,
241 &desc,
242 &find,
243 CUTENSOR_WORKSPACE_RECOMMENDED, &worksize ) );
244
245 // Allocate workspace
246 void *work = nullptr;
247 if(worksize > 0)
248 {
249 if( cudaSuccess != cudaMalloc(&work, worksize) ) // This is optional!
250 {
251 work = nullptr;
252 worksize = 0;
253 }
254 }
255
256 printf("Query recommended workspace size and allocate it\n");
257
258 /* ***************************** */
259 printf("Execute contraction from plan\n");
260
261 int numRuns = 5;
262 for(int i=0; i < numRuns; ++i)
263 {
264 // Create Contraction Plan && look-up cache (if attached)
265 cutensorContractionPlan_t plan;
266 HANDLE_ERROR( cutensorInitContractionPlan(&handle,
267 &plan,
268 &desc,
269 &find,
270 worksize) );
271
272 printf("Create plan for contraction\n");
273
274 /* ***************************** */
275
276 cutensorStatus_t err;
277
278 // Execute the tensor contraction
279 err = cutensorContraction(&handle,
280 &plan,
281 (void*)&alpha, A_d,
282 B_d,
283 (void*)&beta, C_d,
284 C_d,
285 work, worksize, 0 /* stream */);
286 cudaDeviceSynchronize();
287
288 // Check for errors
289 if(err != CUTENSOR_STATUS_SUCCESS)
290 {
291 printf("ERROR: %s\n", cutensorGetErrorString(err));
292 }
293 }
294
295
296 /* ***************************** */
297 HANDLE_ERROR( cutensorHandleWriteCacheToFile(&handle, cacheFilename) );
298 printf("Cache has been successfully written to file (%s).\n", cacheFilename);
299
300 // Detach cache and free-up resources
301 HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );
302
303 if ( A ) free( A );
304 if ( B ) free( B );
305 if ( C ) free( C );
306 if ( cachelines ) free(cachelines);
307 if ( A_d ) cudaFree( A_d );
308 if ( B_d ) cudaFree( B_d );
309 if ( C_d ) cudaFree( C_d );
310 if ( work ) cudaFree( work );
311
312 printf("Successful completion\n");
313
314 return 0;
315}
Finally, let us add a second contraction loop, but this time we want the –otherwise
identical– contraction to be cached using a different cacheline:
This can be useful if the state of the hardware cache between these two calls is substantially different (i.e.,
affecting the measured runtime of the kernels).
To that end, we use the CUTENSOR_CONTRACTION_DESCRIPTOR_TAG
attribute:
uint32_t tag = 1;
HANDLE_ERROR( cutensorContractionDescriptorSetAttribute(
&handle,
&desc,
CUTENSOR_CONTRACTION_DESCRIPTOR_TAG,
&tag,
sizeof(uint32_t)));
With this change, the example code now looks as follows:
1#include <stdlib.h>
2#include <stdio.h>
3
4#include <cuda_runtime.h>
5#include <cutensor.h>
6
7#include <unordered_map>
8#include <vector>
9
10// Handle cuTENSOR errors
11#define HANDLE_ERROR(x) { \
12 const auto err = x; \
13 if( err != CUTENSOR_STATUS_SUCCESS ) \
14 { printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
15}
16
17int main(int argc, char** argv)
18{
19 // Host element type definition
20 typedef float floatTypeA;
21 typedef float floatTypeB;
22 typedef float floatTypeC;
23 typedef float floatTypeCompute;
24
25 // CUDA types
26 cudaDataType_t typeA = CUDA_R_32F;
27 cudaDataType_t typeB = CUDA_R_32F;
28 cudaDataType_t typeC = CUDA_R_32F;
29 cutensorComputeType_t typeCompute = CUTENSOR_COMPUTE_32F;
30
31 floatTypeCompute alpha = (floatTypeCompute)1.1f;
32 floatTypeCompute beta = (floatTypeCompute)0.9f;
33
34 printf("Include headers and define data types\n");
35
36 /* ***************************** */
37
38 // Create vector of modes
39 std::vector<int> modeC{'m','u','n','v'};
40 std::vector<int> modeA{'m','h','k','n'};
41 std::vector<int> modeB{'u','k','v','h'};
42 int nmodeA = modeA.size();
43 int nmodeB = modeB.size();
44 int nmodeC = modeC.size();
45
46 // Extents
47 std::unordered_map<int, int64_t> extent;
48 extent['m'] = 96;
49 extent['n'] = 96;
50 extent['u'] = 96;
51 extent['v'] = 64;
52 extent['h'] = 64;
53 extent['k'] = 64;
54
55 // Create a vector of extents for each tensor
56 std::vector<int64_t> extentC;
57 for(auto mode : modeC)
58 extentC.push_back(extent[mode]);
59 std::vector<int64_t> extentA;
60 for(auto mode : modeA)
61 extentA.push_back(extent[mode]);
62 std::vector<int64_t> extentB;
63 for(auto mode : modeB)
64 extentB.push_back(extent[mode]);
65
66 printf("Define modes and extents\n");
67
68 /* ***************************** */
69
70 // Number of elements of each tensor
71 size_t elementsA = 1;
72 for(auto mode : modeA)
73 elementsA *= extent[mode];
74 size_t elementsB = 1;
75 for(auto mode : modeB)
76 elementsB *= extent[mode];
77 size_t elementsC = 1;
78 for(auto mode : modeC)
79 elementsC *= extent[mode];
80
81 // Size in bytes
82 size_t sizeA = sizeof(floatTypeA) * elementsA;
83 size_t sizeB = sizeof(floatTypeB) * elementsB;
84 size_t sizeC = sizeof(floatTypeC) * elementsC;
85
86 // Allocate on device
87 void *A_d, *B_d, *C_d;
88 cudaMalloc((void**)&A_d, sizeA);
89 cudaMalloc((void**)&B_d, sizeB);
90 cudaMalloc((void**)&C_d, sizeC);
91
92 // Allocate on host
93 floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
94 floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
95 floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
96
97 // Initialize data on host
98 for(int64_t i = 0; i < elementsA; i++)
99 A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
100 for(int64_t i = 0; i < elementsB; i++)
101 B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
102 for(int64_t i = 0; i < elementsC; i++)
103 C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
104
105 // Copy to device
106 cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
107 cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice);
108 cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice);
109
110 printf("Allocate, initialize and transfer tensors\n");
111
112 /* ***************************** */
113
114 // Initialize cuTENSOR library
115 cutensorHandle_t handle;
116 cutensorInit(&handle);
117
118 /**********************
119 * Setup plan cache
120 **********************/
121 printf("Attach cachelines\n");
122
123 constexpr int32_t numCachelines = 1024;
124 const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
125 printf("Allocating: %.2f kB for the cache\n", sizeCache / 1000.);
126 cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
127 HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );
128
129
130 const char cacheFilename[] = "./cache.bin";
131 uint32_t numCachelinesRead = 0;
132 cutensorStatus_t status = cutensorHandleReadCacheFromFile(&handle, cacheFilename, &numCachelinesRead);
133 if (status == CUTENSOR_STATUS_SUCCESS)
134 {
135 printf("%d cachelines have been successfully read from file (%s).\n", numCachelinesRead, cacheFilename);
136 }
137 else if (status == CUTENSOR_STATUS_IO_ERROR)
138 {
139 printf("File (%s) doesn't seem to exist.\n", cacheFilename);
140 }
141 else if (status == CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE)
142 {
143 printf("Cannot read cache: Please attach at least %d cachelines to the handle.\n", numCachelines);
144 }
145
146 // Create Tensor Descriptors
147 cutensorTensorDescriptor_t descA;
148 HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
149 &descA,
150 nmodeA,
151 extentA.data(),
152 NULL,/*stride*/
153 typeA, CUTENSOR_OP_IDENTITY ) );
154
155 cutensorTensorDescriptor_t descB;
156 HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
157 &descB,
158 nmodeB,
159 extentB.data(),
160 NULL,/*stride*/
161 typeB, CUTENSOR_OP_IDENTITY ) );
162
163 cutensorTensorDescriptor_t descC;
164 HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
165 &descC,
166 nmodeC,
167 extentC.data(),
168 NULL,/*stride*/
169 typeC, CUTENSOR_OP_IDENTITY ) );
170
171 printf("Initialize cuTENSOR and tensor descriptors\n");
172
173 /* ***************************** */
174
175 //Retrieve the memory alignment for each tensor
176 uint32_t alignmentRequirementA;
177 HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
178 A_d,
179 &descA,
180 &alignmentRequirementA) );
181
182 uint32_t alignmentRequirementB;
183 HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
184 B_d,
185 &descB,
186 &alignmentRequirementB) );
187
188 uint32_t alignmentRequirementC;
189 HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
190 C_d,
191 &descC,
192 &alignmentRequirementC) );
193
194 printf("Query best alignment requirement for our pointers\n");
195
196 /* ***************************** */
197
198 // Create the Contraction Descriptor
199 cutensorContractionDescriptor_t desc;
200 HANDLE_ERROR( cutensorInitContractionDescriptor( &handle,
201 &desc,
202 &descA, modeA.data(), alignmentRequirementA,
203 &descB, modeB.data(), alignmentRequirementB,
204 &descC, modeC.data(), alignmentRequirementC,
205 &descC, modeC.data(), alignmentRequirementC,
206 typeCompute) );
207
208 printf("Initialize contraction descriptor\n");
209
210 /* ***************************** */
211
212 // Set the algorithm to use
213 cutensorContractionFind_t find;
214 HANDLE_ERROR( cutensorInitContractionFind(
215 &handle, &find,
216 CUTENSOR_ALGO_DEFAULT) );
217
218 const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_INCREMENTAL;
219 HANDLE_ERROR(cutensorContractionFindSetAttribute(
220 &handle,
221 &find,
222 CUTENSOR_CONTRACTION_FIND_AUTOTUNE_MODE,
223 &autotuneMode ,
224 sizeof(cutensorAutotuneMode_t)));
225
226 const uint32_t incCount = 4;
227 HANDLE_ERROR(cutensorContractionFindSetAttribute(
228 &handle,
229 &find,
230 CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT,
231 &incCount,
232 sizeof(uint32_t)));
233
234 printf("Initialize settings to find algorithm\n");
235
236 /* ***************************** */
237
238 // Query workspace
239 size_t worksize = 0;
240 HANDLE_ERROR( cutensorContractionGetWorkspace(&handle,
241 &desc,
242 &find,
243 CUTENSOR_WORKSPACE_RECOMMENDED, &worksize ) );
244
245 // Allocate workspace
246 void *work = nullptr;
247 if(worksize > 0)
248 {
249 if( cudaSuccess != cudaMalloc(&work, worksize) ) // This is optional!
250 {
251 work = nullptr;
252 worksize = 0;
253 }
254 }
255
256 printf("Query recommended workspace size and allocate it\n");
257
258 /* ***************************** */
259 printf("Execute contraction from plan\n");
260
261 int numRuns = 5;
262 for(int i=0; i < numRuns; ++i)
263 {
264 // Create Contraction Plan && look-up cache (if attached)
265 cutensorContractionPlan_t plan;
266 HANDLE_ERROR( cutensorInitContractionPlan(&handle,
267 &plan,
268 &desc,
269 &find,
270 worksize) );
271
272 printf("Create plan for contraction\n");
273
274 /* ***************************** */
275
276 cutensorStatus_t err;
277
278 // Execute the tensor contraction
279 err = cutensorContraction(&handle,
280 &plan,
281 (void*)&alpha, A_d,
282 B_d,
283 (void*)&beta, C_d,
284 C_d,
285 work, worksize, 0 /* stream */);
286 cudaDeviceSynchronize();
287
288 // Check for errors
289 if(err != CUTENSOR_STATUS_SUCCESS)
290 {
291 printf("ERROR: %s\n", cutensorGetErrorString(err));
292 }
293 }
294
295 uint32_t tag = 1;
296 HANDLE_ERROR( cutensorContractionDescriptorSetAttribute(
297 &handle,
298 &desc,
299 CUTENSOR_CONTRACTION_DESCRIPTOR_TAG,
300 &tag,
301 sizeof(uint32_t)));
302
303 for(int i=0; i < numRuns; ++i)
304 {
305 // Create Contraction Plan && look-up cache (if attached)
306 cutensorContractionPlan_t plan;
307 HANDLE_ERROR( cutensorInitContractionPlan(&handle,
308 &plan,
309 &desc,
310 &find,
311 worksize) );
312
313 printf("Create plan for contraction\n");
314
315 /* ***************************** */
316
317 cutensorStatus_t err;
318
319 // Execute the tensor contraction
320 err = cutensorContraction(&handle,
321 &plan,
322 (void*)&alpha, A_d,
323 B_d,
324 (void*)&beta, C_d,
325 C_d,
326 work, worksize, 0 /* stream */);
327 cudaDeviceSynchronize();
328
329 // Check for errors
330 if(err != CUTENSOR_STATUS_SUCCESS)
331 {
332 printf("ERROR: %s\n", cutensorGetErrorString(err));
333 }
334 }
335
336
337 /* ***************************** */
338 HANDLE_ERROR( cutensorHandleWriteCacheToFile(&handle, cacheFilename) );
339 printf("Cache has been successfully written to file (%s).\n", cacheFilename);
340
341 // Detach cache and free-up resources
342 HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );
343
344 if ( A ) free( A );
345 if ( B ) free( B );
346 if ( C ) free( C );
347 if ( cachelines ) free(cachelines);
348 if ( A_d ) cudaFree( A_d );
349 if ( B_d ) cudaFree( B_d );
350 if ( C_d ) cudaFree( C_d );
351 if ( work ) cudaFree( work );
352
353 printf("Successful completion\n");
354
355 return 0;
356}
You can confirm that the cache has two entries now by invoking the binary once again; this time it should report that “2 cachelines have been successfully read from file (./cache.bin)”.
This concludes our example of the plan cache; you can find these examples (including timings and warm-up runs) in the samples repository.
If you have any further question or suggestions, please do not hesitate to reach out.