Plan Cache#
This section introduces the software-managed plan cache that has the following key features:
Minimize launch-related overhead (e.g., due to kernel selection).
Overhead-free autotuning (a.k.a. Incremental Autotuning).
This feature enables users to automatically find the best implementation for the given problem and thereby increases performance.
The cache is implemented in a thread-safe manner and is shared across all threads that use the same
cutensorHandle_t
.Serialize and deserialize the cache:
Allows users to store the state of the cache to disc and reuse it later
In essence, the plan cache can be seen as a lookup table from a specific problem
instance (i.e., cutensorOperationDescriptor_t
) to an actual implementation (encoded by
cutensorPlan_t
).
The remainder of this section assumes familiarity with Getting Started.
Note
The cache is activated by default and can be deactivated via the CUTENSOR_DISABLE_PLAN_CACHE
environment variable (see Environment Variables).
Incremental Autotuning#
The incremental autotuning feature enables users to automatically explore different implementations, referred to as candidates, for a given operation.
When using the cache with the incremental auto-tuning feature (CUTENSOR_AUTOTUNE_MODE_INCREMENTAL
),
successive invocations of the same operation (albeit with potentially
different data pointers) will be performed by different candidates; the timing for those
candidates is automatically measured and the fastest candidate is stored in the plan
cache. The number of different candidates to explore is configurable by the user (via CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT
);
all subsequent calls to the same problem will then be mapped to the fastest candidate
(stored in the cache), thus taking advantage of the fastest (measured) candidate.
This autotuning approach has some key advantages:
Candidates are evaluated at a point in time where hardware caches are in a production-environment state (i.e., the hardware cache state reflects the real-world situation).
Overhead is minimized (i.e., no timing loop, no synchronization).
Moreover, the candidates are evaluated in the order given by our performance model (from fastest to slowest).
Incremental autotuning is especially powerful if combined with
cuTENSOR’s cache serialization feature (via cutensorHandleWritePlanCacheToFile()
and
cutensorHandleReadPlanCacheFromFile()
) by writing the tuned cache to disc.
Note
We recommend warming up the GPU (i.e., reaching steady-state performance) before auto-tuning to minimize fluctuations in measured performance.
Introductory Example#
This subsection provides a basic overview of the cache-related API calls and features. In addition to the steps outlined in Getting Started, in this example we also:
Set a suitable cache size
Configure the cache behavior on a contraction-by-contraction basis (via
cutensorPlanPreferenceSetAttribute()
).
Let’s first look at the same example outlined in Getting Started: As cuTENSOR 2.x has the cache enabled by default, it is already taking advantage of caching. While optional, the following code demonstrates how to resize then cache from its implementation-defined initial value.
// Set cache size
constexpr int32_t numEntries = 128;
HANDLE_ERROR( cutensorHandleResizePlanCachelines(&handle, numEntries) );
// ...
Note that the number of entries is configurable by the user; ideally, we would like the cache to be large enough to have enough capacity for each of the application’s distinct contraction calls. Since this might not always be possible (due to memory constraints),
cuTENSOR’s plan cache will evict cache entries using a least-recently-used (LRU) policy. Users
can also choose to disable caching on a contraction-by-contraction basis (via cutensorCacheMode_t::CUTENSOR_CACHE_MODE_NONE
).
Note that the cache lookup occurs when the plan is created. As such, this technique is particularly helpful if the same contraction is planed multiple times in the same application.
To disable caching for a certain contraction (i.e., opt-out), the
following settings in cutensorPlanPreference_t
need to be modified:
const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_NONE;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
&handle,
&find,
CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
&cacheMode,
sizeof(cutensorCacheMode_t)));
This concludes the introductory example.
Advanced Example#
This example will augment the previous example and explains how to:
Take advantage of incremental auto-tuning
It is recommended to warm up the GPU (i.e., reach steady-state performance) before using auto-tuning (to avoid big fluctuations in measured performance)
Use tags to distinguish two otherwise identical tensor contractions from each other
This is usful if the hardware cache of the GPU is (likely) substantially different between the two calls (e.g., if one of the operands was just read/written by a preceeding call) and it is expected that the state of the cache has significant impact on the performance (e.g., for bandwidth-bound contractions)
Write the plan cache state to a file and read it back in
Let us start by enabling incremental autotuning.
To do so, we modify cutensorPlanPreference_t
as follows:
const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
&handle,
&find,
CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE_MODE,
&autotuneMode ,
sizeof(cutensorAutotuneMode_t)));
const uint32_t incCount = 4;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
&handle,
&find,
CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
&incCount,
sizeof(uint32_t)));
The first call to cutensorPlanPreferenceSetAttribute()
enables incremental auto-tuning,
while the second call sets the CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT
; this
value corresponds to the number of different candidates that should be explored via
incremental autotuning before subsequent calls look-up from the plan cache.
Higher values of incCount
explore more candidates, and as such cause a larger overhead
initially, but they can also result in better performance – if the initial overhead
can be amortized (e.g., when writing the cache to disc).
We feel that a CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT
of four is a good default value.
The following code incorporates those changes:
1#include <stdlib.h>
2#include <stdio.h>
3
4#include <unordered_map>
5#include <vector>
6#include <cassert>
7
8#include <cuda_runtime.h>
9#include <cutensor.h>
10
11#define HANDLE_ERROR(x) \
12{ const auto err = x; \
13 if( err != CUTENSOR_STATUS_SUCCESS ) \
14 { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
15};
16
17#define HANDLE_CUDA_ERROR(x) \
18{ const auto err = x; \
19 if( err != cudaSuccess ) \
20 { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
21};
22
23struct GPUTimer
24{
25 GPUTimer()
26 {
27 cudaEventCreate(&start_);
28 cudaEventCreate(&stop_);
29 cudaEventRecord(start_, 0);
30 }
31
32 ~GPUTimer()
33 {
34 cudaEventDestroy(start_);
35 cudaEventDestroy(stop_);
36 }
37
38 void start()
39 {
40 cudaEventRecord(start_, 0);
41 }
42
43 float seconds()
44 {
45 cudaEventRecord(stop_, 0);
46 cudaEventSynchronize(stop_);
47 float time;
48 cudaEventElapsedTime(&time, start_, stop_);
49 return time * 1e-3;
50 }
51 private:
52 cudaEvent_t start_, stop_;
53};
54
55int main()
56{
57 typedef float floatTypeA;
58 typedef float floatTypeB;
59 typedef float floatTypeC;
60 typedef float floatTypeCompute;
61
62 cutensorDataType_t typeA = CUTENSOR_R_32F;
63 cutensorDataType_t typeB = CUTENSOR_R_32F;
64 cutensorDataType_t typeC = CUTENSOR_R_32F;
65 const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;
66
67 floatTypeCompute alpha = (floatTypeCompute)1.1f;
68 floatTypeCompute beta = (floatTypeCompute)0.f;
69
70 /**********************
71 * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
72 **********************/
73
74 std::vector<int> modeC{'m','u','n','v'};
75 std::vector<int> modeA{'m','h','k','n'};
76 std::vector<int> modeB{'u','k','v','h'};
77 int nmodeA = modeA.size();
78 int nmodeB = modeB.size();
79 int nmodeC = modeC.size();
80
81 std::unordered_map<int, int64_t> extent;
82 extent['m'] = 96;
83 extent['n'] = 96;
84 extent['u'] = 96;
85 extent['v'] = 64;
86 extent['h'] = 64;
87 extent['k'] = 64;
88
89 double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;
90
91 std::vector<int64_t> extentC;
92 for (auto mode : modeC)
93 extentC.push_back(extent[mode]);
94 std::vector<int64_t> extentA;
95 for (auto mode : modeA)
96 extentA.push_back(extent[mode]);
97 std::vector<int64_t> extentB;
98 for (auto mode : modeB)
99 extentB.push_back(extent[mode]);
100
101 /**********************
102 * Allocating data
103 **********************/
104
105 size_t elementsA = 1;
106 for (auto mode : modeA)
107 elementsA *= extent[mode];
108 size_t elementsB = 1;
109 for (auto mode : modeB)
110 elementsB *= extent[mode];
111 size_t elementsC = 1;
112 for (auto mode : modeC)
113 elementsC *= extent[mode];
114
115 size_t sizeA = sizeof(floatTypeA) * elementsA;
116 size_t sizeB = sizeof(floatTypeB) * elementsB;
117 size_t sizeC = sizeof(floatTypeC) * elementsC;
118 printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);
119
120 void *A_d, *B_d, *C_d;
121 HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
122 HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
123 HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));
124
125 const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
126 assert(uintptr_t(A_d) % kAlignment == 0);
127 assert(uintptr_t(B_d) % kAlignment == 0);
128 assert(uintptr_t(C_d) % kAlignment == 0);
129
130 floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
131 floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
132 floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
133
134 if (A == NULL || B == NULL || C == NULL)
135 {
136 printf("Error: Host allocation of A or C.\n");
137 return -1;
138 }
139
140 /*******************
141 * Initialize data
142 *******************/
143
144 for (int64_t i = 0; i < elementsA; i++)
145 A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
146 for (int64_t i = 0; i < elementsB; i++)
147 B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
148 for (int64_t i = 0; i < elementsC; i++)
149 C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
150
151 HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
152 HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
153 HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));
154
155 /*************************
156 * cuTENSOR
157 *************************/
158
159 cutensorHandle_t handle;
160 HANDLE_ERROR(cutensorCreate(&handle));
161
162 /**********************
163 * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
164 **********************/
165 uint32_t numEntries = 128;
166 HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));
167
168 /**********************
169 * Create Tensor Descriptors
170 **********************/
171 cutensorTensorDescriptor_t descA;
172 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
173 &descA,
174 nmodeA,
175 extentA.data(),
176 NULL,/*stride*/
177 typeA, kAlignment));
178
179 cutensorTensorDescriptor_t descB;
180 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
181 &descB,
182 nmodeB,
183 extentB.data(),
184 NULL,/*stride*/
185 typeB, kAlignment));
186
187 cutensorTensorDescriptor_t descC;
188 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
189 &descC,
190 nmodeC,
191 extentC.data(),
192 NULL,/*stride*/
193 typeC, kAlignment));
194
195 /*******************************
196 * Create Contraction Descriptor
197 *******************************/
198
199 cutensorOperationDescriptor_t desc;
200 HANDLE_ERROR(cutensorCreateContraction(handle,
201 &desc,
202 descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
203 descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
204 descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
205 descC, modeC.data(),
206 descCompute));
207
208 /**************************
209 * PlanPreference: Set the algorithm to use and enable incremental autotuning
210 ***************************/
211
212 const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;
213
214 cutensorPlanPreference_t planPref;
215 HANDLE_ERROR(cutensorCreatePlanPreference(
216 handle,
217 &planPref,
218 algo,
219 CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation
220
221 const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
222 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
223 handle,
224 planPref,
225 CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
226 &cacheMode,
227 sizeof(cutensorCacheMode_t)));
228
229 const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
230 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
231 handle,
232 planPref,
233 CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
234 &autotuneMode ,
235 sizeof(cutensorAutotuneMode_t)));
236
237 const uint32_t incCount = 4;
238 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
239 handle,
240 planPref,
241 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
242 &incCount,
243 sizeof(uint32_t)));
244
245 /**********************
246 * Query workspace estimate
247 **********************/
248
249 uint64_t workspaceSizeEstimate = 0;
250 const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
251 HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
252 desc,
253 planPref,
254 workspacePref,
255 &workspaceSizeEstimate));
256
257 /**************************
258 * Create Contraction Plan
259 **************************/
260
261 cutensorPlan_t plan;
262 HANDLE_ERROR(cutensorCreatePlan(handle,
263 &plan,
264 desc,
265 planPref,
266 workspaceSizeEstimate));
267
268 /**************************
269 * Optional: Query information about the created plan
270 **************************/
271
272 // query actually used workspace
273 uint64_t actualWorkspaceSize = 0;
274 HANDLE_ERROR(cutensorPlanGetAttribute(handle,
275 plan,
276 CUTENSOR_PLAN_REQUIRED_WORKSPACE,
277 &actualWorkspaceSize,
278 sizeof(actualWorkspaceSize)));
279
280 // At this point the user knows exactly how much memory is need by the operation and
281 // only the smaller actual workspace needs to be allocated
282 assert(actualWorkspaceSize <= workspaceSizeEstimate);
283
284 void *work = nullptr;
285 if (actualWorkspaceSize > 0)
286 {
287 HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
288 assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
289 }
290
291 /**********************
292 * Run
293 **********************/
294
295 double minTimeCUTENSOR = 1e100;
296 for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
297 {
298 cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
299 cudaDeviceSynchronize();
300
301 // Set up timing
302 GPUTimer timer;
303 timer.start();
304
305 // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
306 HANDLE_ERROR(cutensorContract(handle,
307 plan,
308 (void*) &alpha, A_d, B_d,
309 (void*) &beta, C_d, C_d,
310 work, actualWorkspaceSize, 0 /* stream */));
311
312 // Synchronize and measure timing
313 auto time = timer.seconds();
314
315 minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
316 }
317
318 /*************************/
319
320 double transferedBytes = sizeC + sizeA + sizeB;
321 transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
322 transferedBytes /= 1e9;
323 printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
324
325 HANDLE_ERROR(cutensorDestroy(handle));
326 HANDLE_ERROR(cutensorDestroyPlan(plan));
327 HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
328 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
329 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
330 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
331
332 if (A) free(A);
333 if (B) free(B);
334 if (C) free(C);
335 if (A_d) cudaFree(A_d);
336 if (B_d) cudaFree(B_d);
337 if (C_d) cudaFree(C_d);
338 if (work) cudaFree(work);
339
340 return 0;
341}
Let us further augment the above example by writing the plan cache to a file and reading it in (provided it was previously written):
const char planCacheFilename[] = "./planCache.bin";
uint32_t numCachelines = 0;
cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
planCacheFilename, &numCachelines);
if (status == CUTENSOR_STATUS_IO_ERROR)
{
printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
}
else if (status != CUTENSOR_STATUS_SUCCESS)
{
printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
}
else
{
printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
numCachelines);
}
// ...
status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
if (status == CUTENSOR_STATUS_IO_ERROR)
{
printf("File (%s) couldn't be written to.\n", planCacheFilename);
}
else if (status != CUTENSOR_STATUS_SUCCESS)
{
printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
cutensorGetErrorString(status));
}
else
{
printf("Plan cache successfully stored to %s.\n", planCacheFilename);
}
Warning
cutensorHandleReadPlanCacheFromFile()
only succeeds if the size of the plan
cache is sufficient to read all cachelines stored in the file; otherwise
CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE
is returned and the sufficient number of cachelines
is stored in numCachelinesRead
.
With these changes the example now looks as follows:
1 #include <stdlib.h>
2 #include <stdio.h>
3
4 #include <unordered_map>
5 #include <vector>
6 #include <cassert>
7
8 #include <cuda_runtime.h>
9 #include <cutensor.h>
10
11 #define HANDLE_ERROR(x) \
12 { const auto err = x; \
13 if( err != CUTENSOR_STATUS_SUCCESS ) \
14 { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
15 };
16
17 #define HANDLE_CUDA_ERROR(x) \
18 { const auto err = x; \
19 if( err != cudaSuccess ) \
20 { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
21 };
22
23 struct GPUTimer
24 {
25 GPUTimer()
26 {
27 cudaEventCreate(&start_);
28 cudaEventCreate(&stop_);
29 cudaEventRecord(start_, 0);
30 }
31
32 ~GPUTimer()
33 {
34 cudaEventDestroy(start_);
35 cudaEventDestroy(stop_);
36 }
37
38 void start()
39 {
40 cudaEventRecord(start_, 0);
41 }
42
43 float seconds()
44 {
45 cudaEventRecord(stop_, 0);
46 cudaEventSynchronize(stop_);
47 float time;
48 cudaEventElapsedTime(&time, start_, stop_);
49 return time * 1e-3;
50 }
51 private:
52 cudaEvent_t start_, stop_;
53 };
54
55 int main()
56 {
57 typedef float floatTypeA;
58 typedef float floatTypeB;
59 typedef float floatTypeC;
60 typedef float floatTypeCompute;
61
62 cutensorDataType_t typeA = CUTENSOR_R_32F;
63 cutensorDataType_t typeB = CUTENSOR_R_32F;
64 cutensorDataType_t typeC = CUTENSOR_R_32F;
65 const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;
66
67 floatTypeCompute alpha = (floatTypeCompute)1.1f;
68 floatTypeCompute beta = (floatTypeCompute)0.f;
69
70 /**********************
71 * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
72 **********************/
73
74 std::vector<int> modeC{'m','u','n','v'};
75 std::vector<int> modeA{'m','h','k','n'};
76 std::vector<int> modeB{'u','k','v','h'};
77 int nmodeA = modeA.size();
78 int nmodeB = modeB.size();
79 int nmodeC = modeC.size();
80
81 std::unordered_map<int, int64_t> extent;
82 extent['m'] = 96;
83 extent['n'] = 96;
84 extent['u'] = 96;
85 extent['v'] = 64;
86 extent['h'] = 64;
87 extent['k'] = 64;
88
89 double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;
90
91 std::vector<int64_t> extentC;
92 for (auto mode : modeC)
93 extentC.push_back(extent[mode]);
94 std::vector<int64_t> extentA;
95 for (auto mode : modeA)
96 extentA.push_back(extent[mode]);
97 std::vector<int64_t> extentB;
98 for (auto mode : modeB)
99 extentB.push_back(extent[mode]);
100
101 /**********************
102 * Allocating data
103 **********************/
104
105 size_t elementsA = 1;
106 for (auto mode : modeA)
107 elementsA *= extent[mode];
108 size_t elementsB = 1;
109 for (auto mode : modeB)
110 elementsB *= extent[mode];
111 size_t elementsC = 1;
112 for (auto mode : modeC)
113 elementsC *= extent[mode];
114
115 size_t sizeA = sizeof(floatTypeA) * elementsA;
116 size_t sizeB = sizeof(floatTypeB) * elementsB;
117 size_t sizeC = sizeof(floatTypeC) * elementsC;
118 printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);
119
120 void *A_d, *B_d, *C_d;
121 HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
122 HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
123 HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));
124
125 const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
126 assert(uintptr_t(A_d) % kAlignment == 0);
127 assert(uintptr_t(B_d) % kAlignment == 0);
128 assert(uintptr_t(C_d) % kAlignment == 0);
129
130 floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
131 floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
132 floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
133
134 if (A == NULL || B == NULL || C == NULL)
135 {
136 printf("Error: Host allocation of A or C.\n");
137 return -1;
138 }
139
140 /*******************
141 * Initialize data
142 *******************/
143
144 for (int64_t i = 0; i < elementsA; i++)
145 A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
146 for (int64_t i = 0; i < elementsB; i++)
147 B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
148 for (int64_t i = 0; i < elementsC; i++)
149 C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
150
151 HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
152 HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
153 HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));
154
155 /*************************
156 * cuTENSOR
157 *************************/
158
159 cutensorHandle_t handle;
160 HANDLE_ERROR(cutensorCreate(&handle));
161
162 /**********************
163 * Load plan cache
164 **********************/
165
166 // holds information about the per-handle plan cache
167 const char planCacheFilename[] = "./planCache.bin";
168 uint32_t numCachelines = 0;
169 cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
170 planCacheFilename, &numCachelines);
171 if (status == CUTENSOR_STATUS_IO_ERROR)
172 {
173 printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
174 }
175 else if (status != CUTENSOR_STATUS_SUCCESS)
176 {
177 printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
178 }
179 else
180 {
181 printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
182 numCachelines);
183 }
184
185 /**********************
186 * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
187 **********************/
188 uint32_t numEntries = 128;
189 HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));
190
191 /**********************
192 * Create Tensor Descriptors
193 **********************/
194 cutensorTensorDescriptor_t descA;
195 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
196 &descA,
197 nmodeA,
198 extentA.data(),
199 NULL,/*stride*/
200 typeA, kAlignment));
201
202 cutensorTensorDescriptor_t descB;
203 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
204 &descB,
205 nmodeB,
206 extentB.data(),
207 NULL,/*stride*/
208 typeB, kAlignment));
209
210 cutensorTensorDescriptor_t descC;
211 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
212 &descC,
213 nmodeC,
214 extentC.data(),
215 NULL,/*stride*/
216 typeC, kAlignment));
217
218 /*******************************
219 * Create Contraction Descriptor
220 *******************************/
221
222 cutensorOperationDescriptor_t desc;
223 HANDLE_ERROR(cutensorCreateContraction(handle,
224 &desc,
225 descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
226 descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
227 descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
228 descC, modeC.data(),
229 descCompute));
230
231 /**************************
232 * PlanPreference: Set the algorithm to use and enable incremental autotuning
233 ***************************/
234
235 const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;
236
237 cutensorPlanPreference_t planPref;
238 HANDLE_ERROR(cutensorCreatePlanPreference(
239 handle,
240 &planPref,
241 algo,
242 CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation
243
244 const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
245 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
246 handle,
247 planPref,
248 CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
249 &cacheMode,
250 sizeof(cutensorCacheMode_t)));
251
252 const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
253 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
254 handle,
255 planPref,
256 CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
257 &autotuneMode ,
258 sizeof(cutensorAutotuneMode_t)));
259
260 const uint32_t incCount = 4;
261 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
262 handle,
263 planPref,
264 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
265 &incCount,
266 sizeof(uint32_t)));
267
268 /**********************
269 * Query workspace estimate
270 **********************/
271
272 uint64_t workspaceSizeEstimate = 0;
273 const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
274 HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
275 desc,
276 planPref,
277 workspacePref,
278 &workspaceSizeEstimate));
279
280 /**************************
281 * Create Contraction Plan
282 **************************/
283
284 cutensorPlan_t plan;
285 HANDLE_ERROR(cutensorCreatePlan(handle,
286 &plan,
287 desc,
288 planPref,
289 workspaceSizeEstimate));
290
291 /**************************
292 * Optional: Query information about the created plan
293 **************************/
294
295 // query actually used workspace
296 uint64_t actualWorkspaceSize = 0;
297 HANDLE_ERROR(cutensorPlanGetAttribute(handle,
298 plan,
299 CUTENSOR_PLAN_REQUIRED_WORKSPACE,
300 &actualWorkspaceSize,
301 sizeof(actualWorkspaceSize)));
302
303 // At this point the user knows exactly how much memory is need by the operation and
304 // only the smaller actual workspace needs to be allocated
305 assert(actualWorkspaceSize <= workspaceSizeEstimate);
306
307 void *work = nullptr;
308 if (actualWorkspaceSize > 0)
309 {
310 HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
311 assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
312 }
313
314 /**********************
315 * Run
316 **********************/
317
318 double minTimeCUTENSOR = 1e100;
319 for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
320 {
321 cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
322 cudaDeviceSynchronize();
323
324 // Set up timing
325 GPUTimer timer;
326 timer.start();
327
328 // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
329 HANDLE_ERROR(cutensorContract(handle,
330 plan,
331 (void*) &alpha, A_d, B_d,
332 (void*) &beta, C_d, C_d,
333 work, actualWorkspaceSize, 0 /* stream */));
334
335 // Synchronize and measure timing
336 auto time = timer.seconds();
337
338 minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
339 }
340
341 /*************************/
342
343 double transferedBytes = sizeC + sizeA + sizeB;
344 transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
345 transferedBytes /= 1e9;
346 printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
347
348 status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
349 if (status == CUTENSOR_STATUS_IO_ERROR)
350 {
351 printf("File (%s) couldn't be written to.\n", planCacheFilename);
352 }
353 else if (status != CUTENSOR_STATUS_SUCCESS)
354 {
355 printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
356 cutensorGetErrorString(status));
357 }
358 else
359 {
360 printf("Plan cache successfully stored to %s.\n", planCacheFilename);
361 }
362
363
364 HANDLE_ERROR(cutensorDestroy(handle));
365 HANDLE_ERROR(cutensorDestroyPlan(plan));
366 HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
367 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
368 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
369 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
370
371 if (A) free(A);
372 if (B) free(B);
373 if (C) free(C);
374 if (A_d) cudaFree(A_d);
375 if (B_d) cudaFree(B_d);
376 if (C_d) cudaFree(C_d);
377 if (work) cudaFree(work);
378
379 return 0;
380 }
Finally, let us add a second contraction loop, but this time we want the –otherwise
identical– contraction to be cached using a different cacheline:
This can be useful if the state of the hardware cache between these two calls is substantially different (i.e.,
affecting the measured runtime of the kernels).
To that end, we use the CUTENSOR_CONTRACTION_DESCRIPTOR_TAG
attribute:
uint32_t tag = 1;
HANDLE_ERROR( cutensorOperationDescriptorSetAttribute(
&handle,
&desc,
CUTENSOR_OPERATION_DESCRIPTOR_TAG,
&tag,
sizeof(uint32_t)));
With this change, the example code now looks as follows:
1 #include <stdlib.h>
2 #include <stdio.h>
3
4 #include <unordered_map>
5 #include <vector>
6 #include <cassert>
7
8 #include <cuda_runtime.h>
9 #include <cutensor.h>
10
11 #define HANDLE_ERROR(x) \
12 { const auto err = x; \
13 if( err != CUTENSOR_STATUS_SUCCESS ) \
14 { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
15 };
16
17 #define HANDLE_CUDA_ERROR(x) \
18 { const auto err = x; \
19 if( err != cudaSuccess ) \
20 { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
21 };
22
23 struct GPUTimer
24 {
25 GPUTimer()
26 {
27 cudaEventCreate(&start_);
28 cudaEventCreate(&stop_);
29 cudaEventRecord(start_, 0);
30 }
31
32 ~GPUTimer()
33 {
34 cudaEventDestroy(start_);
35 cudaEventDestroy(stop_);
36 }
37
38 void start()
39 {
40 cudaEventRecord(start_, 0);
41 }
42
43 float seconds()
44 {
45 cudaEventRecord(stop_, 0);
46 cudaEventSynchronize(stop_);
47 float time;
48 cudaEventElapsedTime(&time, start_, stop_);
49 return time * 1e-3;
50 }
51 private:
52 cudaEvent_t start_, stop_;
53 };
54
55 int main()
56 {
57 typedef float floatTypeA;
58 typedef float floatTypeB;
59 typedef float floatTypeC;
60 typedef float floatTypeCompute;
61
62 cutensorDataType_t typeA = CUTENSOR_R_32F;
63 cutensorDataType_t typeB = CUTENSOR_R_32F;
64 cutensorDataType_t typeC = CUTENSOR_R_32F;
65 const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;
66
67 floatTypeCompute alpha = (floatTypeCompute)1.1f;
68 floatTypeCompute beta = (floatTypeCompute)0.f;
69
70 /**********************
71 * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
72 **********************/
73
74 std::vector<int> modeC{'m','u','n','v'};
75 std::vector<int> modeA{'m','h','k','n'};
76 std::vector<int> modeB{'u','k','v','h'};
77 int nmodeA = modeA.size();
78 int nmodeB = modeB.size();
79 int nmodeC = modeC.size();
80
81 std::unordered_map<int, int64_t> extent;
82 extent['m'] = 96;
83 extent['n'] = 96;
84 extent['u'] = 96;
85 extent['v'] = 64;
86 extent['h'] = 64;
87 extent['k'] = 64;
88
89 double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;
90
91 std::vector<int64_t> extentC;
92 for (auto mode : modeC)
93 extentC.push_back(extent[mode]);
94 std::vector<int64_t> extentA;
95 for (auto mode : modeA)
96 extentA.push_back(extent[mode]);
97 std::vector<int64_t> extentB;
98 for (auto mode : modeB)
99 extentB.push_back(extent[mode]);
100
101 /**********************
102 * Allocating data
103 **********************/
104
105 size_t elementsA = 1;
106 for (auto mode : modeA)
107 elementsA *= extent[mode];
108 size_t elementsB = 1;
109 for (auto mode : modeB)
110 elementsB *= extent[mode];
111 size_t elementsC = 1;
112 for (auto mode : modeC)
113 elementsC *= extent[mode];
114
115 size_t sizeA = sizeof(floatTypeA) * elementsA;
116 size_t sizeB = sizeof(floatTypeB) * elementsB;
117 size_t sizeC = sizeof(floatTypeC) * elementsC;
118 printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);
119
120 void *A_d, *B_d, *C_d;
121 HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
122 HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
123 HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));
124
125 const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
126 assert(uintptr_t(A_d) % kAlignment == 0);
127 assert(uintptr_t(B_d) % kAlignment == 0);
128 assert(uintptr_t(C_d) % kAlignment == 0);
129
130 floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
131 floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
132 floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
133
134 if (A == NULL || B == NULL || C == NULL)
135 {
136 printf("Error: Host allocation of A or C.\n");
137 return -1;
138 }
139
140 /*******************
141 * Initialize data
142 *******************/
143
144 for (int64_t i = 0; i < elementsA; i++)
145 A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
146 for (int64_t i = 0; i < elementsB; i++)
147 B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
148 for (int64_t i = 0; i < elementsC; i++)
149 C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
150
151 HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
152 HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
153 HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));
154
155 /*************************
156 * cuTENSOR
157 *************************/
158
159 cutensorHandle_t handle;
160 HANDLE_ERROR(cutensorCreate(&handle));
161
162 /**********************
163 * Load plan cache
164 **********************/
165
166 // holds information about the per-handle plan cache
167 const char planCacheFilename[] = "./planCache.bin";
168 uint32_t numCachelines = 0;
169 cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
170 planCacheFilename, &numCachelines);
171 if (status == CUTENSOR_STATUS_IO_ERROR)
172 {
173 printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
174 }
175 else if (status != CUTENSOR_STATUS_SUCCESS)
176 {
177 printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
178 }
179 else
180 {
181 printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
182 numCachelines);
183 }
184
185 /**********************
186 * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
187 **********************/
188 uint32_t numEntries = 128;
189 HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));
190
191 /**********************
192 * Create Tensor Descriptors
193 **********************/
194 cutensorTensorDescriptor_t descA;
195 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
196 &descA,
197 nmodeA,
198 extentA.data(),
199 NULL,/*stride*/
200 typeA, kAlignment));
201
202 cutensorTensorDescriptor_t descB;
203 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
204 &descB,
205 nmodeB,
206 extentB.data(),
207 NULL,/*stride*/
208 typeB, kAlignment));
209
210 cutensorTensorDescriptor_t descC;
211 HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
212 &descC,
213 nmodeC,
214 extentC.data(),
215 NULL,/*stride*/
216 typeC, kAlignment));
217
218 /*******************************
219 * Create Contraction Descriptor
220 *******************************/
221
222 cutensorOperationDescriptor_t desc;
223 HANDLE_ERROR(cutensorCreateContraction(handle,
224 &desc,
225 descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
226 descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
227 descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
228 descC, modeC.data(),
229 descCompute));
230
231 /**************************
232 * PlanPreference: Set the algorithm to use and enable incremental autotuning
233 ***************************/
234
235 const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;
236
237 cutensorPlanPreference_t planPref;
238 HANDLE_ERROR(cutensorCreatePlanPreference(
239 handle,
240 &planPref,
241 algo,
242 CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation
243
244 const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
245 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
246 handle,
247 planPref,
248 CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
249 &cacheMode,
250 sizeof(cutensorCacheMode_t)));
251
252 const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
253 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
254 handle,
255 planPref,
256 CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
257 &autotuneMode ,
258 sizeof(cutensorAutotuneMode_t)));
259
260 const uint32_t incCount = 4;
261 HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
262 handle,
263 planPref,
264 CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
265 &incCount,
266 sizeof(uint32_t)));
267
268 /**********************
269 * Query workspace estimate
270 **********************/
271
272 uint64_t workspaceSizeEstimate = 0;
273 const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
274 HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
275 desc,
276 planPref,
277 workspacePref,
278 &workspaceSizeEstimate));
279
280 /**************************
281 * Create Contraction Plan
282 **************************/
283
284 cutensorPlan_t plan;
285 HANDLE_ERROR(cutensorCreatePlan(handle,
286 &plan,
287 desc,
288 planPref,
289 workspaceSizeEstimate));
290
291 /**************************
292 * Optional: Query information about the created plan
293 **************************/
294
295 // query actually used workspace
296 uint64_t actualWorkspaceSize = 0;
297 HANDLE_ERROR(cutensorPlanGetAttribute(handle,
298 plan,
299 CUTENSOR_PLAN_REQUIRED_WORKSPACE,
300 &actualWorkspaceSize,
301 sizeof(actualWorkspaceSize)));
302
303 // At this point the user knows exactly how much memory is need by the operation and
304 // only the smaller actual workspace needs to be allocated
305 assert(actualWorkspaceSize <= workspaceSizeEstimate);
306
307 void *work = nullptr;
308 if (actualWorkspaceSize > 0)
309 {
310 HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
311 assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
312 }
313
314 /**********************
315 * Run
316 **********************/
317
318 double minTimeCUTENSOR = 1e100;
319 for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
320 {
321 cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
322 cudaDeviceSynchronize();
323
324 // Set up timing
325 GPUTimer timer;
326 timer.start();
327
328 // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
329 HANDLE_ERROR(cutensorContract(handle,
330 plan,
331 (void*) &alpha, A_d, B_d,
332 (void*) &beta, C_d, C_d,
333 work, actualWorkspaceSize, 0 /* stream */));
334
335 // Synchronize and measure timing
336 auto time = timer.seconds();
337
338 minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
339 }
340
341 /*************************/
342
343 double transferedBytes = sizeC + sizeA + sizeB;
344 transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
345 transferedBytes /= 1e9;
346 printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
347
348 uint32_t tag = 1;
349 HANDLE_ERROR( cutensorOperationDescriptorSetAttribute(
350 &handle,
351 &desc,
352 CUTENSOR_OPERATION_DESCRIPTOR_TAG,
353 &tag,
354 sizeof(uint32_t)));
355
356 /**************************
357 * Create Contraction Plan (with a different tag)
358 **************************/
359
360 cutensorPlan_t plan;
361 HANDLE_ERROR(cutensorCreatePlan(handle,
362 &plan,
363 desc,
364 planPref,
365 workspaceSizeEstimate));
366
367 /**************************
368 * Optional: Query information about the created plan
369 **************************/
370
371 // query actually used workspace
372 uint64_t actualWorkspaceSize = 0;
373 HANDLE_ERROR(cutensorPlanGetAttribute(handle,
374 plan,
375 CUTENSOR_PLAN_REQUIRED_WORKSPACE,
376 &actualWorkspaceSize,
377 sizeof(actualWorkspaceSize)));
378
379 // At this point the user knows exactly how much memory is need by the operation and
380 // only the smaller actual workspace needs to be allocated
381 assert(actualWorkspaceSize <= workspaceSizeEstimate);
382
383 void *work = nullptr;
384 if (actualWorkspaceSize > 0)
385 {
386 HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
387 assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
388 }
389
390 /**********************
391 * Run
392 **********************/
393
394 double minTimeCUTENSOR = 1e100;
395 for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
396 {
397 cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
398 cudaDeviceSynchronize();
399
400 // Set up timing
401 GPUTimer timer;
402 timer.start();
403
404 // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
405 HANDLE_ERROR(cutensorContract(handle,
406 plan,
407 (void*) &alpha, A_d, B_d,
408 (void*) &beta, C_d, C_d,
409 work, actualWorkspaceSize, 0 /* stream */));
410
411 // Synchronize and measure timing
412 auto time = timer.seconds();
413
414 minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
415 }
416
417 /*************************/
418
419 double transferedBytes = sizeC + sizeA + sizeB;
420 transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
421 transferedBytes /= 1e9;
422 printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
423
424 status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
425 if (status == CUTENSOR_STATUS_IO_ERROR)
426 {
427 printf("File (%s) couldn't be written to.\n", planCacheFilename);
428 }
429 else if (status != CUTENSOR_STATUS_SUCCESS)
430 {
431 printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
432 cutensorGetErrorString(status));
433 }
434 else
435 {
436 printf("Plan cache successfully stored to %s.\n", planCacheFilename);
437 }
438
439
440 HANDLE_ERROR(cutensorDestroy(handle));
441 HANDLE_ERROR(cutensorDestroyPlan(plan));
442 HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
443 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
444 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
445 HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
446
447 if (A) free(A);
448 if (B) free(B);
449 if (C) free(C);
450 if (A_d) cudaFree(A_d);
451 if (B_d) cudaFree(B_d);
452 if (C_d) cudaFree(C_d);
453 if (work) cudaFree(work);
454
455 return 0;
456 }
You can confirm that the cache has two entries now by invoking the binary once again; this time it should report that “2 cachelines have been successfully read from file (./cache.bin)”.
This concludes our example of the plan cache; you can find these examples (including timings and warm-up runs) in the samples repository.
If you have any further question or suggestions, please do not hesitate to reach out.