Plan Cache (beta)

This section introduces the software-managed plan cache that has the following key features:

  • Minimize launch-related overhead (e.g., due to kernel selection).

  • Overhead-free autotuning (a.k.a. Incremental Autotuning).

    • This feature enables users to automatically find the best implementation for the given problem and thereby increases performance.

  • The cache is implemented in a thread-safe manner and is shared across all threads that use the same cutensorHandle_t.

  • Serialize and deserialize the cache:

    • Allows users to store the state of the cache to disc and reuse it later

In essence, the plan cache can be seen as a lookup table from a specific problem instance (e.g., cutensorContractionDescriptor_t) to an actual implementation (encoded by cutensorContractionPlan_t).

The remainder of this section assumes familiarity with Getting Started.

Note

The cache can always be deactivated via the CUTENSOR_DISABLE_PLAN_CACHE environment variable (see Environment Variables).

Incremental Autotuning

The incremental autotuning feature enables users to automatically explore different implementations, referred to as candidates, for a given problem.

When using the cache with the incremental auto-tuning feature (CUTENSOR_AUTOTUNE_INCREMENTAL), successive invocations of the same contraction problem (albeit with potentially different data pointers) will be performed by different candidates; the timing for those candidates is automatically measured and the fastest candidate is stored in the plan cache. The number of different candidates to explore is configurable by the user (via CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT); all subsequent calls to the same problem will then be mapped to the fastest candidate (stored in the cache), thus taking advantage of the fastest (real-world) candidate.

This autotuning approach has some key advantages:

  • Candidates are evaluated at a point in time where hardware caches are in a production-environment state (i.e., the hardware cache state reflects the real-world situation).

  • Overhead is minimized (i.e., no timing loop, no synchronization).

    • Moreover, the candidates are evaluated in the order given by our performance model (from fastest to slowest).

Incremental autotuning is especially powerful if combined with cuTENSOR’s cache serialization feature (via cutensorHandleWriteCacheToFile and cutensorHandleReadCacheFromFile) by writing the tuned cache to disc.

Note

We recommend warming up the GPU (i.e., reaching steady-state performance) before auto-tuning is used to minimize fluctuations in measured performance.

Introductory Example

This subsection provides a basic overview of the cache-related API calls and features. In addition to the steps outlined in Getting Started, in this example we also:

  • Attach cachelines to the handle (via cutensorHandleAttachPlanCachelines).

  • Configure the cache behavior on a contraction-by-contraction basis (via cutensorContractionFindSetAttribute).

Let’s start from the same example outlined in Getting Started and enhance it with the steps required to use the plan cache. First we have to add calls to attach and detach cachelines:

// Set number of cache lines
constexpr int32_t numCachelines = 1024;
// Set cache size and allocate
const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
// Attach cache
HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );

// ...

// Detach cache and free-up resources
HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );

Note that the number of cachelines is configurable by the user; ideally we would like to supply as many cachelines as the applications has distinct contraction calls. Since this might not always be possible (due to memory constraints), cuTENSOR’s plan cache will evict cache entries using a least-recently-used (LRU) policy. Users can choose to disable caching on a contraction-by-contraction basis (via cutensorCacheMode_t::CUTENSOR_CACHE_NONE).

Note

Until the cachelines are detached using cutensorHandleDetachPlanCachelines, the user-allocated cachelines must not be freed.

Moreover, cutensorHandleDetachPlanCachelines also deallocates other plan cache related resources. It should be called once the plan cache is no longer needed in order to avoid resource leaks.

With the above mentioned changes (and a loop around the contraction call) our example now looks as follows:

  1#include <stdlib.h>
  2#include <stdio.h>
  3
  4#include <cuda_runtime.h>
  5#include <cutensor.h>
  6
  7#include <unordered_map>
  8#include <vector>
  9
 10// Handle cuTENSOR errors
 11#define HANDLE_ERROR(x) {                                                              \
 12  const auto err = x;                                                                  \
 13  if( err != CUTENSOR_STATUS_SUCCESS )                                                   \
 14  { printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
 15}
 16
 17int main(int argc, char** argv)
 18{
 19  // Host element type definition
 20  typedef float floatTypeA;
 21  typedef float floatTypeB;
 22  typedef float floatTypeC;
 23  typedef float floatTypeCompute;
 24
 25  // CUDA types
 26  cudaDataType_t typeA = CUDA_R_32F;
 27  cudaDataType_t typeB = CUDA_R_32F;
 28  cudaDataType_t typeC = CUDA_R_32F;
 29  cutensorComputeType_t typeCompute = CUTENSOR_COMPUTE_32F;
 30
 31  floatTypeCompute alpha = (floatTypeCompute)1.1f;
 32  floatTypeCompute beta  = (floatTypeCompute)0.9f;
 33
 34  printf("Include headers and define data types\n");
 35
 36  /* ***************************** */
 37
 38  // Create vector of modes
 39  std::vector<int> modeC{'m','u','n','v'};
 40  std::vector<int> modeA{'m','h','k','n'};
 41  std::vector<int> modeB{'u','k','v','h'};
 42  int nmodeA = modeA.size();
 43  int nmodeB = modeB.size();
 44  int nmodeC = modeC.size();
 45
 46  // Extents
 47  std::unordered_map<int, int64_t> extent;
 48  extent['m'] = 96;
 49  extent['n'] = 96;
 50  extent['u'] = 96;
 51  extent['v'] = 64;
 52  extent['h'] = 64;
 53  extent['k'] = 64;
 54
 55  // Create a vector of extents for each tensor
 56  std::vector<int64_t> extentC;
 57  for(auto mode : modeC)
 58      extentC.push_back(extent[mode]);
 59  std::vector<int64_t> extentA;
 60  for(auto mode : modeA)
 61      extentA.push_back(extent[mode]);
 62  std::vector<int64_t> extentB;
 63  for(auto mode : modeB)
 64      extentB.push_back(extent[mode]);
 65
 66  printf("Define modes and extents\n");
 67
 68  /* ***************************** */
 69
 70  // Number of elements of each tensor
 71  size_t elementsA = 1;
 72  for(auto mode : modeA)
 73      elementsA *= extent[mode];
 74  size_t elementsB = 1;
 75  for(auto mode : modeB)
 76      elementsB *= extent[mode];
 77  size_t elementsC = 1;
 78  for(auto mode : modeC)
 79      elementsC *= extent[mode];
 80
 81  // Size in bytes
 82  size_t sizeA = sizeof(floatTypeA) * elementsA;
 83  size_t sizeB = sizeof(floatTypeB) * elementsB;
 84  size_t sizeC = sizeof(floatTypeC) * elementsC;
 85
 86  // Allocate on device
 87  void *A_d, *B_d, *C_d;
 88  cudaMalloc((void**)&A_d, sizeA);
 89  cudaMalloc((void**)&B_d, sizeB);
 90  cudaMalloc((void**)&C_d, sizeC);
 91
 92  // Allocate on host
 93  floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
 94  floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
 95  floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
 96
 97  // Initialize data on host
 98  for(int64_t i = 0; i < elementsA; i++)
 99      A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
100  for(int64_t i = 0; i < elementsB; i++)
101      B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
102  for(int64_t i = 0; i < elementsC; i++)
103      C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
104
105  // Copy to device
106  cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
107  cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice);
108  cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice);
109
110  printf("Allocate, initialize and transfer tensors\n");
111
112  /* ***************************** */
113
114  // Initialize cuTENSOR library
115  cutensorHandle_t handle;
116  cutensorInit(&handle);
117
118  /**********************
119   * Setup plan cache
120   **********************/
121  printf("Attach cachelines\n");
122
123  constexpr int32_t numCachelines = 1024;
124  const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
125  printf("Allocating: %.2f kB for the cache\n", sizeCache / 1000.);
126  cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
127  HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );
128
129  // Create Tensor Descriptors
130  cutensorTensorDescriptor_t descA;
131  HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
132              &descA,
133              nmodeA,
134              extentA.data(),
135              NULL,/*stride*/
136              typeA, CUTENSOR_OP_IDENTITY ) );
137
138  cutensorTensorDescriptor_t descB;
139  HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
140              &descB,
141              nmodeB,
142              extentB.data(),
143              NULL,/*stride*/
144              typeB, CUTENSOR_OP_IDENTITY ) );
145
146  cutensorTensorDescriptor_t descC;
147  HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
148              &descC,
149              nmodeC,
150              extentC.data(),
151              NULL,/*stride*/
152              typeC, CUTENSOR_OP_IDENTITY ) );
153
154  printf("Initialize cuTENSOR and tensor descriptors\n");
155
156  /* ***************************** */
157
158  //Retrieve the memory alignment for each tensor
159  uint32_t alignmentRequirementA;
160  HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
161             A_d,
162             &descA,
163             &alignmentRequirementA) );
164
165  uint32_t alignmentRequirementB;
166  HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
167             B_d,
168             &descB,
169             &alignmentRequirementB) );
170
171  uint32_t alignmentRequirementC;
172  HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
173             C_d,
174             &descC,
175             &alignmentRequirementC) );
176
177  printf("Query best alignment requirement for our pointers\n");
178
179  /* ***************************** */
180
181  // Create the Contraction Descriptor
182  cutensorContractionDescriptor_t desc;
183  HANDLE_ERROR( cutensorInitContractionDescriptor( &handle,
184              &desc,
185              &descA, modeA.data(), alignmentRequirementA,
186              &descB, modeB.data(), alignmentRequirementB,
187              &descC, modeC.data(), alignmentRequirementC,
188              &descC, modeC.data(), alignmentRequirementC,
189              typeCompute) );
190
191  printf("Initialize contraction descriptor\n");
192
193  /* ***************************** */
194
195  // Set the algorithm to use
196  cutensorContractionFind_t find;
197  HANDLE_ERROR( cutensorInitContractionFind(
198              &handle, &find,
199              CUTENSOR_ALGO_DEFAULT) );
200
201  printf("Initialize settings to find algorithm\n");
202
203  /* ***************************** */
204
205  // Query workspace
206  size_t worksize = 0;
207  HANDLE_ERROR( cutensorContractionGetWorkspace(&handle,
208              &desc,
209              &find,
210              CUTENSOR_WORKSPACE_RECOMMENDED, &worksize ) );
211
212  // Allocate workspace
213  void *work = nullptr;
214  if(worksize > 0)
215  {
216      if( cudaSuccess != cudaMalloc(&work, worksize) ) // This is optional!
217      {
218          work = nullptr;
219          worksize = 0;
220      }
221  }
222
223  printf("Query recommended workspace size and allocate it\n");
224
225  /* ***************************** */
226  printf("Execute contraction from plan\n");
227
228  int numRuns = 5;
229  for(int i=0; i < numRuns; ++i)
230  {
231     // Create Contraction Plan && look-up cache (if attached)
232     cutensorContractionPlan_t plan;
233     HANDLE_ERROR( cutensorInitContractionPlan(&handle,
234                                               &plan,
235                                               &desc,
236                                               &find,
237                                               worksize) );
238
239     printf("Create plan for contraction\n");
240
241     /* ***************************** */
242
243     cutensorStatus_t err;
244
245     // Execute the tensor contraction
246     err = cutensorContraction(&handle,
247                               &plan,
248                        (void*)&alpha, A_d,
249                                       B_d,
250                        (void*)&beta,  C_d,
251                                       C_d,
252                               work, worksize, 0 /* stream */);
253     cudaDeviceSynchronize();
254
255     // Check for errors
256     if(err != CUTENSOR_STATUS_SUCCESS)
257     {
258         printf("ERROR: %s\n", cutensorGetErrorString(err));
259     }
260  }
261
262  /* ***************************** */
263
264  // Detach cache and free-up resources
265  HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );
266
267  if ( A ) free( A );
268  if ( B ) free( B );
269  if ( C ) free( C );
270  if ( cachelines ) free(cachelines);
271  if ( A_d ) cudaFree( A_d );
272  if ( B_d ) cudaFree( B_d );
273  if ( C_d ) cudaFree( C_d );
274  if ( work ) cudaFree( work );
275
276  printf("Successful completion\n");
277
278  return 0;
279}

This minimal change suffices to cache the plan for the tensor contraction call in line 233 and 246. It’s important to note that the call to cutensorInitContractionPlan is inside of the loop; the lookup from the cache happens here.

To disable caching for a certain contraction, the corresponding cutensorContractionFind_t needs to be modified accordingly:

const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_NONE;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
     &handle,
     &find,
     CUTENSOR_CONTRACTION_FIND_CACHE_MODE,
     &cacheMode,
     sizeof(cutensorCacheMode_t)));

This concludes the introductory example.

Advanced Example

This example will augment the previous example and explains how to:

  • Take advantage of incremental auto-tuning

    • It is recommended to warm up the GPU (i.e., reach steady-state performance) before using auto-tuning (to avoid big fluctuations in measured performance)

  • Use tags to distinguish two otherwise identical tensor contractions from each other

    • This is usful if the hardware cache of the GPU is (likely) substantially different between the two calls (e.g., if one of the operands was just read/written by a preceeding call) and it is expected that the state of the cache has significant impact on the performance (e.g., for bandwidth-bound contractions)

  • Write the plan cache state to a file and read it back in

Let us start by enabling incremental autotuning. To do so, we modify cutensorContractionFind_t as follows:

const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_INCREMENTAL;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
    &handle,
    &find,
    CUTENSOR_CONTRACTION_FIND_AUTOTUNE_MODE,
    &autotuneMode ,
    sizeof(cutensorAutotuneMode_t)));

const uint32_t incCount = 4;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
    &handle,
    &find,
    CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT,
    &incCount,
    sizeof(uint32_t)));

The first call to cutensorContractionFindSetAttribute enables incremental auto-tuning, while the second call sets the CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT; this value corresponds to the number of different candidates that should be explored via incremental autotuning before subsequent calls look-up from the plan cache. Higher values of incCount explore more candidates, and as such cause a larger overhead initially, but they can also result in better performance – if the initial overhead can be amortized (e.g., when writing the cache to disc). We feel that a CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT of four is a good default value.

The following code incorporates those changes:

  1#include <stdlib.h>
  2#include <stdio.h>
  3
  4#include <cuda_runtime.h>
  5#include <cutensor.h>
  6
  7#include <unordered_map>
  8#include <vector>
  9
 10// Handle cuTENSOR errors
 11#define HANDLE_ERROR(x) {                                                              \
 12  const auto err = x;                                                                  \
 13  if( err != CUTENSOR_STATUS_SUCCESS )                                                   \
 14  { printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
 15}
 16
 17int main(int argc, char** argv)
 18{
 19  // Host element type definition
 20  typedef float floatTypeA;
 21  typedef float floatTypeB;
 22  typedef float floatTypeC;
 23  typedef float floatTypeCompute;
 24
 25  // CUDA types
 26  cudaDataType_t typeA = CUDA_R_32F;
 27  cudaDataType_t typeB = CUDA_R_32F;
 28  cudaDataType_t typeC = CUDA_R_32F;
 29  cutensorComputeType_t typeCompute = CUTENSOR_COMPUTE_32F;
 30
 31  floatTypeCompute alpha = (floatTypeCompute)1.1f;
 32  floatTypeCompute beta  = (floatTypeCompute)0.9f;
 33
 34  printf("Include headers and define data types\n");
 35
 36  /* ***************************** */
 37
 38  // Create vector of modes
 39  std::vector<int> modeC{'m','u','n','v'};
 40  std::vector<int> modeA{'m','h','k','n'};
 41  std::vector<int> modeB{'u','k','v','h'};
 42  int nmodeA = modeA.size();
 43  int nmodeB = modeB.size();
 44  int nmodeC = modeC.size();
 45
 46  // Extents
 47  std::unordered_map<int, int64_t> extent;
 48  extent['m'] = 96;
 49  extent['n'] = 96;
 50  extent['u'] = 96;
 51  extent['v'] = 64;
 52  extent['h'] = 64;
 53  extent['k'] = 64;
 54
 55  // Create a vector of extents for each tensor
 56  std::vector<int64_t> extentC;
 57  for(auto mode : modeC)
 58      extentC.push_back(extent[mode]);
 59  std::vector<int64_t> extentA;
 60  for(auto mode : modeA)
 61      extentA.push_back(extent[mode]);
 62  std::vector<int64_t> extentB;
 63  for(auto mode : modeB)
 64      extentB.push_back(extent[mode]);
 65
 66  printf("Define modes and extents\n");
 67
 68  /* ***************************** */
 69
 70  // Number of elements of each tensor
 71  size_t elementsA = 1;
 72  for(auto mode : modeA)
 73      elementsA *= extent[mode];
 74  size_t elementsB = 1;
 75  for(auto mode : modeB)
 76      elementsB *= extent[mode];
 77  size_t elementsC = 1;
 78  for(auto mode : modeC)
 79      elementsC *= extent[mode];
 80
 81  // Size in bytes
 82  size_t sizeA = sizeof(floatTypeA) * elementsA;
 83  size_t sizeB = sizeof(floatTypeB) * elementsB;
 84  size_t sizeC = sizeof(floatTypeC) * elementsC;
 85
 86  // Allocate on device
 87  void *A_d, *B_d, *C_d;
 88  cudaMalloc((void**)&A_d, sizeA);
 89  cudaMalloc((void**)&B_d, sizeB);
 90  cudaMalloc((void**)&C_d, sizeC);
 91
 92  // Allocate on host
 93  floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
 94  floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
 95  floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
 96
 97  // Initialize data on host
 98  for(int64_t i = 0; i < elementsA; i++)
 99      A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
100  for(int64_t i = 0; i < elementsB; i++)
101      B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
102  for(int64_t i = 0; i < elementsC; i++)
103      C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
104
105  // Copy to device
106  cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
107  cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice);
108  cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice);
109
110  printf("Allocate, initialize and transfer tensors\n");
111
112  /* ***************************** */
113
114  // Initialize cuTENSOR library
115  cutensorHandle_t handle;
116  cutensorInit(&handle);
117
118  /**********************
119   * Setup plan cache
120   **********************/
121  printf("Attach cachelines\n");
122
123  constexpr int32_t numCachelines = 1024;
124  const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
125  printf("Allocating: %.2f kB for the cache\n", sizeCache / 1000.);
126  cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
127  HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );
128
129  // Create Tensor Descriptors
130  cutensorTensorDescriptor_t descA;
131  HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
132              &descA,
133              nmodeA,
134              extentA.data(),
135              NULL,/*stride*/
136              typeA, CUTENSOR_OP_IDENTITY ) );
137
138  cutensorTensorDescriptor_t descB;
139  HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
140              &descB,
141              nmodeB,
142              extentB.data(),
143              NULL,/*stride*/
144              typeB, CUTENSOR_OP_IDENTITY ) );
145
146  cutensorTensorDescriptor_t descC;
147  HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
148              &descC,
149              nmodeC,
150              extentC.data(),
151              NULL,/*stride*/
152              typeC, CUTENSOR_OP_IDENTITY ) );
153
154  printf("Initialize cuTENSOR and tensor descriptors\n");
155
156  /* ***************************** */
157
158  //Retrieve the memory alignment for each tensor
159  uint32_t alignmentRequirementA;
160  HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
161             A_d,
162             &descA,
163             &alignmentRequirementA) );
164
165  uint32_t alignmentRequirementB;
166  HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
167             B_d,
168             &descB,
169             &alignmentRequirementB) );
170
171  uint32_t alignmentRequirementC;
172  HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
173             C_d,
174             &descC,
175             &alignmentRequirementC) );
176
177  printf("Query best alignment requirement for our pointers\n");
178
179  /* ***************************** */
180
181  // Create the Contraction Descriptor
182  cutensorContractionDescriptor_t desc;
183  HANDLE_ERROR( cutensorInitContractionDescriptor( &handle,
184              &desc,
185              &descA, modeA.data(), alignmentRequirementA,
186              &descB, modeB.data(), alignmentRequirementB,
187              &descC, modeC.data(), alignmentRequirementC,
188              &descC, modeC.data(), alignmentRequirementC,
189              typeCompute) );
190
191  printf("Initialize contraction descriptor\n");
192
193  /* ***************************** */
194
195  // Set the algorithm to use
196  cutensorContractionFind_t find;
197  HANDLE_ERROR( cutensorInitContractionFind(
198              &handle, &find,
199              CUTENSOR_ALGO_DEFAULT) );
200
201 const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_INCREMENTAL;
202 HANDLE_ERROR(cutensorContractionFindSetAttribute(
203     &handle,
204     &find,
205     CUTENSOR_CONTRACTION_FIND_AUTOTUNE_MODE,
206     &autotuneMode ,
207     sizeof(cutensorAutotuneMode_t)));
208
209 const uint32_t incCount = 4;
210 HANDLE_ERROR(cutensorContractionFindSetAttribute(
211     &handle,
212     &find,
213     CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT,
214     &incCount,
215     sizeof(uint32_t)));
216
217  printf("Initialize settings to find algorithm\n");
218
219  /* ***************************** */
220
221  // Query workspace
222  size_t worksize = 0;
223  HANDLE_ERROR( cutensorContractionGetWorkspace(&handle,
224              &desc,
225              &find,
226              CUTENSOR_WORKSPACE_RECOMMENDED, &worksize ) );
227
228  // Allocate workspace
229  void *work = nullptr;
230  if(worksize > 0)
231  {
232      if( cudaSuccess != cudaMalloc(&work, worksize) ) // This is optional!
233      {
234          work = nullptr;
235          worksize = 0;
236      }
237  }
238
239  printf("Query recommended workspace size and allocate it\n");
240
241  /* ***************************** */
242  printf("Execute contraction from plan\n");
243
244  int numRuns = 5;
245  for(int i=0; i < numRuns; ++i)
246  {
247     // Create Contraction Plan && look-up cache (if attached)
248     cutensorContractionPlan_t plan;
249     HANDLE_ERROR( cutensorInitContractionPlan(&handle,
250                                               &plan,
251                                               &desc,
252                                               &find,
253                                               worksize) );
254
255     printf("Create plan for contraction\n");
256
257     /* ***************************** */
258
259     cutensorStatus_t err;
260
261     // Execute the tensor contraction
262     err = cutensorContraction(&handle,
263                               &plan,
264                        (void*)&alpha, A_d,
265                                       B_d,
266                        (void*)&beta,  C_d,
267                                       C_d,
268                               work, worksize, 0 /* stream */);
269     cudaDeviceSynchronize();
270
271     // Check for errors
272     if(err != CUTENSOR_STATUS_SUCCESS)
273     {
274         printf("ERROR: %s\n", cutensorGetErrorString(err));
275     }
276  }
277
278  /* ***************************** */
279
280  // Detach cache and free-up resources
281  HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );
282
283  if ( A ) free( A );
284  if ( B ) free( B );
285  if ( C ) free( C );
286  if ( cachelines ) free(cachelines);
287  if ( A_d ) cudaFree( A_d );
288  if ( B_d ) cudaFree( B_d );
289  if ( C_d ) cudaFree( C_d );
290  if ( work ) cudaFree( work );
291
292  printf("Successful completion\n");
293
294  return 0;
295}

Let us further augment the above example by writing the plan cache to a file and reading it in (provided it was previously written):

const char cacheFilename[] = "./cache.bin";
uint32_t numCachelinesRead = 0;
cutensorStatus_t status = cutensorHandleReadCacheFromFile(&handle, cacheFilename, &numCachelinesRead);
if (status == CUTENSOR_STATUS_SUCCESS)
{
    printf("%d cachelines have been successfully read from file (%s).\n", numCachelinesRead, cacheFilename);
}
else if (status == CUTENSOR_STATUS_IO_ERROR)
{
    printf("File (%s) doesn't seem to exist.\n", cacheFilename);
}
else if (status == CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE)
{
    printf("Cannot read cache: Please attach at least %d cachelines to the handle.\n", numCachelines);
}

// ...

const char filename[] = "./cache.bin";
HANDLE_ERROR( cutensorHandleWriteCacheToFile(&handle, filename) );
printf("Cache has been successfully written to file (%s).\n", filename);

Warning

cutensorHandleReadCacheFromFile only succeeds if the number of attached cachelines is sufficient to read all cachelines stored in the file; otherwise CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE is returned and the sufficient number of cachelines is stored in numCachelinesRead.

With these changes the example now looks as follows:

  1#include <stdlib.h>
  2#include <stdio.h>
  3
  4#include <cuda_runtime.h>
  5#include <cutensor.h>
  6
  7#include <unordered_map>
  8#include <vector>
  9
 10// Handle cuTENSOR errors
 11#define HANDLE_ERROR(x) {                                                              \
 12  const auto err = x;                                                                  \
 13  if( err != CUTENSOR_STATUS_SUCCESS )                                                   \
 14  { printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
 15}
 16
 17int main(int argc, char** argv)
 18{
 19  // Host element type definition
 20  typedef float floatTypeA;
 21  typedef float floatTypeB;
 22  typedef float floatTypeC;
 23  typedef float floatTypeCompute;
 24
 25  // CUDA types
 26  cudaDataType_t typeA = CUDA_R_32F;
 27  cudaDataType_t typeB = CUDA_R_32F;
 28  cudaDataType_t typeC = CUDA_R_32F;
 29  cutensorComputeType_t typeCompute = CUTENSOR_COMPUTE_32F;
 30
 31  floatTypeCompute alpha = (floatTypeCompute)1.1f;
 32  floatTypeCompute beta  = (floatTypeCompute)0.9f;
 33
 34  printf("Include headers and define data types\n");
 35
 36  /* ***************************** */
 37
 38  // Create vector of modes
 39  std::vector<int> modeC{'m','u','n','v'};
 40  std::vector<int> modeA{'m','h','k','n'};
 41  std::vector<int> modeB{'u','k','v','h'};
 42  int nmodeA = modeA.size();
 43  int nmodeB = modeB.size();
 44  int nmodeC = modeC.size();
 45
 46  // Extents
 47  std::unordered_map<int, int64_t> extent;
 48  extent['m'] = 96;
 49  extent['n'] = 96;
 50  extent['u'] = 96;
 51  extent['v'] = 64;
 52  extent['h'] = 64;
 53  extent['k'] = 64;
 54
 55  // Create a vector of extents for each tensor
 56  std::vector<int64_t> extentC;
 57  for(auto mode : modeC)
 58      extentC.push_back(extent[mode]);
 59  std::vector<int64_t> extentA;
 60  for(auto mode : modeA)
 61      extentA.push_back(extent[mode]);
 62  std::vector<int64_t> extentB;
 63  for(auto mode : modeB)
 64      extentB.push_back(extent[mode]);
 65
 66  printf("Define modes and extents\n");
 67
 68  /* ***************************** */
 69
 70  // Number of elements of each tensor
 71  size_t elementsA = 1;
 72  for(auto mode : modeA)
 73      elementsA *= extent[mode];
 74  size_t elementsB = 1;
 75  for(auto mode : modeB)
 76      elementsB *= extent[mode];
 77  size_t elementsC = 1;
 78  for(auto mode : modeC)
 79      elementsC *= extent[mode];
 80
 81  // Size in bytes
 82  size_t sizeA = sizeof(floatTypeA) * elementsA;
 83  size_t sizeB = sizeof(floatTypeB) * elementsB;
 84  size_t sizeC = sizeof(floatTypeC) * elementsC;
 85
 86  // Allocate on device
 87  void *A_d, *B_d, *C_d;
 88  cudaMalloc((void**)&A_d, sizeA);
 89  cudaMalloc((void**)&B_d, sizeB);
 90  cudaMalloc((void**)&C_d, sizeC);
 91
 92  // Allocate on host
 93  floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
 94  floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
 95  floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
 96
 97  // Initialize data on host
 98  for(int64_t i = 0; i < elementsA; i++)
 99      A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
100  for(int64_t i = 0; i < elementsB; i++)
101      B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
102  for(int64_t i = 0; i < elementsC; i++)
103      C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
104
105  // Copy to device
106  cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
107  cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice);
108  cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice);
109
110  printf("Allocate, initialize and transfer tensors\n");
111
112  /* ***************************** */
113
114  // Initialize cuTENSOR library
115  cutensorHandle_t handle;
116  cutensorInit(&handle);
117
118  /**********************
119   * Setup plan cache
120   **********************/
121  printf("Attach cachelines\n");
122
123  constexpr int32_t numCachelines = 1024;
124  const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
125  printf("Allocating: %.2f kB for the cache\n", sizeCache / 1000.);
126  cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
127  HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );
128
129
130  const char cacheFilename[] = "./cache.bin";
131  uint32_t numCachelinesRead = 0;
132  cutensorStatus_t status = cutensorHandleReadCacheFromFile(&handle, cacheFilename, &numCachelinesRead);
133  if (status == CUTENSOR_STATUS_SUCCESS)
134  {
135      printf("%d cachelines have been successfully read from file (%s).\n", numCachelinesRead, cacheFilename);
136  }
137  else if (status == CUTENSOR_STATUS_IO_ERROR)
138  {
139      printf("File (%s) doesn't seem to exist.\n", cacheFilename);
140  }
141  else if (status == CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE)
142  {
143      printf("Cannot read cache: Please attach at least %d cachelines to the handle.\n", numCachelines);
144  }
145
146  // Create Tensor Descriptors
147  cutensorTensorDescriptor_t descA;
148  HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
149              &descA,
150              nmodeA,
151              extentA.data(),
152              NULL,/*stride*/
153              typeA, CUTENSOR_OP_IDENTITY ) );
154
155  cutensorTensorDescriptor_t descB;
156  HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
157              &descB,
158              nmodeB,
159              extentB.data(),
160              NULL,/*stride*/
161              typeB, CUTENSOR_OP_IDENTITY ) );
162
163  cutensorTensorDescriptor_t descC;
164  HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
165              &descC,
166              nmodeC,
167              extentC.data(),
168              NULL,/*stride*/
169              typeC, CUTENSOR_OP_IDENTITY ) );
170
171  printf("Initialize cuTENSOR and tensor descriptors\n");
172
173  /* ***************************** */
174
175  //Retrieve the memory alignment for each tensor
176  uint32_t alignmentRequirementA;
177  HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
178             A_d,
179             &descA,
180             &alignmentRequirementA) );
181
182  uint32_t alignmentRequirementB;
183  HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
184             B_d,
185             &descB,
186             &alignmentRequirementB) );
187
188  uint32_t alignmentRequirementC;
189  HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
190             C_d,
191             &descC,
192             &alignmentRequirementC) );
193
194  printf("Query best alignment requirement for our pointers\n");
195
196  /* ***************************** */
197
198  // Create the Contraction Descriptor
199  cutensorContractionDescriptor_t desc;
200  HANDLE_ERROR( cutensorInitContractionDescriptor( &handle,
201              &desc,
202              &descA, modeA.data(), alignmentRequirementA,
203              &descB, modeB.data(), alignmentRequirementB,
204              &descC, modeC.data(), alignmentRequirementC,
205              &descC, modeC.data(), alignmentRequirementC,
206              typeCompute) );
207
208  printf("Initialize contraction descriptor\n");
209
210  /* ***************************** */
211
212  // Set the algorithm to use
213  cutensorContractionFind_t find;
214  HANDLE_ERROR( cutensorInitContractionFind(
215              &handle, &find,
216              CUTENSOR_ALGO_DEFAULT) );
217
218  const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_INCREMENTAL;
219  HANDLE_ERROR(cutensorContractionFindSetAttribute(
220      &handle,
221      &find,
222      CUTENSOR_CONTRACTION_FIND_AUTOTUNE_MODE,
223      &autotuneMode ,
224      sizeof(cutensorAutotuneMode_t)));
225
226  const uint32_t incCount = 4;
227  HANDLE_ERROR(cutensorContractionFindSetAttribute(
228      &handle,
229      &find,
230      CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT,
231      &incCount,
232      sizeof(uint32_t)));
233
234  printf("Initialize settings to find algorithm\n");
235
236  /* ***************************** */
237
238  // Query workspace
239  size_t worksize = 0;
240  HANDLE_ERROR( cutensorContractionGetWorkspace(&handle,
241              &desc,
242              &find,
243              CUTENSOR_WORKSPACE_RECOMMENDED, &worksize ) );
244
245  // Allocate workspace
246  void *work = nullptr;
247  if(worksize > 0)
248  {
249      if( cudaSuccess != cudaMalloc(&work, worksize) ) // This is optional!
250      {
251          work = nullptr;
252          worksize = 0;
253      }
254  }
255
256  printf("Query recommended workspace size and allocate it\n");
257
258  /* ***************************** */
259  printf("Execute contraction from plan\n");
260
261  int numRuns = 5;
262  for(int i=0; i < numRuns; ++i)
263  {
264     // Create Contraction Plan && look-up cache (if attached)
265     cutensorContractionPlan_t plan;
266     HANDLE_ERROR( cutensorInitContractionPlan(&handle,
267                                               &plan,
268                                               &desc,
269                                               &find,
270                                               worksize) );
271
272     printf("Create plan for contraction\n");
273
274     /* ***************************** */
275
276     cutensorStatus_t err;
277
278     // Execute the tensor contraction
279     err = cutensorContraction(&handle,
280                               &plan,
281                        (void*)&alpha, A_d,
282                                       B_d,
283                        (void*)&beta,  C_d,
284                                       C_d,
285                               work, worksize, 0 /* stream */);
286     cudaDeviceSynchronize();
287
288     // Check for errors
289     if(err != CUTENSOR_STATUS_SUCCESS)
290     {
291         printf("ERROR: %s\n", cutensorGetErrorString(err));
292     }
293  }
294
295
296  /* ***************************** */
297  HANDLE_ERROR( cutensorHandleWriteCacheToFile(&handle, cacheFilename) );
298  printf("Cache has been successfully written to file (%s).\n", cacheFilename);
299
300  // Detach cache and free-up resources
301  HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );
302
303  if ( A ) free( A );
304  if ( B ) free( B );
305  if ( C ) free( C );
306  if ( cachelines ) free(cachelines);
307  if ( A_d ) cudaFree( A_d );
308  if ( B_d ) cudaFree( B_d );
309  if ( C_d ) cudaFree( C_d );
310  if ( work ) cudaFree( work );
311
312  printf("Successful completion\n");
313
314  return 0;
315}

Finally, let us add a second contraction loop, but this time we want the –otherwise identical– contraction to be cached using a different cacheline: This can be useful if the state of the hardware cache between these two calls is substantially different (i.e., affecting the measured runtime of the kernels). To that end, we use the CUTENSOR_CONTRACTION_DESCRIPTOR_TAG attribute:

uint32_t tag = 1;
HANDLE_ERROR( cutensorContractionDescriptorSetAttribute(
     &handle,
     &desc,
     CUTENSOR_CONTRACTION_DESCRIPTOR_TAG,
     &tag,
     sizeof(uint32_t)));

With this change, the example code now looks as follows:

  1#include <stdlib.h>
  2#include <stdio.h>
  3
  4#include <cuda_runtime.h>
  5#include <cutensor.h>
  6
  7#include <unordered_map>
  8#include <vector>
  9
 10// Handle cuTENSOR errors
 11#define HANDLE_ERROR(x) {                                                              \
 12  const auto err = x;                                                                  \
 13  if( err != CUTENSOR_STATUS_SUCCESS )                                                   \
 14  { printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
 15}
 16
 17int main(int argc, char** argv)
 18{
 19  // Host element type definition
 20  typedef float floatTypeA;
 21  typedef float floatTypeB;
 22  typedef float floatTypeC;
 23  typedef float floatTypeCompute;
 24
 25  // CUDA types
 26  cudaDataType_t typeA = CUDA_R_32F;
 27  cudaDataType_t typeB = CUDA_R_32F;
 28  cudaDataType_t typeC = CUDA_R_32F;
 29  cutensorComputeType_t typeCompute = CUTENSOR_COMPUTE_32F;
 30
 31  floatTypeCompute alpha = (floatTypeCompute)1.1f;
 32  floatTypeCompute beta  = (floatTypeCompute)0.9f;
 33
 34  printf("Include headers and define data types\n");
 35
 36  /* ***************************** */
 37
 38  // Create vector of modes
 39  std::vector<int> modeC{'m','u','n','v'};
 40  std::vector<int> modeA{'m','h','k','n'};
 41  std::vector<int> modeB{'u','k','v','h'};
 42  int nmodeA = modeA.size();
 43  int nmodeB = modeB.size();
 44  int nmodeC = modeC.size();
 45
 46  // Extents
 47  std::unordered_map<int, int64_t> extent;
 48  extent['m'] = 96;
 49  extent['n'] = 96;
 50  extent['u'] = 96;
 51  extent['v'] = 64;
 52  extent['h'] = 64;
 53  extent['k'] = 64;
 54
 55  // Create a vector of extents for each tensor
 56  std::vector<int64_t> extentC;
 57  for(auto mode : modeC)
 58      extentC.push_back(extent[mode]);
 59  std::vector<int64_t> extentA;
 60  for(auto mode : modeA)
 61      extentA.push_back(extent[mode]);
 62  std::vector<int64_t> extentB;
 63  for(auto mode : modeB)
 64      extentB.push_back(extent[mode]);
 65
 66  printf("Define modes and extents\n");
 67
 68  /* ***************************** */
 69
 70  // Number of elements of each tensor
 71  size_t elementsA = 1;
 72  for(auto mode : modeA)
 73      elementsA *= extent[mode];
 74  size_t elementsB = 1;
 75  for(auto mode : modeB)
 76      elementsB *= extent[mode];
 77  size_t elementsC = 1;
 78  for(auto mode : modeC)
 79      elementsC *= extent[mode];
 80
 81  // Size in bytes
 82  size_t sizeA = sizeof(floatTypeA) * elementsA;
 83  size_t sizeB = sizeof(floatTypeB) * elementsB;
 84  size_t sizeC = sizeof(floatTypeC) * elementsC;
 85
 86  // Allocate on device
 87  void *A_d, *B_d, *C_d;
 88  cudaMalloc((void**)&A_d, sizeA);
 89  cudaMalloc((void**)&B_d, sizeB);
 90  cudaMalloc((void**)&C_d, sizeC);
 91
 92  // Allocate on host
 93  floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
 94  floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
 95  floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
 96
 97  // Initialize data on host
 98  for(int64_t i = 0; i < elementsA; i++)
 99      A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
100  for(int64_t i = 0; i < elementsB; i++)
101      B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
102  for(int64_t i = 0; i < elementsC; i++)
103      C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
104
105  // Copy to device
106  cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
107  cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice);
108  cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice);
109
110  printf("Allocate, initialize and transfer tensors\n");
111
112  /* ***************************** */
113
114  // Initialize cuTENSOR library
115  cutensorHandle_t handle;
116  cutensorInit(&handle);
117
118  /**********************
119   * Setup plan cache
120   **********************/
121  printf("Attach cachelines\n");
122
123  constexpr int32_t numCachelines = 1024;
124  const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
125  printf("Allocating: %.2f kB for the cache\n", sizeCache / 1000.);
126  cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
127  HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );
128
129
130  const char cacheFilename[] = "./cache.bin";
131  uint32_t numCachelinesRead = 0;
132  cutensorStatus_t status = cutensorHandleReadCacheFromFile(&handle, cacheFilename, &numCachelinesRead);
133  if (status == CUTENSOR_STATUS_SUCCESS)
134  {
135      printf("%d cachelines have been successfully read from file (%s).\n", numCachelinesRead, cacheFilename);
136  }
137  else if (status == CUTENSOR_STATUS_IO_ERROR)
138  {
139      printf("File (%s) doesn't seem to exist.\n", cacheFilename);
140  }
141  else if (status == CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE)
142  {
143      printf("Cannot read cache: Please attach at least %d cachelines to the handle.\n", numCachelines);
144  }
145
146  // Create Tensor Descriptors
147  cutensorTensorDescriptor_t descA;
148  HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
149              &descA,
150              nmodeA,
151              extentA.data(),
152              NULL,/*stride*/
153              typeA, CUTENSOR_OP_IDENTITY ) );
154
155  cutensorTensorDescriptor_t descB;
156  HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
157              &descB,
158              nmodeB,
159              extentB.data(),
160              NULL,/*stride*/
161              typeB, CUTENSOR_OP_IDENTITY ) );
162
163  cutensorTensorDescriptor_t descC;
164  HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
165              &descC,
166              nmodeC,
167              extentC.data(),
168              NULL,/*stride*/
169              typeC, CUTENSOR_OP_IDENTITY ) );
170
171  printf("Initialize cuTENSOR and tensor descriptors\n");
172
173  /* ***************************** */
174
175  //Retrieve the memory alignment for each tensor
176  uint32_t alignmentRequirementA;
177  HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
178             A_d,
179             &descA,
180             &alignmentRequirementA) );
181
182  uint32_t alignmentRequirementB;
183  HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
184             B_d,
185             &descB,
186             &alignmentRequirementB) );
187
188  uint32_t alignmentRequirementC;
189  HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
190             C_d,
191             &descC,
192             &alignmentRequirementC) );
193
194  printf("Query best alignment requirement for our pointers\n");
195
196  /* ***************************** */
197
198  // Create the Contraction Descriptor
199  cutensorContractionDescriptor_t desc;
200  HANDLE_ERROR( cutensorInitContractionDescriptor( &handle,
201              &desc,
202              &descA, modeA.data(), alignmentRequirementA,
203              &descB, modeB.data(), alignmentRequirementB,
204              &descC, modeC.data(), alignmentRequirementC,
205              &descC, modeC.data(), alignmentRequirementC,
206              typeCompute) );
207
208  printf("Initialize contraction descriptor\n");
209
210  /* ***************************** */
211
212  // Set the algorithm to use
213  cutensorContractionFind_t find;
214  HANDLE_ERROR( cutensorInitContractionFind(
215              &handle, &find,
216              CUTENSOR_ALGO_DEFAULT) );
217
218  const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_INCREMENTAL;
219  HANDLE_ERROR(cutensorContractionFindSetAttribute(
220      &handle,
221      &find,
222      CUTENSOR_CONTRACTION_FIND_AUTOTUNE_MODE,
223      &autotuneMode ,
224      sizeof(cutensorAutotuneMode_t)));
225
226  const uint32_t incCount = 4;
227  HANDLE_ERROR(cutensorContractionFindSetAttribute(
228      &handle,
229      &find,
230      CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT,
231      &incCount,
232      sizeof(uint32_t)));
233
234  printf("Initialize settings to find algorithm\n");
235
236  /* ***************************** */
237
238  // Query workspace
239  size_t worksize = 0;
240  HANDLE_ERROR( cutensorContractionGetWorkspace(&handle,
241              &desc,
242              &find,
243              CUTENSOR_WORKSPACE_RECOMMENDED, &worksize ) );
244
245  // Allocate workspace
246  void *work = nullptr;
247  if(worksize > 0)
248  {
249      if( cudaSuccess != cudaMalloc(&work, worksize) ) // This is optional!
250      {
251          work = nullptr;
252          worksize = 0;
253      }
254  }
255
256  printf("Query recommended workspace size and allocate it\n");
257
258  /* ***************************** */
259  printf("Execute contraction from plan\n");
260
261  int numRuns = 5;
262  for(int i=0; i < numRuns; ++i)
263  {
264     // Create Contraction Plan && look-up cache (if attached)
265     cutensorContractionPlan_t plan;
266     HANDLE_ERROR( cutensorInitContractionPlan(&handle,
267                                               &plan,
268                                               &desc,
269                                               &find,
270                                               worksize) );
271
272     printf("Create plan for contraction\n");
273
274     /* ***************************** */
275
276     cutensorStatus_t err;
277
278     // Execute the tensor contraction
279     err = cutensorContraction(&handle,
280                               &plan,
281                        (void*)&alpha, A_d,
282                                       B_d,
283                        (void*)&beta,  C_d,
284                                       C_d,
285                               work, worksize, 0 /* stream */);
286     cudaDeviceSynchronize();
287
288     // Check for errors
289     if(err != CUTENSOR_STATUS_SUCCESS)
290     {
291         printf("ERROR: %s\n", cutensorGetErrorString(err));
292     }
293  }
294
295  uint32_t tag = 1;
296  HANDLE_ERROR( cutensorContractionDescriptorSetAttribute(
297      &handle,
298      &desc,
299      CUTENSOR_CONTRACTION_DESCRIPTOR_TAG,
300      &tag,
301      sizeof(uint32_t)));
302
303  for(int i=0; i < numRuns; ++i)
304  {
305     // Create Contraction Plan && look-up cache (if attached)
306     cutensorContractionPlan_t plan;
307     HANDLE_ERROR( cutensorInitContractionPlan(&handle,
308                                               &plan,
309                                               &desc,
310                                               &find,
311                                               worksize) );
312
313     printf("Create plan for contraction\n");
314
315     /* ***************************** */
316
317     cutensorStatus_t err;
318
319     // Execute the tensor contraction
320     err = cutensorContraction(&handle,
321                               &plan,
322                        (void*)&alpha, A_d,
323                                       B_d,
324                        (void*)&beta,  C_d,
325                                       C_d,
326                               work, worksize, 0 /* stream */);
327     cudaDeviceSynchronize();
328
329     // Check for errors
330     if(err != CUTENSOR_STATUS_SUCCESS)
331     {
332         printf("ERROR: %s\n", cutensorGetErrorString(err));
333     }
334  }
335
336
337  /* ***************************** */
338  HANDLE_ERROR( cutensorHandleWriteCacheToFile(&handle, cacheFilename) );
339  printf("Cache has been successfully written to file (%s).\n", cacheFilename);
340
341  // Detach cache and free-up resources
342  HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );
343
344  if ( A ) free( A );
345  if ( B ) free( B );
346  if ( C ) free( C );
347  if ( cachelines ) free(cachelines);
348  if ( A_d ) cudaFree( A_d );
349  if ( B_d ) cudaFree( B_d );
350  if ( C_d ) cudaFree( C_d );
351  if ( work ) cudaFree( work );
352
353  printf("Successful completion\n");
354
355  return 0;
356}

You can confirm that the cache has two entries now by invoking the binary once again; this time it should report that “2 cachelines have been successfully read from file (./cache.bin)”.

This concludes our example of the plan cache; you can find these examples (including timings and warm-up runs) in the samples repository.

If you have any further question or suggestions, please do not hesitate to reach out.