Plan Cache

This section introduces the software-managed plan cache that has the following key features:

  • Minimize launch-related overhead (e.g., due to kernel selection).

  • Overhead-free autotuning (a.k.a. Incremental Autotuning).

    • This feature enables users to automatically find the best implementation for the given problem and thereby increases performance.

  • The cache is implemented in a thread-safe manner and is shared across all threads that use the same cutensorHandle_t.

  • Serialize and deserialize the cache:

    • Allows users to store the state of the cache to disc and reuse it later

In essence, the plan cache can be seen as a lookup table from a specific problem instance (i.e., cutensorOperationDescriptor_t) to an actual implementation (encoded by cutensorPlan_t).

The remainder of this section assumes familiarity with Getting Started.

Note

The cache is activated by default and can be deactivated via the CUTENSOR_DISABLE_PLAN_CACHE environment variable (see Environment Variables).

Incremental Autotuning

The incremental autotuning feature enables users to automatically explore different implementations, referred to as candidates, for a given operation.

When using the cache with the incremental auto-tuning feature (CUTENSOR_AUTOTUNE_MODE_INCREMENTAL), successive invocations of the same operation (albeit with potentially different data pointers) will be performed by different candidates; the timing for those candidates is automatically measured and the fastest candidate is stored in the plan cache. The number of different candidates to explore is configurable by the user (via CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT); all subsequent calls to the same problem will then be mapped to the fastest candidate (stored in the cache), thus taking advantage of the fastest (measured) candidate.

This autotuning approach has some key advantages:

  • Candidates are evaluated at a point in time where hardware caches are in a production-environment state (i.e., the hardware cache state reflects the real-world situation).

  • Overhead is minimized (i.e., no timing loop, no synchronization).

    • Moreover, the candidates are evaluated in the order given by our performance model (from fastest to slowest).

Incremental autotuning is especially powerful if combined with cuTENSOR’s cache serialization feature (via cutensorHandleWritePlanCacheToFile() and cutensorHandleReadPlanCacheFromFile()) by writing the tuned cache to disc.

Note

We recommend warming up the GPU (i.e., reaching steady-state performance) before auto-tuning to minimize fluctuations in measured performance.

Introductory Example

This subsection provides a basic overview of the cache-related API calls and features. In addition to the steps outlined in Getting Started, in this example we also:

Let’s first look at the same example outlined in Getting Started: As cuTENSOR 2.x has the cache enabled by default, it is already taking advantage of caching. While optional, the following code demonstrates how to resize then cache from its implementation-defined initial value.

// Set cache size
constexpr int32_t numEntries = 128;
HANDLE_ERROR( cutensorHandleResizePlanCachelines(&handle, numEntries) );

// ...

Note that the number of entries is configurable by the user; ideally, we would like the cache to be large enough to have enough capacity for each of the application’s distinct contraction calls. Since this might not always be possible (due to memory constraints), cuTENSOR’s plan cache will evict cache entries using a least-recently-used (LRU) policy. Users can also choose to disable caching on a contraction-by-contraction basis (via cutensorCacheMode_t::CUTENSOR_CACHE_MODE_NONE).

Note that the cache lookup occurs when the plan is created. As such, this technique is particularly helpful if the same contraction is planed multiple times in the same application.

To disable caching for a certain contraction (i.e., opt-out), the following settings in cutensorPlanPreference_t need to be modified:

const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_NONE;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
     &handle,
     &find,
     CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
     &cacheMode,
     sizeof(cutensorCacheMode_t)));

This concludes the introductory example.

Advanced Example

This example will augment the previous example and explains how to:

  • Take advantage of incremental auto-tuning

    • It is recommended to warm up the GPU (i.e., reach steady-state performance) before using auto-tuning (to avoid big fluctuations in measured performance)

  • Use tags to distinguish two otherwise identical tensor contractions from each other

    • This is usful if the hardware cache of the GPU is (likely) substantially different between the two calls (e.g., if one of the operands was just read/written by a preceeding call) and it is expected that the state of the cache has significant impact on the performance (e.g., for bandwidth-bound contractions)

  • Write the plan cache state to a file and read it back in

Let us start by enabling incremental autotuning. To do so, we modify cutensorPlanPreference_t as follows:

const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
    &handle,
    &find,
    CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE_MODE,
    &autotuneMode ,
    sizeof(cutensorAutotuneMode_t)));

const uint32_t incCount = 4;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
    &handle,
    &find,
    CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
    &incCount,
    sizeof(uint32_t)));

The first call to cutensorPlanPreferenceSetAttribute() enables incremental auto-tuning, while the second call sets the CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT; this value corresponds to the number of different candidates that should be explored via incremental autotuning before subsequent calls look-up from the plan cache. Higher values of incCount explore more candidates, and as such cause a larger overhead initially, but they can also result in better performance – if the initial overhead can be amortized (e.g., when writing the cache to disc). We feel that a CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT of four is a good default value.

The following code incorporates those changes:

  1#include <stdlib.h>
  2#include <stdio.h>
  3
  4#include <unordered_map>
  5#include <vector>
  6#include <cassert>
  7
  8#include <cuda_runtime.h>
  9#include <cutensor.h>
 10
 11#define HANDLE_ERROR(x)                                               \
 12{ const auto err = x;                                                 \
 13  if( err != CUTENSOR_STATUS_SUCCESS )                                \
 14  { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
 15};
 16
 17#define HANDLE_CUDA_ERROR(x)                                      \
 18{ const auto err = x;                                             \
 19  if( err != cudaSuccess )                                        \
 20  { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
 21};
 22
 23struct GPUTimer
 24{
 25    GPUTimer()
 26    {
 27        cudaEventCreate(&start_);
 28        cudaEventCreate(&stop_);
 29        cudaEventRecord(start_, 0);
 30    }
 31
 32    ~GPUTimer()
 33    {
 34        cudaEventDestroy(start_);
 35        cudaEventDestroy(stop_);
 36    }
 37
 38    void start()
 39    {
 40        cudaEventRecord(start_, 0);
 41    }
 42
 43    float seconds()
 44    {
 45        cudaEventRecord(stop_, 0);
 46        cudaEventSynchronize(stop_);
 47        float time;
 48        cudaEventElapsedTime(&time, start_, stop_);
 49        return time * 1e-3;
 50    }
 51    private:
 52    cudaEvent_t start_, stop_;
 53};
 54
 55int main()
 56{
 57    typedef float floatTypeA;
 58    typedef float floatTypeB;
 59    typedef float floatTypeC;
 60    typedef float floatTypeCompute;
 61
 62    cutensorDataType_t typeA = CUTENSOR_R_32F;
 63    cutensorDataType_t typeB = CUTENSOR_R_32F;
 64    cutensorDataType_t typeC = CUTENSOR_R_32F;
 65    const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;
 66
 67    floatTypeCompute alpha = (floatTypeCompute)1.1f;
 68    floatTypeCompute beta  = (floatTypeCompute)0.f;
 69
 70    /**********************
 71     * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
 72     **********************/
 73
 74    std::vector<int> modeC{'m','u','n','v'};
 75    std::vector<int> modeA{'m','h','k','n'};
 76    std::vector<int> modeB{'u','k','v','h'};
 77    int nmodeA = modeA.size();
 78    int nmodeB = modeB.size();
 79    int nmodeC = modeC.size();
 80
 81    std::unordered_map<int, int64_t> extent;
 82    extent['m'] = 96;
 83    extent['n'] = 96;
 84    extent['u'] = 96;
 85    extent['v'] = 64;
 86    extent['h'] = 64;
 87    extent['k'] = 64;
 88
 89    double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;
 90
 91    std::vector<int64_t> extentC;
 92    for (auto mode : modeC)
 93        extentC.push_back(extent[mode]);
 94    std::vector<int64_t> extentA;
 95    for (auto mode : modeA)
 96        extentA.push_back(extent[mode]);
 97    std::vector<int64_t> extentB;
 98    for (auto mode : modeB)
 99        extentB.push_back(extent[mode]);
100
101    /**********************
102     * Allocating data
103     **********************/
104
105    size_t elementsA = 1;
106    for (auto mode : modeA)
107        elementsA *= extent[mode];
108    size_t elementsB = 1;
109    for (auto mode : modeB)
110        elementsB *= extent[mode];
111    size_t elementsC = 1;
112    for (auto mode : modeC)
113        elementsC *= extent[mode];
114
115    size_t sizeA = sizeof(floatTypeA) * elementsA;
116    size_t sizeB = sizeof(floatTypeB) * elementsB;
117    size_t sizeC = sizeof(floatTypeC) * elementsC;
118    printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);
119
120    void *A_d, *B_d, *C_d;
121    HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
122    HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
123    HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));
124
125    const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
126    assert(uintptr_t(A_d) % kAlignment == 0);
127    assert(uintptr_t(B_d) % kAlignment == 0);
128    assert(uintptr_t(C_d) % kAlignment == 0);
129
130    floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
131    floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
132    floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
133
134    if (A == NULL || B == NULL || C == NULL)
135    {
136        printf("Error: Host allocation of A or C.\n");
137        return -1;
138    }
139
140    /*******************
141     * Initialize data
142     *******************/
143
144    for (int64_t i = 0; i < elementsA; i++)
145        A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
146    for (int64_t i = 0; i < elementsB; i++)
147        B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
148    for (int64_t i = 0; i < elementsC; i++)
149        C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
150
151    HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
152    HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
153    HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));
154
155    /*************************
156     * cuTENSOR
157     *************************/
158
159    cutensorHandle_t handle;
160    HANDLE_ERROR(cutensorCreate(&handle));
161
162    /**********************
163     * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
164     **********************/
165    uint32_t numEntries = 128;
166    HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));
167
168    /**********************
169     * Create Tensor Descriptors
170     **********************/
171    cutensorTensorDescriptor_t descA;
172    HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
173                 &descA,
174                 nmodeA,
175                 extentA.data(),
176                 NULL,/*stride*/
177                 typeA, kAlignment));
178
179    cutensorTensorDescriptor_t descB;
180    HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
181                 &descB,
182                 nmodeB,
183                 extentB.data(),
184                 NULL,/*stride*/
185                 typeB, kAlignment));
186
187    cutensorTensorDescriptor_t descC;
188    HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
189                 &descC,
190                 nmodeC,
191                 extentC.data(),
192                 NULL,/*stride*/
193                 typeC, kAlignment));
194
195    /*******************************
196     * Create Contraction Descriptor
197     *******************************/
198
199    cutensorOperationDescriptor_t desc;
200    HANDLE_ERROR(cutensorCreateContraction(handle,
201                 &desc,
202                 descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
203                 descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
204                 descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
205                 descC, modeC.data(),
206                 descCompute));
207
208    /**************************
209     * PlanPreference: Set the algorithm to use and enable incremental autotuning
210     ***************************/
211
212    const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;
213
214    cutensorPlanPreference_t planPref;
215    HANDLE_ERROR(cutensorCreatePlanPreference(
216                               handle,
217                               &planPref,
218                               algo,
219                               CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation
220
221    const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
222    HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
223        handle,
224        planPref,
225        CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
226        &cacheMode,
227        sizeof(cutensorCacheMode_t)));
228
229    const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
230    HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
231        handle,
232        planPref,
233        CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
234        &autotuneMode ,
235        sizeof(cutensorAutotuneMode_t)));
236
237    const uint32_t incCount = 4;
238    HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
239        handle,
240        planPref,
241        CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
242        &incCount,
243        sizeof(uint32_t)));
244
245    /**********************
246     * Query workspace estimate
247     **********************/
248
249    uint64_t workspaceSizeEstimate = 0;
250    const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
251    HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
252                                          desc,
253                                          planPref,
254                                          workspacePref,
255                                          &workspaceSizeEstimate));
256
257    /**************************
258     * Create Contraction Plan
259     **************************/
260
261    cutensorPlan_t plan;
262    HANDLE_ERROR(cutensorCreatePlan(handle,
263                 &plan,
264                 desc,
265                 planPref,
266                 workspaceSizeEstimate));
267
268    /**************************
269     * Optional: Query information about the created plan
270     **************************/
271
272    // query actually used workspace
273    uint64_t actualWorkspaceSize = 0;
274    HANDLE_ERROR(cutensorPlanGetAttribute(handle,
275        plan,
276        CUTENSOR_PLAN_REQUIRED_WORKSPACE,
277        &actualWorkspaceSize,
278        sizeof(actualWorkspaceSize)));
279
280    // At this point the user knows exactly how much memory is need by the operation and
281    // only the smaller actual workspace needs to be allocated
282    assert(actualWorkspaceSize <= workspaceSizeEstimate);
283
284    void *work = nullptr;
285    if (actualWorkspaceSize > 0)
286    {
287        HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
288        assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
289    }
290
291    /**********************
292     * Run
293     **********************/
294
295    double minTimeCUTENSOR = 1e100;
296    for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
297    {
298        cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
299        cudaDeviceSynchronize();
300
301        // Set up timing
302        GPUTimer timer;
303        timer.start();
304
305        // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
306        HANDLE_ERROR(cutensorContract(handle,
307                                  plan,
308                                  (void*) &alpha, A_d, B_d,
309                                  (void*) &beta,  C_d, C_d,
310                                  work, actualWorkspaceSize, 0 /* stream */));
311
312        // Synchronize and measure timing
313        auto time = timer.seconds();
314
315        minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
316    }
317
318    /*************************/
319
320    double transferedBytes = sizeC + sizeA + sizeB;
321    transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
322    transferedBytes /= 1e9;
323    printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
324
325    HANDLE_ERROR(cutensorDestroy(handle));
326    HANDLE_ERROR(cutensorDestroyPlan(plan));
327    HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
328    HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
329    HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
330    HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
331
332    if (A) free(A);
333    if (B) free(B);
334    if (C) free(C);
335    if (A_d) cudaFree(A_d);
336    if (B_d) cudaFree(B_d);
337    if (C_d) cudaFree(C_d);
338    if (work) cudaFree(work);
339
340    return 0;
341}

Let us further augment the above example by writing the plan cache to a file and reading it in (provided it was previously written):

const char planCacheFilename[] = "./planCache.bin";
uint32_t numCachelines = 0;
cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
        planCacheFilename, &numCachelines);
if (status == CUTENSOR_STATUS_IO_ERROR)
{
    printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
}
else if (status != CUTENSOR_STATUS_SUCCESS)
{
    printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
}
else
{
    printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
            numCachelines);
}

// ...

status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
if (status == CUTENSOR_STATUS_IO_ERROR)
{
    printf("File (%s) couldn't be written to.\n", planCacheFilename);
}
else if (status != CUTENSOR_STATUS_SUCCESS)
{
    printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
            cutensorGetErrorString(status));
}
else
{
    printf("Plan cache successfully stored to %s.\n", planCacheFilename);
}

Warning

cutensorHandleReadPlanCacheFromFile() only succeeds if the size of the plan cache is sufficient to read all cachelines stored in the file; otherwise CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE is returned and the sufficient number of cachelines is stored in numCachelinesRead.

With these changes the example now looks as follows:

  1 #include <stdlib.h>
  2 #include <stdio.h>
  3
  4 #include <unordered_map>
  5 #include <vector>
  6 #include <cassert>
  7
  8 #include <cuda_runtime.h>
  9 #include <cutensor.h>
 10
 11 #define HANDLE_ERROR(x)                                               \
 12 { const auto err = x;                                                 \
 13   if( err != CUTENSOR_STATUS_SUCCESS )                                \
 14   { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
 15 };
 16
 17 #define HANDLE_CUDA_ERROR(x)                                      \
 18 { const auto err = x;                                             \
 19   if( err != cudaSuccess )                                        \
 20   { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
 21 };
 22
 23 struct GPUTimer
 24 {
 25     GPUTimer()
 26     {
 27         cudaEventCreate(&start_);
 28         cudaEventCreate(&stop_);
 29         cudaEventRecord(start_, 0);
 30     }
 31
 32     ~GPUTimer()
 33     {
 34         cudaEventDestroy(start_);
 35         cudaEventDestroy(stop_);
 36     }
 37
 38     void start()
 39     {
 40         cudaEventRecord(start_, 0);
 41     }
 42
 43     float seconds()
 44     {
 45         cudaEventRecord(stop_, 0);
 46         cudaEventSynchronize(stop_);
 47         float time;
 48         cudaEventElapsedTime(&time, start_, stop_);
 49         return time * 1e-3;
 50     }
 51     private:
 52     cudaEvent_t start_, stop_;
 53 };
 54
 55 int main()
 56 {
 57     typedef float floatTypeA;
 58     typedef float floatTypeB;
 59     typedef float floatTypeC;
 60     typedef float floatTypeCompute;
 61
 62     cutensorDataType_t typeA = CUTENSOR_R_32F;
 63     cutensorDataType_t typeB = CUTENSOR_R_32F;
 64     cutensorDataType_t typeC = CUTENSOR_R_32F;
 65     const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;
 66
 67     floatTypeCompute alpha = (floatTypeCompute)1.1f;
 68     floatTypeCompute beta  = (floatTypeCompute)0.f;
 69
 70     /**********************
 71      * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
 72      **********************/
 73
 74     std::vector<int> modeC{'m','u','n','v'};
 75     std::vector<int> modeA{'m','h','k','n'};
 76     std::vector<int> modeB{'u','k','v','h'};
 77     int nmodeA = modeA.size();
 78     int nmodeB = modeB.size();
 79     int nmodeC = modeC.size();
 80
 81     std::unordered_map<int, int64_t> extent;
 82     extent['m'] = 96;
 83     extent['n'] = 96;
 84     extent['u'] = 96;
 85     extent['v'] = 64;
 86     extent['h'] = 64;
 87     extent['k'] = 64;
 88
 89     double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;
 90
 91     std::vector<int64_t> extentC;
 92     for (auto mode : modeC)
 93         extentC.push_back(extent[mode]);
 94     std::vector<int64_t> extentA;
 95     for (auto mode : modeA)
 96         extentA.push_back(extent[mode]);
 97     std::vector<int64_t> extentB;
 98     for (auto mode : modeB)
 99         extentB.push_back(extent[mode]);
100
101     /**********************
102      * Allocating data
103      **********************/
104
105     size_t elementsA = 1;
106     for (auto mode : modeA)
107         elementsA *= extent[mode];
108     size_t elementsB = 1;
109     for (auto mode : modeB)
110         elementsB *= extent[mode];
111     size_t elementsC = 1;
112     for (auto mode : modeC)
113         elementsC *= extent[mode];
114
115     size_t sizeA = sizeof(floatTypeA) * elementsA;
116     size_t sizeB = sizeof(floatTypeB) * elementsB;
117     size_t sizeC = sizeof(floatTypeC) * elementsC;
118     printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);
119
120     void *A_d, *B_d, *C_d;
121     HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
122     HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
123     HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));
124
125     const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
126     assert(uintptr_t(A_d) % kAlignment == 0);
127     assert(uintptr_t(B_d) % kAlignment == 0);
128     assert(uintptr_t(C_d) % kAlignment == 0);
129
130     floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
131     floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
132     floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
133
134     if (A == NULL || B == NULL || C == NULL)
135     {
136         printf("Error: Host allocation of A or C.\n");
137         return -1;
138     }
139
140     /*******************
141      * Initialize data
142      *******************/
143
144     for (int64_t i = 0; i < elementsA; i++)
145         A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
146     for (int64_t i = 0; i < elementsB; i++)
147         B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
148     for (int64_t i = 0; i < elementsC; i++)
149         C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
150
151     HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
152     HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
153     HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));
154
155     /*************************
156      * cuTENSOR
157      *************************/
158
159     cutensorHandle_t handle;
160     HANDLE_ERROR(cutensorCreate(&handle));
161
162     /**********************
163      * Load plan cache
164      **********************/
165
166     // holds information about the per-handle plan cache
167     const char planCacheFilename[] = "./planCache.bin";
168     uint32_t numCachelines = 0;
169     cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
170             planCacheFilename, &numCachelines);
171     if (status == CUTENSOR_STATUS_IO_ERROR)
172     {
173         printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
174     }
175     else if (status != CUTENSOR_STATUS_SUCCESS)
176     {
177         printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
178     }
179     else
180     {
181         printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
182                 numCachelines);
183     }
184
185     /**********************
186      * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
187      **********************/
188     uint32_t numEntries = 128;
189     HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));
190
191     /**********************
192      * Create Tensor Descriptors
193      **********************/
194     cutensorTensorDescriptor_t descA;
195     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
196                  &descA,
197                  nmodeA,
198                  extentA.data(),
199                  NULL,/*stride*/
200                  typeA, kAlignment));
201
202     cutensorTensorDescriptor_t descB;
203     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
204                  &descB,
205                  nmodeB,
206                  extentB.data(),
207                  NULL,/*stride*/
208                  typeB, kAlignment));
209
210     cutensorTensorDescriptor_t descC;
211     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
212                  &descC,
213                  nmodeC,
214                  extentC.data(),
215                  NULL,/*stride*/
216                  typeC, kAlignment));
217
218     /*******************************
219      * Create Contraction Descriptor
220      *******************************/
221
222     cutensorOperationDescriptor_t desc;
223     HANDLE_ERROR(cutensorCreateContraction(handle,
224                  &desc,
225                  descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
226                  descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
227                  descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
228                  descC, modeC.data(),
229                  descCompute));
230
231     /**************************
232      * PlanPreference: Set the algorithm to use and enable incremental autotuning
233      ***************************/
234
235     const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;
236
237     cutensorPlanPreference_t planPref;
238     HANDLE_ERROR(cutensorCreatePlanPreference(
239                                handle,
240                                &planPref,
241                                algo,
242                                CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation
243
244     const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
245     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
246         handle,
247         planPref,
248         CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
249         &cacheMode,
250         sizeof(cutensorCacheMode_t)));
251
252     const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
253     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
254         handle,
255         planPref,
256         CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
257         &autotuneMode ,
258         sizeof(cutensorAutotuneMode_t)));
259
260     const uint32_t incCount = 4;
261     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
262         handle,
263         planPref,
264         CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
265         &incCount,
266         sizeof(uint32_t)));
267
268     /**********************
269      * Query workspace estimate
270      **********************/
271
272     uint64_t workspaceSizeEstimate = 0;
273     const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
274     HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
275                                           desc,
276                                           planPref,
277                                           workspacePref,
278                                           &workspaceSizeEstimate));
279
280     /**************************
281      * Create Contraction Plan
282      **************************/
283
284     cutensorPlan_t plan;
285     HANDLE_ERROR(cutensorCreatePlan(handle,
286                  &plan,
287                  desc,
288                  planPref,
289                  workspaceSizeEstimate));
290
291     /**************************
292      * Optional: Query information about the created plan
293      **************************/
294
295     // query actually used workspace
296     uint64_t actualWorkspaceSize = 0;
297     HANDLE_ERROR(cutensorPlanGetAttribute(handle,
298         plan,
299         CUTENSOR_PLAN_REQUIRED_WORKSPACE,
300         &actualWorkspaceSize,
301         sizeof(actualWorkspaceSize)));
302
303     // At this point the user knows exactly how much memory is need by the operation and
304     // only the smaller actual workspace needs to be allocated
305     assert(actualWorkspaceSize <= workspaceSizeEstimate);
306
307     void *work = nullptr;
308     if (actualWorkspaceSize > 0)
309     {
310         HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
311         assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
312     }
313
314     /**********************
315      * Run
316      **********************/
317
318     double minTimeCUTENSOR = 1e100;
319     for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
320     {
321         cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
322         cudaDeviceSynchronize();
323
324         // Set up timing
325         GPUTimer timer;
326         timer.start();
327
328         // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
329         HANDLE_ERROR(cutensorContract(handle,
330                                   plan,
331                                   (void*) &alpha, A_d, B_d,
332                                   (void*) &beta,  C_d, C_d,
333                                   work, actualWorkspaceSize, 0 /* stream */));
334
335         // Synchronize and measure timing
336         auto time = timer.seconds();
337
338         minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
339     }
340
341     /*************************/
342
343     double transferedBytes = sizeC + sizeA + sizeB;
344     transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
345     transferedBytes /= 1e9;
346     printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
347
348     status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
349     if (status == CUTENSOR_STATUS_IO_ERROR)
350     {
351         printf("File (%s) couldn't be written to.\n", planCacheFilename);
352     }
353     else if (status != CUTENSOR_STATUS_SUCCESS)
354     {
355         printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
356                 cutensorGetErrorString(status));
357     }
358     else
359     {
360         printf("Plan cache successfully stored to %s.\n", planCacheFilename);
361     }
362
363
364     HANDLE_ERROR(cutensorDestroy(handle));
365     HANDLE_ERROR(cutensorDestroyPlan(plan));
366     HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
367     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
368     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
369     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
370
371     if (A) free(A);
372     if (B) free(B);
373     if (C) free(C);
374     if (A_d) cudaFree(A_d);
375     if (B_d) cudaFree(B_d);
376     if (C_d) cudaFree(C_d);
377     if (work) cudaFree(work);
378
379     return 0;
380 }

Finally, let us add a second contraction loop, but this time we want the –otherwise identical– contraction to be cached using a different cacheline: This can be useful if the state of the hardware cache between these two calls is substantially different (i.e., affecting the measured runtime of the kernels). To that end, we use the CUTENSOR_CONTRACTION_DESCRIPTOR_TAG attribute:

uint32_t tag = 1;
HANDLE_ERROR( cutensorOperationDescriptorSetAttribute(
     &handle,
     &desc,
     CUTENSOR_OPERATION_DESCRIPTOR_TAG,
     &tag,
     sizeof(uint32_t)));

With this change, the example code now looks as follows:

  1 #include <stdlib.h>
  2 #include <stdio.h>
  3
  4 #include <unordered_map>
  5 #include <vector>
  6 #include <cassert>
  7
  8 #include <cuda_runtime.h>
  9 #include <cutensor.h>
 10
 11 #define HANDLE_ERROR(x)                                               \
 12 { const auto err = x;                                                 \
 13   if( err != CUTENSOR_STATUS_SUCCESS )                                \
 14   { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
 15 };
 16
 17 #define HANDLE_CUDA_ERROR(x)                                      \
 18 { const auto err = x;                                             \
 19   if( err != cudaSuccess )                                        \
 20   { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
 21 };
 22
 23 struct GPUTimer
 24 {
 25     GPUTimer()
 26     {
 27         cudaEventCreate(&start_);
 28         cudaEventCreate(&stop_);
 29         cudaEventRecord(start_, 0);
 30     }
 31
 32     ~GPUTimer()
 33     {
 34         cudaEventDestroy(start_);
 35         cudaEventDestroy(stop_);
 36     }
 37
 38     void start()
 39     {
 40         cudaEventRecord(start_, 0);
 41     }
 42
 43     float seconds()
 44     {
 45         cudaEventRecord(stop_, 0);
 46         cudaEventSynchronize(stop_);
 47         float time;
 48         cudaEventElapsedTime(&time, start_, stop_);
 49         return time * 1e-3;
 50     }
 51     private:
 52     cudaEvent_t start_, stop_;
 53 };
 54
 55 int main()
 56 {
 57     typedef float floatTypeA;
 58     typedef float floatTypeB;
 59     typedef float floatTypeC;
 60     typedef float floatTypeCompute;
 61
 62     cutensorDataType_t typeA = CUTENSOR_R_32F;
 63     cutensorDataType_t typeB = CUTENSOR_R_32F;
 64     cutensorDataType_t typeC = CUTENSOR_R_32F;
 65     const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;
 66
 67     floatTypeCompute alpha = (floatTypeCompute)1.1f;
 68     floatTypeCompute beta  = (floatTypeCompute)0.f;
 69
 70     /**********************
 71      * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
 72      **********************/
 73
 74     std::vector<int> modeC{'m','u','n','v'};
 75     std::vector<int> modeA{'m','h','k','n'};
 76     std::vector<int> modeB{'u','k','v','h'};
 77     int nmodeA = modeA.size();
 78     int nmodeB = modeB.size();
 79     int nmodeC = modeC.size();
 80
 81     std::unordered_map<int, int64_t> extent;
 82     extent['m'] = 96;
 83     extent['n'] = 96;
 84     extent['u'] = 96;
 85     extent['v'] = 64;
 86     extent['h'] = 64;
 87     extent['k'] = 64;
 88
 89     double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;
 90
 91     std::vector<int64_t> extentC;
 92     for (auto mode : modeC)
 93         extentC.push_back(extent[mode]);
 94     std::vector<int64_t> extentA;
 95     for (auto mode : modeA)
 96         extentA.push_back(extent[mode]);
 97     std::vector<int64_t> extentB;
 98     for (auto mode : modeB)
 99         extentB.push_back(extent[mode]);
100
101     /**********************
102      * Allocating data
103      **********************/
104
105     size_t elementsA = 1;
106     for (auto mode : modeA)
107         elementsA *= extent[mode];
108     size_t elementsB = 1;
109     for (auto mode : modeB)
110         elementsB *= extent[mode];
111     size_t elementsC = 1;
112     for (auto mode : modeC)
113         elementsC *= extent[mode];
114
115     size_t sizeA = sizeof(floatTypeA) * elementsA;
116     size_t sizeB = sizeof(floatTypeB) * elementsB;
117     size_t sizeC = sizeof(floatTypeC) * elementsC;
118     printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);
119
120     void *A_d, *B_d, *C_d;
121     HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
122     HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
123     HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));
124
125     const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
126     assert(uintptr_t(A_d) % kAlignment == 0);
127     assert(uintptr_t(B_d) % kAlignment == 0);
128     assert(uintptr_t(C_d) % kAlignment == 0);
129
130     floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
131     floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
132     floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
133
134     if (A == NULL || B == NULL || C == NULL)
135     {
136         printf("Error: Host allocation of A or C.\n");
137         return -1;
138     }
139
140     /*******************
141      * Initialize data
142      *******************/
143
144     for (int64_t i = 0; i < elementsA; i++)
145         A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
146     for (int64_t i = 0; i < elementsB; i++)
147         B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
148     for (int64_t i = 0; i < elementsC; i++)
149         C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
150
151     HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
152     HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
153     HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));
154
155     /*************************
156      * cuTENSOR
157      *************************/
158
159     cutensorHandle_t handle;
160     HANDLE_ERROR(cutensorCreate(&handle));
161
162     /**********************
163      * Load plan cache
164      **********************/
165
166     // holds information about the per-handle plan cache
167     const char planCacheFilename[] = "./planCache.bin";
168     uint32_t numCachelines = 0;
169     cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
170             planCacheFilename, &numCachelines);
171     if (status == CUTENSOR_STATUS_IO_ERROR)
172     {
173         printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
174     }
175     else if (status != CUTENSOR_STATUS_SUCCESS)
176     {
177         printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
178     }
179     else
180     {
181         printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
182                 numCachelines);
183     }
184
185     /**********************
186      * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
187      **********************/
188     uint32_t numEntries = 128;
189     HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));
190
191     /**********************
192      * Create Tensor Descriptors
193      **********************/
194     cutensorTensorDescriptor_t descA;
195     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
196                  &descA,
197                  nmodeA,
198                  extentA.data(),
199                  NULL,/*stride*/
200                  typeA, kAlignment));
201
202     cutensorTensorDescriptor_t descB;
203     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
204                  &descB,
205                  nmodeB,
206                  extentB.data(),
207                  NULL,/*stride*/
208                  typeB, kAlignment));
209
210     cutensorTensorDescriptor_t descC;
211     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
212                  &descC,
213                  nmodeC,
214                  extentC.data(),
215                  NULL,/*stride*/
216                  typeC, kAlignment));
217
218     /*******************************
219      * Create Contraction Descriptor
220      *******************************/
221
222     cutensorOperationDescriptor_t desc;
223     HANDLE_ERROR(cutensorCreateContraction(handle,
224                  &desc,
225                  descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
226                  descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
227                  descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
228                  descC, modeC.data(),
229                  descCompute));
230
231     /**************************
232      * PlanPreference: Set the algorithm to use and enable incremental autotuning
233      ***************************/
234
235     const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;
236
237     cutensorPlanPreference_t planPref;
238     HANDLE_ERROR(cutensorCreatePlanPreference(
239                                handle,
240                                &planPref,
241                                algo,
242                                CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation
243
244     const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
245     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
246         handle,
247         planPref,
248         CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
249         &cacheMode,
250         sizeof(cutensorCacheMode_t)));
251
252     const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
253     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
254         handle,
255         planPref,
256         CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
257         &autotuneMode ,
258         sizeof(cutensorAutotuneMode_t)));
259
260     const uint32_t incCount = 4;
261     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
262         handle,
263         planPref,
264         CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
265         &incCount,
266         sizeof(uint32_t)));
267
268     /**********************
269      * Query workspace estimate
270      **********************/
271
272     uint64_t workspaceSizeEstimate = 0;
273     const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
274     HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
275                                           desc,
276                                           planPref,
277                                           workspacePref,
278                                           &workspaceSizeEstimate));
279
280     /**************************
281      * Create Contraction Plan
282      **************************/
283
284     cutensorPlan_t plan;
285     HANDLE_ERROR(cutensorCreatePlan(handle,
286                  &plan,
287                  desc,
288                  planPref,
289                  workspaceSizeEstimate));
290
291     /**************************
292      * Optional: Query information about the created plan
293      **************************/
294
295     // query actually used workspace
296     uint64_t actualWorkspaceSize = 0;
297     HANDLE_ERROR(cutensorPlanGetAttribute(handle,
298         plan,
299         CUTENSOR_PLAN_REQUIRED_WORKSPACE,
300         &actualWorkspaceSize,
301         sizeof(actualWorkspaceSize)));
302
303     // At this point the user knows exactly how much memory is need by the operation and
304     // only the smaller actual workspace needs to be allocated
305     assert(actualWorkspaceSize <= workspaceSizeEstimate);
306
307     void *work = nullptr;
308     if (actualWorkspaceSize > 0)
309     {
310         HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
311         assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
312     }
313
314     /**********************
315      * Run
316      **********************/
317
318     double minTimeCUTENSOR = 1e100;
319     for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
320     {
321         cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
322         cudaDeviceSynchronize();
323
324         // Set up timing
325         GPUTimer timer;
326         timer.start();
327
328         // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
329         HANDLE_ERROR(cutensorContract(handle,
330                                   plan,
331                                   (void*) &alpha, A_d, B_d,
332                                   (void*) &beta,  C_d, C_d,
333                                   work, actualWorkspaceSize, 0 /* stream */));
334
335         // Synchronize and measure timing
336         auto time = timer.seconds();
337
338         minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
339     }
340
341     /*************************/
342
343     double transferedBytes = sizeC + sizeA + sizeB;
344     transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
345     transferedBytes /= 1e9;
346     printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
347
348     uint32_t tag = 1;
349     HANDLE_ERROR( cutensorOperationDescriptorSetAttribute(
350          &handle,
351          &desc,
352          CUTENSOR_OPERATION_DESCRIPTOR_TAG,
353          &tag,
354          sizeof(uint32_t)));
355
356     /**************************
357      * Create Contraction Plan (with a different tag)
358      **************************/
359
360     cutensorPlan_t plan;
361     HANDLE_ERROR(cutensorCreatePlan(handle,
362                  &plan,
363                  desc,
364                  planPref,
365                  workspaceSizeEstimate));
366
367     /**************************
368      * Optional: Query information about the created plan
369      **************************/
370
371     // query actually used workspace
372     uint64_t actualWorkspaceSize = 0;
373     HANDLE_ERROR(cutensorPlanGetAttribute(handle,
374         plan,
375         CUTENSOR_PLAN_REQUIRED_WORKSPACE,
376         &actualWorkspaceSize,
377         sizeof(actualWorkspaceSize)));
378
379     // At this point the user knows exactly how much memory is need by the operation and
380     // only the smaller actual workspace needs to be allocated
381     assert(actualWorkspaceSize <= workspaceSizeEstimate);
382
383     void *work = nullptr;
384     if (actualWorkspaceSize > 0)
385     {
386         HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
387         assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
388     }
389
390     /**********************
391      * Run
392      **********************/
393
394     double minTimeCUTENSOR = 1e100;
395     for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
396     {
397         cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
398         cudaDeviceSynchronize();
399
400         // Set up timing
401         GPUTimer timer;
402         timer.start();
403
404         // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
405         HANDLE_ERROR(cutensorContract(handle,
406                                   plan,
407                                   (void*) &alpha, A_d, B_d,
408                                   (void*) &beta,  C_d, C_d,
409                                   work, actualWorkspaceSize, 0 /* stream */));
410
411         // Synchronize and measure timing
412         auto time = timer.seconds();
413
414         minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
415     }
416
417     /*************************/
418
419     double transferedBytes = sizeC + sizeA + sizeB;
420     transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
421     transferedBytes /= 1e9;
422     printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);
423
424     status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
425     if (status == CUTENSOR_STATUS_IO_ERROR)
426     {
427         printf("File (%s) couldn't be written to.\n", planCacheFilename);
428     }
429     else if (status != CUTENSOR_STATUS_SUCCESS)
430     {
431         printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
432                 cutensorGetErrorString(status));
433     }
434     else
435     {
436         printf("Plan cache successfully stored to %s.\n", planCacheFilename);
437     }
438
439
440     HANDLE_ERROR(cutensorDestroy(handle));
441     HANDLE_ERROR(cutensorDestroyPlan(plan));
442     HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
443     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
444     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
445     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));
446
447     if (A) free(A);
448     if (B) free(B);
449     if (C) free(C);
450     if (A_d) cudaFree(A_d);
451     if (B_d) cudaFree(B_d);
452     if (C_d) cudaFree(C_d);
453     if (work) cudaFree(work);
454
455     return 0;
456 }

You can confirm that the cache has two entries now by invoking the binary once again; this time it should report that “2 cachelines have been successfully read from file (./cache.bin)”.

This concludes our example of the plan cache; you can find these examples (including timings and warm-up runs) in the samples repository.

If you have any further question or suggestions, please do not hesitate to reach out.