Plan Cache#

This section introduces the software-managed plan cache that has the following key features:

Minimize launch-related overhead (e.g., due to kernel selection).
Overhead-free autotuning (a.k.a. Incremental Autotuning).
- This feature enables users to automatically find the best implementation for the given problem and thereby increases performance.
The cache is implemented in a thread-safe manner and is shared across all threads that use the same cutensorHandle_t.
Serialize and deserialize the cache:
- Allows users to store the state of the cache to disc and reuse it later

In essence, the plan cache can be seen as a lookup table from a specific problem instance (i.e., cutensorOperationDescriptor_t) to an actual implementation (encoded by cutensorPlan_t).

The remainder of this section assumes familiarity with Getting Started.

Note

The cache is activated by default and can be deactivated via the CUTENSOR_DISABLE_PLAN_CACHE environment variable (see Environment Variables).

Incremental Autotuning#

The incremental autotuning feature enables users to automatically explore different implementations, referred to as candidates, for a given operation.

When using the cache with the incremental auto-tuning feature (CUTENSOR_AUTOTUNE_MODE_INCREMENTAL), successive invocations of the same operation (albeit with potentially different data pointers) will be performed by different candidates; the timing for those candidates is automatically measured and the fastest candidate is stored in the plan cache. The number of different candidates to explore is configurable by the user (via CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT); all subsequent calls to the same problem will then be mapped to the fastest candidate (stored in the cache), thus taking advantage of the fastest (measured) candidate.

This autotuning approach has some key advantages:

Candidates are evaluated at a point in time where hardware caches are in a production-environment state (i.e., the hardware cache state reflects the real-world situation).
Overhead is minimized (i.e., no timing loop, no synchronization).
- Moreover, the candidates are evaluated in the order given by our performance model (from fastest to slowest).

Incremental autotuning is especially powerful if combined with cuTENSOR’s cache serialization feature (via cutensorHandleWritePlanCacheToFile() and cutensorHandleReadPlanCacheFromFile()) by writing the tuned cache to disc.

Note

We recommend warming up the GPU (i.e., reaching steady-state performance) before auto-tuning to minimize fluctuations in measured performance.

Introductory Example#

This subsection provides a basic overview of the cache-related API calls and features. In addition to the steps outlined in Getting Started, in this example we also:

Set a suitable cache size

Configure the cache behavior on a contraction-by-contraction basis (via cutensorPlanPreferenceSetAttribute()).

Let’s first look at the same example outlined in Getting Started: As cuTENSOR 2.x has the cache enabled by default, it is already taking advantage of caching. While optional, the following code demonstrates how to resize then cache from its implementation-defined initial value.

// Set cache size
constexpr int32_t numEntries = 128;
HANDLE_ERROR( cutensorHandleResizePlanCachelines(&handle, numEntries) );

// ...

Note that the number of entries is configurable by the user; ideally, we would like the cache to be large enough to have enough capacity for each of the application’s distinct contraction calls. Since this might not always be possible (due to memory constraints), cuTENSOR’s plan cache will evict cache entries using a least-recently-used (LRU) policy. Users can also choose to disable caching on a contraction-by-contraction basis (via cutensorCacheMode_t::CUTENSOR_CACHE_MODE_NONE).

Note that the cache lookup occurs when the plan is created. As such, this technique is particularly helpful if the same contraction is planed multiple times in the same application.

To disable caching for a certain contraction (i.e., opt-out), the following settings in cutensorPlanPreference_t need to be modified:

const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_NONE;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
     &handle,
     &find,
     CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
     &cacheMode,
     sizeof(cutensorCacheMode_t)));

This concludes the introductory example.

Advanced Example#

This example will augment the previous example and explains how to:

Take advantage of incremental auto-tuning
- It is recommended to warm up the GPU (i.e., reach steady-state performance) before using auto-tuning (to avoid big fluctuations in measured performance)
Use tags to distinguish two otherwise identical tensor contractions from each other
- This is usful if the hardware cache of the GPU is (likely) substantially different between the two calls (e.g., if one of the operands was just read/written by a preceeding call) and it is expected that the state of the cache has significant impact on the performance (e.g., for bandwidth-bound contractions)
Write the plan cache state to a file and read it back in

Let us start by enabling incremental autotuning. To do so, we modify cutensorPlanPreference_t as follows:

const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
    &handle,
    &find,
    CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE_MODE,
    &autotuneMode ,
    sizeof(cutensorAutotuneMode_t)));

const uint32_t incCount = 4;
HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
    &handle,
    &find,
    CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
    &incCount,
    sizeof(uint32_t)));

The first call to cutensorPlanPreferenceSetAttribute() enables incremental auto-tuning, while the second call sets the CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT; this value corresponds to the number of different candidates that should be explored via incremental autotuning before subsequent calls look-up from the plan cache. Higher values of incCount explore more candidates, and as such cause a larger overhead initially, but they can also result in better performance – if the initial overhead can be amortized (e.g., when writing the cache to disc). We feel that a CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT of four is a good default value.

The following code incorporates those changes:

#include <stdlib.h>
#include <stdio.h>

#include <unordered_map>
#include <vector>
#include <cassert>

#include <cuda_runtime.h>
#include <cutensor.h>

#define HANDLE_ERROR(x)                                               \
{ const auto err = x;                                                 \
  if( err != CUTENSOR_STATUS_SUCCESS )                                \
  { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
};

#define HANDLE_CUDA_ERROR(x)                                      \
{ const auto err = x;                                             \
  if( err != cudaSuccess )                                        \
  { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
};

struct GPUTimer
{
    GPUTimer()
    {
        cudaEventCreate(&start_);
        cudaEventCreate(&stop_);
        cudaEventRecord(start_, 0);
    }

    ~GPUTimer()
    {
        cudaEventDestroy(start_);
        cudaEventDestroy(stop_);
    }

    void start()
    {
        cudaEventRecord(start_, 0);
    }

    float seconds()
    {
        cudaEventRecord(stop_, 0);
        cudaEventSynchronize(stop_);
        float time;
        cudaEventElapsedTime(&time, start_, stop_);
        return time * 1e-3;
    }
    private:
    cudaEvent_t start_, stop_;
};

int main()
{
    typedef float floatTypeA;
    typedef float floatTypeB;
    typedef float floatTypeC;
    typedef float floatTypeCompute;

    cutensorDataType_t typeA = CUTENSOR_R_32F;
    cutensorDataType_t typeB = CUTENSOR_R_32F;
    cutensorDataType_t typeC = CUTENSOR_R_32F;
    const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;

    floatTypeCompute alpha = (floatTypeCompute)1.1f;
    floatTypeCompute beta  = (floatTypeCompute)0.f;

    /**********************
     * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
     **********************/

    std::vector<int> modeC{'m','u','n','v'};
    std::vector<int> modeA{'m','h','k','n'};
    std::vector<int> modeB{'u','k','v','h'};
    int nmodeA = modeA.size();
    int nmodeB = modeB.size();
    int nmodeC = modeC.size();

    std::unordered_map<int, int64_t> extent;
    extent['m'] = 96;
    extent['n'] = 96;
    extent['u'] = 96;
    extent['v'] = 64;
    extent['h'] = 64;
    extent['k'] = 64;

    double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;

    std::vector<int64_t> extentC;
    for (auto mode : modeC)
        extentC.push_back(extent[mode]);
    std::vector<int64_t> extentA;
    for (auto mode : modeA)
        extentA.push_back(extent[mode]);
    std::vector<int64_t> extentB;
    for (auto mode : modeB)
        extentB.push_back(extent[mode]);

    /**********************
     * Allocating data
     **********************/

    size_t elementsA = 1;
    for (auto mode : modeA)
        elementsA *= extent[mode];
    size_t elementsB = 1;
    for (auto mode : modeB)
        elementsB *= extent[mode];
    size_t elementsC = 1;
    for (auto mode : modeC)
        elementsC *= extent[mode];

    size_t sizeA = sizeof(floatTypeA) * elementsA;
    size_t sizeB = sizeof(floatTypeB) * elementsB;
    size_t sizeC = sizeof(floatTypeC) * elementsC;
    printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);

    void *A_d, *B_d, *C_d;
    HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
    HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
    HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));

    const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
    assert(uintptr_t(A_d) % kAlignment == 0);
    assert(uintptr_t(B_d) % kAlignment == 0);
    assert(uintptr_t(C_d) % kAlignment == 0);

    floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
    floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
    floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);

    if (A == NULL || B == NULL || C == NULL)
    {
        printf("Error: Host allocation of A or C.\n");
        return -1;
    }

    /*******************
     * Initialize data
     *******************/

    for (int64_t i = 0; i < elementsA; i++)
        A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
    for (int64_t i = 0; i < elementsB; i++)
        B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
    for (int64_t i = 0; i < elementsC; i++)
        C[i] = (((float) rand())/RAND_MAX - 0.5)*100;

    HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
    HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
    HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));

    /*************************
     * cuTENSOR
     *************************/

    cutensorHandle_t handle;
    HANDLE_ERROR(cutensorCreate(&handle));

    /**********************
     * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
     **********************/
    uint32_t numEntries = 128;
    HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));

    /**********************
     * Create Tensor Descriptors
     **********************/
    cutensorTensorDescriptor_t descA;
    HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                 &descA,
                 nmodeA,
                 extentA.data(),
                 NULL,/*stride*/
                 typeA, kAlignment));

    cutensorTensorDescriptor_t descB;
    HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                 &descB,
                 nmodeB,
                 extentB.data(),
                 NULL,/*stride*/
                 typeB, kAlignment));

    cutensorTensorDescriptor_t descC;
    HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                 &descC,
                 nmodeC,
                 extentC.data(),
                 NULL,/*stride*/
                 typeC, kAlignment));

    /*******************************
     * Create Contraction Descriptor
     *******************************/

    cutensorOperationDescriptor_t desc;
    HANDLE_ERROR(cutensorCreateContraction(handle,
                 &desc,
                 descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
                 descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
                 descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
                 descC, modeC.data(),
                 descCompute));

    /**************************
     * PlanPreference: Set the algorithm to use and enable incremental autotuning
     ***************************/

    const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;

    cutensorPlanPreference_t planPref;
    HANDLE_ERROR(cutensorCreatePlanPreference(
                               handle,
                               &planPref,
                               algo,
                               CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation

    const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
    HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
        handle,
        planPref,
        CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
        &cacheMode,
        sizeof(cutensorCacheMode_t)));

    const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
    HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
        handle,
        planPref,
        CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
        &autotuneMode ,
        sizeof(cutensorAutotuneMode_t)));

    const uint32_t incCount = 4;
    HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
        handle,
        planPref,
        CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
        &incCount,
        sizeof(uint32_t)));

    /**********************
     * Query workspace estimate
     **********************/

    uint64_t workspaceSizeEstimate = 0;
    const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
    HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
                                          desc,
                                          planPref,
                                          workspacePref,
                                          &workspaceSizeEstimate));

    /**************************
     * Create Contraction Plan
     **************************/

    cutensorPlan_t plan;
    HANDLE_ERROR(cutensorCreatePlan(handle,
                 &plan,
                 desc,
                 planPref,
                 workspaceSizeEstimate));

    /**************************
     * Optional: Query information about the created plan
     **************************/

    // query actually used workspace
    uint64_t actualWorkspaceSize = 0;
    HANDLE_ERROR(cutensorPlanGetAttribute(handle,
        plan,
        CUTENSOR_PLAN_REQUIRED_WORKSPACE,
        &actualWorkspaceSize,
        sizeof(actualWorkspaceSize)));

    // At this point the user knows exactly how much memory is need by the operation and
    // only the smaller actual workspace needs to be allocated
    assert(actualWorkspaceSize <= workspaceSizeEstimate);

    void *work = nullptr;
    if (actualWorkspaceSize > 0)
    {
        HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
        assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
    }

    /**********************
     * Run
     **********************/

    double minTimeCUTENSOR = 1e100;
    for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
    {
        cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
        cudaDeviceSynchronize();

        // Set up timing
        GPUTimer timer;
        timer.start();

        // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
        HANDLE_ERROR(cutensorContract(handle,
                                  plan,
                                  (void*) &alpha, A_d, B_d,
                                  (void*) &beta,  C_d, C_d,
                                  work, actualWorkspaceSize, 0 /* stream */));

        // Synchronize and measure timing
        auto time = timer.seconds();

        minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
    }

    /*************************/

    double transferedBytes = sizeC + sizeA + sizeB;
    transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
    transferedBytes /= 1e9;
    printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);

    HANDLE_ERROR(cutensorDestroy(handle));
    HANDLE_ERROR(cutensorDestroyPlan(plan));
    HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
    HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
    HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
    HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));

    if (A) free(A);
    if (B) free(B);
    if (C) free(C);
    if (A_d) cudaFree(A_d);
    if (B_d) cudaFree(B_d);
    if (C_d) cudaFree(C_d);
    if (work) cudaFree(work);

    return 0;
}

Let us further augment the above example by writing the plan cache to a file and reading it in (provided it was previously written):

const char planCacheFilename[] = "./planCache.bin";
uint32_t numCachelines = 0;
cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
        planCacheFilename, &numCachelines);
if (status == CUTENSOR_STATUS_IO_ERROR)
{
    printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
}
else if (status != CUTENSOR_STATUS_SUCCESS)
{
    printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
}
else
{
    printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
            numCachelines);
}

// ...

status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
if (status == CUTENSOR_STATUS_IO_ERROR)
{
    printf("File (%s) couldn't be written to.\n", planCacheFilename);
}
else if (status != CUTENSOR_STATUS_SUCCESS)
{
    printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
            cutensorGetErrorString(status));
}
else
{
    printf("Plan cache successfully stored to %s.\n", planCacheFilename);
}

Warning

cutensorHandleReadPlanCacheFromFile() only succeeds if the size of the plan cache is sufficient to read all cachelines stored in the file; otherwise CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE is returned and the sufficient number of cachelines is stored in numCachelinesRead.

With these changes the example now looks as follows:

 #include <stdlib.h>
 #include <stdio.h>

 #include <unordered_map>
 #include <vector>
 #include <cassert>

 #include <cuda_runtime.h>
 #include <cutensor.h>

 #define HANDLE_ERROR(x)                                               \
 { const auto err = x;                                                 \
   if( err != CUTENSOR_STATUS_SUCCESS )                                \
   { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
 };

 #define HANDLE_CUDA_ERROR(x)                                      \
 { const auto err = x;                                             \
   if( err != cudaSuccess )                                        \
   { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
 };

 struct GPUTimer
 {
     GPUTimer()
     {
         cudaEventCreate(&start_);
         cudaEventCreate(&stop_);
         cudaEventRecord(start_, 0);
     }

     ~GPUTimer()
     {
         cudaEventDestroy(start_);
         cudaEventDestroy(stop_);
     }

     void start()
     {
         cudaEventRecord(start_, 0);
     }

     float seconds()
     {
         cudaEventRecord(stop_, 0);
         cudaEventSynchronize(stop_);
         float time;
         cudaEventElapsedTime(&time, start_, stop_);
         return time * 1e-3;
     }
     private:
     cudaEvent_t start_, stop_;
 };

 int main()
 {
     typedef float floatTypeA;
     typedef float floatTypeB;
     typedef float floatTypeC;
     typedef float floatTypeCompute;

     cutensorDataType_t typeA = CUTENSOR_R_32F;
     cutensorDataType_t typeB = CUTENSOR_R_32F;
     cutensorDataType_t typeC = CUTENSOR_R_32F;
     const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;

     floatTypeCompute alpha = (floatTypeCompute)1.1f;
     floatTypeCompute beta  = (floatTypeCompute)0.f;

     /**********************
      * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
      **********************/

     std::vector<int> modeC{'m','u','n','v'};
     std::vector<int> modeA{'m','h','k','n'};
     std::vector<int> modeB{'u','k','v','h'};
     int nmodeA = modeA.size();
     int nmodeB = modeB.size();
     int nmodeC = modeC.size();

     std::unordered_map<int, int64_t> extent;
     extent['m'] = 96;
     extent['n'] = 96;
     extent['u'] = 96;
     extent['v'] = 64;
     extent['h'] = 64;
     extent['k'] = 64;

     double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;

     std::vector<int64_t> extentC;
     for (auto mode : modeC)
         extentC.push_back(extent[mode]);
     std::vector<int64_t> extentA;
     for (auto mode : modeA)
         extentA.push_back(extent[mode]);
     std::vector<int64_t> extentB;
     for (auto mode : modeB)
         extentB.push_back(extent[mode]);

     /**********************
      * Allocating data
      **********************/

     size_t elementsA = 1;
     for (auto mode : modeA)
         elementsA *= extent[mode];
     size_t elementsB = 1;
     for (auto mode : modeB)
         elementsB *= extent[mode];
     size_t elementsC = 1;
     for (auto mode : modeC)
         elementsC *= extent[mode];

     size_t sizeA = sizeof(floatTypeA) * elementsA;
     size_t sizeB = sizeof(floatTypeB) * elementsB;
     size_t sizeC = sizeof(floatTypeC) * elementsC;
     printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);

     void *A_d, *B_d, *C_d;
     HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
     HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
     HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));

     const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
     assert(uintptr_t(A_d) % kAlignment == 0);
     assert(uintptr_t(B_d) % kAlignment == 0);
     assert(uintptr_t(C_d) % kAlignment == 0);

     floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
     floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
     floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);

     if (A == NULL || B == NULL || C == NULL)
     {
         printf("Error: Host allocation of A or C.\n");
         return -1;
     }

     /*******************
      * Initialize data
      *******************/

     for (int64_t i = 0; i < elementsA; i++)
         A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
     for (int64_t i = 0; i < elementsB; i++)
         B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
     for (int64_t i = 0; i < elementsC; i++)
         C[i] = (((float) rand())/RAND_MAX - 0.5)*100;

     HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
     HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
     HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));

     /*************************
      * cuTENSOR
      *************************/

     cutensorHandle_t handle;
     HANDLE_ERROR(cutensorCreate(&handle));

     /**********************
      * Load plan cache
      **********************/

     // holds information about the per-handle plan cache
     const char planCacheFilename[] = "./planCache.bin";
     uint32_t numCachelines = 0;
     cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
             planCacheFilename, &numCachelines);
     if (status == CUTENSOR_STATUS_IO_ERROR)
     {
         printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
     }
     else if (status != CUTENSOR_STATUS_SUCCESS)
     {
         printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
     }
     else
     {
         printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
                 numCachelines);
     }

     /**********************
      * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
      **********************/
     uint32_t numEntries = 128;
     HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));

     /**********************
      * Create Tensor Descriptors
      **********************/
     cutensorTensorDescriptor_t descA;
     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                  &descA,
                  nmodeA,
                  extentA.data(),
                  NULL,/*stride*/
                  typeA, kAlignment));

     cutensorTensorDescriptor_t descB;
     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                  &descB,
                  nmodeB,
                  extentB.data(),
                  NULL,/*stride*/
                  typeB, kAlignment));

     cutensorTensorDescriptor_t descC;
     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                  &descC,
                  nmodeC,
                  extentC.data(),
                  NULL,/*stride*/
                  typeC, kAlignment));

     /*******************************
      * Create Contraction Descriptor
      *******************************/

     cutensorOperationDescriptor_t desc;
     HANDLE_ERROR(cutensorCreateContraction(handle,
                  &desc,
                  descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
                  descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
                  descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
                  descC, modeC.data(),
                  descCompute));

     /**************************
      * PlanPreference: Set the algorithm to use and enable incremental autotuning
      ***************************/

     const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;

     cutensorPlanPreference_t planPref;
     HANDLE_ERROR(cutensorCreatePlanPreference(
                                handle,
                                &planPref,
                                algo,
                                CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation

     const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
         handle,
         planPref,
         CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
         &cacheMode,
         sizeof(cutensorCacheMode_t)));

     const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
         handle,
         planPref,
         CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
         &autotuneMode ,
         sizeof(cutensorAutotuneMode_t)));

     const uint32_t incCount = 4;
     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
         handle,
         planPref,
         CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
         &incCount,
         sizeof(uint32_t)));

     /**********************
      * Query workspace estimate
      **********************/

     uint64_t workspaceSizeEstimate = 0;
     const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
     HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
                                           desc,
                                           planPref,
                                           workspacePref,
                                           &workspaceSizeEstimate));

     /**************************
      * Create Contraction Plan
      **************************/

     cutensorPlan_t plan;
     HANDLE_ERROR(cutensorCreatePlan(handle,
                  &plan,
                  desc,
                  planPref,
                  workspaceSizeEstimate));

     /**************************
      * Optional: Query information about the created plan
      **************************/

     // query actually used workspace
     uint64_t actualWorkspaceSize = 0;
     HANDLE_ERROR(cutensorPlanGetAttribute(handle,
         plan,
         CUTENSOR_PLAN_REQUIRED_WORKSPACE,
         &actualWorkspaceSize,
         sizeof(actualWorkspaceSize)));

     // At this point the user knows exactly how much memory is need by the operation and
     // only the smaller actual workspace needs to be allocated
     assert(actualWorkspaceSize <= workspaceSizeEstimate);

     void *work = nullptr;
     if (actualWorkspaceSize > 0)
     {
         HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
         assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
     }

     /**********************
      * Run
      **********************/

     double minTimeCUTENSOR = 1e100;
     for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
     {
         cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
         cudaDeviceSynchronize();

         // Set up timing
         GPUTimer timer;
         timer.start();

         // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
         HANDLE_ERROR(cutensorContract(handle,
                                   plan,
                                   (void*) &alpha, A_d, B_d,
                                   (void*) &beta,  C_d, C_d,
                                   work, actualWorkspaceSize, 0 /* stream */));

         // Synchronize and measure timing
         auto time = timer.seconds();

         minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
     }

     /*************************/

     double transferedBytes = sizeC + sizeA + sizeB;
     transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
     transferedBytes /= 1e9;
     printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);

     status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
     if (status == CUTENSOR_STATUS_IO_ERROR)
     {
         printf("File (%s) couldn't be written to.\n", planCacheFilename);
     }
     else if (status != CUTENSOR_STATUS_SUCCESS)
     {
         printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
                 cutensorGetErrorString(status));
     }
     else
     {
         printf("Plan cache successfully stored to %s.\n", planCacheFilename);
     }


     HANDLE_ERROR(cutensorDestroy(handle));
     HANDLE_ERROR(cutensorDestroyPlan(plan));
     HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));

     if (A) free(A);
     if (B) free(B);
     if (C) free(C);
     if (A_d) cudaFree(A_d);
     if (B_d) cudaFree(B_d);
     if (C_d) cudaFree(C_d);
     if (work) cudaFree(work);

     return 0;
 }

Finally, let us add a second contraction loop, but this time we want the –otherwise identical– contraction to be cached using a different cacheline: This can be useful if the state of the hardware cache between these two calls is substantially different (i.e., affecting the measured runtime of the kernels). To that end, we use the CUTENSOR_CONTRACTION_DESCRIPTOR_TAG attribute:

uint32_t tag = 1;
HANDLE_ERROR( cutensorOperationDescriptorSetAttribute(
     &handle,
     &desc,
     CUTENSOR_OPERATION_DESCRIPTOR_TAG,
     &tag,
     sizeof(uint32_t)));

With this change, the example code now looks as follows:

 #include <stdlib.h>
 #include <stdio.h>

 #include <unordered_map>
 #include <vector>
 #include <cassert>

 #include <cuda_runtime.h>
 #include <cutensor.h>

 #define HANDLE_ERROR(x)                                               \
 { const auto err = x;                                                 \
   if( err != CUTENSOR_STATUS_SUCCESS )                                \
   { printf("Error: %s\n", cutensorGetErrorString(err)); exit(-1); } \
 };

 #define HANDLE_CUDA_ERROR(x)                                      \
 { const auto err = x;                                             \
   if( err != cudaSuccess )                                        \
   { printf("Error: %s\n", cudaGetErrorString(err)); exit(-1); } \
 };

 struct GPUTimer
 {
     GPUTimer()
     {
         cudaEventCreate(&start_);
         cudaEventCreate(&stop_);
         cudaEventRecord(start_, 0);
     }

     ~GPUTimer()
     {
         cudaEventDestroy(start_);
         cudaEventDestroy(stop_);
     }

     void start()
     {
         cudaEventRecord(start_, 0);
     }

     float seconds()
     {
         cudaEventRecord(stop_, 0);
         cudaEventSynchronize(stop_);
         float time;
         cudaEventElapsedTime(&time, start_, stop_);
         return time * 1e-3;
     }
     private:
     cudaEvent_t start_, stop_;
 };

 int main()
 {
     typedef float floatTypeA;
     typedef float floatTypeB;
     typedef float floatTypeC;
     typedef float floatTypeCompute;

     cutensorDataType_t typeA = CUTENSOR_R_32F;
     cutensorDataType_t typeB = CUTENSOR_R_32F;
     cutensorDataType_t typeC = CUTENSOR_R_32F;
     const cutensorComputeDescriptor_t descCompute = CUTENSOR_COMPUTE_DESC_32F;

     floatTypeCompute alpha = (floatTypeCompute)1.1f;
     floatTypeCompute beta  = (floatTypeCompute)0.f;

     /**********************
      * Computing: C_{m,u,n,v} = alpha * A_{m,h,k,n} B_{u,k,v,h} + beta * C_{m,u,n,v}
      **********************/

     std::vector<int> modeC{'m','u','n','v'};
     std::vector<int> modeA{'m','h','k','n'};
     std::vector<int> modeB{'u','k','v','h'};
     int nmodeA = modeA.size();
     int nmodeB = modeB.size();
     int nmodeC = modeC.size();

     std::unordered_map<int, int64_t> extent;
     extent['m'] = 96;
     extent['n'] = 96;
     extent['u'] = 96;
     extent['v'] = 64;
     extent['h'] = 64;
     extent['k'] = 64;

     double gflops = (2.0 * extent['m'] * extent['n'] * extent['u'] * extent['v'] * extent['k'] * extent['h']) /1e9;

     std::vector<int64_t> extentC;
     for (auto mode : modeC)
         extentC.push_back(extent[mode]);
     std::vector<int64_t> extentA;
     for (auto mode : modeA)
         extentA.push_back(extent[mode]);
     std::vector<int64_t> extentB;
     for (auto mode : modeB)
         extentB.push_back(extent[mode]);

     /**********************
      * Allocating data
      **********************/

     size_t elementsA = 1;
     for (auto mode : modeA)
         elementsA *= extent[mode];
     size_t elementsB = 1;
     for (auto mode : modeB)
         elementsB *= extent[mode];
     size_t elementsC = 1;
     for (auto mode : modeC)
         elementsC *= extent[mode];

     size_t sizeA = sizeof(floatTypeA) * elementsA;
     size_t sizeB = sizeof(floatTypeB) * elementsB;
     size_t sizeC = sizeof(floatTypeC) * elementsC;
     printf("Total memory: %.2f GiB\n", (sizeA + sizeB + sizeC)/1024./1024./1024);

     void *A_d, *B_d, *C_d;
     HANDLE_CUDA_ERROR(cudaMalloc((void**) &A_d, sizeA));
     HANDLE_CUDA_ERROR(cudaMalloc((void**) &B_d, sizeB));
     HANDLE_CUDA_ERROR(cudaMalloc((void**) &C_d, sizeC));

     const uint32_t kAlignment = 128; // Alignment of the global-memory device pointers (bytes)
     assert(uintptr_t(A_d) % kAlignment == 0);
     assert(uintptr_t(B_d) % kAlignment == 0);
     assert(uintptr_t(C_d) % kAlignment == 0);

     floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
     floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
     floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);

     if (A == NULL || B == NULL || C == NULL)
     {
         printf("Error: Host allocation of A or C.\n");
         return -1;
     }

     /*******************
      * Initialize data
      *******************/

     for (int64_t i = 0; i < elementsA; i++)
         A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
     for (int64_t i = 0; i < elementsB; i++)
         B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
     for (int64_t i = 0; i < elementsC; i++)
         C[i] = (((float) rand())/RAND_MAX - 0.5)*100;

     HANDLE_CUDA_ERROR(cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice));
     HANDLE_CUDA_ERROR(cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice));
     HANDLE_CUDA_ERROR(cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice));

     /*************************
      * cuTENSOR
      *************************/

     cutensorHandle_t handle;
     HANDLE_ERROR(cutensorCreate(&handle));

     /**********************
      * Load plan cache
      **********************/

     // holds information about the per-handle plan cache
     const char planCacheFilename[] = "./planCache.bin";
     uint32_t numCachelines = 0;
     cutensorStatus_t status = cutensorHandleReadPlanCacheFromFile(handle,
             planCacheFilename, &numCachelines);
     if (status == CUTENSOR_STATUS_IO_ERROR)
     {
         printf("File (%s) doesn't seem to exist.\n", planCacheFilename);
     }
     else if (status != CUTENSOR_STATUS_SUCCESS)
     {
         printf("cutensorHandleReadPlanCacheFromFile reports error: %s\n", cutensorGetErrorString(status));
     }
     else
     {
         printf("cutensorHandleReadPlanCacheFromFile read %d cachelines from file.\n",
                 numCachelines);
     }

     /**********************
      * Optional: Resize the cache in case you expect the default option to be insufficient fore your use case
      **********************/
     uint32_t numEntries = 128;
     HANDLE_ERROR(cutensorHandleResizePlanCache(handle, numEntries));

     /**********************
      * Create Tensor Descriptors
      **********************/
     cutensorTensorDescriptor_t descA;
     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                  &descA,
                  nmodeA,
                  extentA.data(),
                  NULL,/*stride*/
                  typeA, kAlignment));

     cutensorTensorDescriptor_t descB;
     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                  &descB,
                  nmodeB,
                  extentB.data(),
                  NULL,/*stride*/
                  typeB, kAlignment));

     cutensorTensorDescriptor_t descC;
     HANDLE_ERROR(cutensorCreateTensorDescriptor(handle,
                  &descC,
                  nmodeC,
                  extentC.data(),
                  NULL,/*stride*/
                  typeC, kAlignment));

     /*******************************
      * Create Contraction Descriptor
      *******************************/

     cutensorOperationDescriptor_t desc;
     HANDLE_ERROR(cutensorCreateContraction(handle,
                  &desc,
                  descA, modeA.data(), /* unary operator A*/CUTENSOR_OP_IDENTITY,
                  descB, modeB.data(), /* unary operator B*/CUTENSOR_OP_IDENTITY,
                  descC, modeC.data(), /* unary operator C*/CUTENSOR_OP_IDENTITY,
                  descC, modeC.data(),
                  descCompute));

     /**************************
      * PlanPreference: Set the algorithm to use and enable incremental autotuning
      ***************************/

     const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;

     cutensorPlanPreference_t planPref;
     HANDLE_ERROR(cutensorCreatePlanPreference(
                                handle,
                                &planPref,
                                algo,
                                CUTENSOR_JIT_MODE_NONE)); // disable just-in-time compilation

     const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_PEDANTIC;
     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
         handle,
         planPref,
         CUTENSOR_PLAN_PREFERENCE_CACHE_MODE,
         &cacheMode,
         sizeof(cutensorCacheMode_t)));

     const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_MODE_INCREMENTAL;
     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
         handle,
         planPref,
         CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE,
         &autotuneMode ,
         sizeof(cutensorAutotuneMode_t)));

     const uint32_t incCount = 4;
     HANDLE_ERROR(cutensorPlanPreferenceSetAttribute(
         handle,
         planPref,
         CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT,
         &incCount,
         sizeof(uint32_t)));

     /**********************
      * Query workspace estimate
      **********************/

     uint64_t workspaceSizeEstimate = 0;
     const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
     HANDLE_ERROR(cutensorEstimateWorkspaceSize(handle,
                                           desc,
                                           planPref,
                                           workspacePref,
                                           &workspaceSizeEstimate));

     /**************************
      * Create Contraction Plan
      **************************/

     cutensorPlan_t plan;
     HANDLE_ERROR(cutensorCreatePlan(handle,
                  &plan,
                  desc,
                  planPref,
                  workspaceSizeEstimate));

     /**************************
      * Optional: Query information about the created plan
      **************************/

     // query actually used workspace
     uint64_t actualWorkspaceSize = 0;
     HANDLE_ERROR(cutensorPlanGetAttribute(handle,
         plan,
         CUTENSOR_PLAN_REQUIRED_WORKSPACE,
         &actualWorkspaceSize,
         sizeof(actualWorkspaceSize)));

     // At this point the user knows exactly how much memory is need by the operation and
     // only the smaller actual workspace needs to be allocated
     assert(actualWorkspaceSize <= workspaceSizeEstimate);

     void *work = nullptr;
     if (actualWorkspaceSize > 0)
     {
         HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
         assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
     }

     /**********************
      * Run
      **********************/

     double minTimeCUTENSOR = 1e100;
     for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
     {
         cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
         cudaDeviceSynchronize();

         // Set up timing
         GPUTimer timer;
         timer.start();

         // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
         HANDLE_ERROR(cutensorContract(handle,
                                   plan,
                                   (void*) &alpha, A_d, B_d,
                                   (void*) &beta,  C_d, C_d,
                                   work, actualWorkspaceSize, 0 /* stream */));

         // Synchronize and measure timing
         auto time = timer.seconds();

         minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
     }

     /*************************/

     double transferedBytes = sizeC + sizeA + sizeB;
     transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
     transferedBytes /= 1e9;
     printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);

     uint32_t tag = 1;
     HANDLE_ERROR( cutensorOperationDescriptorSetAttribute(
          &handle,
          &desc,
          CUTENSOR_OPERATION_DESCRIPTOR_TAG,
          &tag,
          sizeof(uint32_t)));

     /**************************
      * Create Contraction Plan (with a different tag)
      **************************/

     cutensorPlan_t plan;
     HANDLE_ERROR(cutensorCreatePlan(handle,
                  &plan,
                  desc,
                  planPref,
                  workspaceSizeEstimate));

     /**************************
      * Optional: Query information about the created plan
      **************************/

     // query actually used workspace
     uint64_t actualWorkspaceSize = 0;
     HANDLE_ERROR(cutensorPlanGetAttribute(handle,
         plan,
         CUTENSOR_PLAN_REQUIRED_WORKSPACE,
         &actualWorkspaceSize,
         sizeof(actualWorkspaceSize)));

     // At this point the user knows exactly how much memory is need by the operation and
     // only the smaller actual workspace needs to be allocated
     assert(actualWorkspaceSize <= workspaceSizeEstimate);

     void *work = nullptr;
     if (actualWorkspaceSize > 0)
     {
         HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
         assert(uintptr_t(work) % 128 == 0); // workspace must be aligned to 128 byte-boundary
     }

     /**********************
      * Run
      **********************/

     double minTimeCUTENSOR = 1e100;
     for (int i=0; i < incCount + 1; ++i) // last iteration will hit the cache
     {
         cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
         cudaDeviceSynchronize();

         // Set up timing
         GPUTimer timer;
         timer.start();

         // Automatically takes advantage of the incremental-autotuning (and updates the cache inside the context)
         HANDLE_ERROR(cutensorContract(handle,
                                   plan,
                                   (void*) &alpha, A_d, B_d,
                                   (void*) &beta,  C_d, C_d,
                                   work, actualWorkspaceSize, 0 /* stream */));

         // Synchronize and measure timing
         auto time = timer.seconds();

         minTimeCUTENSOR = (minTimeCUTENSOR < time) ? minTimeCUTENSOR : time;
     }

     /*************************/

     double transferedBytes = sizeC + sizeA + sizeB;
     transferedBytes += ((float) beta != 0.f) ? sizeC : 0;
     transferedBytes /= 1e9;
     printf("cuTensor: %.2f GFLOPs/s %.2f GB/s\n", gflops / minTimeCUTENSOR, transferedBytes/ minTimeCUTENSOR);

     status = cutensorHandleWritePlanCacheToFile(handle, planCacheFilename);
     if (status == CUTENSOR_STATUS_IO_ERROR)
     {
         printf("File (%s) couldn't be written to.\n", planCacheFilename);
     }
     else if (status != CUTENSOR_STATUS_SUCCESS)
     {
         printf("cutensorHandleWritePlanCacheToFile reports error: %s\n",
                 cutensorGetErrorString(status));
     }
     else
     {
         printf("Plan cache successfully stored to %s.\n", planCacheFilename);
     }


     HANDLE_ERROR(cutensorDestroy(handle));
     HANDLE_ERROR(cutensorDestroyPlan(plan));
     HANDLE_ERROR(cutensorDestroyOperationDescriptor(desc));
     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descA));
     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descB));
     HANDLE_ERROR(cutensorDestroyTensorDescriptor(descC));

     if (A) free(A);
     if (B) free(B);
     if (C) free(C);
     if (A_d) cudaFree(A_d);
     if (B_d) cudaFree(B_d);
     if (C_d) cudaFree(C_d);
     if (work) cudaFree(work);

     return 0;
 }

You can confirm that the cache has two entries now by invoking the binary once again; this time it should report that “2 cachelines have been successfully read from file (./cache.bin)”.

This concludes our example of the plan cache; you can find these examples (including timings and warm-up runs) in the samples repository.

If you have any further question or suggestions, please do not hesitate to reach out.