Plan Cache (beta)¶
This section introduces the software-managed plan cache that has the following key features:
Minimize launch-related overhead (e.g., due to kernel selection).
-
Overhead-free autotuning (a.k.a. Incremental Autotuning).
This feature enables users to automatically find the best implementation for the given problem and thereby increases performance.
The cache is implemented in a thread-safe manner and is shared across all threads that use the same
cutensorHandle_t
.-
Serialize and deserialize the cache:
Allows users to store the state of the cache to disc and reuse it later
In essence, the plan cache can be seen as a lookup table from a specific problem
instance (e.g., cutensorContractionDescriptor_t
) to an actual implementation (encoded by
cutensorContractionPlan_t
).
The remainder of this section assumes familiarity with Getting Started.
Note
The cache can always be deactivated via the CUTENSOR_DISABLE_PLAN_CACHE
environment variable (see Environment Variables).
Incremental Autotuning¶
The incremental autotuning feature enables users to automatically explore different implementations, referred to as candidates, for a given problem.
When using the cache with the incremental auto-tuning feature (CUTENSOR_AUTOTUNE_INCREMENTAL
),
successive invocations of the same contraction problem (albeit with potentially
different data pointers) will be performed by different candidates; the timing for those
candidates is automatically measured and the fastest candidate is stored in the plan
cache. The number of different candidates to explore is configurable by the user (via CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT
);
all subsequent calls to the same problem will then be mapped to the fastest candidate
(stored in the cache), thus taking advantage of the fastest (real-world) candidate.
This autotuning approach has some key advantages:
Candidates are evaluated at a point in time where hardware caches are in a production-environment state (i.e., the hardware cache state reflects the real-world situation).
-
Overhead is minimized (i.e., no timing loop, no synchronization).
Moreover, the candidates are evaluated in the order given by our performance model (from fastest to slowest).
Incremental autotuning is especially powerful if combined with
cuTENSOR’s cache serialization feature (via cutensorHandleWriteCacheToFile
and
cutensorHandleReadCacheFromFile
) by writing the tuned cache to disc.
Note
We recommend warming up the GPU (i.e., reaching steady-state performance) before auto-tuning is used to minimize fluctuations in measured performance.
Introductory Example¶
This subsection provides a basic overview of the cache-related API calls and features. In addition to the steps outlined in Getting Started, in this example we also:
Attach cachelines to the handle (via
cutensorHandleAttachPlanCachelines
).Configure the cache behavior on a contraction-by-contraction basis (via
cutensorContractionFindSetAttribute
).
Let’s start from the same example outlined in Getting Started and enhance it with the steps required to use the plan cache. First we have to add calls to attach and detach cachelines:
// Set number of cache lines
constexpr int32_t numCachelines = 1024;
// Set cache size and allocate
const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
// Attach cache
HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );
// ...
// Detach cache and free-up resources
HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );
Note that the number of cachelines is configurable by the user; ideally we would like to supply as many cachelines
as the applications has distinct contraction calls. Since this might not always be possible (due to memory constraints),
cuTENSOR’s plan cache will evict cache entries using a least-recently-used (LRU) policy. Users
can choose to disable caching on a contraction-by-contraction basis (via cutensorCacheMode_t::CUTENSOR_CACHE_NONE
).
Note
Until the cachelines are detached using cutensorHandleDetachPlanCachelines
, the user-allocated cachelines must not be freed.
Moreover, cutensorHandleDetachPlanCachelines
also deallocates other plan cache related resources.
It should be called once the plan cache is no longer needed in order to avoid resource leaks.
With the above mentioned changes (and a loop around the contraction call) our example now looks as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 |
#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include <cutensor.h>
#include <unordered_map>
#include <vector>
// Handle cuTENSOR errors
#define HANDLE_ERROR(x) { \
const auto err = x; \
if( err != CUTENSOR_STATUS_SUCCESS ) \
{ printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
}
int main(int argc, char** argv)
{
// Host element type definition
typedef float floatTypeA;
typedef float floatTypeB;
typedef float floatTypeC;
typedef float floatTypeCompute;
// CUDA types
cudaDataType_t typeA = CUDA_R_32F;
cudaDataType_t typeB = CUDA_R_32F;
cudaDataType_t typeC = CUDA_R_32F;
cutensorComputeType_t typeCompute = CUTENSOR_COMPUTE_32F;
floatTypeCompute alpha = (floatTypeCompute)1.1f;
floatTypeCompute beta = (floatTypeCompute)0.9f;
printf("Include headers and define data types\n");
/* ***************************** */
// Create vector of modes
std::vector<int> modeC{'m','u','n','v'};
std::vector<int> modeA{'m','h','k','n'};
std::vector<int> modeB{'u','k','v','h'};
int nmodeA = modeA.size();
int nmodeB = modeB.size();
int nmodeC = modeC.size();
// Extents
std::unordered_map<int, int64_t> extent;
extent['m'] = 96;
extent['n'] = 96;
extent['u'] = 96;
extent['v'] = 64;
extent['h'] = 64;
extent['k'] = 64;
// Create a vector of extents for each tensor
std::vector<int64_t> extentC;
for(auto mode : modeC)
extentC.push_back(extent[mode]);
std::vector<int64_t> extentA;
for(auto mode : modeA)
extentA.push_back(extent[mode]);
std::vector<int64_t> extentB;
for(auto mode : modeB)
extentB.push_back(extent[mode]);
printf("Define modes and extents\n");
/* ***************************** */
// Number of elements of each tensor
size_t elementsA = 1;
for(auto mode : modeA)
elementsA *= extent[mode];
size_t elementsB = 1;
for(auto mode : modeB)
elementsB *= extent[mode];
size_t elementsC = 1;
for(auto mode : modeC)
elementsC *= extent[mode];
// Size in bytes
size_t sizeA = sizeof(floatTypeA) * elementsA;
size_t sizeB = sizeof(floatTypeB) * elementsB;
size_t sizeC = sizeof(floatTypeC) * elementsC;
// Allocate on device
void *A_d, *B_d, *C_d;
cudaMalloc((void**)&A_d, sizeA);
cudaMalloc((void**)&B_d, sizeB);
cudaMalloc((void**)&C_d, sizeC);
// Allocate on host
floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
// Initialize data on host
for(int64_t i = 0; i < elementsA; i++)
A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
for(int64_t i = 0; i < elementsB; i++)
B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
for(int64_t i = 0; i < elementsC; i++)
C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
// Copy to device
cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice);
printf("Allocate, initialize and transfer tensors\n");
/* ***************************** */
// Initialize cuTENSOR library
cutensorHandle_t handle;
cutensorInit(&handle);
/**********************
* Setup plan cache
**********************/
printf("Attach cachelines\n");
constexpr int32_t numCachelines = 1024;
const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
printf("Allocating: %.2f kB for the cache\n", sizeCache / 1000.);
cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );
// Create Tensor Descriptors
cutensorTensorDescriptor_t descA;
HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
&descA,
nmodeA,
extentA.data(),
NULL,/*stride*/
typeA, CUTENSOR_OP_IDENTITY ) );
cutensorTensorDescriptor_t descB;
HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
&descB,
nmodeB,
extentB.data(),
NULL,/*stride*/
typeB, CUTENSOR_OP_IDENTITY ) );
cutensorTensorDescriptor_t descC;
HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
&descC,
nmodeC,
extentC.data(),
NULL,/*stride*/
typeC, CUTENSOR_OP_IDENTITY ) );
printf("Initialize cuTENSOR and tensor descriptors\n");
/* ***************************** */
//Retrieve the memory alignment for each tensor
uint32_t alignmentRequirementA;
HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
A_d,
&descA,
&alignmentRequirementA) );
uint32_t alignmentRequirementB;
HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
B_d,
&descB,
&alignmentRequirementB) );
uint32_t alignmentRequirementC;
HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
C_d,
&descC,
&alignmentRequirementC) );
printf("Query best alignment requirement for our pointers\n");
/* ***************************** */
// Create the Contraction Descriptor
cutensorContractionDescriptor_t desc;
HANDLE_ERROR( cutensorInitContractionDescriptor( &handle,
&desc,
&descA, modeA.data(), alignmentRequirementA,
&descB, modeB.data(), alignmentRequirementB,
&descC, modeC.data(), alignmentRequirementC,
&descC, modeC.data(), alignmentRequirementC,
typeCompute) );
printf("Initialize contraction descriptor\n");
/* ***************************** */
// Set the algorithm to use
cutensorContractionFind_t find;
HANDLE_ERROR( cutensorInitContractionFind(
&handle, &find,
CUTENSOR_ALGO_DEFAULT) );
printf("Initialize settings to find algorithm\n");
/* ***************************** */
// Query workspace
size_t worksize = 0;
HANDLE_ERROR( cutensorContractionGetWorkspace(&handle,
&desc,
&find,
CUTENSOR_WORKSPACE_RECOMMENDED, &worksize ) );
// Allocate workspace
void *work = nullptr;
if(worksize > 0)
{
if( cudaSuccess != cudaMalloc(&work, worksize) ) // This is optional!
{
work = nullptr;
worksize = 0;
}
}
printf("Query recommended workspace size and allocate it\n");
/* ***************************** */
printf("Execute contraction from plan\n");
int numRuns = 5;
for(int i=0; i < numRuns; ++i)
{
// Create Contraction Plan && look-up cache (if attached)
cutensorContractionPlan_t plan;
HANDLE_ERROR( cutensorInitContractionPlan(&handle,
&plan,
&desc,
&find,
worksize) );
printf("Create plan for contraction\n");
/* ***************************** */
cutensorStatus_t err;
// Execute the tensor contraction
err = cutensorContraction(&handle,
&plan,
(void*)&alpha, A_d,
B_d,
(void*)&beta, C_d,
C_d,
work, worksize, 0 /* stream */);
cudaDeviceSynchronize();
// Check for errors
if(err != CUTENSOR_STATUS_SUCCESS)
{
printf("ERROR: %s\n", cutensorGetErrorString(err));
}
}
/* ***************************** */
// Detach cache and free-up resources
HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );
if ( A ) free( A );
if ( B ) free( B );
if ( C ) free( C );
if ( cachelines ) free(cachelines);
if ( A_d ) cudaFree( A_d );
if ( B_d ) cudaFree( B_d );
if ( C_d ) cudaFree( C_d );
if ( work ) cudaFree( work );
printf("Successful completion\n");
return 0;
}
|
This minimal change suffices to cache the plan for the tensor contraction call in line 233 and 246.
It’s important to note that the call to cutensorInitContractionPlan
is inside of the
loop; the lookup from the cache happens here.
To disable caching for a certain contraction, the
corresponding cutensorContractionFind_t
needs to be modified accordingly:
const cutensorCacheMode_t cacheMode = CUTENSOR_CACHE_MODE_NONE;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
&handle,
&find,
CUTENSOR_CONTRACTION_FIND_CACHE_MODE,
&cacheMode,
sizeof(cutensorCacheMode_t)));
This concludes the introductory example.
Advanced Example¶
This example will augment the previous example and explains how to:
-
Take advantage of incremental auto-tuning
It is recommended to warm up the GPU (i.e., reach steady-state performance) before using auto-tuning (to avoid big fluctuations in measured performance)
-
Use tags to distinguish two otherwise identical tensor contractions from each other
This is usful if the hardware cache of the GPU is (likely) substantially different between the two calls (e.g., if one of the operands was just read/written by a preceeding call) and it is expected that the state of the cache has significant impact on the performance (e.g., for bandwidth-bound contractions)
Write the plan cache state to a file and read it back in
Let us start by enabling incremental autotuning.
To do so, we modify cutensorContractionFind_t
as follows:
const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_INCREMENTAL;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
&handle,
&find,
CUTENSOR_CONTRACTION_FIND_AUTOTUNE_MODE,
&autotuneMode ,
sizeof(cutensorAutotuneMode_t)));
const uint32_t incCount = 4;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
&handle,
&find,
CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT,
&incCount,
sizeof(uint32_t)));
The first call to cutensorContractionFindSetAttribute
enables incremental auto-tuning,
while the second call sets the CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT
; this
value corresponds to the number of different candidates that should be explored via
incremental autotuning before subsequent calls look-up from the plan cache.
Higher values of incCount
explore more candidates, and as such cause a larger overhead
initially, but they can also result in better performance – if the initial overhead
can be amortized (e.g., when writing the cache to disc).
We feel that a CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT
of four is a good default value.
The following code incorporates those changes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 |
#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include <cutensor.h>
#include <unordered_map>
#include <vector>
// Handle cuTENSOR errors
#define HANDLE_ERROR(x) { \
const auto err = x; \
if( err != CUTENSOR_STATUS_SUCCESS ) \
{ printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
}
int main(int argc, char** argv)
{
// Host element type definition
typedef float floatTypeA;
typedef float floatTypeB;
typedef float floatTypeC;
typedef float floatTypeCompute;
// CUDA types
cudaDataType_t typeA = CUDA_R_32F;
cudaDataType_t typeB = CUDA_R_32F;
cudaDataType_t typeC = CUDA_R_32F;
cutensorComputeType_t typeCompute = CUTENSOR_COMPUTE_32F;
floatTypeCompute alpha = (floatTypeCompute)1.1f;
floatTypeCompute beta = (floatTypeCompute)0.9f;
printf("Include headers and define data types\n");
/* ***************************** */
// Create vector of modes
std::vector<int> modeC{'m','u','n','v'};
std::vector<int> modeA{'m','h','k','n'};
std::vector<int> modeB{'u','k','v','h'};
int nmodeA = modeA.size();
int nmodeB = modeB.size();
int nmodeC = modeC.size();
// Extents
std::unordered_map<int, int64_t> extent;
extent['m'] = 96;
extent['n'] = 96;
extent['u'] = 96;
extent['v'] = 64;
extent['h'] = 64;
extent['k'] = 64;
// Create a vector of extents for each tensor
std::vector<int64_t> extentC;
for(auto mode : modeC)
extentC.push_back(extent[mode]);
std::vector<int64_t> extentA;
for(auto mode : modeA)
extentA.push_back(extent[mode]);
std::vector<int64_t> extentB;
for(auto mode : modeB)
extentB.push_back(extent[mode]);
printf("Define modes and extents\n");
/* ***************************** */
// Number of elements of each tensor
size_t elementsA = 1;
for(auto mode : modeA)
elementsA *= extent[mode];
size_t elementsB = 1;
for(auto mode : modeB)
elementsB *= extent[mode];
size_t elementsC = 1;
for(auto mode : modeC)
elementsC *= extent[mode];
// Size in bytes
size_t sizeA = sizeof(floatTypeA) * elementsA;
size_t sizeB = sizeof(floatTypeB) * elementsB;
size_t sizeC = sizeof(floatTypeC) * elementsC;
// Allocate on device
void *A_d, *B_d, *C_d;
cudaMalloc((void**)&A_d, sizeA);
cudaMalloc((void**)&B_d, sizeB);
cudaMalloc((void**)&C_d, sizeC);
// Allocate on host
floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
// Initialize data on host
for(int64_t i = 0; i < elementsA; i++)
A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
for(int64_t i = 0; i < elementsB; i++)
B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
for(int64_t i = 0; i < elementsC; i++)
C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
// Copy to device
cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice);
printf("Allocate, initialize and transfer tensors\n");
/* ***************************** */
// Initialize cuTENSOR library
cutensorHandle_t handle;
cutensorInit(&handle);
/**********************
* Setup plan cache
**********************/
printf("Attach cachelines\n");
constexpr int32_t numCachelines = 1024;
const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
printf("Allocating: %.2f kB for the cache\n", sizeCache / 1000.);
cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );
// Create Tensor Descriptors
cutensorTensorDescriptor_t descA;
HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
&descA,
nmodeA,
extentA.data(),
NULL,/*stride*/
typeA, CUTENSOR_OP_IDENTITY ) );
cutensorTensorDescriptor_t descB;
HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
&descB,
nmodeB,
extentB.data(),
NULL,/*stride*/
typeB, CUTENSOR_OP_IDENTITY ) );
cutensorTensorDescriptor_t descC;
HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
&descC,
nmodeC,
extentC.data(),
NULL,/*stride*/
typeC, CUTENSOR_OP_IDENTITY ) );
printf("Initialize cuTENSOR and tensor descriptors\n");
/* ***************************** */
//Retrieve the memory alignment for each tensor
uint32_t alignmentRequirementA;
HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
A_d,
&descA,
&alignmentRequirementA) );
uint32_t alignmentRequirementB;
HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
B_d,
&descB,
&alignmentRequirementB) );
uint32_t alignmentRequirementC;
HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
C_d,
&descC,
&alignmentRequirementC) );
printf("Query best alignment requirement for our pointers\n");
/* ***************************** */
// Create the Contraction Descriptor
cutensorContractionDescriptor_t desc;
HANDLE_ERROR( cutensorInitContractionDescriptor( &handle,
&desc,
&descA, modeA.data(), alignmentRequirementA,
&descB, modeB.data(), alignmentRequirementB,
&descC, modeC.data(), alignmentRequirementC,
&descC, modeC.data(), alignmentRequirementC,
typeCompute) );
printf("Initialize contraction descriptor\n");
/* ***************************** */
// Set the algorithm to use
cutensorContractionFind_t find;
HANDLE_ERROR( cutensorInitContractionFind(
&handle, &find,
CUTENSOR_ALGO_DEFAULT) );
const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_INCREMENTAL;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
&handle,
&find,
CUTENSOR_CONTRACTION_FIND_AUTOTUNE_MODE,
&autotuneMode ,
sizeof(cutensorAutotuneMode_t)));
const uint32_t incCount = 4;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
&handle,
&find,
CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT,
&incCount,
sizeof(uint32_t)));
printf("Initialize settings to find algorithm\n");
/* ***************************** */
// Query workspace
size_t worksize = 0;
HANDLE_ERROR( cutensorContractionGetWorkspace(&handle,
&desc,
&find,
CUTENSOR_WORKSPACE_RECOMMENDED, &worksize ) );
// Allocate workspace
void *work = nullptr;
if(worksize > 0)
{
if( cudaSuccess != cudaMalloc(&work, worksize) ) // This is optional!
{
work = nullptr;
worksize = 0;
}
}
printf("Query recommended workspace size and allocate it\n");
/* ***************************** */
printf("Execute contraction from plan\n");
int numRuns = 5;
for(int i=0; i < numRuns; ++i)
{
// Create Contraction Plan && look-up cache (if attached)
cutensorContractionPlan_t plan;
HANDLE_ERROR( cutensorInitContractionPlan(&handle,
&plan,
&desc,
&find,
worksize) );
printf("Create plan for contraction\n");
/* ***************************** */
cutensorStatus_t err;
// Execute the tensor contraction
err = cutensorContraction(&handle,
&plan,
(void*)&alpha, A_d,
B_d,
(void*)&beta, C_d,
C_d,
work, worksize, 0 /* stream */);
cudaDeviceSynchronize();
// Check for errors
if(err != CUTENSOR_STATUS_SUCCESS)
{
printf("ERROR: %s\n", cutensorGetErrorString(err));
}
}
/* ***************************** */
// Detach cache and free-up resources
HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );
if ( A ) free( A );
if ( B ) free( B );
if ( C ) free( C );
if ( cachelines ) free(cachelines);
if ( A_d ) cudaFree( A_d );
if ( B_d ) cudaFree( B_d );
if ( C_d ) cudaFree( C_d );
if ( work ) cudaFree( work );
printf("Successful completion\n");
return 0;
}
|
Let us further augment the above example by writing the plan cache to a file and reading it in (provided it was previously written):
const char cacheFilename[] = "./cache.bin";
uint32_t numCachelinesRead = 0;
cutensorStatus_t status = cutensorHandleReadCacheFromFile(&handle, cacheFilename, &numCachelinesRead);
if (status == CUTENSOR_STATUS_SUCCESS)
{
printf("%d cachelines have been successfully read from file (%s).\n", numCachelinesRead, cacheFilename);
}
else if (status == CUTENSOR_STATUS_IO_ERROR)
{
printf("File (%s) doesn't seem to exist.\n", cacheFilename);
}
else if (status == CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE)
{
printf("Cannot read cache: Please attach at least %d cachelines to the handle.\n", numCachelines);
}
// ...
const char filename[] = "./cache.bin";
HANDLE_ERROR( cutensorHandleWriteCacheToFile(&handle, filename) );
printf("Cache has been successfully written to file (%s).\n", filename);
Warning
cutensorHandleReadCacheFromFile
only succeeds if the number of attached
cachelines is sufficient to read all cachelines stored in the file; otherwise
CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE
is returned and the sufficient number of cachelines
is stored in numCachelinesRead
.
With these changes the example now looks as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 |
#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include <cutensor.h>
#include <unordered_map>
#include <vector>
// Handle cuTENSOR errors
#define HANDLE_ERROR(x) { \
const auto err = x; \
if( err != CUTENSOR_STATUS_SUCCESS ) \
{ printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
}
int main(int argc, char** argv)
{
// Host element type definition
typedef float floatTypeA;
typedef float floatTypeB;
typedef float floatTypeC;
typedef float floatTypeCompute;
// CUDA types
cudaDataType_t typeA = CUDA_R_32F;
cudaDataType_t typeB = CUDA_R_32F;
cudaDataType_t typeC = CUDA_R_32F;
cutensorComputeType_t typeCompute = CUTENSOR_COMPUTE_32F;
floatTypeCompute alpha = (floatTypeCompute)1.1f;
floatTypeCompute beta = (floatTypeCompute)0.9f;
printf("Include headers and define data types\n");
/* ***************************** */
// Create vector of modes
std::vector<int> modeC{'m','u','n','v'};
std::vector<int> modeA{'m','h','k','n'};
std::vector<int> modeB{'u','k','v','h'};
int nmodeA = modeA.size();
int nmodeB = modeB.size();
int nmodeC = modeC.size();
// Extents
std::unordered_map<int, int64_t> extent;
extent['m'] = 96;
extent['n'] = 96;
extent['u'] = 96;
extent['v'] = 64;
extent['h'] = 64;
extent['k'] = 64;
// Create a vector of extents for each tensor
std::vector<int64_t> extentC;
for(auto mode : modeC)
extentC.push_back(extent[mode]);
std::vector<int64_t> extentA;
for(auto mode : modeA)
extentA.push_back(extent[mode]);
std::vector<int64_t> extentB;
for(auto mode : modeB)
extentB.push_back(extent[mode]);
printf("Define modes and extents\n");
/* ***************************** */
// Number of elements of each tensor
size_t elementsA = 1;
for(auto mode : modeA)
elementsA *= extent[mode];
size_t elementsB = 1;
for(auto mode : modeB)
elementsB *= extent[mode];
size_t elementsC = 1;
for(auto mode : modeC)
elementsC *= extent[mode];
// Size in bytes
size_t sizeA = sizeof(floatTypeA) * elementsA;
size_t sizeB = sizeof(floatTypeB) * elementsB;
size_t sizeC = sizeof(floatTypeC) * elementsC;
// Allocate on device
void *A_d, *B_d, *C_d;
cudaMalloc((void**)&A_d, sizeA);
cudaMalloc((void**)&B_d, sizeB);
cudaMalloc((void**)&C_d, sizeC);
// Allocate on host
floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
// Initialize data on host
for(int64_t i = 0; i < elementsA; i++)
A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
for(int64_t i = 0; i < elementsB; i++)
B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
for(int64_t i = 0; i < elementsC; i++)
C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
// Copy to device
cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice);
printf("Allocate, initialize and transfer tensors\n");
/* ***************************** */
// Initialize cuTENSOR library
cutensorHandle_t handle;
cutensorInit(&handle);
/**********************
* Setup plan cache
**********************/
printf("Attach cachelines\n");
constexpr int32_t numCachelines = 1024;
const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
printf("Allocating: %.2f kB for the cache\n", sizeCache / 1000.);
cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );
const char cacheFilename[] = "./cache.bin";
uint32_t numCachelinesRead = 0;
cutensorStatus_t status = cutensorHandleReadCacheFromFile(&handle, cacheFilename, &numCachelinesRead);
if (status == CUTENSOR_STATUS_SUCCESS)
{
printf("%d cachelines have been successfully read from file (%s).\n", numCachelinesRead, cacheFilename);
}
else if (status == CUTENSOR_STATUS_IO_ERROR)
{
printf("File (%s) doesn't seem to exist.\n", cacheFilename);
}
else if (status == CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE)
{
printf("Cannot read cache: Please attach at least %d cachelines to the handle.\n", numCachelines);
}
// Create Tensor Descriptors
cutensorTensorDescriptor_t descA;
HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
&descA,
nmodeA,
extentA.data(),
NULL,/*stride*/
typeA, CUTENSOR_OP_IDENTITY ) );
cutensorTensorDescriptor_t descB;
HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
&descB,
nmodeB,
extentB.data(),
NULL,/*stride*/
typeB, CUTENSOR_OP_IDENTITY ) );
cutensorTensorDescriptor_t descC;
HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
&descC,
nmodeC,
extentC.data(),
NULL,/*stride*/
typeC, CUTENSOR_OP_IDENTITY ) );
printf("Initialize cuTENSOR and tensor descriptors\n");
/* ***************************** */
//Retrieve the memory alignment for each tensor
uint32_t alignmentRequirementA;
HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
A_d,
&descA,
&alignmentRequirementA) );
uint32_t alignmentRequirementB;
HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
B_d,
&descB,
&alignmentRequirementB) );
uint32_t alignmentRequirementC;
HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
C_d,
&descC,
&alignmentRequirementC) );
printf("Query best alignment requirement for our pointers\n");
/* ***************************** */
// Create the Contraction Descriptor
cutensorContractionDescriptor_t desc;
HANDLE_ERROR( cutensorInitContractionDescriptor( &handle,
&desc,
&descA, modeA.data(), alignmentRequirementA,
&descB, modeB.data(), alignmentRequirementB,
&descC, modeC.data(), alignmentRequirementC,
&descC, modeC.data(), alignmentRequirementC,
typeCompute) );
printf("Initialize contraction descriptor\n");
/* ***************************** */
// Set the algorithm to use
cutensorContractionFind_t find;
HANDLE_ERROR( cutensorInitContractionFind(
&handle, &find,
CUTENSOR_ALGO_DEFAULT) );
const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_INCREMENTAL;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
&handle,
&find,
CUTENSOR_CONTRACTION_FIND_AUTOTUNE_MODE,
&autotuneMode ,
sizeof(cutensorAutotuneMode_t)));
const uint32_t incCount = 4;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
&handle,
&find,
CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT,
&incCount,
sizeof(uint32_t)));
printf("Initialize settings to find algorithm\n");
/* ***************************** */
// Query workspace
size_t worksize = 0;
HANDLE_ERROR( cutensorContractionGetWorkspace(&handle,
&desc,
&find,
CUTENSOR_WORKSPACE_RECOMMENDED, &worksize ) );
// Allocate workspace
void *work = nullptr;
if(worksize > 0)
{
if( cudaSuccess != cudaMalloc(&work, worksize) ) // This is optional!
{
work = nullptr;
worksize = 0;
}
}
printf("Query recommended workspace size and allocate it\n");
/* ***************************** */
printf("Execute contraction from plan\n");
int numRuns = 5;
for(int i=0; i < numRuns; ++i)
{
// Create Contraction Plan && look-up cache (if attached)
cutensorContractionPlan_t plan;
HANDLE_ERROR( cutensorInitContractionPlan(&handle,
&plan,
&desc,
&find,
worksize) );
printf("Create plan for contraction\n");
/* ***************************** */
cutensorStatus_t err;
// Execute the tensor contraction
err = cutensorContraction(&handle,
&plan,
(void*)&alpha, A_d,
B_d,
(void*)&beta, C_d,
C_d,
work, worksize, 0 /* stream */);
cudaDeviceSynchronize();
// Check for errors
if(err != CUTENSOR_STATUS_SUCCESS)
{
printf("ERROR: %s\n", cutensorGetErrorString(err));
}
}
/* ***************************** */
HANDLE_ERROR( cutensorHandleWriteCacheToFile(&handle, cacheFilename) );
printf("Cache has been successfully written to file (%s).\n", cacheFilename);
// Detach cache and free-up resources
HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );
if ( A ) free( A );
if ( B ) free( B );
if ( C ) free( C );
if ( cachelines ) free(cachelines);
if ( A_d ) cudaFree( A_d );
if ( B_d ) cudaFree( B_d );
if ( C_d ) cudaFree( C_d );
if ( work ) cudaFree( work );
printf("Successful completion\n");
return 0;
}
|
Finally, let us add a second contraction loop, but this time we want the –otherwise
identical– contraction to be cached using a different cacheline:
This can be useful if the state of the hardware cache between these two calls is substantially different (i.e.,
affecting the measured runtime of the kernels).
To that end, we use the CUTENSOR_CONTRACTION_DESCRIPTOR_TAG
attribute:
uint32_t tag = 1;
HANDLE_ERROR( cutensorContractionDescriptorSetAttribute(
&handle,
&desc,
CUTENSOR_CONTRACTION_DESCRIPTOR_TAG,
&tag,
sizeof(uint32_t)));
With this change, the example code now looks as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 |
#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include <cutensor.h>
#include <unordered_map>
#include <vector>
// Handle cuTENSOR errors
#define HANDLE_ERROR(x) { \
const auto err = x; \
if( err != CUTENSOR_STATUS_SUCCESS ) \
{ printf("Error: %s in line %d\n", cutensorGetErrorString(err), __LINE__); exit(-1); } \
}
int main(int argc, char** argv)
{
// Host element type definition
typedef float floatTypeA;
typedef float floatTypeB;
typedef float floatTypeC;
typedef float floatTypeCompute;
// CUDA types
cudaDataType_t typeA = CUDA_R_32F;
cudaDataType_t typeB = CUDA_R_32F;
cudaDataType_t typeC = CUDA_R_32F;
cutensorComputeType_t typeCompute = CUTENSOR_COMPUTE_32F;
floatTypeCompute alpha = (floatTypeCompute)1.1f;
floatTypeCompute beta = (floatTypeCompute)0.9f;
printf("Include headers and define data types\n");
/* ***************************** */
// Create vector of modes
std::vector<int> modeC{'m','u','n','v'};
std::vector<int> modeA{'m','h','k','n'};
std::vector<int> modeB{'u','k','v','h'};
int nmodeA = modeA.size();
int nmodeB = modeB.size();
int nmodeC = modeC.size();
// Extents
std::unordered_map<int, int64_t> extent;
extent['m'] = 96;
extent['n'] = 96;
extent['u'] = 96;
extent['v'] = 64;
extent['h'] = 64;
extent['k'] = 64;
// Create a vector of extents for each tensor
std::vector<int64_t> extentC;
for(auto mode : modeC)
extentC.push_back(extent[mode]);
std::vector<int64_t> extentA;
for(auto mode : modeA)
extentA.push_back(extent[mode]);
std::vector<int64_t> extentB;
for(auto mode : modeB)
extentB.push_back(extent[mode]);
printf("Define modes and extents\n");
/* ***************************** */
// Number of elements of each tensor
size_t elementsA = 1;
for(auto mode : modeA)
elementsA *= extent[mode];
size_t elementsB = 1;
for(auto mode : modeB)
elementsB *= extent[mode];
size_t elementsC = 1;
for(auto mode : modeC)
elementsC *= extent[mode];
// Size in bytes
size_t sizeA = sizeof(floatTypeA) * elementsA;
size_t sizeB = sizeof(floatTypeB) * elementsB;
size_t sizeC = sizeof(floatTypeC) * elementsC;
// Allocate on device
void *A_d, *B_d, *C_d;
cudaMalloc((void**)&A_d, sizeA);
cudaMalloc((void**)&B_d, sizeB);
cudaMalloc((void**)&C_d, sizeC);
// Allocate on host
floatTypeA *A = (floatTypeA*) malloc(sizeof(floatTypeA) * elementsA);
floatTypeB *B = (floatTypeB*) malloc(sizeof(floatTypeB) * elementsB);
floatTypeC *C = (floatTypeC*) malloc(sizeof(floatTypeC) * elementsC);
// Initialize data on host
for(int64_t i = 0; i < elementsA; i++)
A[i] = (((float) rand())/RAND_MAX - 0.5)*100;
for(int64_t i = 0; i < elementsB; i++)
B[i] = (((float) rand())/RAND_MAX - 0.5)*100;
for(int64_t i = 0; i < elementsC; i++)
C[i] = (((float) rand())/RAND_MAX - 0.5)*100;
// Copy to device
cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
cudaMemcpy(A_d, A, sizeA, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice);
printf("Allocate, initialize and transfer tensors\n");
/* ***************************** */
// Initialize cuTENSOR library
cutensorHandle_t handle;
cutensorInit(&handle);
/**********************
* Setup plan cache
**********************/
printf("Attach cachelines\n");
constexpr int32_t numCachelines = 1024;
const size_t sizeCache = numCachelines * sizeof(cutensorPlanCacheline_t);
printf("Allocating: %.2f kB for the cache\n", sizeCache / 1000.);
cutensorPlanCacheline_t* cachelines = (cutensorPlanCacheline_t*) malloc(sizeCache);
HANDLE_ERROR( cutensorHandleAttachPlanCachelines(&handle, cachelines, numCachelines) );
const char cacheFilename[] = "./cache.bin";
uint32_t numCachelinesRead = 0;
cutensorStatus_t status = cutensorHandleReadCacheFromFile(&handle, cacheFilename, &numCachelinesRead);
if (status == CUTENSOR_STATUS_SUCCESS)
{
printf("%d cachelines have been successfully read from file (%s).\n", numCachelinesRead, cacheFilename);
}
else if (status == CUTENSOR_STATUS_IO_ERROR)
{
printf("File (%s) doesn't seem to exist.\n", cacheFilename);
}
else if (status == CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE)
{
printf("Cannot read cache: Please attach at least %d cachelines to the handle.\n", numCachelines);
}
// Create Tensor Descriptors
cutensorTensorDescriptor_t descA;
HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
&descA,
nmodeA,
extentA.data(),
NULL,/*stride*/
typeA, CUTENSOR_OP_IDENTITY ) );
cutensorTensorDescriptor_t descB;
HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
&descB,
nmodeB,
extentB.data(),
NULL,/*stride*/
typeB, CUTENSOR_OP_IDENTITY ) );
cutensorTensorDescriptor_t descC;
HANDLE_ERROR( cutensorInitTensorDescriptor( &handle,
&descC,
nmodeC,
extentC.data(),
NULL,/*stride*/
typeC, CUTENSOR_OP_IDENTITY ) );
printf("Initialize cuTENSOR and tensor descriptors\n");
/* ***************************** */
//Retrieve the memory alignment for each tensor
uint32_t alignmentRequirementA;
HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
A_d,
&descA,
&alignmentRequirementA) );
uint32_t alignmentRequirementB;
HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
B_d,
&descB,
&alignmentRequirementB) );
uint32_t alignmentRequirementC;
HANDLE_ERROR( cutensorGetAlignmentRequirement( &handle,
C_d,
&descC,
&alignmentRequirementC) );
printf("Query best alignment requirement for our pointers\n");
/* ***************************** */
// Create the Contraction Descriptor
cutensorContractionDescriptor_t desc;
HANDLE_ERROR( cutensorInitContractionDescriptor( &handle,
&desc,
&descA, modeA.data(), alignmentRequirementA,
&descB, modeB.data(), alignmentRequirementB,
&descC, modeC.data(), alignmentRequirementC,
&descC, modeC.data(), alignmentRequirementC,
typeCompute) );
printf("Initialize contraction descriptor\n");
/* ***************************** */
// Set the algorithm to use
cutensorContractionFind_t find;
HANDLE_ERROR( cutensorInitContractionFind(
&handle, &find,
CUTENSOR_ALGO_DEFAULT) );
const cutensorAutotuneMode_t autotuneMode = CUTENSOR_AUTOTUNE_INCREMENTAL;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
&handle,
&find,
CUTENSOR_CONTRACTION_FIND_AUTOTUNE_MODE,
&autotuneMode ,
sizeof(cutensorAutotuneMode_t)));
const uint32_t incCount = 4;
HANDLE_ERROR(cutensorContractionFindSetAttribute(
&handle,
&find,
CUTENSOR_CONTRACTION_FIND_INCREMENTAL_COUNT,
&incCount,
sizeof(uint32_t)));
printf("Initialize settings to find algorithm\n");
/* ***************************** */
// Query workspace
size_t worksize = 0;
HANDLE_ERROR( cutensorContractionGetWorkspace(&handle,
&desc,
&find,
CUTENSOR_WORKSPACE_RECOMMENDED, &worksize ) );
// Allocate workspace
void *work = nullptr;
if(worksize > 0)
{
if( cudaSuccess != cudaMalloc(&work, worksize) ) // This is optional!
{
work = nullptr;
worksize = 0;
}
}
printf("Query recommended workspace size and allocate it\n");
/* ***************************** */
printf("Execute contraction from plan\n");
int numRuns = 5;
for(int i=0; i < numRuns; ++i)
{
// Create Contraction Plan && look-up cache (if attached)
cutensorContractionPlan_t plan;
HANDLE_ERROR( cutensorInitContractionPlan(&handle,
&plan,
&desc,
&find,
worksize) );
printf("Create plan for contraction\n");
/* ***************************** */
cutensorStatus_t err;
// Execute the tensor contraction
err = cutensorContraction(&handle,
&plan,
(void*)&alpha, A_d,
B_d,
(void*)&beta, C_d,
C_d,
work, worksize, 0 /* stream */);
cudaDeviceSynchronize();
// Check for errors
if(err != CUTENSOR_STATUS_SUCCESS)
{
printf("ERROR: %s\n", cutensorGetErrorString(err));
}
}
uint32_t tag = 1;
HANDLE_ERROR( cutensorContractionDescriptorSetAttribute(
&handle,
&desc,
CUTENSOR_CONTRACTION_DESCRIPTOR_TAG,
&tag,
sizeof(uint32_t)));
for(int i=0; i < numRuns; ++i)
{
// Create Contraction Plan && look-up cache (if attached)
cutensorContractionPlan_t plan;
HANDLE_ERROR( cutensorInitContractionPlan(&handle,
&plan,
&desc,
&find,
worksize) );
printf("Create plan for contraction\n");
/* ***************************** */
cutensorStatus_t err;
// Execute the tensor contraction
err = cutensorContraction(&handle,
&plan,
(void*)&alpha, A_d,
B_d,
(void*)&beta, C_d,
C_d,
work, worksize, 0 /* stream */);
cudaDeviceSynchronize();
// Check for errors
if(err != CUTENSOR_STATUS_SUCCESS)
{
printf("ERROR: %s\n", cutensorGetErrorString(err));
}
}
/* ***************************** */
HANDLE_ERROR( cutensorHandleWriteCacheToFile(&handle, cacheFilename) );
printf("Cache has been successfully written to file (%s).\n", cacheFilename);
// Detach cache and free-up resources
HANDLE_ERROR( cutensorHandleDetachPlanCachelines(&handle) );
if ( A ) free( A );
if ( B ) free( B );
if ( C ) free( C );
if ( cachelines ) free(cachelines);
if ( A_d ) cudaFree( A_d );
if ( B_d ) cudaFree( B_d );
if ( C_d ) cudaFree( C_d );
if ( work ) cudaFree( work );
printf("Successful completion\n");
return 0;
}
|
You can confirm that the cache has two entries now by invoking the binary once again; this time it should report that “2 cachelines have been successfully read from file (./cache.bin)”.
This concludes our example of the plan cache; you can find these examples (including timings and warm-up runs) in the samples repository.
If you have any further question or suggestions, please do not hesitate to reach out.