Transition to cuTENSOR 2.x#
cuTENSOR 2.x brings significant enhancements and benefits over its predecessor (see Release Notes). This guide provides an overview of the structure and main features of the new and improved API, outlines the key differences between the old and the new API, and demonstates how to migrate existing code to the new API.
Overview#
cuTENSOR 2.x is designed to deliver higher performance, more functionality, and easier integration for your projects. We introduce a multi-stage API for all supported operations; Figure 1 illustrates the key stages (optional stages are grey): Creating an operation descriptor, restricting the kernel space, planning (i.e., selecting a kernel), and execution.
Figure 1: Multi-stage API introduced by cuTENSOR 2.x#
This structure allows us to reduce our API footprint by consolidating all supported operations (contractions, reductions, elementwise, and permutation) into a single cutensorOperationDescriptor_t object and share the same steps to reach execution. Additionally the new API supports querying the plan to determine the exact workspace size that is required by the operation (via cutensorPlanGetAttribute()), thus reducing the memory requirements for applications to the bare minimum.
Differences at a glance#
The key differences between the old and the new API are listed below:
cudaDataType
-> cutensorDataType_t (e.g.,CUDA_R_32F
->CUTENSOR_R_32F
)cutensorComputeType_t
-> cutensorComputeDescriptor_t (e.g.,CUTENSOR_COMPUTE_32F
->CUTENSOR_COMPUTE_DESC_32F
)The previously deprecated compute types
CUTENSOR_R_MIN...
andCUTENSOR_C_MIN...
have been removed.
cutensorInitTensorDescriptor
-> cutensorCreateTensorDescriptor()cutensorContractionDescriptor_t
-> cutensorOperationDescriptor_tcutensorInitContractionDescriptor
-> cutensorCreateContraction()cutensorInitContractionFind
-> cutensorCreatePlanPreference()cutensorContractionGetWorkspaceSize
-> cutensorEstimateWorkspaceSize()cutensorInitContractionPlan
-> cutensorCreatePlan()cutensorContraction
-> cutensorContract()cutensorElementwiseBinary
-> cutensorElementwiseBinaryExecute()cutensorElementwiseTrinary
-> cutensorElementwiseTrinaryExecute()cutensorPermutation
-> cutensorPermute()cutensorReduction
-> cutensorReduce()“Init” functions have become “Create/Destroy” pairs:
cutensorInit
-> cutensorCreate() and cutensorDestroy()cutensorInitTensorDescriptor
-> cutensorCreateTensorDescriptor() and cutensorDestroyTensorDescriptor()cutensorInitContractionDescriptor
-> cutensorCreateContraction() and cutensorDestroyOperationDescriptor()cutensorInitContractionFind
-> cutensorCreatePlanPreference() and cutensorDestroyPlanPreference()cutensorInitContractionPlan
-> cutensorCreatePlan() and cutensorDestroyPlan()
The cutensorOperator_t (e.g.,
CUTENSOR_OP_IDENTITY
) is no longer part of thecutensorTensorDescriptor_t
and has moved to creation of each operation (e.g., cutensorCreateContraction())Similarly, the alignment is no longer part of each operation but has moved to cutensorCreateTensorDescriptor(), with the intention being that a
cutensorTensorDescriptor_t
object describes all aspects related to the physical layout of the tensor in memory.
Example 1: Migrating a contraction from 1.x to 2.x#
DataType names have to be adjusted, since their prefix has been changed from CUDA_
to CUTENSOR_
(e.g., CUDA_R_32F
to CUTENSOR_R_32F
)
cuTENSOR 1.x |
cuTENSOR 2.x |
---|---|
cudaDataType_t typeA = CUDA_R_32F;
cudaDataType_t typeB = CUDA_R_32F;
cudaDataType_t typeC = CUDA_R_32F;
|
cutensorDataType_t typeA = CUTENSOR_R_32F;
cutensorDataType_t typeB = CUTENSOR_R_32F;
cutensorDataType_t typeC = CUTENSOR_R_32F;
|
Since our initialization functions have been replaced by Create/Destroy pairs, it is no longer necessary to use pointers to cuTENSOR objects; users can now directly allocate the structure using cutensorCreate(). Any memory allocated by cutensorCreate() can be safely released using cutensorDestroy().
cuTENSOR 1.x |
cuTENSOR 2.x |
---|---|
cutensorHandle_t* handle;
|
cutensorHandle_t handle;
|
cutensorInitTensorDescriptor
is replaced by cutensorCreateTensorDescriptor().
This is not just a name change; the last argument has been changed. Specifically, instead of specifying the cutensorOperator_t (which is now part of the operation descriptor), the user specifies the alignment of the tensor pointer in bytes.
cuTENSOR 1.x |
cuTENSOR 2.x |
---|---|
cutensorTensorDescriptor_t descA;
cutensorInitTensorDescriptor(handle,
&descA,
nmodeA,
extentA.data(),
NULL,/*stride*/
typeA, CUTENSOR_OP_IDENTITY);
|
cutensorTensorDescriptor_t descA;
cutensorCreateTensorDescriptor(handle,
&descA,
nmodeA,
extentA.data(),
NULL,/*stride*/
typeA, kAlignment);
|
In the new API, a contraction is represented by a cutensorOperationDescriptor_t initialized using cutensorCreateContraction().
cuTENSOR 1.x |
cuTENSOR 2.x |
---|---|
cutensorContractionDescriptor_t desc;
cutensorInitContractionDescriptor(handle,
&desc,
&descA, modeA.data(), alignmentRequirementA,
&descB, modeB.data(), alignmentRequirementB,
&descC, modeC.data(), alignmentRequirementC,
&descC, modeC.data(), alignmentRequirementC,
typeCompute);
|
cutensorOperationDescriptor_t desc;
cutensorCreateContraction(handle,
&desc,
descA, modeA.data(), CUTENSOR_OP_IDENTITY,
descB, modeB.data(), CUTENSOR_OP_IDENTITY,
descC, modeC.data(), CUTENSOR_OP_IDENTITY,
descC, modeC.data(),
descCompute);
|
cutensorContractionFind_t
has been renamed to cutensorPlanPreference_t to
indicate that it is not only limited to contractions but to all operations instead.
Essentially, its functionality remains the same: It configures how cutensorCreatePlan() is going to function.
cuTENSOR 1.x |
cuTENSOR 2.x |
---|---|
cutensorContractionFind_t find;
cutensorInitContractionFind(handle,
&find,
CUTENSOR_ALGO_DEFAULT);
|
cutensorPlanPreference_t planPref;
cutensorCreatePlanPreference(handle,
&planPref,
CUTENSOR_ALGO_DEFAULT,
CUTENSOR_JIT_MODE_NONE);
|
cutensorContractionGetWorkspaceSize
has been renamed to cutensorEstimateWorkspaceSize().
Three values of cutensorWorksizePreference_t are available; note that CUTENSOR_WORKSPACE_RECOMMENDED
has been renamed to CUTENSOR_WORKSPACE_DEFAULT
.
cuTENSOR 1.x |
cuTENSOR 2.x |
---|---|
uint64_t worksize = 0;
cutensorContractionGetWorkspaceSize(handle,
&desc,
&find,
CUTENSOR_WORKSPACE_RECOMMENDED,
&worksize);
|
uint64_t workspaceSizeEstimate = 0;
cutensorEstimateWorkspaceSize(handle,
desc,
planPref,
CUTENSOR_WORKSPACE_DEFAULT,
&workspaceSizeEstimate);
|
cutensorInitContractionPlan
has been renamed to cutensorCreatePlan().
cuTENSOR 1.x |
cuTENSOR 2.x |
---|---|
cutensorContractionPlan_t plan;
cutensorInitContractionPlan(handle,
&plan,
&desc,
&find,
worksize);
|
cutensorPlan_t plan;
cutensorCreatePlan(handle,
&plan,
desc,
planPref,
workspaceSizeEstimate);
|
After planning, users can query the created plan to find the actual workspace required to execute the operation.
The actualWorkspaceSize
is guarranteed to be smaller or equal to the workspaceSizeEstimate
used above to create the plan.
cuTENSOR 1.x |
cuTENSOR 2.x |
---|---|
void *work = nullptr;
if (worksize > 0)
if (cudaSuccess != cudaMalloc(&work, worksize))
{
work = nullptr;
worksize = 0;
}
|
uint64_t actualWorkspaceSize = 0;
cutensorPlanGetAttribute(handle,
plan,
CUTENSOR_PLAN_REQUIRED_WORKSPACE,
&actualWorkspaceSize,
sizeof(actualWorkspaceSize));
void *work = nullptr;
if (actualWorkspaceSize > 0)
HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));
|
cutensorContraction
has been renamed to cutensorContract().
cuTENSOR 1.x |
cuTENSOR 2.x |
---|---|
cutensorContraction(handle,
&plan,
(void*) &alpha, A_d, B_d,
(void*) &beta, C_d, C_d,
work, worksize, 0 /* stream */);
|
cudaStream_t stream;
HANDLE_CUDA_ERROR(cudaStreamCreate(&stream));
cutensorContract(handle,
plan,
(void*) &alpha, A_d, B_d,
(void*) &beta, C_d, C_d,
work, actualWorkspaceSize, stream);
|
Example 2: Migrating a reduction operation from 1.x to 2.x#
Reductions (along with permutations and elementwise operations, see Example 3: Migrating a permutation/elementwise operation from 1.x to 2.x) were previously only exposed through an execution function (i.e., single-stage API); on the contrary, with cuTENSOR 2.x, reductions utilize same multi-stage API that applies to all other operations as well. The steps necessary to compute a reduction using the new API are very similar to Example 1: Migrating a contraction from 1.x to 2.x and are illustrated below.
cuTENSOR 1.x |
cuTENSOR 2.x |
---|---|
uint64_t worksize = 0;
cutensorReductionGetWorkspaceSize(handle,
A_d, &descA, modeA.data(),
C_d, &descC, modeC.data(),
C_d, &descC, modeC.data(),
opReduce, typeCompute, &worksize);
void *work = nullptr;
if (worksize > 0)
{
cudaMalloc(&work, worksize);
}
|
const cutensorOperator_t opReduce = CUTENSOR_OP_ADD;
cutensorOperationDescriptor_t desc;
cutensorCreateReduction(handle, &desc,
descA, modeA.data(), CUTENSOR_OP_IDENTITY,
descC, modeC.data(), CUTENSOR_OP_IDENTITY,
descC, modeC.data(), opReduce,
descCompute);
const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;
cutensorPlanPreference_t planPref;
cutensorCreatePlanPreference(handle,
&planPref,
algo,
CUTENSOR_JIT_MODE_NONE);
uint64_t workspaceSizeEstimate = 0;
const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
cutensorEstimateWorkspaceSize(handle,
desc,
planPref,
workspacePref,
&workspaceSizeEstimate);
cutensorPlan_t plan;
cutensorCreatePlan(handle,
&plan,
desc,
planPref,
workspaceSizeEstimate);
uint64_t actualWorkspaceSize = 0;
cutensorPlanGetAttribute(handle,
plan,
CUTENSOR_PLAN_REQUIRED_WORKSPACE,
&actualWorkspaceSize,
sizeof(actualWorkspaceSize));
void *work = nullptr;
if (actualWorkspaceSize > 0)
{
cudaMalloc(&work, actualWorkspaceSize);
}
|
cuTENSOR 1.x |
cuTENSOR 2.x |
---|---|
const cutensorOperator_t opReduce = CUTENSOR_OP_ADD;
cutensorReduction(handle,
(const void*)&alpha, A_d, &descA, modeA.data(),
(const void*)&beta, C_d, &descC, modeC.data(),
C_d, &descC, modeC.data(),
opReduce, typeCompute, work, worksize, 0 /* stream */);
|
cutensorReduce(handle, plan,
(const void*)&alpha, A_d,
(const void*)&beta, C_d,
C_d, work, actualWorkspaceSize, stream);
|
Example 3: Migrating a permutation/elementwise operation from 1.x to 2.x#
In the new API, permutations and elementwise operations also utilize the same multi-stage API. The steps necessary to compute an elementwise binary operation using the new API are illustrated below.
cuTENSOR 1.x |
cuTENSOR 2.x |
---|---|
cutensorOperationDescriptor_t desc;
cutensorCreateElementwiseBinary(handle,
&desc,
descA, modeA.data(), CUTENSOR_OP_IDENTITY,
descC, modeC.data(), CUTENSOR_OP_IDENTITY,
descC, modeC.data(), CUTENSOR_OP_ADD,
descCompute);
const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;
cutensorPlanPreference_t planPref;
cutensorCreatePlanPreference(handle,
&planPref,
algo,
CUTENSOR_JIT_MODE_NONE);
cutensorPlan_t plan;
cutensorCreatePlan(handle,
&plan,
desc,
planPref,
0 /* workspaceSizeLimit */);
|
cuTENSOR 1.x |
cuTENSOR 2.x |
---|---|
cutensorElementwiseBinary(handle,
(void*)&alpha, A_d, &descA, modeA.data(),
(void*)&gamma, C_d, &descC, modeC.data(),
C_d, &descC, modeC.data(),
CUTENSOR_OP_ADD, typeCompute, 0 /* stream */);
|
cutensorElementwiseBinaryExecute(handle,
plan,
(void*)&alpha, A_d,
(void*)&gamma, C_d,
C_d, 0 /* stream */));
|
We omit an example w.r.t. cutensorPermute() since it is akin to the example above.