Transition to cuTENSOR 2.x

cuTENSOR 2.x brings significant enhancements and benefits over its predecessor (see Release Notes). This guide provides an overview of the structure and main features of the new and improved API, outlines the key differences between the old and the new API, and demonstates how to migrate existing code to the new API.

Overview

cuTENSOR 2.x is designed to deliver higher performance, more functionality, and easier integration for your projects. We introduce a multi-stage API for all supported operations; Figure 1 illustrates the key stages (optional stages are grey): Creating an operation descriptor, restricting the kernel space, planning (i.e., selecting a kernel), and execution.

Illustraction of the API of cuTENSOR 2.x

Figure 1: Multi-stage API introduced by cuTENSOR 2.x

This structure allows us to reduce our API footprint by consolidating all supported operations (contractions, reductions, elementwise, and permutation) into a single cutensorOperationDescriptor_t object and share the same steps to reach execution. Additionally the new API supports querying the plan to determine the exact workspace size that is required by the operation (via cutensorPlanGetAttribute()), thus reducing the memory requirements for applications to the bare minimum.

Differences at a glance

The key differences between the old and the new API are listed below:

Example 1: Migrating a contraction from 1.x to 2.x

DataType names have to be adjusted, since their prefix has been changed from CUDA_ to CUTENSOR_ (e.g., CUDA_R_32F to CUTENSOR_R_32F)

cuTENSOR 1.x

cuTENSOR 2.x

cudaDataType_t typeA = CUDA_R_32F;
cudaDataType_t typeB = CUDA_R_32F;
cudaDataType_t typeC = CUDA_R_32F;
cutensorDataType_t typeA = CUTENSOR_R_32F;
cutensorDataType_t typeB = CUTENSOR_R_32F;
cutensorDataType_t typeC = CUTENSOR_R_32F;

Since our initialization functions have been replaced by Create/Destroy pairs, it is no longer necessary to use pointers to cuTENSOR objects; users can now directly allocate the structure using cutensorCreate(). Any memory allocated by cutensorCreate() can be safely released using cutensorDestroy().

cuTENSOR 1.x

cuTENSOR 2.x

cutensorHandle_t* handle;
cutensorHandle_t handle;

cutensorInitTensorDescriptor is replaced by cutensorCreateTensorDescriptor(). This is not just a name change; the last argument has been changed. Specifically, instead of specifying the cutensorOperator_t (which is now part of the operation descriptor), the user specifies the alignment of the tensor pointer in bytes.

cuTENSOR 1.x

cuTENSOR 2.x

cutensorTensorDescriptor_t descA;
cutensorInitTensorDescriptor(handle,
                             &descA,
                             nmodeA,
                             extentA.data(),
                             NULL,/*stride*/
                             typeA, CUTENSOR_OP_IDENTITY);
cutensorTensorDescriptor_t descA;
cutensorCreateTensorDescriptor(handle,
                               &descA,
                               nmodeA,
                               extentA.data(),
                               NULL,/*stride*/
                               typeA, kAlignment);

In the new API, a contraction is represented by a cutensorOperationDescriptor_t initialized using cutensorCreateContraction().

cuTENSOR 1.x

cuTENSOR 2.x

cutensorContractionDescriptor_t desc;
cutensorInitContractionDescriptor(handle,
                                  &desc,
                                  &descA, modeA.data(), alignmentRequirementA,
                                  &descB, modeB.data(), alignmentRequirementB,
                                  &descC, modeC.data(), alignmentRequirementC,
                                  &descC, modeC.data(), alignmentRequirementC,
                                  typeCompute);
cutensorOperationDescriptor_t desc;
cutensorCreateContraction(handle,
                          &desc,
                          descA, modeA.data(), CUTENSOR_OP_IDENTITY,
                          descB, modeB.data(), CUTENSOR_OP_IDENTITY,
                          descC, modeC.data(), CUTENSOR_OP_IDENTITY,
                          descC, modeC.data(),
                          descCompute);

cutensorContractionFind_t has been renamed to cutensorPlanPreference_t to indicate that it is not only limited to contractions but to all operations instead. Essentially, its functionality remains the same: It configures how cutensorCreatePlan() is going to function.

cuTENSOR 1.x

cuTENSOR 2.x

cutensorContractionFind_t find;
cutensorInitContractionFind(handle,
                            &find,
                            CUTENSOR_ALGO_DEFAULT);
cutensorPlanPreference_t planPref;
cutensorCreatePlanPreference(handle,
                             &planPref,
                             CUTENSOR_ALGO_DEFAULT,
                             CUTENSOR_JIT_MODE_NONE);

cutensorContractionGetWorkspaceSize has been renamed to cutensorEstimateWorkspaceSize(). Three values of cutensorWorksizePreference_t are available; note that CUTENSOR_WORKSPACE_RECOMMENDED has been renamed to CUTENSOR_WORKSPACE_DEFAULT.

cuTENSOR 1.x

cuTENSOR 2.x

uint64_t worksize = 0;
cutensorContractionGetWorkspaceSize(handle,
                                    &desc,
                                    &find,
                                    CUTENSOR_WORKSPACE_RECOMMENDED,
                                    &worksize);
uint64_t workspaceSizeEstimate = 0;
cutensorEstimateWorkspaceSize(handle,
                              desc,
                              planPref,
                              CUTENSOR_WORKSPACE_DEFAULT,
                              &workspaceSizeEstimate);

cutensorInitContractionPlan has been renamed to cutensorCreatePlan().

cuTENSOR 1.x

cuTENSOR 2.x

cutensorContractionPlan_t plan;
cutensorInitContractionPlan(handle,
                            &plan,
                            &desc,
                            &find,
                            worksize);
cutensorPlan_t plan;
cutensorCreatePlan(handle,
                   &plan,
                   desc,
                   planPref,
                   workspaceSizeEstimate);

After planning, users can query the created plan to find the actual workspace required to execute the operation. The actualWorkspaceSize is guarranteed to be smaller or equal to the workspaceSizeEstimate used above to create the plan.

cuTENSOR 1.x

cuTENSOR 2.x

void *work = nullptr;
if (worksize > 0)
  if (cudaSuccess != cudaMalloc(&work, worksize))
  {
    work = nullptr;
    worksize = 0;
  }
uint64_t actualWorkspaceSize = 0;
cutensorPlanGetAttribute(handle,
                         plan,
                         CUTENSOR_PLAN_REQUIRED_WORKSPACE,
                         &actualWorkspaceSize,
                         sizeof(actualWorkspaceSize));

void *work = nullptr;
if (actualWorkspaceSize > 0)
  HANDLE_CUDA_ERROR(cudaMalloc(&work, actualWorkspaceSize));

cutensorContraction has been renamed to cutensorContract().

cuTENSOR 1.x

cuTENSOR 2.x

cutensorContraction(handle,
                    &plan,
                    (void*) &alpha, A_d, B_d,
                    (void*) &beta,  C_d, C_d,
                    work, worksize, 0 /* stream */);
cudaStream_t stream;
HANDLE_CUDA_ERROR(cudaStreamCreate(&stream));

cutensorContract(handle,
                 plan,
                 (void*) &alpha, A_d, B_d,
                 (void*) &beta,  C_d, C_d,
                 work, actualWorkspaceSize, stream);

Example 2: Migrating a reduction operation from 1.x to 2.x

Reductions (along with permutations and elementwise operations, see Example 3: Migrating a permutation/elementwise operation from 1.x to 2.x) were previously only exposed through an execution function (i.e., single-stage API); on the contrary, with cuTENSOR 2.x, reductions utilize same multi-stage API that applies to all other operations as well. The steps necessary to compute a reduction using the new API are very similar to Example 1: Migrating a contraction from 1.x to 2.x and are illustrated below.

cuTENSOR 1.x

cuTENSOR 2.x

uint64_t worksize = 0;
cutensorReductionGetWorkspaceSize(handle,
                                  A_d, &descA, modeA.data(),
                                  C_d, &descC, modeC.data(),
                                  C_d, &descC, modeC.data(),
                                  opReduce, typeCompute, &worksize);
void *work = nullptr;
if (worksize > 0)
{
    cudaMalloc(&work, worksize);
}
const cutensorOperator_t opReduce = CUTENSOR_OP_ADD;
cutensorOperationDescriptor_t desc;
cutensorCreateReduction(handle, &desc,
                        descA, modeA.data(), CUTENSOR_OP_IDENTITY,
                        descC, modeC.data(), CUTENSOR_OP_IDENTITY,
                        descC, modeC.data(), opReduce,
                        descCompute);

const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;

cutensorPlanPreference_t planPref;
cutensorCreatePlanPreference(handle,
                             &planPref,
                             algo,
                             CUTENSOR_JIT_MODE_NONE);

uint64_t workspaceSizeEstimate = 0;
const cutensorWorksizePreference_t workspacePref = CUTENSOR_WORKSPACE_DEFAULT;
cutensorEstimateWorkspaceSize(handle,
                              desc,
                              planPref,
                              workspacePref,
                              &workspaceSizeEstimate);

cutensorPlan_t plan;
cutensorCreatePlan(handle,
                   &plan,
                   desc,
                   planPref,
                   workspaceSizeEstimate);

uint64_t actualWorkspaceSize = 0;
cutensorPlanGetAttribute(handle,
                         plan,
                         CUTENSOR_PLAN_REQUIRED_WORKSPACE,
                         &actualWorkspaceSize,
                         sizeof(actualWorkspaceSize));

void *work = nullptr;
if (actualWorkspaceSize > 0)
{
    cudaMalloc(&work, actualWorkspaceSize);
}

cuTENSOR 1.x

cuTENSOR 2.x

const cutensorOperator_t opReduce = CUTENSOR_OP_ADD;
cutensorReduction(handle,
                  (const void*)&alpha, A_d, &descA, modeA.data(),
                  (const void*)&beta,  C_d, &descC, modeC.data(),
                  C_d, &descC, modeC.data(),
                  opReduce, typeCompute, work, worksize, 0 /* stream */);
cutensorReduce(handle, plan,
               (const void*)&alpha, A_d,
               (const void*)&beta,  C_d,
               C_d, work, actualWorkspaceSize, stream);

Example 3: Migrating a permutation/elementwise operation from 1.x to 2.x

In the new API, permutations and elementwise operations also utilize the same multi-stage API. The steps necessary to compute an elementwise binary operation using the new API are illustrated below.

cuTENSOR 1.x

cuTENSOR 2.x

cutensorOperationDescriptor_t  desc;
cutensorCreateElementwiseBinary(handle,
                                &desc,
                                descA, modeA.data(), CUTENSOR_OP_IDENTITY,
                                descC, modeC.data(), CUTENSOR_OP_IDENTITY,
                                descC, modeC.data(), CUTENSOR_OP_ADD,
                                descCompute);

const cutensorAlgo_t algo = CUTENSOR_ALGO_DEFAULT;

cutensorPlanPreference_t  planPref;
cutensorCreatePlanPreference(handle,
                             &planPref,
                             algo,
                             CUTENSOR_JIT_MODE_NONE);

cutensorPlan_t  plan;
cutensorCreatePlan(handle,
                   &plan,
                   desc,
                   planPref,
                   0 /* workspaceSizeLimit */);

cuTENSOR 1.x

cuTENSOR 2.x

cutensorElementwiseBinary(handle,
                          (void*)&alpha, A_d, &descA, modeA.data(),
                          (void*)&gamma, C_d, &descC, modeC.data(),
                          C_d, &descC, modeC.data(),
                          CUTENSOR_OP_ADD, typeCompute, 0 /* stream */);
cutensorElementwiseBinaryExecute(handle,
                                 plan,
                                 (void*)&alpha, A_d,
                                 (void*)&gamma, C_d,
                                 C_d, 0 /* stream */));

We omit an example w.r.t. cutensorPermute() since it is akin to the example above.