cuTENSOR Functions

Helper Functions

The helper functions initialize cuTENSOR, create tensor descriptors, check error codes, and retrieve library and CUDA runtime versions.


cutensorInit()

Warning

doxygenfunction: Cannot find function “cutensorInit” in doxygen xml output for project “cuTensor” from directory: _xml


cutensorInitTensorDescriptor()

cutensorStatus_t cutensorInitTensorDescriptor(const cutensorHandle_t *handle, cutensorTensorDescriptor_t *desc, const uint32_t numModes, const int64_t extent[], const int64_t stride[], cudaDataType_t dataType, cutensorOperator_t unaryOp)

Remark

non-blocking, no reentrant, and thread-safe

Brief

Initializes a tensor descriptor

Precondition

extent and stride arrays must each contain at least sizeof(int64_t) * numModes bytes

Parameters:
  • handle[in] Opaque handle holding cuTENSOR’s library context.

  • desc[out] Pointer to the address where the allocated tensor descriptor object is stored.

  • numModes[in] Number of modes.

  • extent[in] Extent of each mode (must be larger than zero).

  • stride[in] stride[i] denotes the displacement (stride) between two consecutive elements in the ith-mode. If stride is NULL, a packed generalized column-major memory layout is assumed (i.e., the strides increase monotonically from left to right). Each stride must be larger than zero; to be precise, a stride of zero can be achieved by omitting this mode entirely; for instance instead of writing C[a,b] = A[b,a] with strideA(a) = 0, you can write C[a,b] = A[b] directly; cuTENSOR will then automatically infer that the a-mode in A should be broadcasted).

  • dataType[in] Data type of the stored entries.

  • unaryOp[in] Unary operator that will be applied to each element of the corresponding tensor in a lazy fashion (i.e., the algorithm uses this tensor as its operand only once). The original data of this tensor remains unchanged.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

  • CUTENSOR_STATUS_NOT_SUPPORTED – if the requested descriptor is not supported (e.g., due to non-supported data type).

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).


cutensorGetAlignmentRequirement()

cutensorStatus_t cutensorGetAlignmentRequirement(const cutensorHandle_t *handle, const void *ptr, const cutensorTensorDescriptor_t *desc, uint32_t *alignmentRequirement)

Brief

Computes the minimal alignment requirement for a given pointer and descriptor

Parameters:
  • handle[in] Opaque handle holding cuTENSOR’s library context.

  • ptr[in] Raw pointer to the data of the respective tensor.

  • desc[in] Tensor descriptor for ptr.

  • alignmentRequirement[out] Largest alignment requirement that ptr can fulfill (in bytes).

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).


cutensorGetErrorString()

const char *cutensorGetErrorString(const cutensorStatus_t error)

Remark

non-blocking, no reentrant, and thread-safe

Brief

Returns the description string for an error code

Returns

the error string

Parameters:

error[in] Error code to convert to string.


cutensorGetVersion()

size_t cutensorGetVersion()

Brief

Returns Version number of the CUTENSOR library


cutensorGetCudartVersion()

size_t cutensorGetCudartVersion()

Brief

Returns version number of the CUDA runtime that cuTENSOR was compiled against

Details

Can be compared against the CUDA runtime version from cudaRuntimeGetVersion().

Element-wise Operations

The following functions perform element-wise operations between tensors.


cutensorElementwiseTrinary()

cutensorStatus_t cutensorElementwiseTrinary(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *beta, const void *B, const cutensorTensorDescriptor_t *descB, const int32_t modeB[], const void *gamma, const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opAB, cutensorOperator_t opABC, cudaDataType_t typeScalar, const cudaStream_t stream)

Where

  • A,B,C,D are multi-mode tensors (of arbitrary data types).

  • \(\Pi^A, \Pi^B, \Pi^C \) are permutation operators that permute the modes of A, B, and C respectively.

  • \(\Psi_{A},\Psi_{B},\Psi_{C}\) are unary element-wise operators (e.g., IDENTITY, CONJUGATE).

  • \(\Phi_{ABC}, \Phi_{AB}\) are binary element-wise operators (e.g., ADD, MUL, MAX, MIN).

Brief

Element-wise tensor operation with three inputs

Details

This function performs a element-wise tensor operation of the form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{ABC}(\Phi_{AB}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \beta \Psi_B(B_{\Pi^B(i_0,i_1,...,i_n)})), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

Notice that the broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.

Moreover, modes may appear in any order, giving users a greater flexibility. The only restrictions are:

  • modes that appear in A or B must also appear in the output tensor; a mode that only appears in the input would be contracted and such an operation would be covered by either cutensorContraction or cutensorReduction.

  • each mode may appear in each tensor at most once.

Input tensors may be read even if the value of the corresponding scalar is zero.

Examples:

  • \( D_{a,b,c,d} = A_{b,d,a,c}\)

  • \( D_{a,b,c,d} = 2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a}\)

  • \( D_{a,b,c,d} = 2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a} + C_{a,b,c,d}\)

  • \( D_{a,b,c,d} = min((2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a}), C_{a,b,c,d})\)

Supported data-type combinations are:

typeA

typeB

typeC

typeScalar

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_32F

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

CUDA_R_32F

CUDA_R_32F

CUDA_R_16F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_32F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_32F

CUDA_C_64F

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:
  • handle[in] Opaque handle holding cuTENSOR’s library context.

  • alpha[in] Scaling factor for A (see equation above) of the type typeScalar. Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.

  • A[in] Multi-mode tensor of type typeA with nmodeA modes. Pointer to the GPU-accessible memory.

  • descA[in] A descriptor that holds the information about the data type, modes, and strides of A.

  • modeA[in] Array (in host memory) of size descA->numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’}). The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.

  • beta[in] Scaling factor for B (see equation above) of the type typeScalar. Pointer to the host memory. If beta is zero, B is not read and the corresponding unary operator is not applied.

  • B[in] Multi-mode tensor of type typeB with nmodeB many modes. Pointer to the GPU-accessible memory.

  • descB[in] The B descriptor that holds information about the data type, modes, and strides of B.

  • modeB[in] Array (in host memory) of size descB->numModes that holds the names of the modes of B. modeB[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor

  • gamma[in] Scaling factor for C (see equation above) of type typeScalar. Pointer to the host memory. If gamma is zero, C is not read and the corresponding unary operator is not applied.

  • C[in] Multi-mode tensor of type typeC with nmodeC many modes. Pointer to the GPU-accessible memory.

  • descC[in] The C descriptor that holds information about the data type, modes, and strides of C.

  • modeC[in] Array (in host memory) of size descC->numModes that holds the names of the modes of C. The modeC[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.

  • D[out] Multi-mode output tensor of type typeC with nmodeC modes that are ordered according to modeD. Pointer to the GPU-accessible memory. Notice that D may alias any input tensor if they share the same memory layout (i.e., same tensor descriptor).

  • descD[in] The D descriptor that holds information about the data type, modes, and strides of D. Notice that we currently request descD and descC to be identical.

  • modeD[in] Array (in host memory) of size descD->numModes that holds the names of the modes of D. The modeD[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.

  • opAB[in] Element-wise binary operator (see \(\Phi_{AB}\) above).

  • opABC[in] Element-wise binary operator (see \(\Phi_{ABC}\) above).

  • typeScalar[in] Denotes the data type for the scalars alpha, beta, and gamma. Moreover, typeScalar determines the data type that is used throughout the computation.

  • stream[in] The cuda stream.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

  • CUTENSOR_STATUS_ARCH_MISMATCH – if the device is either not ready, or the target architecture is not supported.


cutensorElementwiseBinary()

cutensorStatus_t cutensorElementwiseBinary(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *gamma, const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opAC, cudaDataType_t typeScalar, cudaStream_t stream)

See cutensorElementwiseTrinary() for details.

Brief

Element-wise tensor operation for two input tensors

Details

This function performs a element-wise tensor operation of the form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{AC}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:
  • handle[in] Opaque handle holding cuTENSOR’s library context.

  • alpha[in] Scaling factor for A (see equation above) of the type typeScalar. Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.

  • A[in] Multi-mode tensor of type typeA with nmodeA modes. Pointer to the GPU-accessible memory.

  • descA[in] A descriptor that holds the information about the data type, modes, and strides of A.

  • modeA[in] Array (in host memory) of size descA->numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’}). The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.

  • gamma[in] Scaling factor for C (see equation above) of type typeScalar. Pointer to the host memory. If gamma is zero, C is not read and the corresponding unary operator is not applied.

  • C[in] Multi-mode tensor of type typeC with nmodeC many modes. Pointer to the GPU-accessible memory.

  • descC[in] The C descriptor that holds information about the data type, modes, and strides of C.

  • modeC[in] Array (in host memory) of size descC->numModes that holds the names of the modes of C. The modeC[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.

  • D[out] Multi-mode output tensor of type typeC with nmodeC modes that are ordered according to modeD. Pointer to the GPU-accessible memory. Notice that D may alias any input tensor if they share the same memory layout (i.e., same tensor descriptor).

  • descD[in] The D descriptor that holds information about the data type, modes, and strides of D. Notice that we currently request descD and descC to be identical.

  • modeD[in] Array (in host memory) of size descD->numModes that holds the names of the modes of D. The modeD[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.

  • opAC[in] Element-wise binary operator (see \(\Phi_{AC}\) above).

  • typeScalar[in] Scalar type for the intermediate computation.

  • stream[in] The cuda stream.

Return values:
  • CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported

  • CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value

  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.


cutensorPermutation()

cutensorStatus_t cutensorPermutation(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], void *B, const cutensorTensorDescriptor_t *descB, const int32_t modeB[], const cudaDataType_t typeScalar, const cudaStream_t stream)

Consequently, this function performs an out-of-place tensor permutation and is a specialization of cutensorElementwise.

Brief

Tensor permutation

Details

This function performs an element-wise tensor operation of the form:

\[ B_{\Pi^B(i_0,i_1,...,i_n)} = \alpha \Psi(A_{\Pi^A(i_0,i_1,...,i_n)}) \]

Where

  • A and B are multi-mode tensors (of arbitrary data types),

  • \(\Pi^A, \Pi^B\) are permutation operators that permute the modes of A, B respectively,

  • \(\Psi\) is an unary element-wise operators (e.g., IDENTITY, SQR, CONJUGATE), and

  • \(\Psi\) is specified in the tensor descriptor descA.

Broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.

Modes may appear in any order. The only restrictions are:

  • modes that appear in A must also appear in the output tensor.

  • each mode may appear in each tensor at most once.

Supported data-type combinations are:

typeA

typeB

typeScalar

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_32F

CUDA_R_16F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_16F

CUDA_R_32F

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_32F

CUDA_C_64F

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:
  • handle[in] Opaque handle holding cuTENSOR’s library context.

  • alpha[in] Scaling factor for A (see equation above) of the type typeScalar. Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.

  • A[in] Multi-mode tensor of type typeA with nmodeA modes. Pointer to the GPU-accessible memory.

  • descA[in] A descriptor that holds information about the data type, modes, and strides of A.

  • modeA[in] Array of size descA->numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’})

  • B[inout] Multi-mode tensor of type typeB with nmodeB modes. Pointer to the GPU-accessible memory.

  • descB[in] A descriptor that holds information about the data type, modes, and strides of B.

  • modeB[in] Array of size descB->numModes that holds the names of the modes of B

  • typeScalar[in] data type of alpha

  • stream[in] The CUDA stream.

Return values:
  • CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported

  • CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value

  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

Contraction Operations

The following functions perform contractions between tensors.


cutensorInitContractionDescriptor()

cutensorStatus_t cutensorInitContractionDescriptor(const cutensorHandle_t *handle, cutensorContractionDescriptor_t *desc, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const uint32_t alignmentRequirementA, const cutensorTensorDescriptor_t *descB, const int32_t modeB[], const uint32_t alignmentRequirementB, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], const uint32_t alignmentRequirementC, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], const uint32_t alignmentRequirementD, cutensorComputeType_t typeCompute)

Brief

Describes the tensor contraction problem of the form:

\[ D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \]

Details

\[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha \mathcal{A}_{{modes}_\mathcal{A}} B_{{modes}_\mathcal{B}} + \beta \mathcal{C}_{{modes}_\mathcal{C}} \]
.

Parameters:
  • handle[in] Opaque handle holding cuTENSOR’s library context.

  • desc[out] This opaque struct gets filled with the information that encodes the tensor contraction problem.

  • descA[in] A descriptor that holds the information about the data type, modes and strides of A.

  • modeA[in] Array with ‘nmodeA’ entries that represent the modes of A. The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.

  • alignmentRequirementA[in] Alignment that cuTENSOR may require for A’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.

  • descB[in] The B descriptor that holds information about the data type, modes, and strides of B.

  • modeB[in] Array with ‘nmodeB’ entries that represent the modes of B. The modeB[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.

  • alignmentRequirementB[in] Alignment that cuTENSOR may require for B’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.

  • modeC[in] Array with ‘nmodeC’ entries that represent the modes of C. The modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.

  • descC[in] The C descriptor that holds information about the data type, modes, and strides of C.

  • alignmentRequirementC[in] Alignment that cuTENSOR may require for C’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.

  • modeD[in] Array with ‘nmodeD’ entries that represent the modes of D (must be identical to modeC for now). The modeD[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.

  • descD[in] The D descriptor that holds information about the data type, modes, and strides of D (must be identical to descC for now).

  • alignmentRequirementD[in] Alignment that cuTENSOR may require for D’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.

  • typeCompute[in] Datatype of for the intermediate computation of typeCompute T = A * B.

Return values:
  • CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported

  • CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value

  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.


cutensorContractionDescriptorSetAttribute()

cutensorStatus_t cutensorContractionDescriptorSetAttribute(const cutensorHandle_t *handle, cutensorContractionDescriptor_t *desc, cutensorContractionDescriptorAttributes_t attr, const void *buf, size_t sizeInBytes)

Brief

Sett attribute for cutensorDescriptor

Parameters:
  • handle[in] Opaque handle holding cuTENSOR’s library context.

  • desc[inout] Contraction descriptor that will be modified.

  • attr[in] Specifies the attribute that will be set.

  • buf[in] This buffer (of size sizeInBytes) determines the value to which attr will be set.

  • sizeInBytes[in] Size of buf (in bytes).

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).


cutensorInitContractionFind()

cutensorStatus_t cutensorInitContractionFind(const cutensorHandle_t *handle, cutensorContractionFind_t *find, const cutensorAlgo_t algo)

Brief

Limits the search space of viable candidates (a.k.a. algorithms)

Details

This function gives the user finer control over the candidates that the subsequent call to cutensorInitContractionPlan is allowed to evaluate.

Parameters:
  • handle[in] Opaque handle holding cuTENSOR’s library context.

  • find[out]

  • algo[in] Allows users to select a specific algorithm. CUTENSOR_ALGO_DEFAULT lets the heuristic choose the algorithm. Any value >= 0 selects a specific GEMM-like algorithm and deactivates the heuristic. If a specified algorithm is not supported CUTENSOR_STATUS_NOT_SUPPORTED is returned. See cutensorAlgo_t for additional choices.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

  • CUTENSOR_STATUS_NOT_SUPPORTED

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.


cutensorContractionFindSetAttribute()

cutensorStatus_t cutensorContractionFindSetAttribute(const cutensorHandle_t *handle, cutensorContractionFind_t *find, cutensorContractionFindAttributes_t attr, const void *buf, size_t sizeInBytes)

Brief

Set attribute for cutensorContractionFind

Parameters:
  • handle[in] Opaque handle holding cuTENSOR’s library context.

  • find[inout] This opaque struct restricts the search space of viable candidates.

  • attr[in] Specifies the attribute that will be set.

  • buf[in] This buffer (of size sizeInBytes) determines the value to which attr will be set.

  • sizeInBytes[in] Size of buf (in bytes).

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).


cutensorContractionGetWorkspaceSize()

cutensorStatus_t cutensorContractionGetWorkspaceSize(const cutensorHandle_t *handle, const cutensorContractionDescriptor_t *desc, const cutensorContractionFind_t *find, const cutensorWorksizePreference_t pref, uint64_t *workspaceSize)

Brief

Determines the required workspaceSize for a given tensor contraction (see cutensorContraction)

Parameters:
  • handle[in] Opaque handle holding cuTENSOR’s library context.

  • desc[in] This opaque struct encodes the tensor contraction problem.

  • find[in] This opaque struct restricts the search space of viable candidates.

  • pref[in] This parameter influences the size of the workspace; see cutensorWorksizePreference_t for details.

  • workspaceSize[out] The workspace size (in bytes) that is required for the given tensor contraction.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).


cutensorContractionGetWorkspace()

Deprecated. Use cutensorContractionGetWorkspaceSize instead.


cutensorInitContractionPlan()

cutensorStatus_t cutensorInitContractionPlan(const cutensorHandle_t *handle, cutensorContractionPlan_t *plan, const cutensorContractionDescriptor_t *desc, const cutensorContractionFind_t *find, const uint64_t workspaceSize)

The plan is created for the active CUDA device.

Brief

Initializes the contraction plan for a given tensor contraction problem

Details

This function applies cuTENSOR’s heuristic to select a candidate for a given tensor contraction problem (encoded by desc). The resulting plan can be reused multiple times as long as the tensor contraction problem remains the same.

Parameters:
  • handle[in] Opaque handle holding cuTENSOR’s library context.

  • plan[out] Opaque handle holding the contraction execution plan (i.e., the candidate that will be executed as well as all it’s runtime parameters for the given tensor contraction problem).

  • desc[in] This opaque struct encodes the given tensor contraction problem.

  • find[in] This opaque struct is used to restrict the search space of viable candidates.

  • workspaceSize[in] Available workspace size (in bytes).

Return values:
  • CUTENSOR_STATUS_SUCCESS – If a viable candidate has been found.

  • CUTENSOR_STATUS_NOT_SUPPORTED – If no viable candidate could be found.

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

  • CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE – if The provided workspace was insufficient.

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).


cutensorContraction()

cutensorStatus_t cutensorContraction(const cutensorHandle_t *handle, const cutensorContractionPlan_t *plan, const void *alpha, const void *A, const void *B, const void *beta, const void *C, void *D, void *workspace, uint64_t workspaceSize, cudaStream_t stream)

The currently active CUDA device must match the CUDA device that was active at the time at which the plan was created.

Brief

This routine computes the tensor contraction

\[ D = alpha * A * B + beta * C \]

Details

\[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha * \mathcal{A}_{{modes}_\mathcal{A}} B_{{modes}_\mathcal{B}} + \beta \mathcal{C}_{{modes}_\mathcal{C}} \]

Supported data-type combinations are:

typeA

typeB

typeC

typeCompute

Tensor Core

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUTENSOR_COMPUTE_32F

Volta+

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUTENSOR_COMPUTE_32F

Ampere+

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUTENSOR_COMPUTE_32F

No

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUTENSOR_COMPUTE_TF32

Ampere+

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUTENSOR_COMPUTE_16BF

Ampere+

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUTENSOR_COMPUTE_16F

Volta+

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUTENSOR_COMPUTE_64F

Ampere+

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUTENSOR_COMPUTE_32F

No

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUTENSOR_COMPUTE_32F

No

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUTENSOR_COMPUTE_TF32

Ampere+

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

CUTENSOR_COMPUTE_64F

Ampere+

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

CUTENSOR_COMPUTE_32F

No

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUTENSOR_COMPUTE_64F

No

CUDA_C_64F

CUDA_R_64F

CUDA_C_64F

CUTENSOR_COMPUTE_64F

No

[Example]

See https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuTENSOR/contraction.cu for a concrete example.

Parameters:
  • handle[in] Opaque handle holding cuTENSOR’s library context.

  • plan[in] Opaque handle holding the contraction execution plan.

  • alpha[in] Scaling for A*B. Its data type is determined by ‘typeCompute’. Pointer to the host memory.

  • A[in] Pointer to the data corresponding to A in device memory. Pointer to the GPU-accessible memory.

  • B[in] Pointer to the data corresponding to B. Pointer to the GPU-accessible memory.

  • beta[in] Scaling for C. Its data type is determined by ‘typeCompute’. Pointer to the host memory.

  • C[in] Pointer to the data corresponding to C. Pointer to the GPU-accessible memory.

  • D[out] Pointer to the data corresponding to D. Pointer to the GPU-accessible memory.

  • workspace[out] Optional parameter that may be NULL. This pointer provides additional workspace, in device memory, to the library for additional optimizations; the workspace must be aligned to 256 bytes.

  • workspaceSize[in] Size of the workspace array in bytes; please refer to cutensorContractionGetWorkspace() to query the required workspace. While cutensorContraction() does not strictly require a workspace for the reduction, it is still recommended to provided some small workspace (e.g., 128 MB).

  • stream[in] The CUDA stream in which all the computation is performed.

Return values:
  • CUTENSOR_STATUS_NOT_SUPPORTED – if operation is not supported.

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

  • CUTENSOR_STATUS_ARCH_MISMATCH – if the plan was created for a different device than the currently active device.

  • CUTENSOR_STATUS_INSUFFICIENT_DRIVER – if the driver is insufficient.

  • CUTENSOR_STATUS_CUDA_ERROR – if some unknown CUDA error has occurred (e.g., out of memory).


cutensorContractionMaxAlgos()

cutensorStatus_t cutensorContractionMaxAlgos(int32_t *maxNumAlgos)

Brief

This routine returns the maximum number of algorithms available to compute tensor contractions

[NOTE] Not all algorithms might be applicable to your specific problem. cutensorContraction() will return CUTENSOR_STATUS_NOT_SUPPORTED if an algorithm is not applicable.

Parameters:

maxNumAlgos[out] This value will hold the maximum number of algorithms available for cutensorContraction(). You can use the returned integer for auto-tuning purposes (i.e., iterate over all algorithms up to the returned value).

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

Reduction Operations

The following functions perform tensor reductions.


cutensorReduction()

cutensorStatus_t cutensorReduction(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *beta, const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opReduce, cutensorComputeType_t typeCompute, void *workspace, uint64_t workspaceSize, cudaStream_t stream)

This function is also able to perform partial reductions; for instance: C[i,j] = alpha * A[k,j,i]; in this case only elements along the k-mode are contracted.

Brief

Implements a tensor reduction of the form

\[ D = alpha * opReduce(opA(A)) + beta * opC(C) \]

Details

For example this function enables users to reduce an entire tensor to a scalar: C[] = alpha * A[i,j,k];

The binary opReduce operator provides extra control over what kind of a reduction ought to be perfromed. For instance, opReduce == CUTENSOR_OP_ADD reduces element of A via a summation while CUTENSOR_OP_MAX would find the largest element in A.

Supported data-type combinations are:

typeA

typeB

typeC

typeCompute

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUTENSOR_COMPUTE_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUTENSOR_COMPUTE_32F

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUTENSOR_COMPUTE_16BF

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUTENSOR_COMPUTE_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUTENSOR_COMPUTE_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUTENSOR_COMPUTE_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUTENSOR_COMPUTE_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

CUTENSOR_COMPUTE_64F

Parameters:
  • handle[in] Opaque handle holding cuTENSOR’s library context.

  • alpha[in] Scaling for A; its data type is determined by ‘typeCompute’. Pointer to the host memory.

  • A[in] Pointer to the data corresponding to A in device memory. Pointer to the GPU-accessible memory.

  • descA[in] A descriptor that holds the information about the data type, modes and strides of A.

  • modeA[in] Array with ‘nmodeA’ entries that represent the modes of A. modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor. Modes that only appear in modeA but not in modeC are reduced (contracted).

  • beta[in] Scaling for C; its data type is determined by ‘typeCompute’. Pointer to the host memory.

  • C[in] Pointer to the data corresponding to C in device memory. Pointer to the GPU-accessible memory.

  • descC[in] A descriptor that holds the information about the data type, modes and strides of C.

  • modeC[in] Array with ‘nmodeC’ entries that represent the modes of C. modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.

  • D[out] Pointer to the data corresponding to C in device memory. Pointer to the GPU-accessible memory.

  • descD[in] Must be identical to descC for now.

  • modeD[in] Must be identical to modeC for now.

  • opReduce[in] binary operator used to reduce elements of A.

  • typeCompute[in] All arithmetic is performed using this data type (i.e., it affects the accuracy and performance).

  • workspace[out] Scratchpad (device) memory; the workspace must be aligned to 128 bytes.

  • workspaceSize[in] Please use cutensorReductionGetWorkspaceSize() to query the required workspace. While lower values, including zero, are valid, they may lead to grossly suboptimal performance.

  • stream[in] The CUDA stream in which all the computation is performed.

Return values:
  • CUTENSOR_STATUS_NOT_SUPPORTED – if operation is not supported.

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.


cutensorReductionGetWorkspaceSize()

cutensorStatus_t cutensorReductionGetWorkspaceSize(const cutensorHandle_t *handle, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], const void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opReduce, cutensorComputeType_t typeCompute, uint64_t *workspaceSize)

Brief

Determines the required workspaceSize for a given tensor reduction (see cutensorReduction)

Parameters:
  • handle[in] Opaque handle holding cuTENSOR’s library context.

  • A[in] same as in cutensorReduction

  • descA[in] same as in cutensorReduction

  • modeA[in] same as in cutensorReduction

  • C[in] same as in cutensorReduction

  • descC[in] same as in cutensorReduction

  • modeC[in] same as in cutensorReduction

  • D[in] same as in cutensorReduction

  • descD[in] same as in cutensorReduction

  • modeD[in] same as in cutensorReduction

  • opReduce[in] same as in cutensorReduction

  • typeCompute[in] same as in cutensorReduction

  • workspaceSize[out] The workspace size (in bytes) that is required for the given tensor reduction.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).


cutensorReductionGetWorkspace()

Deprecated. Use cutensorReductionGetWorkspaceSize instead.



Logger Functions

cutensorLoggerSetCallback()

cutensorStatus_t cutensorLoggerSetCallback(cutensorLoggerCallback_t callback)

Brief

This function sets the logging callback routine.

Parameters:

callback[in] Pointer to a callback function. Check cutensorLoggerCallback_t.


cutensorLoggerSetFile()

cutensorStatus_t cutensorLoggerSetFile(FILE *file)

Brief

This function sets the logging output file.

Parameters:

file[in] An open file with write permission.


cutensorLoggerOpenFile()

cutensorStatus_t cutensorLoggerOpenFile(const char *logFile)

Brief

This function opens a logging output file in the given path.

Parameters:

logFile[in] Path to the logging output file.


cutensorLoggerSetLevel()

cutensorStatus_t cutensorLoggerSetLevel(int32_t level)

Brief

This function sets the value of the logging level.

Parameters:

level[in] Log level, should be one of the following: 0. Off

  1. Errors

  2. Performance Trace

  3. Performance Hints

  4. Heuristics Trace

  5. API Trace


cutensorLoggerSetMask()

cutensorStatus_t cutensorLoggerSetMask(int32_t mask)

Brief

This function sets the value of the log mask.

Parameters:

mask[in] Log mask, the bitwise OR of the following: 0. Off

  1. Errors

  2. Performance Trace

  3. Performance Hints

  4. Heuristics Trace

  5. API Trace


cutensorLoggerForceDisable()

cutensorStatus_t cutensorLoggerForceDisable()

Brief

This function disables logging for the entire run.