cuTENSOR Functions¶

Helper Functions¶

The helper functions initialize cuTENSOR, create tensor descriptors, check error codes, and retrieve library and CUDA runtime versions.

`cutensorCreate()`¶

cutensorStatus_t cutensorCreate(cutensorHandle_t *handle)¶

Initializes the cuTENSOR library and allocates the memory for the library context.

The device associated with a particular cuTENSOR handle is assumed to remain unchanged after the cutensorCreate call. In order for the cuTENSOR library to use a different device, the application must set the new device to be used by calling cudaSetDevice and then create another cuTENSOR handle, which will be associated with the new device, by calling cutensorCreate.

Moreover, each handle by default has a plan cache that can store the least recently used cutensorPlan_t; its default capacity is 64, but it can be changed via cutensorHandleResizePlanCache if this is too little storage space. See the Plan Cache Guide for more information.

The user is responsible for calling cutensorDestroy to free the resources associated with the handle.

Remark

blocking, no reentrant, and thread-safe

Parameters:: handle – [out] Pointer to cutensorHandle_t
Return values:: CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise

`cutensorDestroy()`¶

cutensorStatus_t cutensorDestroy(cutensorHandle_t handle)¶

Frees all resources related to the provided library handle.

Remark

blocking, no reentrant, and thread-safe

Parameters:: handle – [inout] Pointer to cutensorHandle_t
Return values:: CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise

`cutensorCreateTensorDescriptor()`¶

cutensorStatus_t cutensorCreateTensorDescriptor(const cutensorHandle_t handle, cutensorTensorDescriptor_t *desc, const uint32_t numModes, const int64_t extent[], const int64_t stride[], cutensorDataType_t dataType, uint32_t alignmentRequirement)¶

Creates a tensor descriptor.

This allocates a small amount of host-memory.

The user is responsible for calling cutensorDestroyTensorDescriptor() to free the associated resources once the tensor descriptor is no longer used.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] Pointer to the address where the allocated tensor descriptor object will be stored.
numModes – [in] Number of modes.
extent – [in] Extent of each mode (must be larger than zero).
stride – [in] stride[i] denotes the displacement (stride) between two consecutive elements in the ith-mode. If stride is NULL, a packed generalized column-major memory layout is assumed (i.e., the strides increase monotonically from left to right). Each stride must be larger than zero; to be precise, a stride of zero can be achieved by omitting this mode entirely; for instance instead of writing C[a,b] = A[b,a] with strideA(a) = 0, you can write C[a,b] = A[b] directly; cuTENSOR will then automatically infer that the a-mode in A should be broadcasted.
dataType – [in] Data type of the stored entries.
alignmentRequirement – [in] Alignment (in bytes) to the base pointer that will be used in conjunction with this tensor descriptor (e.g., cudaMalloc has a default alignment of 256 bytes).

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_NOT_SUPPORTED – if the requested descriptor is not supported (e.g., due to non-supported data type).
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

Pre:

extent and stride arrays must each contain at least sizeof(int64_t) * numModes bytes

`cutensorDestroyTensorDescriptor()`¶

cutensorStatus_t cutensorDestroyTensorDescriptor(cutensorTensorDescriptor_t desc)¶

Frees all resources related to the provided tensor descriptor.

Remark

blocking, no reentrant, and thread-safe

Parameters:: desc – [inout] The cutensorTensorDescriptor_t object that will be deallocated.
Return values:: CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise

`cutensorGetErrorString()`¶

const char *cutensorGetErrorString(const cutensorStatus_t error)¶

Returns the description string for an error code.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:: error – [in] Error code to convert to string.
Return values:: The – null-terminated error string.

`cutensorGetVersion()`¶

size_t cutensorGetVersion()¶: Returns Version number of the CUTENSOR library.

`cutensorGetCudartVersion()`¶

size_t cutensorGetCudartVersion()¶

Returns version number of the CUDA runtime that cuTENSOR was compiled against.

Can be compared against the CUDA runtime version from cudaRuntimeGetVersion().

Element-wise Operations¶

The following functions perform element-wise operations between tensors.

`cutensorCreateElementwiseTrinary()`¶

cutensorStatus_t cutensorCreateElementwiseTrinary(const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descB, const int32_t modeB[], cutensorOperator_t opB, const cutensorTensorDescriptor_t descC, const int32_t modeC[], cutensorOperator_t opC, const cutensorTensorDescriptor_t descD, const int32_t modeD[], cutensorOperator_t opAB, cutensorOperator_t opABC, const cutensorComputeDescriptor_t descCompute)¶

This function creates an operation descriptor that encodes an elementwise trinary operation.

Said trinary operation has the following general form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{ABC}(\Phi_{AB}(\alpha op_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \beta op_B(B_{\Pi^B(i_0,i_1,...,i_n)})), \gamma op_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

Where

A,B,C,D are multi-mode tensors (of arbitrary data types).
\(\Pi^A, \Pi^B, \Pi^C \) are permutation operators that permute the modes of A, B, and C respectively.
\(op_{A},op_{B},op_{C}\) are unary element-wise operators (e.g., IDENTITY, CONJUGATE).
\(\Phi_{ABC}, \Phi_{AB}\) are binary element-wise operators (e.g., ADD, MUL, MAX, MIN).

Notice that the broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.

Moreover, modes may appear in any order, giving users a greater flexibility. The only restrictions are:

modes that appear in A or B must also appear in the output tensor; a mode that only appears in the input would be contracted and such an operation would be covered by either cutensorContract or cutensorReduce.
each mode may appear in each tensor at most once.

Input tensors may be read even if the value of the corresponding scalar is zero.

Examples:

\( D_{a,b,c,d} = A_{b,d,a,c}\)
\( D_{a,b,c,d} = 2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a}\)
\( D_{a,b,c,d} = 2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a} + C_{a,b,c,d}\)
\( D_{a,b,c,d} = min((2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a}), C_{a,b,c,d})\)

Call cutensorElementwiseTrinaryExecute to perform the actual operation.

Please use cutensorDestroyOperationDescriptor to deallocated the descriptor once it is no longer used.

Supported data-type combinations are:

typeA	typeB	typeC	descCompute
CUTENSOR_R_16F	CUTENSOR_R_16F	CUTENSOR_R_16F	CUTENSOR_COMPUTE_DESC_16F
CUTENSOR_R_16F	CUTENSOR_R_16F	CUTENSOR_R_16F	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_16BF	CUTENSOR_R_16BF	CUTENSOR_R_16BF	CUTENSOR_COMPUTE_DESC_16BF
CUTENSOR_R_16BF	CUTENSOR_R_16BF	CUTENSOR_R_16BF	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_64F	CUTENSOR_R_64F	CUTENSOR_R_64F	CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_32F	CUTENSOR_C_32F	CUTENSOR_C_32F	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_C_64F	CUTENSOR_C_64F	CUTENSOR_C_64F	CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_R_16F	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_64F	CUTENSOR_R_64F	CUTENSOR_R_32F	CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_64F	CUTENSOR_C_64F	CUTENSOR_C_32F	CUTENSOR_COMPUTE_DESC_64F

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the requested elementwise operation.
descA – [in] A descriptor that holds the information about the data type, modes, and strides of A.
modeA – [in] Array (in host memory) of size descA->numModes that holds the names of the modes of A (e.g., if \(A_{a,b,c}\) then modeA = {‘a’,’b’,’c’}). The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorCreateTensorDescriptor.
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descB – [in] A descriptor that holds information about the data type, modes, and strides of B.
modeB – [in] Array (in host memory) of size descB->numModes that holds the names of the modes of B. modeB[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor
opB – [in] Unary operator that will be applied to each element of B before it is further processed. The original data of this tensor remains unchanged.
descC – [in] A descriptor that holds information about the data type, modes, and strides of C.
modeC – [in] Array (in host memory) of size descC->numModes that holds the names of the modes of C. The modeC[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor.
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.
descD – [in] A descriptor that holds information about the data type, modes, and strides of D. Notice that we currently request descD and descC to be identical.
modeD – [in] Array (in host memory) of size descD->numModes that holds the names of the modes of D. The modeD[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor.
opAB – [in] Element-wise binary operator (see \(\Phi_{AB}\) above).
opABC – [in] Element-wise binary operator (see \(\Phi_{ABC}\) above).
descCompute – [in] Determines the precision in which this operations is performed.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_ARCH_MISMATCH – if the device is either not ready, or the target architecture is not supported.

`cutensorElementwiseTrinaryExecute()`¶

cutensorStatus_t cutensorElementwiseTrinaryExecute(const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, const void *beta, const void *B, const void *gamma, const void *C, void *D, cudaStream_t stream)¶

Performs an element-wise tensor operation for three input tensors (see cutensorCreateElementwiseTrinary)

This function performs a element-wise tensor operation of the form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{ABC}(\Phi_{AB}(\alpha op_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \beta op_B(B_{\Pi^B(i_0,i_1,...,i_n)})), \gamma op_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

See cutensorCreateElementwiseTrinary() for details.

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding all information about the desired elementwise operation (created by cutensorCreateElementwiseTrinary followed by cutensorCreatePlan).
alpha – [in] Scaling factor for A (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
A – [in] Multi-mode tensor (described by descA as part of cutensorCreateElementwiseTrinary). Pointer to the GPU-accessible memory.
beta – [in] Scaling factor for B (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If beta is zero, B is not read and the corresponding unary operator is not applied.
B – [in] Multi-mode tensor (described by descB as part of cutensorCreateElementwiseTrinary). Pointer to the GPU-accessible memory.
gamma – [in] Scaling factor for C (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If gamma is zero, C is not read and the corresponding unary operator is not applied.
C – [in] Multi-mode tensor (described by descC as part of cutensorCreateElementwiseTrinary). Pointer to the GPU-accessible memory.
D – [out] Multi-mode tensor (described by descD as part of cutensorCreateElementwiseTrinary). Pointer to the GPU-accessible memory (C and D may be identical, if and only if descC == descD).
stream – [in] The CUDA stream used to perform the operation.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorCreateElementwiseBinary()`¶

cutensorStatus_t cutensorCreateElementwiseBinary(const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descC, const int32_t modeC[], cutensorOperator_t opC, const cutensorTensorDescriptor_t descD, const int32_t modeD[], cutensorOperator_t opAC, const cutensorComputeDescriptor_t descCompute)¶

This function creates an operation descriptor for an elementwise binary operation.

The binary operation has the following general form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{AC}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

Call cutensorElementwiseBinaryExecute to perform the actual operation.

Supported data-type combinations are:

typeA	typeC	descCompute
CUTENSOR_R_16F	CUTENSOR_R_16F	CUTENSOR_COMPUTE_DESC_16F
CUTENSOR_R_16F	CUTENSOR_R_16F	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_16BF	CUTENSOR_R_16BF	CUTENSOR_COMPUTE_DESC_16BF
CUTENSOR_R_16BF	CUTENSOR_R_16BF	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_64F	CUTENSOR_R_64F	CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_32F	CUTENSOR_C_32F	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_C_64F	CUTENSOR_C_64F	CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_R_32F	CUTENSOR_R_16F	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_64F	CUTENSOR_R_32F	CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_64F	CUTENSOR_C_32F	CUTENSOR_COMPUTE_DESC_64F

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the requested elementwise operation.
descA – [in] The descriptor that holds the information about the data type, modes, and strides of A.
modeA – [in] Array (in host memory) of size descA->numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’}). The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorCreateTensorDescriptor.
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descC – [in] The descriptor that holds information about the data type, modes, and strides of C.
modeC – [in] Array (in host memory) of size descC->numModes that holds the names of the modes of C. The modeC[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor.
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.
descD – [in] The descriptor that holds information about the data type, modes, and strides of D. Notice that we currently request descD and descC to be identical.
modeD – [in] Array (in host memory) of size descD->numModes that holds the names of the modes of D. The modeD[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor.
opAC – [in] Element-wise binary operator (see \(\Phi_{AC}\) above).
descCompute – [in] Determines the precision in which this operations is performed.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorElementwiseBinaryExecute()`¶

cutensorStatus_t cutensorElementwiseBinaryExecute(const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, const void *gamma, const void *C, void *D, cudaStream_t stream)¶

Performs an element-wise tensor operation for two input tensors (see cutensorCreateElementwiseBinary)

This function performs a element-wise tensor operation of the form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{AC}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

See cutensorCreateElementwiseBinary() for details.

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding all information about the desired elementwise operation (created by cutensorCreateElementwiseBinary followed by cutensorCreatePlan).
alpha – [in] Scaling factor for A (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
A – [in] Multi-mode tensor (described by descA as part of cutensorCreateElementwiseBinary). Pointer to the GPU-accessible memory.
gamma – [in] Scaling factor for C (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If gamma is zero, C is not read and the corresponding unary operator is not applied.
C – [in] Multi-mode tensor (described by descC as part of cutensorCreateElementwiseBinary). Pointer to the GPU-accessible memory.
D – [out] Multi-mode tensor (described by descD as part of cutensorCreateElementwiseBinary). Pointer to the GPU-accessible memory (C and D may be identical, if and only if descC == descD).
stream – [in] The CUDA stream used to perform the operation.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorCreatePermutation()`¶

cutensorStatus_t cutensorCreatePermutation(const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descB, const int32_t modeB[], const cutensorComputeDescriptor_t descCompute)¶

This function creates an operation descriptor for a tensor permutation.

The tensor permutation has the following general form:

\[ B_{\Pi^B(i_0,i_1,...,i_n)} = \alpha op_A(A_{\Pi^A(i_0,i_1,...,i_n)}) \]

Consequently, this function performs an out-of-place tensor permutation and is a specialization of cutensorCreateElementwiseBinary.

Where

A and B are multi-mode tensors (of arbitrary data types),
\(\Pi^A, \Pi^B\) are permutation operators that permute the modes of A, B respectively,
\(op_A\) is an unary element-wise operators (e.g., IDENTITY, SQR, CONJUGATE), and
\(\Psi\) is specified in the tensor descriptor descA.

Broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.

Modes may appear in any order. The only restrictions are:

modes that appear in A must also appear in the output tensor.
each mode may appear in each tensor at most once.

Supported data-type combinations are:

typeA	typeB	descCompute
CUTENSOR_R_16F	CUTENSOR_R_16F	CUTENSOR_COMPUTE_DESC_16F
CUTENSOR_R_16F	CUTENSOR_R_16F	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_16F	CUTENSOR_R_32F	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_32F	CUTENSOR_R_16F	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_16BF	CUTENSOR_R_16BF	CUTENSOR_COMPUTE_DESC_16BF
CUTENSOR_R_16BF	CUTENSOR_R_16BF	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_64F	CUTENSOR_R_64F	CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_R_32F	CUTENSOR_R_64F	CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_R_64F	CUTENSOR_R_32F	CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_32F	CUTENSOR_C_32F	CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_C_64F	CUTENSOR_C_64F	CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_32F	CUTENSOR_C_64F	CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_64F	CUTENSOR_C_32F	CUTENSOR_COMPUTE_DESC_64F

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the requested permutation.
descA – [in] The descriptor that holds information about the data type, modes, and strides of A.
modeA – [in] Array of size descA->numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’})
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descB – [in] The descriptor that holds information about the data type, modes, and strides of B.
modeB – [in] Array of size descB->numModes that holds the names of the modes of B
descCompute – [in] Determines the precision in which this operations is performed.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorPermute()`¶

cutensorStatus_t cutensorPermute(const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, void *B, const cudaStream_t stream)¶

Performs the tensor permutation that is encoded by plan (see cutensorCreatePermutation).

This function performs an elementwise tensor operation of the form:

\[ B_{\Pi^B(i_0,i_1,...,i_n)} = \alpha \Psi(A_{\Pi^A(i_0,i_1,...,i_n)}) \]

Consequently, this function performs an out-of-place tensor permutation.

Where

A and B are multi-mode tensors (of arbitrary data types),
\(\Pi^A, \Pi^B\) are permutation operators that permute the modes of A, B respectively,
\(\Psi\) is an unary element-wise operators (e.g., IDENTITY, SQR, CONJUGATE), and
\(\Psi\) is specified in the tensor descriptor descA.

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding all information about the desired tensor reduction (created by cutensorCreatePermutation followed by cutensorCreatePlan).
alpha – [in] Scaling factor for A (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
A – [in] Multi-mode tensor of type typeA with nmodeA modes. Pointer to the GPU-accessible memory.
B – [inout] Multi-mode tensor of type typeB with nmodeB modes. Pointer to the GPU-accessible memory.
stream – [in] The CUDA stream.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

Contraction Operations¶

The following functions perform contractions between tensors.

`cutensorCreateContraction()`¶

cutensorStatus_t cutensorCreateContraction(const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descB, const int32_t modeB[], cutensorOperator_t opB, const cutensorTensorDescriptor_t descC, const int32_t modeC[], cutensorOperator_t opC, const cutensorTensorDescriptor_t descD, const int32_t modeD[], const cutensorComputeDescriptor_t descCompute)¶

This function allocates a cutensorOperationDescriptor_t object that encodes a tensor contraction of the form \( D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \).

Allocates data for desc to be used to perform a tensor contraction of the form

\[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha op_\mathcal{A}(\mathcal{A}_{{modes}_\mathcal{A}}) op_\mathcal{B}(B_{{modes}_\mathcal{B}}) + \beta op_\mathcal{C}(\mathcal{C}_{{modes}_\mathcal{C}}). \]

See cutensorCreatePlan (or cutensorCreatePlanAutotuned) to create the plan (i.e., to select the kernel) followed by a call to cutensorContract to perform the actual contraction.

The user is responsible for calling cutensorDestroyOperationDescriptor to free the resources associated with the descriptor.

Supported data-type combinations are:

typeA	typeB	typeC	descCompute	typeScalar	Tensor Core
CUTENSOR_R_16F	CUTENSOR_R_16F	CUTENSOR_R_16F	CUTENSOR_COMPUTE_DESC_32F	CUTENSOR_R_32F	Volta+
CUTENSOR_R_16BF	CUTENSOR_R_16BF	CUTENSOR_R_16BF	CUTENSOR_COMPUTE_DESC_32F	CUTENSOR_R_32F	Ampere+
CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_COMPUTE_DESC_32F	CUTENSOR_R_32F	No
CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_COMPUTE_DESC_TF32	CUTENSOR_R_32F	Ampere+
CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_COMPUTE_DESC_3XTF32	CUTENSOR_R_32F	Ampere+
CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_COMPUTE_DESC_16BF	CUTENSOR_R_32F	Ampere+
CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_R_32F	CUTENSOR_COMPUTE_DESC_16F	CUTENSOR_R_32F	Volta+
CUTENSOR_R_64F	CUTENSOR_R_64F	CUTENSOR_R_64F	CUTENSOR_COMPUTE_DESC_64F	CUTENSOR_R_64F	Ampere+
CUTENSOR_R_64F	CUTENSOR_R_64F	CUTENSOR_R_64F	CUTENSOR_COMPUTE_DESC_32F	CUTENSOR_R_64F	No
CUTENSOR_C_32F	CUTENSOR_C_32F	CUTENSOR_C_32F	CUTENSOR_COMPUTE_DESC_32F	CUTENSOR_C_32F	No
CUTENSOR_C_32F	CUTENSOR_C_32F	CUTENSOR_C_32F	CUTENSOR_COMPUTE_DESC_TF32	CUTENSOR_C_32F	Ampere+
CUTENSOR_C_32F	CUTENSOR_C_32F	CUTENSOR_C_32F	CUTENSOR_COMPUTE_DESC_3XTF32	CUTENSOR_C_32F	Ampere+
CUTENSOR_C_64F	CUTENSOR_C_64F	CUTENSOR_C_64F	CUTENSOR_COMPUTE_DESC_64F	CUTENSOR_C_64F	Ampere+
CUTENSOR_C_64F	CUTENSOR_C_64F	CUTENSOR_C_64F	CUTENSOR_COMPUTE_DESC_32F	CUTENSOR_C_64F	No
CUTENSOR_R_64F	CUTENSOR_C_64F	CUTENSOR_C_64F	CUTENSOR_COMPUTE_DESC_64F	CUTENSOR_C_64F	No
CUTENSOR_C_64F	CUTENSOR_R_64F	CUTENSOR_C_64F	CUTENSOR_COMPUTE_DESC_64F	CUTENSOR_C_64F	No

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the tensor contraction operation.
descA – [in] The descriptor that holds the information about the data type, modes and strides of A.
modeA – [in] Array with ‘nmodeA’ entries that represent the modes of A. The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descB – [in] The descriptor that holds information about the data type, modes, and strides of B.
modeB – [in] Array with ‘nmodeB’ entries that represent the modes of B. The modeB[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
opB – [in] Unary operator that will be applied to each element of B before it is further processed. The original data of this tensor remains unchanged.
modeC – [in] Array with ‘nmodeC’ entries that represent the modes of C. The modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
descC – [in] The escriptor that holds information about the data type, modes, and strides of C.
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.
modeD – [in] Array with ‘nmodeD’ entries that represent the modes of D (must be identical to modeC for now). The modeD[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
descD – [in] The descriptor that holds information about the data type, modes, and strides of D (must be identical to descC for now).
typeCompute – [in] Datatype of for the intermediate computation of typeCompute T = A * B.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorContract()`¶

cutensorStatus_t cutensorContract(const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, const void *B, const void *beta, const void *C, void *D, void *workspace, uint64_t workspaceSize, cudaStream_t stream)¶

This routine computes the tensor contraction \( D = alpha * A * B + beta * C \).

\[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha * \mathcal{A}_{{modes}_\mathcal{A}} B_{{modes}_\mathcal{B}} + \beta \mathcal{C}_{{modes}_\mathcal{C}} \]

The active CUDA device must match the CUDA device that was active at the time at which the plan was created.

[Example]: See https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuTENSOR/contraction.cu for a concrete example.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding the contraction execution plan (created by cutensorCreateContraction followed by cutensorCreatePlan).
alpha – [in] Scaling for A*B. Its data type is determined by ‘descCompute’ (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory.
A – [in] Pointer to the data corresponding to A. Pointer to the GPU-accessible memory.
B – [in] Pointer to the data corresponding to B. Pointer to the GPU-accessible memory.
beta – [in] Scaling for C. Its data type is determined by ‘descCompute’ (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory.
C – [in] Pointer to the data corresponding to C. Pointer to the GPU-accessible memory.
D – [out] Pointer to the data corresponding to D. Pointer to the GPU-accessible memory.
workspace – [out] Optional parameter that may be NULL. This pointer provides additional workspace, in device memory, to the library for additional optimizations; the workspace must be aligned to 256 bytes (i.e., the default alignment of cudaMalloc).
workspaceSize – [in] Size of the workspace array in bytes; please refer to cutensorEstimateWorkspaceSize to query the required workspace. While cutensorContract does not strictly require a workspace for the contraction, it is still recommended to provided some small workspace (e.g., 128 MB).
stream – [in] The CUDA stream in which all the computation is performed.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if operation is not supported.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_ARCH_MISMATCH – if the plan was created for a different device than the currently active device.
CUTENSOR_STATUS_INSUFFICIENT_DRIVER – if the driver is insufficient.
CUTENSOR_STATUS_CUDA_ERROR – if some unknown CUDA error has occurred (e.g., out of memory).

Reduction Operations¶

The following functions perform tensor reductions.

`cutensorCreateReduction()`¶

cutensorStatus_t cutensorCreateReduction(const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descC, const int32_t modeC[], cutensorOperator_t opC, const cutensorTensorDescriptor_t descD, const int32_t modeD[], cutensorOperator_t opReduce, const cutensorComputeDescriptor_t descCompute)¶

Creates a cutensorOperatorDescriptor_t object that encodes a tensor reduction of the form \( D = alpha * opReduce(opA(A)) + beta * opC(C) \).

For example this function enables users to reduce an entire tensor to a scalar: C[] = alpha * A[i,j,k];

This function is also able to perform partial reductions; for instance: C[i,j] = alpha * A[k,j,i]; in this case only elements along the k-mode are contracted.

The binary opReduce operator provides extra control over what kind of a reduction ought to be performed. For instance, setting opReduce to CUTENSOR_OP_ADD reduces element of A via a summation while CUTENSOR_OP_MAX would find the largest element in A.

Supported data-type combinations are:

typeA	typeB	typeC	typeCompute
`CUTENSOR_R_16F`	`CUTENSOR_R_16F`	`CUTENSOR_R_16F`	`CUTENSOR_COMPUTE_DESC_16F`
`CUTENSOR_R_16F`	`CUTENSOR_R_16F`	`CUTENSOR_R_16F`	`CUTENSOR_COMPUTE_DESC_32F`
`CUTENSOR_R_16BF`	`CUTENSOR_R_16BF`	`CUTENSOR_R_16BF`	`CUTENSOR_COMPUTE_DESC_16BF`
`CUTENSOR_R_16BF`	`CUTENSOR_R_16BF`	`CUTENSOR_R_16BF`	`CUTENSOR_COMPUTE_DESC_32F`
`CUTENSOR_R_32F`	`CUTENSOR_R_32F`	`CUTENSOR_R_32F`	`CUTENSOR_COMPUTE_DESC_32F`
`CUTENSOR_R_64F`	`CUTENSOR_R_64F`	`CUTENSOR_R_64F`	`CUTENSOR_COMPUTE_DESC_64F`
`CUTENSOR_C_32F`	`CUTENSOR_C_32F`	`CUTENSOR_C_32F`	`CUTENSOR_COMPUTE_DESC_32F`
`CUTENSOR_C_64F`	`CUTENSOR_C_64F`	`CUTENSOR_C_64F`	`CUTENSOR_COMPUTE_DESC_64F`

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the requested tensor reduction operation.
descA – [in] The descriptor that holds the information about the data type, modes and strides of A.
modeA – [in] Array with ‘nmodeA’ entries that represent the modes of A. modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorCreateTensorDescriptor. Modes that only appear in modeA but not in modeC are reduced (contracted).
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descC – [in] The descriptor that holds the information about the data type, modes and strides of C.
modeC – [in] Array with ‘nmodeC’ entries that represent the modes of C. modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorCreateTensorDescriptor.
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.
descD – [in] Must be identical to descC for now.
modeD – [in] Must be identical to modeC for now.
opReduce – [in] binary operator used to reduce elements of A.
typeCompute – [in] All arithmetic is performed using this data type (i.e., it affects the accuracy and performance).

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if operation is not supported.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorReduce()`¶

cutensorStatus_t cutensorReduce(const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, const void *beta, const void *C, void *D, void *workspace, uint64_t workspaceSize, cudaStream_t stream)¶

Performs the tensor reduction that is encoded by plan (see cutensorCreateReduction).

Parameters:

alpha – [in] Scaling for A. Its data type is determined by ‘descCompute’ (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory.
A – [in] Pointer to the data corresponding to A in device memory. Pointer to the GPU-accessible memory.
beta – [in] Scaling for C. Its data type is determined by ‘descCompute’ (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory.
C – [in] Pointer to the data corresponding to C in device memory. Pointer to the GPU-accessible memory.
D – [out] Pointer to the data corresponding to C in device memory. Pointer to the GPU-accessible memory.
workspace – [out] Scratchpad (device) memory of size —at least— workspaceSize bytes; the workspace must be aligned to 256 bytes (i.e., the default alignment of cudaMalloc).
workspaceSize – [in] Please use cutensorEstimateWorkspaceSize() to query the required workspace.
stream – [in] The CUDA stream in which all the computation is performed.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

Generic Operation Functions¶

The following functions are generic and work with all the different operations.

`cutensorDestroyOperationDescriptor()`¶

cutensorStatus_t cutensorDestroyOperationDescriptor(cutensorOperationDescriptor_t desc)¶

Frees all resources related to the provided descriptor.

Remark

blocking, no reentrant, and thread-safe

Parameters:: desc – [inout] The cutensorOperationDescriptor_t object that will be deallocated.
Return values:: CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise

`cutensorOperationDescriptorGetAttribute()`¶

cutensorStatus_t cutensorOperationDescriptorGetAttribute(const cutensorHandle_t handle, cutensorOperationDescriptor_t desc, cutensorOperationDescriptorAttribute_t attr, void *buf, size_t sizeInBytes)¶

This function retrieves an attribute of the provided cutensorOperationDescriptor_t object (see cutensorOperationDescriptorAttribute_t).

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [in] The cutensorOperationDescriptor_t object whos attribute is queried.
attr – [in] Specifies the attribute that will be retrieved.
buf – [out] This buffer (of size sizeInBytes) will hold the requested attribute of the provided cutensorOperationDescriptor_t object.
sizeInBytes – [in] Size of buf (in bytes); see cutensorOperationDescriptorAttribute_t for the exact size.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

`cutensorOperationDescriptorSetAttribute()`¶

cutensorStatus_t cutensorOperationDescriptorSetAttribute(const cutensorHandle_t handle, cutensorOperationDescriptor_t desc, cutensorOperationDescriptorAttribute_t attr, const void *buf, size_t sizeInBytes)¶

Set attribute of a cutensorOperationDescriptor_t object.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [inout] Operation descriptor that will be modified.
attr – [in] Specifies the attribute that will be set.
buf – [in] This buffer (of size sizeInBytes) determines the value to which attr will be set.
sizeInBytes – [in] Size of buf (in bytes).

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

`cutensorCreatePlanPreference()`¶

cutensorStatus_t cutensorCreatePlanPreference(const cutensorHandle_t handle, cutensorPlanPreference_t *pref, cutensorAlgo_t algo, cutensorJitMode_t jitMode)¶

Allocates the cutensorPlanPreference_t, enabling users to limit the applicable kernels for a given plan/operation.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
pref – [out] Pointer to the structure holding the cutensorPlanPreference_t allocated by this function. See cutensorPlanPreference_t.
algo – [in] Allows users to select a specific algorithm. CUTENSOR_ALGO_DEFAULT lets the heuristic choose the algorithm. Any value >= 0 selects a specific GEMM-like algorithm and deactivates the heuristic. If a specified algorithm is not supported CUTENSOR_STATUS_NOT_SUPPORTED is returned. See cutensorAlgo_t for additional choices.
jitMode – [in] Determines if cuTENSOR is allowed to use JIT-compiled kernels (leading to a longer plan-creation phase); see cutensorJitMode_t.

`cutensorDestroyPlanPreference()`¶

cutensorStatus_t cutensorDestroyPlanPreference(cutensorPlanPreference_t pref)¶

Frees all resources related to the provided preference.

Remark

blocking, no reentrant, and thread-safe

Parameters:: pref – [inout] The cutensorPlanPreference_t object that will be deallocated.
Return values:: CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise

`cutensorPlanPreferenceSetAttribute()`¶

cutensorStatus_t cutensorPlanPreferenceSetAttribute(const cutensorHandle_t handle, cutensorPlanPreference_t pref, cutensorPlanPreferenceAttribute_t attr, const void *buf, size_t sizeInBytes)¶

Set attribute of a cutensorPlanPreference_t object.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
pref – [inout] This opaque struct restricts the search space of viable candidates.
attr – [in] Specifies the attribute that will be set.
buf – [in] This buffer (of size sizeInBytes) determines the value to which attr will be set.
sizeInBytes – [in] Size of buf (in bytes); see cutensorPlanPreferenceAttribute_t for the exact size.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

`cutensorEstimateWorkspaceSize()`¶

cutensorStatus_t cutensorEstimateWorkspaceSize(const cutensorHandle_t handle, const cutensorOperationDescriptor_t desc, const cutensorPlanPreference_t planPref, const cutensorWorksizePreference_t workspacePref, uint64_t *workspaceSizeEstimate)¶

Determines the required workspaceSize for the given operation encoded by desc.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [in] This opaque struct encodes the operation.
planPref – [in] This opaque struct restricts the space of viable candidates.
workspacePref – [in] This parameter influences the size of the workspace; see cutensorWorksizePreference_t for details.
workspaceSizeEstimate – [out] The workspace size (in bytes) that is required for the given operation.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

`cutensorCreatePlan()`¶

cutensorStatus_t cutensorCreatePlan(const cutensorHandle_t handle, cutensorPlan_t *plan, const cutensorOperationDescriptor_t desc, const cutensorPlanPreference_t pref, uint64_t workspaceSizeLimit)¶

This function allocates a cutensorPlan_t object, selects an appropriate kernel for a given operation (encoded by desc) and prepares a plan that encodes the execution.

This function applies cuTENSOR’s heuristic to select a candidate/kernel for a given operation (created by either cutensorCreateContraction, cutensorCreateReduction, cutensorCreatePermutation, cutensorCreateElementwiseBinary, or cutensorCreateElementwiseTrinary). The created plan can then be be passed to either cutensorContract, cutensorReduce, cutensorPermute, cutensorElementwiseBinaryExecute, or cutensorElementwiseTrinaryExecute to perform the actual operation.

The plan is created for the active CUDA device.

Note: cutensorCreatePlan must not be captured via CUDA graphs if Just-In-Time compilation is enabled (i.e., cutensorJitMode_t is not CUTENSOR_JIT_MODE_NONE).

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [out] Pointer to the data structure created by this function that holds all information (e.g., selected kernel) necessary to perform the desired operation.
desc – [in] This opaque struct encodes the given operation (see cutensorCreateContraction, cutensorCreateReduction, cutensorCreatePermutation, cutensorCreateElementwiseBinary, or cutensorCreateElementwiseTrinary).
pref – [in] This opaque struct is used to restrict the space of applicable candidates/kernels (see cutensorCreatePlanPreference or cutensorPlanPreferenceAttribute_t). May be nullptr, in that case default choices are assumed.
workspaceSizeLimit – [in] Denotes the maximal workspace that the corresponding operation is allowed to use (see cutensorEstimateWorkspaceSize)

Return values:

CUTENSOR_STATUS_SUCCESS – If a viable candidate has been found.
CUTENSOR_STATUS_NOT_SUPPORTED – If no viable candidate could be found.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE – if The provided workspace was insufficient.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

`cutensorDestroyPlan()`¶

cutensorStatus_t cutensorDestroyPlan(cutensorPlan_t plan)¶

Frees all resources related to the provided plan.

Remark

blocking, no reentrant, and thread-safe

Parameters:: plan – [inout] The cutensorPlan_t object that will be deallocated.
Return values:: CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise

`cutensorPlanGetAttribute()`¶

cutensorStatus_t cutensorPlanGetAttribute(const cutensorHandle_t handle, const cutensorPlan_t plan, cutensorPlanAttribute_t attr, void *buf, size_t sizeInBytes)¶

Retrieves information about an already-created plan (see cutensorPlanAttribute_t)

Parameters:

plan – [in] Denotes an already-created plan (e.g., via cutensorCreatePlan or cutensorCreatePlanAutotuned)
attr – [in] Requested attribute.
buf – [out] On successful exit: Holds the information of the requested attribute.
sizeInBytes – [in] size of buf in bytes.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

`cutensorPlanSetAttribute()`¶

cutensorStatus_t cutensorPlanPreferenceSetAttribute(const cutensorHandle_t handle, cutensorPlanPreference_t pref, cutensorPlanPreferenceAttribute_t attr, const void *buf, size_t sizeInBytes)

Set attribute of a cutensorPlanPreference_t object.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
pref – [inout] This opaque struct restricts the search space of viable candidates.
attr – [in] Specifies the attribute that will be set.
buf – [in] This buffer (of size sizeInBytes) determines the value to which attr will be set.
sizeInBytes – [in] Size of buf (in bytes); see cutensorPlanPreferenceAttribute_t for the exact size.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

Cache-related Operations¶

`cutensorHandleResizePlanCache()`¶

cutensorStatus_t cutensorHandleResizePlanCache(cutensorHandle_t handle, const uint32_t numEntries)¶

Resizes the plan cache.

This function changes the number of plans that can be stored in the plan cache of the handle.

Resizing invalidates the cache.

While this function is not thread-safe, the resulting cache can be shared across different threads in a thread-safe manner.

Remark

non-blocking, no reentrant, and not thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context. The cache will be attached to the handle
numEntries – [in] Number of entries the cache will support.

Return values:

CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

`cutensorHandleReadPlanCacheFromFile()`¶

cutensorStatus_t cutensorHandleReadPlanCacheFromFile(cutensorHandle_t handle, const char filename[], uint32_t *numCachelinesRead)¶

Reads a Plan-Cache from file and overwrites the cachelines of the provided handle.

A cache is only valid for the same cuTENSOR version and CUDA version; moreover, the GPU architecture (incl. multiprocessor count) must match, otherwise CUTENSOR_STATUS_INVALID_VALUE will be returned.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:

handle – [inout] Opaque handle holding cuTENSOR’s library context.
filename – [in] Specifies the filename (including the absolute path) to the file that holds all the cache information that have previously been written by cutensorHandleWritePlanCacheToFile.
numCachelinesRead – [out] On exit, this variable will hold the number of successfully-read cachelines, if CUTENSOR_STATUS_SUCCESS is returned. Otherwise, this variable will hold the number of cachelines that are required to read all cachelines associated to the cache pointed to by filename; in that case CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE is returned.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_INVALID_VALUE – if the stored cache was created by a different cuTENSOR- or CUDA-version or if the GPU architecture (incl. multiprocessor count) doesn’t match
CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE – if the stored cache requires more cachelines than those that are currently attached to the handle
CUTENSOR_STATUS_IO_ERROR – if the file cannot be read
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorHandleWritePlanCacheToFile()`¶

cutensorStatus_t cutensorHandleWritePlanCacheToFile(const cutensorHandle_t handle, const char filename[])¶

Writes the Plan-Cache (that belongs to the provided handle) to file.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
filename – [in] Specifies the filename (including the absolute path) to the file that should hold all the cache information. Warning: an existing file will be overwritten.

Return values:

CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_INVALID_VALUE – if the no cache has been attached
CUTENSOR_STATUS_IO_ERROR – if the file cannot be written to

`cutensorReadKernelCacheFromFile()`¶

cutensorStatus_t cutensorReadKernelCacheFromFile(cutensorHandle_t handle, const char filename[])¶

Reads a kernel cache from file and adds all non-existing JIT compiled kernels to the kernel cache.

A cache is only valid for the same cuTENSOR version and CUDA version; moreover, the GPU architecture (incl. multiprocessor count) must match, otherwise CUTENSOR_STATUS_INVALID_VALUE will be returned.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:

handle – [inout] Opaque handle holding cuTENSOR’s library context.
filename – [in] Specifies the filename (including the absolute path) to the file that holds all the cache information that have previously been written by cutensorWriteKernelCacheToFile.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_INVALID_VALUE – if the stored cache was created by a different cuTENSOR- or CUDA-version or if the GPU architecture (incl. multiprocessor count) doesn’t match
CUTENSOR_STATUS_IO_ERROR – if the file cannot be read
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_NOT_SUPPORTED – if the function is not available for the operating system or CUDA Toolkit.

`cutensorWriteKernelCacheToFile()`¶

cutensorStatus_t cutensorWriteKernelCacheToFile(const cutensorHandle_t handle, const char filename[])¶

Writes the —per library— kernel cache to file.

Writes the just-in-time compiled kernels to the provided file (those kernels belong to the library—not to the handle).

Remark

non-blocking, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
filename – [in] Specifies the filename (including the absolute path) to the file that should hold all the cache information. Warning: an existing file will be overwritten.

Return values:

CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_SUCCESS – The operation completed successfully or there were no kernels in the cache.
CUTENSOR_STATUS_IO_ERROR – if the file cannot be written to
CUTENSOR_STATUS_NOT_SUPPORTED – if the function is not available for the operating system or CUDA Toolkit.

Logger Functions¶

`cutensorLoggerSetCallback()`¶

cutensorStatus_t cutensorLoggerSetCallback(cutensorLoggerCallback_t callback)¶

This function sets the logging callback routine.

Parameters:: callback – [in] Pointer to a callback function. Check cutensorLoggerCallback_t.

`cutensorLoggerSetFile()`¶

cutensorStatus_t cutensorLoggerSetFile(FILE *file)¶

This function sets the logging output file.

Parameters:: file – [in] An open file with write permission.

`cutensorLoggerOpenFile()`¶

cutensorStatus_t cutensorLoggerOpenFile(const char *logFile)¶

This function opens a logging output file in the given path.

Parameters:: logFile – [in] Path to the logging output file.

`cutensorLoggerSetLevel()`¶

cutensorStatus_t cutensorLoggerSetLevel(int32_t level)¶

This function sets the value of the logging level.

Parameters:

level – [in] Log level, should be one of the following: 0. Off

Errors
Performance Trace
Performance Hints
Heuristics Trace
API Trace

`cutensorLoggerSetMask()`¶

cutensorStatus_t cutensorLoggerSetMask(int32_t mask)¶

This function sets the value of the log mask.

Parameters:

mask – [in] Log mask, the bitwise OR of the following: 0. Off

Errors
Performance Trace
Performance Hints
Heuristics Trace
API Trace

`cutensorLoggerForceDisable()`¶

cutensorStatus_t cutensorLoggerForceDisable()¶: This function disables logging for the entire run.

cuTENSOR Functions¶

Helper Functions¶

cutensorCreate()¶

cutensorDestroy()¶

cutensorCreateTensorDescriptor()¶

cutensorDestroyTensorDescriptor()¶

cutensorGetErrorString()¶

cutensorGetVersion()¶

cutensorGetCudartVersion()¶

Element-wise Operations¶

cutensorCreateElementwiseTrinary()¶

cutensorElementwiseTrinaryExecute()¶

cutensorCreateElementwiseBinary()¶

cutensorElementwiseBinaryExecute()¶

cutensorCreatePermutation()¶

cutensorPermute()¶

Contraction Operations¶

cutensorCreateContraction()¶

cutensorContract()¶

Reduction Operations¶

cutensorCreateReduction()¶

cutensorReduce()¶

Generic Operation Functions¶

cutensorDestroyOperationDescriptor()¶

cutensorOperationDescriptorGetAttribute()¶

cutensorOperationDescriptorSetAttribute()¶

cutensorCreatePlanPreference()¶

cutensorDestroyPlanPreference()¶

cutensorPlanPreferenceSetAttribute()¶

cutensorEstimateWorkspaceSize()¶

cutensorCreatePlan()¶

cutensorDestroyPlan()¶

cutensorPlanGetAttribute()¶

cutensorPlanSetAttribute()¶

Cache-related Operations¶

cutensorHandleResizePlanCache()¶

cutensorHandleReadPlanCacheFromFile()¶

cutensorHandleWritePlanCacheToFile()¶

cutensorReadKernelCacheFromFile()¶

cutensorWriteKernelCacheToFile()¶

Logger Functions¶

cutensorLoggerSetCallback()¶

cutensorLoggerSetFile()¶

cutensorLoggerOpenFile()¶

cutensorLoggerSetLevel()¶

cutensorLoggerSetMask()¶

cutensorLoggerForceDisable()¶

`cutensorCreate()`¶

`cutensorDestroy()`¶

`cutensorCreateTensorDescriptor()`¶

`cutensorDestroyTensorDescriptor()`¶

`cutensorGetErrorString()`¶

`cutensorGetVersion()`¶

`cutensorGetCudartVersion()`¶

`cutensorCreateElementwiseTrinary()`¶

`cutensorElementwiseTrinaryExecute()`¶

`cutensorCreateElementwiseBinary()`¶

`cutensorElementwiseBinaryExecute()`¶

`cutensorCreatePermutation()`¶

`cutensorPermute()`¶

`cutensorCreateContraction()`¶

`cutensorContract()`¶

`cutensorCreateReduction()`¶

`cutensorReduce()`¶

`cutensorDestroyOperationDescriptor()`¶

`cutensorOperationDescriptorGetAttribute()`¶

`cutensorOperationDescriptorSetAttribute()`¶

`cutensorCreatePlanPreference()`¶

`cutensorDestroyPlanPreference()`¶

`cutensorPlanPreferenceSetAttribute()`¶

`cutensorEstimateWorkspaceSize()`¶

`cutensorCreatePlan()`¶

`cutensorDestroyPlan()`¶

`cutensorPlanGetAttribute()`¶

`cutensorPlanSetAttribute()`¶

`cutensorHandleResizePlanCache()`¶

`cutensorHandleReadPlanCacheFromFile()`¶

`cutensorHandleWritePlanCacheToFile()`¶

`cutensorReadKernelCacheFromFile()`¶

`cutensorWriteKernelCacheToFile()`¶

`cutensorLoggerSetCallback()`¶

`cutensorLoggerSetFile()`¶

`cutensorLoggerOpenFile()`¶

`cutensorLoggerSetLevel()`¶

`cutensorLoggerSetMask()`¶

`cutensorLoggerForceDisable()`¶