cuTENSOR Functions¶
Helper Functions¶
The helper functions initialize cuTENSOR, create tensor descriptors, check error codes, and retrieve library and CUDA runtime versions.
cutensorCreate()
¶

cutensorStatus_t cutensorCreate(cutensorHandle_t *handle)¶
Initializes the cuTENSOR library and allocates the memory for the library context.
The device associated with a particular cuTENSOR handle is assumed to remain unchanged after the cutensorCreate call. In order for the cuTENSOR library to use a different device, the application must set the new device to be used by calling cudaSetDevice and then create another cuTENSOR handle, which will be associated with the new device, by calling cutensorCreate.
Moreover, each handle by default has a plan cache that can store the least recently used cutensorPlan_t; its default capacity is 64, but it can be changed via cutensorHandleResizePlanCache if this is too little storage space. See the Plan Cache Guide for more information.
The user is responsible for calling cutensorDestroy to free the resources associated with the handle.
Remark
blocking, no reentrant, and threadsafe
 Parameters:
handle – [out] Pointer to cutensorHandle_t
 Return values:
CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise
cutensorDestroy()
¶

cutensorStatus_t cutensorDestroy(cutensorHandle_t handle)¶
Frees all resources related to the provided library handle.
Remark
blocking, no reentrant, and threadsafe
 Parameters:
handle – [inout] Pointer to cutensorHandle_t
 Return values:
CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise
cutensorCreateTensorDescriptor()
¶

cutensorStatus_t cutensorCreateTensorDescriptor(const cutensorHandle_t handle, cutensorTensorDescriptor_t *desc, const uint32_t numModes, const int64_t extent[], const int64_t stride[], cutensorDataType_t dataType, uint32_t alignmentRequirement)¶
Creates a tensor descriptor.
This allocates a small amount of hostmemory.
The user is responsible for calling cutensorDestroyTensorDescriptor() to free the associated resources once the tensor descriptor is no longer used.
Remark
nonblocking, no reentrant, and threadsafe
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] Pointer to the address where the allocated tensor descriptor object will be stored.
numModes – [in] Number of modes.
extent – [in] Extent of each mode (must be larger than zero).
stride – [in] stride[i] denotes the displacement (a.k.a. stride)—in elements of the base type—between two consecutive elements in the ithmode. If stride is NULL, a packed generalized columnmajor memory layout is assumed (i.e., the strides increase monotonically from left to right). Each stride must be larger than zero; to be precise, a stride of zero can be achieved by omitting this mode entirely; for instance instead of writing C[a,b] = A[b,a] with strideA(a) = 0, you can write C[a,b] = A[b] directly; cuTENSOR will then automatically infer that the amode in A should be broadcasted.
dataType – [in] Data type of the stored entries.
alignmentRequirement – [in] Alignment (in bytes) to the base pointer that will be used in conjunction with this tensor descriptor (e.g.,
cudaMalloc
has a default alignment of 256 bytes).
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_NOT_SUPPORTED – if the requested descriptor is not supported (e.g., due to nonsupported data type).
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
 Pre:
extent and stride arrays must each contain at least sizeof(int64_t) * numModes bytes
cutensorDestroyTensorDescriptor()
¶

cutensorStatus_t cutensorDestroyTensorDescriptor(cutensorTensorDescriptor_t desc)¶
Frees all resources related to the provided tensor descriptor.
Remark
blocking, no reentrant, and threadsafe
 Parameters:
desc – [inout] The cutensorTensorDescriptor_t object that will be deallocated.
 Return values:
CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise
cutensorGetErrorString()
¶

const char *cutensorGetErrorString(const cutensorStatus_t error)¶
Returns the description string for an error code.
Remark
nonblocking, no reentrant, and threadsafe
 Parameters:
error – [in] Error code to convert to string.
 Return values:
The – nullterminated error string.
cutensorGetVersion()
¶

size_t cutensorGetVersion()¶
Returns Version number of the CUTENSOR library.
cutensorGetCudartVersion()
¶

size_t cutensorGetCudartVersion()¶
Returns version number of the CUDA runtime that cuTENSOR was compiled against.
Can be compared against the CUDA runtime version from cudaRuntimeGetVersion().
Elementwise Operations¶
The following functions perform elementwise operations between tensors.
cutensorCreateElementwiseTrinary()
¶

cutensorStatus_t cutensorCreateElementwiseTrinary(const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descB, const int32_t modeB[], cutensorOperator_t opB, const cutensorTensorDescriptor_t descC, const int32_t modeC[], cutensorOperator_t opC, const cutensorTensorDescriptor_t descD, const int32_t modeD[], cutensorOperator_t opAB, cutensorOperator_t opABC, const cutensorComputeDescriptor_t descCompute)¶
This function creates an operation descriptor that encodes an elementwise trinary operation.
Said trinary operation has the following general form:
\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{ABC}(\Phi_{AB}(\alpha op_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \beta op_B(B_{\Pi^B(i_0,i_1,...,i_n)})), \gamma op_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]Where
A,B,C,D are multimode tensors (of arbitrary data types).
\(\Pi^A, \Pi^B, \Pi^C \) are permutation operators that permute the modes of A, B, and C respectively.
\(op_{A},op_{B},op_{C}\) are unary elementwise operators (e.g., IDENTITY, CONJUGATE).
\(\Phi_{ABC}, \Phi_{AB}\) are binary elementwise operators (e.g., ADD, MUL, MAX, MIN).
Notice that the broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.
Moreover, modes may appear in any order, giving users a greater flexibility. The only restrictions are:
modes that appear in A or B must also appear in the output tensor; a mode that only appears in the input would be contracted and such an operation would be covered by either cutensorContract or cutensorReduce.
each mode may appear in each tensor at most once.
Input tensors may be read even if the value of the corresponding scalar is zero.
Examples:
\( D_{a,b,c,d} = A_{b,d,a,c}\)
\( D_{a,b,c,d} = 2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a}\)
\( D_{a,b,c,d} = 2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a} + C_{a,b,c,d}\)
\( D_{a,b,c,d} = min((2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a}), C_{a,b,c,d})\)
Call cutensorElementwiseTrinaryExecute to perform the actual operation.
Please use cutensorDestroyOperationDescriptor to deallocated the descriptor once it is no longer used.
Supported datatype combinations are:
typeA
typeB
typeC
descCompute
CUTENSOR_R_16F
CUTENSOR_R_16F
CUTENSOR_R_16F
CUTENSOR_COMPUTE_DESC_16F
CUTENSOR_R_16F
CUTENSOR_R_16F
CUTENSOR_R_16F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_16BF
CUTENSOR_R_16BF
CUTENSOR_R_16BF
CUTENSOR_COMPUTE_DESC_16BF
CUTENSOR_R_16BF
CUTENSOR_R_16BF
CUTENSOR_R_16BF
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_64F
CUTENSOR_R_64F
CUTENSOR_R_64F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_32F
CUTENSOR_C_32F
CUTENSOR_C_32F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_C_64F
CUTENSOR_C_64F
CUTENSOR_C_64F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_R_16F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_64F
CUTENSOR_R_64F
CUTENSOR_R_32F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_64F
CUTENSOR_C_64F
CUTENSOR_C_32F
CUTENSOR_COMPUTE_DESC_64F
Remark
calls asynchronous functions, no reentrant, and threadsafe
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the requested elementwise operation.
descA – [in] A descriptor that holds the information about the data type, modes, and strides of A.
modeA – [in] Array (in host memory) of size descA>numModes that holds the names of the modes of A (e.g., if \(A_{a,b,c}\) then modeA = {‘a’,’b’,’c’}). The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorCreateTensorDescriptor.
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descB – [in] A descriptor that holds information about the data type, modes, and strides of B.
modeB – [in] Array (in host memory) of size descB>numModes that holds the names of the modes of B. modeB[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor
opB – [in] Unary operator that will be applied to each element of B before it is further processed. The original data of this tensor remains unchanged.
descC – [in] A descriptor that holds information about the data type, modes, and strides of C.
modeC – [in] Array (in host memory) of size descC>numModes that holds the names of the modes of C. The modeC[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor.
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.
descD – [in] A descriptor that holds information about the data type, modes, and strides of D. Notice that we currently request descD and descC to be identical.
modeD – [in] Array (in host memory) of size descD>numModes that holds the names of the modes of D. The modeD[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor.
opAB – [in] Elementwise binary operator (see \(\Phi_{AB}\) above).
opABC – [in] Elementwise binary operator (see \(\Phi_{ABC}\) above).
descCompute – [in] Determines the precision in which this operations is performed.
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_ARCH_MISMATCH – if the device is either not ready, or the target architecture is not supported.
cutensorElementwiseTrinaryExecute()
¶

cutensorStatus_t cutensorElementwiseTrinaryExecute(const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, const void *beta, const void *B, const void *gamma, const void *C, void *D, cudaStream_t stream)¶
Performs an elementwise tensor operation for three input tensors (see cutensorCreateElementwiseTrinary)
This function performs a elementwise tensor operation of the form:
\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{ABC}(\Phi_{AB}(\alpha op_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \beta op_B(B_{\Pi^B(i_0,i_1,...,i_n)})), \gamma op_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]See cutensorCreateElementwiseTrinary() for details.
Remark
calls asynchronous functions, no reentrant, and threadsafe
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding all information about the desired elementwise operation (created by cutensorCreateElementwiseTrinary followed by cutensorCreatePlan).
alpha – [in] Scaling factor for A (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
A – [in] Multimode tensor (described by
descA
as part of cutensorCreateElementwiseTrinary). Pointer to the GPUaccessible memory. The data accessed via this pointer must not overlap with the elements written to D.beta – [in] Scaling factor for B (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If beta is zero, B is not read and the corresponding unary operator is not applied.
B – [in] Multimode tensor (described by
descB
as part of cutensorCreateElementwiseTrinary). Pointer to the GPUaccessible memory. The data accessed via this pointer must not overlap with the elements written to D.gamma – [in] Scaling factor for C (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If gamma is zero, C is not read and the corresponding unary operator is not applied.
C – [in] Multimode tensor (described by
descC
as part of cutensorCreateElementwiseTrinary). Pointer to the GPUaccessible memory. The data accessed via this pointer must not overlap with the elements written to D.D – [out] Multimode tensor (described by
descD
as part of cutensorCreateElementwiseTrinary). Pointer to the GPUaccessible memory (C
andD
may be identical, if and only ifdescC == descD
).stream – [in] The CUDA stream used to perform the operation.
 Return values:
CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
cutensorCreateElementwiseBinary()
¶

cutensorStatus_t cutensorCreateElementwiseBinary(const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descC, const int32_t modeC[], cutensorOperator_t opC, const cutensorTensorDescriptor_t descD, const int32_t modeD[], cutensorOperator_t opAC, const cutensorComputeDescriptor_t descCompute)¶
This function creates an operation descriptor for an elementwise binary operation.
The binary operation has the following general form:
\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{AC}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]Call cutensorElementwiseBinaryExecute to perform the actual operation.
Supported datatype combinations are:
typeA
typeC
descCompute
CUTENSOR_R_16F
CUTENSOR_R_16F
CUTENSOR_COMPUTE_DESC_16F
CUTENSOR_R_16F
CUTENSOR_R_16F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_16BF
CUTENSOR_R_16BF
CUTENSOR_COMPUTE_DESC_16BF
CUTENSOR_R_16BF
CUTENSOR_R_16BF
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_64F
CUTENSOR_R_64F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_32F
CUTENSOR_C_32F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_C_64F
CUTENSOR_C_64F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_R_32F
CUTENSOR_R_16F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_64F
CUTENSOR_R_32F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_64F
CUTENSOR_C_32F
CUTENSOR_COMPUTE_DESC_64F
Remark
calls asynchronous functions, no reentrant, and threadsafe
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the requested elementwise operation.
descA – [in] The descriptor that holds the information about the data type, modes, and strides of A.
modeA – [in] Array (in host memory) of size descA>numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’}). The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorCreateTensorDescriptor.
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descC – [in] The descriptor that holds information about the data type, modes, and strides of C.
modeC – [in] Array (in host memory) of size descC>numModes that holds the names of the modes of C. The modeC[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor.
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.
descD – [in] The descriptor that holds information about the data type, modes, and strides of D. Notice that we currently request descD and descC to be identical.
modeD – [in] Array (in host memory) of size descD>numModes that holds the names of the modes of D. The modeD[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor.
opAC – [in] Elementwise binary operator (see \(\Phi_{AC}\) above).
descCompute – [in] Determines the precision in which this operations is performed.
 Return values:
CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
cutensorElementwiseBinaryExecute()
¶

cutensorStatus_t cutensorElementwiseBinaryExecute(const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, const void *gamma, const void *C, void *D, cudaStream_t stream)¶
Performs an elementwise tensor operation for two input tensors (see cutensorCreateElementwiseBinary)
This function performs a elementwise tensor operation of the form:
\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{AC}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]See cutensorCreateElementwiseBinary() for details.
Remark
calls asynchronous functions, no reentrant, and threadsafe
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding all information about the desired elementwise operation (created by cutensorCreateElementwiseBinary followed by cutensorCreatePlan).
alpha – [in] Scaling factor for A (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
A – [in] Multimode tensor (described by
descA
as part of cutensorCreateElementwiseBinary). Pointer to the GPUaccessible memory. The data accessed via this pointer must not overlap with the elements written to D.gamma – [in] Scaling factor for C (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If gamma is zero, C is not read and the corresponding unary operator is not applied.
C – [in] Multimode tensor (described by
descC
as part of cutensorCreateElementwiseBinary). Pointer to the GPUaccessible memory. The data accessed via this pointer must not overlap with the elements written to D.D – [out] Multimode tensor (described by
descD
as part of cutensorCreateElementwiseBinary). Pointer to the GPUaccessible memory (C
andD
may be identical, if and only ifdescC == descD
).stream – [in] The CUDA stream used to perform the operation.
 Return values:
CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
cutensorCreatePermutation()
¶

cutensorStatus_t cutensorCreatePermutation(const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descB, const int32_t modeB[], const cutensorComputeDescriptor_t descCompute)¶
This function creates an operation descriptor for a tensor permutation.
The tensor permutation has the following general form:
\[ B_{\Pi^B(i_0,i_1,...,i_n)} = \alpha op_A(A_{\Pi^A(i_0,i_1,...,i_n)}) \]Consequently, this function performs an outofplace tensor permutation and is a specialization of cutensorCreateElementwiseBinary.
Where
A and B are multimode tensors (of arbitrary data types),
\(\Pi^A, \Pi^B\) are permutation operators that permute the modes of A, B respectively,
\(op_A\) is an unary elementwise operators (e.g., IDENTITY, SQR, CONJUGATE), and
\(\Psi\) is specified in the tensor descriptor descA.
Broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.
Modes may appear in any order. The only restrictions are:
modes that appear in A must also appear in the output tensor.
each mode may appear in each tensor at most once.
Supported datatype combinations are:
typeA
typeB
descCompute
CUTENSOR_R_16F
CUTENSOR_R_16F
CUTENSOR_COMPUTE_DESC_16F
CUTENSOR_R_16F
CUTENSOR_R_16F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_16F
CUTENSOR_R_32F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_32F
CUTENSOR_R_16F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_16BF
CUTENSOR_R_16BF
CUTENSOR_COMPUTE_DESC_16BF
CUTENSOR_R_16BF
CUTENSOR_R_16BF
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_64F
CUTENSOR_R_64F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_R_32F
CUTENSOR_R_64F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_R_64F
CUTENSOR_R_32F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_32F
CUTENSOR_C_32F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_C_64F
CUTENSOR_C_64F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_32F
CUTENSOR_C_64F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_64F
CUTENSOR_C_32F
CUTENSOR_COMPUTE_DESC_64F
Remark
calls asynchronous functions, no reentrant, and threadsafe
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the requested permutation.
descA – [in] The descriptor that holds information about the data type, modes, and strides of A.
modeA – [in] Array of size descA>numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’})
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descB – [in] The descriptor that holds information about the data type, modes, and strides of B.
modeB – [in] Array of size descB>numModes that holds the names of the modes of B
descCompute – [in] Determines the precision in which this operations is performed.
 Return values:
CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
cutensorPermute()
¶

cutensorStatus_t cutensorPermute(const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, void *B, const cudaStream_t stream)¶
Performs the tensor permutation that is encoded by
plan
(see cutensorCreatePermutation).This function performs an elementwise tensor operation of the form:
\[ B_{\Pi^B(i_0,i_1,...,i_n)} = \alpha \Psi(A_{\Pi^A(i_0,i_1,...,i_n)}) \]Consequently, this function performs an outofplace tensor permutation.
Where
A and B are multimode tensors (of arbitrary data types),
\(\Pi^A, \Pi^B\) are permutation operators that permute the modes of A, B respectively,
\(\Psi\) is an unary elementwise operators (e.g., IDENTITY, SQR, CONJUGATE), and
\(\Psi\) is specified in the tensor descriptor descA.
Remark
calls asynchronous functions, no reentrant, and threadsafe
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding all information about the desired tensor reduction (created by cutensorCreatePermutation followed by cutensorCreatePlan).
alpha – [in] Scaling factor for A (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
A – [in] Multimode tensor of type typeA with nmodeA modes. Pointer to the GPUaccessible memory. The data accessed via this pointer must not overlap with the elements written to D.
B – [inout] Multimode tensor of type typeB with nmodeB modes. Pointer to the GPUaccessible memory.
stream – [in] The CUDA stream.
 Return values:
CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
Contraction Operations¶
The following functions perform contractions between tensors.
cutensorCreateContraction()
¶

cutensorStatus_t cutensorCreateContraction(const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descB, const int32_t modeB[], cutensorOperator_t opB, const cutensorTensorDescriptor_t descC, const int32_t modeC[], cutensorOperator_t opC, const cutensorTensorDescriptor_t descD, const int32_t modeD[], const cutensorComputeDescriptor_t descCompute)¶
This function allocates a cutensorOperationDescriptor_t object that encodes a tensor contraction of the form \( D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \).
Allocates data for
desc
to be used to perform a tensor contraction of the form\[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha op_\mathcal{A}(\mathcal{A}_{{modes}_\mathcal{A}}) op_\mathcal{B}(B_{{modes}_\mathcal{B}}) + \beta op_\mathcal{C}(\mathcal{C}_{{modes}_\mathcal{C}}). \]See cutensorCreatePlan (or cutensorCreatePlanAutotuned) to create the plan (i.e., to select the kernel) followed by a call to cutensorContract to perform the actual contraction.
The user is responsible for calling cutensorDestroyOperationDescriptor to free the resources associated with the descriptor.
Supported datatype combinations are:
typeA
typeB
typeC
descCompute
typeScalar
Tensor Core
CUTENSOR_R_16F
CUTENSOR_R_16F
CUTENSOR_R_16F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_32F
Volta+
CUTENSOR_R_16BF
CUTENSOR_R_16BF
CUTENSOR_R_16BF
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_32F
Ampere+
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_32F
No
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_COMPUTE_DESC_TF32
CUTENSOR_R_32F
Ampere+
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_COMPUTE_DESC_3XTF32
CUTENSOR_R_32F
Ampere+
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_COMPUTE_DESC_16BF
CUTENSOR_R_32F
Ampere+
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_R_32F
CUTENSOR_COMPUTE_DESC_16F
CUTENSOR_R_32F
Volta+
CUTENSOR_R_64F
CUTENSOR_R_64F
CUTENSOR_R_64F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_R_64F
Ampere+
CUTENSOR_R_64F
CUTENSOR_R_64F
CUTENSOR_R_64F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_R_64F
No
CUTENSOR_C_32F
CUTENSOR_C_32F
CUTENSOR_C_32F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_C_32F
No
CUTENSOR_C_32F
CUTENSOR_C_32F
CUTENSOR_C_32F
CUTENSOR_COMPUTE_DESC_TF32
CUTENSOR_C_32F
Ampere+
CUTENSOR_C_32F
CUTENSOR_C_32F
CUTENSOR_C_32F
CUTENSOR_COMPUTE_DESC_3XTF32
CUTENSOR_C_32F
Ampere+
CUTENSOR_C_64F
CUTENSOR_C_64F
CUTENSOR_C_64F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_64F
Ampere+
CUTENSOR_C_64F
CUTENSOR_C_64F
CUTENSOR_C_64F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_C_64F
No
CUTENSOR_R_64F
CUTENSOR_C_64F
CUTENSOR_C_64F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_64F
No
CUTENSOR_C_64F
CUTENSOR_R_64F
CUTENSOR_C_64F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_C_64F
No
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the tensor contraction operation.
descA – [in] The descriptor that holds the information about the data type, modes and strides of A.
modeA – [in] Array with ‘nmodeA’ entries that represent the modes of A. The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descB – [in] The descriptor that holds information about the data type, modes, and strides of B.
modeB – [in] Array with ‘nmodeB’ entries that represent the modes of B. The modeB[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
opB – [in] Unary operator that will be applied to each element of B before it is further processed. The original data of this tensor remains unchanged.
modeC – [in] Array with ‘nmodeC’ entries that represent the modes of C. The modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
descC – [in] The escriptor that holds information about the data type, modes, and strides of C.
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.
modeD – [in] Array with ‘nmodeD’ entries that represent the modes of D (must be identical to modeC for now). The modeD[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
descD – [in] The descriptor that holds information about the data type, modes, and strides of D (must be identical to
descC
for now).typeCompute – [in] Datatype of for the intermediate computation of typeCompute T = A * B.
 Return values:
CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
cutensorContract()
¶

cutensorStatus_t cutensorContract(const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, const void *B, const void *beta, const void *C, void *D, void *workspace, uint64_t workspaceSize, cudaStream_t stream)¶
This routine computes the tensor contraction \( D = alpha * A * B + beta * C \).
\[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha * \mathcal{A}_{{modes}_\mathcal{A}} B_{{modes}_\mathcal{B}} + \beta \mathcal{C}_{{modes}_\mathcal{C}} \]The active CUDA device must match the CUDA device that was active at the time at which the plan was created.
 [Example]
See https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuTENSOR/contraction.cu for a concrete example.
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding the contraction execution plan (created by cutensorCreateContraction followed by cutensorCreatePlan).
alpha – [in] Scaling for A*B. Its data type is determined by ‘descCompute’ (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory.
A – [in] Pointer to the data corresponding to A. Pointer to the GPUaccessible memory. The data accessed via this pointer must not overlap with the elements written to D.
B – [in] Pointer to the data corresponding to B. Pointer to the GPUaccessible memory. The data accessed via this pointer must not overlap with the elements written to D.
beta – [in] Scaling for C. Its data type is determined by ‘descCompute’ (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory.
C – [in] Pointer to the data corresponding to C. Pointer to the GPUaccessible memory.
D – [out] Pointer to the data corresponding to D. Pointer to the GPUaccessible memory.
workspace – [out] Optional parameter that may be NULL. This pointer provides additional workspace, in device memory, to the library for additional optimizations; the workspace must be aligned to 256 bytes (i.e., the default alignment of cudaMalloc).
workspaceSize – [in] Size of the workspace array in bytes; please refer to cutensorEstimateWorkspaceSize to query the required workspace. While cutensorContract does not strictly require a workspace for the contraction, it is still recommended to provided some small workspace (e.g., 128 MB).
stream – [in] The CUDA stream in which all the computation is performed.
 Return values:
CUTENSOR_STATUS_NOT_SUPPORTED – if operation is not supported.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_ARCH_MISMATCH – if the plan was created for a different device than the currently active device.
CUTENSOR_STATUS_INSUFFICIENT_DRIVER – if the driver is insufficient.
CUTENSOR_STATUS_CUDA_ERROR – if some unknown CUDA error has occurred (e.g., out of memory).
Reduction Operations¶
The following functions perform tensor reductions.
cutensorCreateReduction()
¶

cutensorStatus_t cutensorCreateReduction(const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descC, const int32_t modeC[], cutensorOperator_t opC, const cutensorTensorDescriptor_t descD, const int32_t modeD[], cutensorOperator_t opReduce, const cutensorComputeDescriptor_t descCompute)¶
Creates a cutensorOperatorDescriptor_t object that encodes a tensor reduction of the form \( D = alpha * opReduce(opA(A)) + beta * opC(C) \).
For example this function enables users to reduce an entire tensor to a scalar: C[] = alpha * A[i,j,k];
This function is also able to perform partial reductions; for instance: C[i,j] = alpha * A[k,j,i]; in this case only elements along the kmode are contracted.
The binary opReduce operator provides extra control over what kind of a reduction ought to be performed. For instance, setting opReduce to
CUTENSOR_OP_ADD
reduces element of A via a summation whileCUTENSOR_OP_MAX
would find the largest element in A.Supported datatype combinations are:
typeA
typeB
typeC
typeCompute
CUTENSOR_COMPUTE_DESC_16F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_COMPUTE_DESC_16BF
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_COMPUTE_DESC_64F
CUTENSOR_COMPUTE_DESC_32F
CUTENSOR_COMPUTE_DESC_64F
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the requested tensor reduction operation.
descA – [in] The descriptor that holds the information about the data type, modes and strides of A.
modeA – [in] Array with ‘nmodeA’ entries that represent the modes of A. modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorCreateTensorDescriptor. Modes that only appear in modeA but not in modeC are reduced (contracted).
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descC – [in] The descriptor that holds the information about the data type, modes and strides of C.
modeC – [in] Array with ‘nmodeC’ entries that represent the modes of C. modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorCreateTensorDescriptor.
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.
descD – [in] Must be identical to descC for now.
modeD – [in] Must be identical to modeC for now.
opReduce – [in] binary operator used to reduce elements of A.
typeCompute – [in] All arithmetic is performed using this data type (i.e., it affects the accuracy and performance).
 Return values:
CUTENSOR_STATUS_NOT_SUPPORTED – if operation is not supported.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
cutensorReduce()
¶

cutensorStatus_t cutensorReduce(const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, const void *beta, const void *C, void *D, void *workspace, uint64_t workspaceSize, cudaStream_t stream)¶
Performs the tensor reduction that is encoded by
plan
(see cutensorCreateReduction). Parameters:
alpha – [in] Scaling for A. Its data type is determined by ‘descCompute’ (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory.
A – [in] Pointer to the data corresponding to A in device memory. Pointer to the GPUaccessible memory. The data accessed via this pointer must not overlap with the elements written to D.
beta – [in] Scaling for C. Its data type is determined by ‘descCompute’ (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory.
C – [in] Pointer to the data corresponding to C in device memory. Pointer to the GPUaccessible memory.
D – [out] Pointer to the data corresponding to C in device memory. Pointer to the GPUaccessible memory.
workspace – [out] Scratchpad (device) memory of size —at least—
workspaceSize
bytes; the workspace must be aligned to 256 bytes (i.e., the default alignment of cudaMalloc).workspaceSize – [in] Please use cutensorEstimateWorkspaceSize() to query the required workspace.
stream – [in] The CUDA stream in which all the computation is performed.
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
Generic Operation Functions¶
The following functions are generic and work with all the different operations.
cutensorDestroyOperationDescriptor()
¶

cutensorStatus_t cutensorDestroyOperationDescriptor(cutensorOperationDescriptor_t desc)¶
Frees all resources related to the provided descriptor.
Remark
blocking, no reentrant, and threadsafe
 Parameters:
desc – [inout] The cutensorOperationDescriptor_t object that will be deallocated.
 Return values:
CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise
cutensorOperationDescriptorGetAttribute()
¶

cutensorStatus_t cutensorOperationDescriptorGetAttribute(const cutensorHandle_t handle, cutensorOperationDescriptor_t desc, cutensorOperationDescriptorAttribute_t attr, void *buf, size_t sizeInBytes)¶
This function retrieves an attribute of the provided cutensorOperationDescriptor_t object (see cutensorOperationDescriptorAttribute_t).
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [in] The cutensorOperationDescriptor_t object whos attribute is queried.
attr – [in] Specifies the attribute that will be retrieved.
buf – [out] This buffer (of size sizeInBytes) will hold the requested attribute of the provided cutensorOperationDescriptor_t object.
sizeInBytes – [in] Size of buf (in bytes); see cutensorOperationDescriptorAttribute_t for the exact size.
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
cutensorOperationDescriptorSetAttribute()
¶

cutensorStatus_t cutensorOperationDescriptorSetAttribute(const cutensorHandle_t handle, cutensorOperationDescriptor_t desc, cutensorOperationDescriptorAttribute_t attr, const void *buf, size_t sizeInBytes)¶
Set attribute of a cutensorOperationDescriptor_t object.
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [inout] Operation descriptor that will be modified.
attr – [in] Specifies the attribute that will be set.
buf – [in] This buffer (of size
sizeInBytes
) determines the value to whichattr
will be set.sizeInBytes – [in] Size of buf (in bytes).
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
cutensorCreatePlanPreference()
¶

cutensorStatus_t cutensorCreatePlanPreference(const cutensorHandle_t handle, cutensorPlanPreference_t *pref, cutensorAlgo_t algo, cutensorJitMode_t jitMode)¶
Allocates the cutensorPlanPreference_t, enabling users to limit the applicable kernels for a given plan/operation.
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
pref – [out] Pointer to the structure holding the cutensorPlanPreference_t allocated by this function. See cutensorPlanPreference_t.
algo – [in] Allows users to select a specific algorithm. CUTENSOR_ALGO_DEFAULT lets the heuristic choose the algorithm. Any value >= 0 selects a specific GEMMlike algorithm and deactivates the heuristic. If a specified algorithm is not supported CUTENSOR_STATUS_NOT_SUPPORTED is returned. See cutensorAlgo_t for additional choices.
jitMode – [in] Determines if cuTENSOR is allowed to use JITcompiled kernels (leading to a longer plancreation phase); see cutensorJitMode_t.
cutensorDestroyPlanPreference()
¶

cutensorStatus_t cutensorDestroyPlanPreference(cutensorPlanPreference_t pref)¶
Frees all resources related to the provided preference.
Remark
blocking, no reentrant, and threadsafe
 Parameters:
pref – [inout] The cutensorPlanPreference_t object that will be deallocated.
 Return values:
CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise
cutensorPlanPreferenceSetAttribute()
¶

cutensorStatus_t cutensorPlanPreferenceSetAttribute(const cutensorHandle_t handle, cutensorPlanPreference_t pref, cutensorPlanPreferenceAttribute_t attr, const void *buf, size_t sizeInBytes)¶
Set attribute of a cutensorPlanPreference_t object.
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
pref – [inout] This opaque struct restricts the search space of viable candidates.
attr – [in] Specifies the attribute that will be set.
buf – [in] This buffer (of size sizeInBytes) determines the value to which
attr
will be set.sizeInBytes – [in] Size of buf (in bytes); see cutensorPlanPreferenceAttribute_t for the exact size.
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
cutensorEstimateWorkspaceSize()
¶

cutensorStatus_t cutensorEstimateWorkspaceSize(const cutensorHandle_t handle, const cutensorOperationDescriptor_t desc, const cutensorPlanPreference_t planPref, const cutensorWorksizePreference_t workspacePref, uint64_t *workspaceSizeEstimate)¶
Determines the required workspaceSize for the given operation encoded by
desc
. Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [in] This opaque struct encodes the operation.
planPref – [in] This opaque struct restricts the space of viable candidates.
workspacePref – [in] This parameter influences the size of the workspace; see cutensorWorksizePreference_t for details.
workspaceSizeEstimate – [out] The workspace size (in bytes) that is required for the given operation.
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
cutensorCreatePlan()
¶

cutensorStatus_t cutensorCreatePlan(const cutensorHandle_t handle, cutensorPlan_t *plan, const cutensorOperationDescriptor_t desc, const cutensorPlanPreference_t pref, uint64_t workspaceSizeLimit)¶
This function allocates a cutensorPlan_t object, selects an appropriate kernel for a given operation (encoded by
desc
) and prepares a plan that encodes the execution.This function applies cuTENSOR’s heuristic to select a candidate/kernel for a given operation (created by either cutensorCreateContraction, cutensorCreateReduction, cutensorCreatePermutation, cutensorCreateElementwiseBinary, or cutensorCreateElementwiseTrinary). The created plan can then be be passed to either cutensorContract, cutensorReduce, cutensorPermute, cutensorElementwiseBinaryExecute, or cutensorElementwiseTrinaryExecute to perform the actual operation.
The plan is created for the active CUDA device.
Note: cutensorCreatePlan must not be captured via CUDA graphs if JustInTime compilation is enabled (i.e., cutensorJitMode_t is not
CUTENSOR_JIT_MODE_NONE
). Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [out] Pointer to the data structure created by this function that holds all information (e.g., selected kernel) necessary to perform the desired operation.
desc – [in] This opaque struct encodes the given operation (see cutensorCreateContraction, cutensorCreateReduction, cutensorCreatePermutation, cutensorCreateElementwiseBinary, or cutensorCreateElementwiseTrinary).
pref – [in] This opaque struct is used to restrict the space of applicable candidates/kernels (see cutensorCreatePlanPreference or cutensorPlanPreferenceAttribute_t). May be
nullptr
, in that case default choices are assumed.workspaceSizeLimit – [in] Denotes the maximal workspace that the corresponding operation is allowed to use (see cutensorEstimateWorkspaceSize)
 Return values:
CUTENSOR_STATUS_SUCCESS – If a viable candidate has been found.
CUTENSOR_STATUS_NOT_SUPPORTED – If no viable candidate could be found.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE – if The provided workspace was insufficient.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
cutensorDestroyPlan()
¶

cutensorStatus_t cutensorDestroyPlan(cutensorPlan_t plan)¶
Frees all resources related to the provided plan.
Remark
blocking, no reentrant, and threadsafe
 Parameters:
plan – [inout] The cutensorPlan_t object that will be deallocated.
 Return values:
CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise
cutensorPlanGetAttribute()
¶

cutensorStatus_t cutensorPlanGetAttribute(const cutensorHandle_t handle, const cutensorPlan_t plan, cutensorPlanAttribute_t attr, void *buf, size_t sizeInBytes)¶
Retrieves information about an alreadycreated plan (see cutensorPlanAttribute_t)
 Parameters:
plan – [in] Denotes an alreadycreated plan (e.g., via cutensorCreatePlan or cutensorCreatePlanAutotuned)
attr – [in] Requested attribute.
buf – [out] On successful exit: Holds the information of the requested attribute.
sizeInBytes – [in] size of
buf
in bytes.
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
cutensorPlanPreferenceSetAttribute()
¶

cutensorStatus_t cutensorPlanPreferenceSetAttribute(const cutensorHandle_t handle, cutensorPlanPreference_t pref, cutensorPlanPreferenceAttribute_t attr, const void *buf, size_t sizeInBytes)
Set attribute of a cutensorPlanPreference_t object.
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
pref – [inout] This opaque struct restricts the search space of viable candidates.
attr – [in] Specifies the attribute that will be set.
buf – [in] This buffer (of size sizeInBytes) determines the value to which
attr
will be set.sizeInBytes – [in] Size of buf (in bytes); see cutensorPlanPreferenceAttribute_t for the exact size.
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
Logger Functions¶
cutensorLoggerSetCallback()
¶

cutensorStatus_t cutensorLoggerSetCallback(cutensorLoggerCallback_t callback)¶
This function sets the logging callback routine.
 Parameters:
callback – [in] Pointer to a callback function. Check cutensorLoggerCallback_t.
cutensorLoggerSetFile()
¶

cutensorStatus_t cutensorLoggerSetFile(FILE *file)¶
This function sets the logging output file.
 Parameters:
file – [in] An open file with write permission.
cutensorLoggerOpenFile()
¶

cutensorStatus_t cutensorLoggerOpenFile(const char *logFile)¶
This function opens a logging output file in the given path.
 Parameters:
logFile – [in] Path to the logging output file.
cutensorLoggerSetLevel()
¶

cutensorStatus_t cutensorLoggerSetLevel(int32_t level)¶
This function sets the value of the logging level.
 Parameters:
level – [in] Log level, should be one of the following: 0. Off
Errors
Performance Trace
Performance Hints
Heuristics Trace
API Trace
cutensorLoggerSetMask()
¶

cutensorStatus_t cutensorLoggerSetMask(int32_t mask)¶
This function sets the value of the log mask.
 Parameters:
mask – [in] Log mask, the bitwise OR of the following: 0. Off
Errors
Performance Trace
Performance Hints
Heuristics Trace
API Trace
cutensorLoggerForceDisable()
¶

cutensorStatus_t cutensorLoggerForceDisable()¶
This function disables logging for the entire run.