cuTENSOR Functions¶
Helper Functions¶
The helper functions initialize cuTENSOR, create tensor descriptors, check error codes, and retrieve library and CUDA runtime versions.
cutensorInit()
¶
Warning
doxygenfunction: Cannot find function “cutensorInit” in doxygen xml output for project “cuTensor” from directory: _xml
cutensorInitTensorDescriptor()
¶

cutensorStatus_t cutensorInitTensorDescriptor(const cutensorHandle_t *handle, cutensorTensorDescriptor_t *desc, const uint32_t numModes, const int64_t extent[], const int64_t stride[], cudaDataType_t dataType, cutensorOperator_t unaryOp)¶
Remark
nonblocking, no reentrant, and threadsafe
 Brief
Initializes a tensor descriptor
 Precondition
extent and stride arrays must each contain at least sizeof(int64_t) * numModes bytes
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] Pointer to the address where the allocated tensor descriptor object is stored.
numModes – [in] Number of modes.
extent – [in] Extent of each mode (must be larger than zero).
stride – [in] stride[i] denotes the displacement (stride) between two consecutive elements in the ithmode. If stride is NULL, a packed generalized columnmajor memory layout is assumed (i.e., the strides increase monotonically from left to right). Each stride must be larger than zero; to be precise, a stride of zero can be achieved by omitting this mode entirely; for instance instead of writing C[a,b] = A[b,a] with strideA(a) = 0, you can write C[a,b] = A[b] directly; cuTENSOR will then automatically infer that the amode in A should be broadcasted).
dataType – [in] Data type of the stored entries.
unaryOp – [in] Unary operator that will be applied to each element of the corresponding tensor in a lazy fashion (i.e., the algorithm uses this tensor as its operand only once). The original data of this tensor remains unchanged.
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_NOT_SUPPORTED – if the requested descriptor is not supported (e.g., due to nonsupported data type).
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
cutensorGetAlignmentRequirement()
¶

cutensorStatus_t cutensorGetAlignmentRequirement(const cutensorHandle_t *handle, const void *ptr, const cutensorTensorDescriptor_t *desc, uint32_t *alignmentRequirement)¶
 Brief
Computes the minimal alignment requirement for a given pointer and descriptor
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
ptr – [in] Raw pointer to the data of the respective tensor.
desc – [in] Tensor descriptor for ptr.
alignmentRequirement – [out] Largest alignment requirement that ptr can fulfill (in bytes).
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
cutensorGetErrorString()
¶

const char *cutensorGetErrorString(const cutensorStatus_t error)¶
Remark
nonblocking, no reentrant, and threadsafe
 Brief
Returns the description string for an error code
 Returns
the error string
 Parameters:
error – [in] Error code to convert to string.
cutensorGetVersion()
¶

size_t cutensorGetVersion()¶
 Brief
Returns Version number of the CUTENSOR library
cutensorGetCudartVersion()
¶

size_t cutensorGetCudartVersion()¶
 Brief
Returns version number of the CUDA runtime that cuTENSOR was compiled against
 Details
Can be compared against the CUDA runtime version from cudaRuntimeGetVersion().
Elementwise Operations¶
The following functions perform elementwise operations between tensors.
cutensorElementwiseTrinary()
¶

cutensorStatus_t cutensorElementwiseTrinary(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *beta, const void *B, const cutensorTensorDescriptor_t *descB, const int32_t modeB[], const void *gamma, const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opAB, cutensorOperator_t opABC, cudaDataType_t typeScalar, const cudaStream_t stream)¶
Where
A,B,C,D are multimode tensors (of arbitrary data types).
\(\Pi^A, \Pi^B, \Pi^C \) are permutation operators that permute the modes of A, B, and C respectively.
\(\Psi_{A},\Psi_{B},\Psi_{C}\) are unary elementwise operators (e.g., IDENTITY, CONJUGATE).
\(\Phi_{ABC}, \Phi_{AB}\) are binary elementwise operators (e.g., ADD, MUL, MAX, MIN).
 Brief
Elementwise tensor operation with three inputs
 Details
This function performs a elementwise tensor operation of the form:
\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{ABC}(\Phi_{AB}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \beta \Psi_B(B_{\Pi^B(i_0,i_1,...,i_n)})), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]
Notice that the broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.
Moreover, modes may appear in any order, giving users a greater flexibility. The only restrictions are:
modes that appear in A or B must also appear in the output tensor; a mode that only appears in the input would be contracted and such an operation would be covered by either cutensorContraction or cutensorReduction.
each mode may appear in each tensor at most once.
Input tensors may be read even if the value of the corresponding scalar is zero.
Examples:
\( D_{a,b,c,d} = A_{b,d,a,c}\)
\( D_{a,b,c,d} = 2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a}\)
\( D_{a,b,c,d} = 2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a} + C_{a,b,c,d}\)
\( D_{a,b,c,d} = min((2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a}), C_{a,b,c,d})\)
Supported datatype combinations are:
typeA
typeB
typeC
typeScalar
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_32F
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUDA_R_64F
CUDA_R_64F
CUDA_R_64F
CUDA_R_64F
CUDA_C_32F
CUDA_C_32F
CUDA_C_32F
CUDA_C_32F
CUDA_C_64F
CUDA_C_64F
CUDA_C_64F
CUDA_C_64F
CUDA_R_32F
CUDA_R_32F
CUDA_R_16F
CUDA_R_32F
CUDA_R_64F
CUDA_R_64F
CUDA_R_32F
CUDA_R_64F
CUDA_C_64F
CUDA_C_64F
CUDA_C_32F
CUDA_C_64F
Remark
calls asynchronous functions, no reentrant, and threadsafe
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
alpha – [in] Scaling factor for A (see equation above) of the type typeScalar. Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
A – [in] Multimode tensor of type typeA with nmodeA modes. Pointer to the GPUaccessible memory.
descA – [in] A descriptor that holds the information about the data type, modes, and strides of A.
modeA – [in] Array (in host memory) of size descA>numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’}). The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
beta – [in] Scaling factor for B (see equation above) of the type typeScalar. Pointer to the host memory. If beta is zero, B is not read and the corresponding unary operator is not applied.
B – [in] Multimode tensor of type typeB with nmodeB many modes. Pointer to the GPUaccessible memory.
descB – [in] The B descriptor that holds information about the data type, modes, and strides of B.
modeB – [in] Array (in host memory) of size descB>numModes that holds the names of the modes of B. modeB[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor
gamma – [in] Scaling factor for C (see equation above) of type typeScalar. Pointer to the host memory. If gamma is zero, C is not read and the corresponding unary operator is not applied.
C – [in] Multimode tensor of type typeC with nmodeC many modes. Pointer to the GPUaccessible memory.
descC – [in] The C descriptor that holds information about the data type, modes, and strides of C.
modeC – [in] Array (in host memory) of size descC>numModes that holds the names of the modes of C. The modeC[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.
D – [out] Multimode output tensor of type typeC with nmodeC modes that are ordered according to modeD. Pointer to the GPUaccessible memory. Notice that D may alias any input tensor if they share the same memory layout (i.e., same tensor descriptor).
descD – [in] The D descriptor that holds information about the data type, modes, and strides of D. Notice that we currently request descD and descC to be identical.
modeD – [in] Array (in host memory) of size descD>numModes that holds the names of the modes of D. The modeD[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.
opAB – [in] Elementwise binary operator (see \(\Phi_{AB}\) above).
opABC – [in] Elementwise binary operator (see \(\Phi_{ABC}\) above).
typeScalar – [in] Denotes the data type for the scalars alpha, beta, and gamma. Moreover, typeScalar determines the data type that is used throughout the computation.
stream – [in] The cuda stream.
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_ARCH_MISMATCH – if the device is either not ready, or the target architecture is not supported.
cutensorElementwiseBinary()
¶

cutensorStatus_t cutensorElementwiseBinary(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *gamma, const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opAC, cudaDataType_t typeScalar, cudaStream_t stream)¶
See cutensorElementwiseTrinary() for details.
 Brief
Elementwise tensor operation for two input tensors
 Details
This function performs a elementwise tensor operation of the form:
\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{AC}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]
Remark
calls asynchronous functions, no reentrant, and threadsafe
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
alpha – [in] Scaling factor for A (see equation above) of the type typeScalar. Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
A – [in] Multimode tensor of type typeA with nmodeA modes. Pointer to the GPUaccessible memory.
descA – [in] A descriptor that holds the information about the data type, modes, and strides of A.
modeA – [in] Array (in host memory) of size descA>numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’}). The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
gamma – [in] Scaling factor for C (see equation above) of type typeScalar. Pointer to the host memory. If gamma is zero, C is not read and the corresponding unary operator is not applied.
C – [in] Multimode tensor of type typeC with nmodeC many modes. Pointer to the GPUaccessible memory.
descC – [in] The C descriptor that holds information about the data type, modes, and strides of C.
modeC – [in] Array (in host memory) of size descC>numModes that holds the names of the modes of C. The modeC[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.
D – [out] Multimode output tensor of type typeC with nmodeC modes that are ordered according to modeD. Pointer to the GPUaccessible memory. Notice that D may alias any input tensor if they share the same memory layout (i.e., same tensor descriptor).
descD – [in] The D descriptor that holds information about the data type, modes, and strides of D. Notice that we currently request descD and descC to be identical.
modeD – [in] Array (in host memory) of size descD>numModes that holds the names of the modes of D. The modeD[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.
opAC – [in] Elementwise binary operator (see \(\Phi_{AC}\) above).
typeScalar – [in] Scalar type for the intermediate computation.
stream – [in] The cuda stream.
 Return values:
CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
cutensorPermutation()
¶

cutensorStatus_t cutensorPermutation(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], void *B, const cutensorTensorDescriptor_t *descB, const int32_t modeB[], const cudaDataType_t typeScalar, const cudaStream_t stream)¶
Consequently, this function performs an outofplace tensor permutation and is a specialization of cutensorElementwise.
 Brief
Tensor permutation
 Details
This function performs an elementwise tensor operation of the form:
\[ B_{\Pi^B(i_0,i_1,...,i_n)} = \alpha \Psi(A_{\Pi^A(i_0,i_1,...,i_n)}) \]
Where
A and B are multimode tensors (of arbitrary data types),
\(\Pi^A, \Pi^B\) are permutation operators that permute the modes of A, B respectively,
\(\Psi\) is an unary elementwise operators (e.g., IDENTITY, SQR, CONJUGATE), and
\(\Psi\) is specified in the tensor descriptor descA.
Broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.
Modes may appear in any order. The only restrictions are:
modes that appear in A must also appear in the output tensor.
each mode may appear in each tensor at most once.
Supported datatype combinations are:
typeA
typeB
typeScalar
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_32F
CUDA_R_16F
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUDA_R_16F
CUDA_R_32F
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUDA_R_64F
CUDA_R_64F
CUDA_R_64F
CUDA_R_32F
CUDA_R_64F
CUDA_R_64F
CUDA_R_64F
CUDA_R_32F
CUDA_R_64F
CUDA_C_32F
CUDA_C_32F
CUDA_C_32F
CUDA_C_64F
CUDA_C_64F
CUDA_C_64F
CUDA_C_32F
CUDA_C_64F
CUDA_C_64F
CUDA_C_64F
CUDA_C_32F
CUDA_C_64F
Remark
calls asynchronous functions, no reentrant, and threadsafe
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
alpha – [in] Scaling factor for A (see equation above) of the type typeScalar. Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
A – [in] Multimode tensor of type typeA with nmodeA modes. Pointer to the GPUaccessible memory.
descA – [in] A descriptor that holds information about the data type, modes, and strides of A.
modeA – [in] Array of size descA>numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’})
B – [inout] Multimode tensor of type typeB with nmodeB modes. Pointer to the GPUaccessible memory.
descB – [in] A descriptor that holds information about the data type, modes, and strides of B.
modeB – [in] Array of size descB>numModes that holds the names of the modes of B
typeScalar – [in] data type of alpha
stream – [in] The CUDA stream.
 Return values:
CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
Contraction Operations¶
The following functions perform contractions between tensors.
cutensorInitContractionDescriptor()
¶

cutensorStatus_t cutensorInitContractionDescriptor(const cutensorHandle_t *handle, cutensorContractionDescriptor_t *desc, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const uint32_t alignmentRequirementA, const cutensorTensorDescriptor_t *descB, const int32_t modeB[], const uint32_t alignmentRequirementB, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], const uint32_t alignmentRequirementC, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], const uint32_t alignmentRequirementD, cutensorComputeType_t typeCompute)¶
 Brief
Describes the tensor contraction problem of the form:
\[ D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \] Details
 \[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha \mathcal{A}_{{modes}_\mathcal{A}} B_{{modes}_\mathcal{B}} + \beta \mathcal{C}_{{modes}_\mathcal{C}} \].
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets filled with the information that encodes the tensor contraction problem.
descA – [in] A descriptor that holds the information about the data type, modes and strides of A.
modeA – [in] Array with ‘nmodeA’ entries that represent the modes of A. The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
alignmentRequirementA – [in] Alignment that cuTENSOR may require for A’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.
descB – [in] The B descriptor that holds information about the data type, modes, and strides of B.
modeB – [in] Array with ‘nmodeB’ entries that represent the modes of B. The modeB[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
alignmentRequirementB – [in] Alignment that cuTENSOR may require for B’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.
modeC – [in] Array with ‘nmodeC’ entries that represent the modes of C. The modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
descC – [in] The C descriptor that holds information about the data type, modes, and strides of C.
alignmentRequirementC – [in] Alignment that cuTENSOR may require for C’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.
modeD – [in] Array with ‘nmodeD’ entries that represent the modes of D (must be identical to modeC for now). The modeD[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
descD – [in] The D descriptor that holds information about the data type, modes, and strides of D (must be identical to descC for now).
alignmentRequirementD – [in] Alignment that cuTENSOR may require for D’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.
typeCompute – [in] Datatype of for the intermediate computation of typeCompute T = A * B.
 Return values:
CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
cutensorContractionDescriptorSetAttribute()
¶

cutensorStatus_t cutensorContractionDescriptorSetAttribute(const cutensorHandle_t *handle, cutensorContractionDescriptor_t *desc, cutensorContractionDescriptorAttributes_t attr, const void *buf, size_t sizeInBytes)¶
 Brief
Sett attribute for cutensorDescriptor
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [inout] Contraction descriptor that will be modified.
attr – [in] Specifies the attribute that will be set.
buf – [in] This buffer (of size sizeInBytes) determines the value to which attr will be set.
sizeInBytes – [in] Size of buf (in bytes).
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
cutensorInitContractionFind()
¶

cutensorStatus_t cutensorInitContractionFind(const cutensorHandle_t *handle, cutensorContractionFind_t *find, const cutensorAlgo_t algo)¶
 Brief
Limits the search space of viable candidates (a.k.a. algorithms)
 Details
This function gives the user finer control over the candidates that the subsequent call to cutensorInitContractionPlan is allowed to evaluate.
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
find – [out]
algo – [in] Allows users to select a specific algorithm. CUTENSOR_ALGO_DEFAULT lets the heuristic choose the algorithm. Any value >= 0 selects a specific GEMMlike algorithm and deactivates the heuristic. If a specified algorithm is not supported CUTENSOR_STATUS_NOT_SUPPORTED is returned. See cutensorAlgo_t for additional choices.
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_NOT_SUPPORTED –
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
cutensorContractionFindSetAttribute()
¶

cutensorStatus_t cutensorContractionFindSetAttribute(const cutensorHandle_t *handle, cutensorContractionFind_t *find, cutensorContractionFindAttributes_t attr, const void *buf, size_t sizeInBytes)¶
 Brief
Set attribute for cutensorContractionFind
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
find – [inout] This opaque struct restricts the search space of viable candidates.
attr – [in] Specifies the attribute that will be set.
buf – [in] This buffer (of size sizeInBytes) determines the value to which attr will be set.
sizeInBytes – [in] Size of buf (in bytes).
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
cutensorContractionGetWorkspaceSize()
¶

cutensorStatus_t cutensorContractionGetWorkspaceSize(const cutensorHandle_t *handle, const cutensorContractionDescriptor_t *desc, const cutensorContractionFind_t *find, const cutensorWorksizePreference_t pref, uint64_t *workspaceSize)¶
 Brief
Determines the required workspaceSize for a given tensor contraction (see cutensorContraction)
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [in] This opaque struct encodes the tensor contraction problem.
find – [in] This opaque struct restricts the search space of viable candidates.
pref – [in] This parameter influences the size of the workspace; see cutensorWorksizePreference_t for details.
workspaceSize – [out] The workspace size (in bytes) that is required for the given tensor contraction.
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
cutensorContractionGetWorkspace()
¶
Deprecated. Use cutensorContractionGetWorkspaceSize
instead.
cutensorInitContractionPlan()
¶

cutensorStatus_t cutensorInitContractionPlan(const cutensorHandle_t *handle, cutensorContractionPlan_t *plan, const cutensorContractionDescriptor_t *desc, const cutensorContractionFind_t *find, const uint64_t workspaceSize)¶
The plan is created for the active CUDA device.
 Brief
Initializes the contraction plan for a given tensor contraction problem
 Details
This function applies cuTENSOR’s heuristic to select a candidate for a given tensor contraction problem (encoded by desc). The resulting plan can be reused multiple times as long as the tensor contraction problem remains the same.
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [out] Opaque handle holding the contraction execution plan (i.e., the candidate that will be executed as well as all it’s runtime parameters for the given tensor contraction problem).
desc – [in] This opaque struct encodes the given tensor contraction problem.
find – [in] This opaque struct is used to restrict the search space of viable candidates.
workspaceSize – [in] Available workspace size (in bytes).
 Return values:
CUTENSOR_STATUS_SUCCESS – If a viable candidate has been found.
CUTENSOR_STATUS_NOT_SUPPORTED – If no viable candidate could be found.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE – if The provided workspace was insufficient.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
cutensorContraction()
¶

cutensorStatus_t cutensorContraction(const cutensorHandle_t *handle, const cutensorContractionPlan_t *plan, const void *alpha, const void *A, const void *B, const void *beta, const void *C, void *D, void *workspace, uint64_t workspaceSize, cudaStream_t stream)¶
The currently active CUDA device must match the CUDA device that was active at the time at which the plan was created.
 Brief
This routine computes the tensor contraction
\[ D = alpha * A * B + beta * C \] Details
 \[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha * \mathcal{A}_{{modes}_\mathcal{A}} B_{{modes}_\mathcal{B}} + \beta \mathcal{C}_{{modes}_\mathcal{C}} \]
Supported datatype combinations are:
typeA
typeB
typeC
typeCompute
Tensor Core
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUTENSOR_COMPUTE_32F
Volta+
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUTENSOR_COMPUTE_32F
Ampere+
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUTENSOR_COMPUTE_32F
No
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUTENSOR_COMPUTE_TF32
Ampere+
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUTENSOR_COMPUTE_16BF
Ampere+
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUTENSOR_COMPUTE_16F
Volta+
CUDA_R_64F
CUDA_R_64F
CUDA_R_64F
CUTENSOR_COMPUTE_64F
Ampere+
CUDA_R_64F
CUDA_R_64F
CUDA_R_64F
CUTENSOR_COMPUTE_32F
No
CUDA_C_32F
CUDA_C_32F
CUDA_C_32F
CUTENSOR_COMPUTE_32F
No
CUDA_C_32F
CUDA_C_32F
CUDA_C_32F
CUTENSOR_COMPUTE_TF32
Ampere+
CUDA_C_64F
CUDA_C_64F
CUDA_C_64F
CUTENSOR_COMPUTE_64F
Ampere+
CUDA_C_64F
CUDA_C_64F
CUDA_C_64F
CUTENSOR_COMPUTE_32F
No
CUDA_R_64F
CUDA_C_64F
CUDA_C_64F
CUTENSOR_COMPUTE_64F
No
CUDA_C_64F
CUDA_R_64F
CUDA_C_64F
CUTENSOR_COMPUTE_64F
No
 [Example]
See https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuTENSOR/contraction.cu for a concrete example.
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding the contraction execution plan.
alpha – [in] Scaling for A*B. Its data type is determined by ‘typeCompute’. Pointer to the host memory.
A – [in] Pointer to the data corresponding to A in device memory. Pointer to the GPUaccessible memory.
B – [in] Pointer to the data corresponding to B. Pointer to the GPUaccessible memory.
beta – [in] Scaling for C. Its data type is determined by ‘typeCompute’. Pointer to the host memory.
C – [in] Pointer to the data corresponding to C. Pointer to the GPUaccessible memory.
D – [out] Pointer to the data corresponding to D. Pointer to the GPUaccessible memory.
workspace – [out] Optional parameter that may be NULL. This pointer provides additional workspace, in device memory, to the library for additional optimizations; the workspace must be aligned to 256 bytes.
workspaceSize – [in] Size of the workspace array in bytes; please refer to cutensorContractionGetWorkspace() to query the required workspace. While cutensorContraction() does not strictly require a workspace for the reduction, it is still recommended to provided some small workspace (e.g., 128 MB).
stream – [in] The CUDA stream in which all the computation is performed.
 Return values:
CUTENSOR_STATUS_NOT_SUPPORTED – if operation is not supported.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_ARCH_MISMATCH – if the plan was created for a different device than the currently active device.
CUTENSOR_STATUS_INSUFFICIENT_DRIVER – if the driver is insufficient.
CUTENSOR_STATUS_CUDA_ERROR – if some unknown CUDA error has occurred (e.g., out of memory).
cutensorContractionMaxAlgos()
¶

cutensorStatus_t cutensorContractionMaxAlgos(int32_t *maxNumAlgos)¶
 Brief
This routine returns the maximum number of algorithms available to compute tensor contractions
 [NOTE] Not all algorithms might be applicable to your specific problem. cutensorContraction() will return CUTENSOR_STATUS_NOT_SUPPORTED if an algorithm is not applicable.
 Parameters:
maxNumAlgos – [out] This value will hold the maximum number of algorithms available for cutensorContraction(). You can use the returned integer for autotuning purposes (i.e., iterate over all algorithms up to the returned value).
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
Reduction Operations¶
The following functions perform tensor reductions.
cutensorReduction()
¶

cutensorStatus_t cutensorReduction(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *beta, const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opReduce, cutensorComputeType_t typeCompute, void *workspace, uint64_t workspaceSize, cudaStream_t stream)¶
This function is also able to perform partial reductions; for instance: C[i,j] = alpha * A[k,j,i]; in this case only elements along the kmode are contracted.
 Brief
Implements a tensor reduction of the form
\[ D = alpha * opReduce(opA(A)) + beta * opC(C) \] Details
For example this function enables users to reduce an entire tensor to a scalar: C[] = alpha * A[i,j,k];
The binary opReduce operator provides extra control over what kind of a reduction ought to be perfromed. For instance, opReduce == CUTENSOR_OP_ADD reduces element of A via a summation while CUTENSOR_OP_MAX would find the largest element in A.
Supported datatype combinations are:
typeA
typeB
typeC
typeCompute
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUDA_R_64F
CUDA_R_64F
CUDA_R_64F
CUDA_C_32F
CUDA_C_32F
CUDA_C_32F
CUDA_C_64F
CUDA_C_64F
CUDA_C_64F
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
alpha – [in] Scaling for A; its data type is determined by ‘typeCompute’. Pointer to the host memory.
A – [in] Pointer to the data corresponding to A in device memory. Pointer to the GPUaccessible memory.
descA – [in] A descriptor that holds the information about the data type, modes and strides of A.
modeA – [in] Array with ‘nmodeA’ entries that represent the modes of A. modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor. Modes that only appear in modeA but not in modeC are reduced (contracted).
beta – [in] Scaling for C; its data type is determined by ‘typeCompute’. Pointer to the host memory.
C – [in] Pointer to the data corresponding to C in device memory. Pointer to the GPUaccessible memory.
descC – [in] A descriptor that holds the information about the data type, modes and strides of C.
modeC – [in] Array with ‘nmodeC’ entries that represent the modes of C. modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
D – [out] Pointer to the data corresponding to C in device memory. Pointer to the GPUaccessible memory.
descD – [in] Must be identical to descC for now.
modeD – [in] Must be identical to modeC for now.
opReduce – [in] binary operator used to reduce elements of A.
typeCompute – [in] All arithmetic is performed using this data type (i.e., it affects the accuracy and performance).
workspace – [out] Scratchpad (device) memory; the workspace must be aligned to 128 bytes.
workspaceSize – [in] Please use cutensorReductionGetWorkspaceSize() to query the required workspace. While lower values, including zero, are valid, they may lead to grossly suboptimal performance.
stream – [in] The CUDA stream in which all the computation is performed.
 Return values:
CUTENSOR_STATUS_NOT_SUPPORTED – if operation is not supported.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
cutensorReductionGetWorkspaceSize()
¶

cutensorStatus_t cutensorReductionGetWorkspaceSize(const cutensorHandle_t *handle, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], const void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opReduce, cutensorComputeType_t typeCompute, uint64_t *workspaceSize)¶
 Brief
Determines the required workspaceSize for a given tensor reduction (see cutensorReduction)
 Parameters:
handle – [in] Opaque handle holding cuTENSOR’s library context.
A – [in] same as in cutensorReduction
descA – [in] same as in cutensorReduction
modeA – [in] same as in cutensorReduction
C – [in] same as in cutensorReduction
descC – [in] same as in cutensorReduction
modeC – [in] same as in cutensorReduction
D – [in] same as in cutensorReduction
descD – [in] same as in cutensorReduction
modeD – [in] same as in cutensorReduction
opReduce – [in] same as in cutensorReduction
typeCompute – [in] same as in cutensorReduction
workspaceSize – [out] The workspace size (in bytes) that is required for the given tensor reduction.
 Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
cutensorReductionGetWorkspace()
¶
Deprecated. Use cutensorReductionGetWorkspaceSize
instead.
Logger Functions¶
cutensorLoggerSetCallback()
¶

cutensorStatus_t cutensorLoggerSetCallback(cutensorLoggerCallback_t callback)¶
 Brief
This function sets the logging callback routine.
 Parameters:
callback – [in] Pointer to a callback function. Check cutensorLoggerCallback_t.
cutensorLoggerSetFile()
¶

cutensorStatus_t cutensorLoggerSetFile(FILE *file)¶
 Brief
This function sets the logging output file.
 Parameters:
file – [in] An open file with write permission.
cutensorLoggerOpenFile()
¶

cutensorStatus_t cutensorLoggerOpenFile(const char *logFile)¶
 Brief
This function opens a logging output file in the given path.
 Parameters:
logFile – [in] Path to the logging output file.
cutensorLoggerSetLevel()
¶

cutensorStatus_t cutensorLoggerSetLevel(int32_t level)¶
 Brief
This function sets the value of the logging level.
 Parameters:
level – [in] Log level, should be one of the following: 0. Off
Errors
Performance Trace
Performance Hints
Heuristics Trace
API Trace
cutensorLoggerSetMask()
¶

cutensorStatus_t cutensorLoggerSetMask(int32_t mask)¶
 Brief
This function sets the value of the log mask.
 Parameters:
mask – [in] Log mask, the bitwise OR of the following: 0. Off
Errors
Performance Trace
Performance Hints
Heuristics Trace
API Trace
cutensorLoggerForceDisable()
¶

cutensorStatus_t cutensorLoggerForceDisable()¶
 Brief
This function disables logging for the entire run.