cuTENSOR Functions¶
Helper Functions¶
The helper functions initialize cuTENSOR, create tensor descriptors, check error codes, and retrieve library and CUDA runtime versions.
cutensorInit()
¶

cutensorStatus_t
cutensorInit
(cutensorHandle_t *handle)¶  Brief
Initializes the cuTENSOR library
 Details
The device associated with a particular cuTENSOR handle is assumed to remain unchanged after the cutensorInit() call. In order for the cuTENSOR library to use a different device, the application must set the new device to be used by calling cudaSetDevice() and then create another cuTENSOR handle, which will be associated with the new device, by calling cutensorInit().
 Returns
CUTENSOR_STATUS_SUCCESS on success and an error code otherwise
 Remark
blocking, no reentrant, and threadsafe
 Parameters
[out] handle
: Pointer to cutensorHandle_t
cutensorInitTensorDescriptor()
¶

cutensorStatus_t
cutensorInitTensorDescriptor
(const cutensorHandle_t *handle, cutensorTensorDescriptor_t *desc, const uint32_t numModes, const int64_t extent[], const int64_t stride[], cudaDataType_t dataType, cutensorOperator_t unaryOp)¶  Brief
Initializes a tensor descriptor
 Precondition
extent and stride arrays must each contain at least sizeof(int64_t) * numModes bytes
 Remark
nonblocking, no reentrant, and threadsafe
 Parameters
[in] handle
: Opaque handle holding cuTENSOR’s library context.[out] desc
: Pointer to the address where the allocated tensor descriptor object is stored.[in] numModes
: Number of modes.[in] extent
: Extent of each mode (must be larger than zero).[in] stride
: stride[i] denotes the displacement (stride) between two consecutive elements in the ithmode. If stride is NULL, a packed generalized columnmajor memory layout is assumed (i.e., the strides increase monotonically from left to right). Each stride must be larger than zero; to be precise, a stride of zero can be achieved by omitting this mode entirely; for instance instead of writing C[a,b] = A[b,a] with strideA(a) = 0, you can write C[a,b] = A[b] directly; cuTENSOR will then automatically infer that the amode in A should be broadcasted).[in] dataType
: Data type of the stored entries.[in] unaryOp
: Unary operator that will be applied to each element of the corresponding tensor in a lazy fashion (i.e., the algorithm uses this tensor as its operand only once). The original data of this tensor remains unchanged.
 Return Value
CUTENSOR_STATUS_SUCCESS
: The operation completed successfully.CUTENSOR_STATUS_NOT_INITIALIZED
: if the handle is not initialized.CUTENSOR_STATUS_NOT_SUPPORTED
: if the requested descriptor is not supported (e.g., due to nonsupported data type).CUTENSOR_STATUS_INVALID_VALUE
: if some input data is invalid (this typically indicates an user error).
cutensorGetAlignmentRequirement()
¶

cutensorStatus_t
cutensorGetAlignmentRequirement
(const cutensorHandle_t *handle, const void *ptr, const cutensorTensorDescriptor_t *desc, uint32_t *alignmentRequirement)¶  Brief
Computes the minimal alignment requirement for a given pointer and descriptor
 Parameters
[in] handle
: Opaque handle holding cuTENSOR’s library context.[in] ptr
: Raw pointer to the data of the respective tensor.[in] desc
: Tensor descriptor for ptr.[out] alignmentRequirement
: Largest alignment requirement that ptr can fulfill (in bytes).
 Return Value
CUTENSOR_STATUS_SUCCESS
: The operation completed successfully.CUTENSOR_STATUS_NOT_INITIALIZED
: if the handle is not initialized.
cutensorGetErrorString()
¶

const char *
cutensorGetErrorString
(const cutensorStatus_t error)¶  Brief
Returns the description string for an error code
 Returns
the error string
 Remark
nonblocking, no reentrant, and threadsafe
 Parameters
[in] error
: Error code to convert to string.
cutensorGetVersion()
¶

size_t
cutensorGetVersion
()¶  Brief
Returns Version number of the CUTENSOR library
Elementwise Operations¶
The following functions perform elementwise operations between tensors.
cutensorElementwiseTrinary()
¶

cutensorStatus_t
cutensorElementwiseTrinary
(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *beta, const void *B, const cutensorTensorDescriptor_t *descB, const int32_t modeB[], const void *gamma, const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opAB, cutensorOperator_t opABC, cudaDataType_t typeScalar, const cudaStream_t stream)¶ Where
A,B,C,D are multimode tensors (of arbitrary data types).
\(\Pi^A, \Pi^B, \Pi^C \) are permutation operators that permute the modes of A, B, and C respectively.
\(\Psi_{A},\Psi_{B},\Psi_{C}\) are unary elementwise operators (e.g., IDENTITY, CONJUGATE).
\(\Phi_{ABC}, \Phi_{AB}\) are binary elementwise operators (e.g., ADD, MUL, MAX, MIN).
 Brief
Elementwise tensor operation with three inputs
 Details
This function performs a elementwise tensor operation of the form:
\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{ABC}(\Phi_{AB}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \beta \Psi_B(B_{\Pi^B(i_0,i_1,...,i_n)})), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]
Notice that the broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.
Moreover, modes may appear in any order, giving users a greater flexibility. The only restrictions are:
modes that appear in A or B must also appear in the output tensor; a mode that only appears in the input would be contracted and such an operation would be covered by either cutensorContraction or cutensorReduction.
each mode may appear in each tensor at most once.
Input tensors may be read even if the value of the corresponding scalar is zero.
Examples:
\( D_{a,b,c,d} = A_{b,d,a,c}\)
\( D_{a,b,c,d} = 2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a}\)
\( D_{a,b,c,d} = 2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a} + C_{a,b,c,d}\)
\( D_{a,b,c,d} = min((2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a}), C_{a,b,c,d})\)
Supported datatype combinations are:
typeA
typeB
typeC
typeScalar
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_32F
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUDA_R_64F
CUDA_R_64F
CUDA_R_64F
CUDA_R_64F
CUDA_C_32F
CUDA_C_32F
CUDA_C_32F
CUDA_C_32F
CUDA_C_64F
CUDA_C_64F
CUDA_C_64F
CUDA_C_64F
CUDA_R_32F
CUDA_R_32F
CUDA_R_16F
CUDA_R_32F
CUDA_R_64F
CUDA_R_64F
CUDA_R_32F
CUDA_R_64F
CUDA_C_64F
CUDA_C_64F
CUDA_C_32F
CUDA_C_64F
 Remark
calls asynchronous functions, no reentrant, and threadsafe
 Parameters
[in] handle
: Opaque handle holding cuTENSOR’s library context.[in] alpha
: Scaling factor for A (see equation above) of the type typeScalar. Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.[in] A
: Multimode tensor of type typeA with nmodeA modes. Pointer to the GPUaccessible memory.[in] descA
: A descriptor that holds the information about the data type, modes, and strides of A.[in] modeA
: Array (in host memory) of size descA>numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’}). The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.[in] beta
: Scaling factor for B (see equation above) of the type typeScalar. Pointer to the host memory. If beta is zero, B is not read and the corresponding unary operator is not applied.[in] B
: Multimode tensor of type typeB with nmodeB many modes. Pointer to the GPUaccessible memory.[in] descB
: The B descriptor that holds information about the data type, modes, and strides of B.[in] modeB
: Array (in host memory) of size descB>numModes that holds the names of the modes of B. modeB[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor[in] gamma
: Scaling factor for C (see equation above) of type typeScalar. Pointer to the host memory. If gamma is zero, C is not read and the corresponding unary operator is not applied.[in] C
: Multimode tensor of type typeC with nmodeC many modes. Pointer to the GPUaccessible memory.[in] descC
: The C descriptor that holds information about the data type, modes, and strides of C.[in] modeC
: Array (in host memory) of size descC>numModes that holds the names of the modes of C. The modeC[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.[out] D
: Multimode output tensor of type typeC with nmodeC modes that are ordered according to modeD. Pointer to the GPUaccessible memory. Notice that D may alias any input tensor if they share the same memory layout (i.e., same tensor descriptor).[in] descD
: The D descriptor that holds information about the data type, modes, and strides of D. Notice that we currently request descD and descC to be identical.[in] modeD
: Array (in host memory) of size descD>numModes that holds the names of the modes of D. The modeD[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.[in] opAB
: Elementwise binary operator (see \(\Phi_{AB}\) above).[in] opABC
: Elementwise binary operator (see \(\Phi_{ABC}\) above).[in] typeScalar
: Denotes the data type for the scalars alpha, beta, and gamma. Moreover, typeScalar determines the data type that is used throughout the computation.[in] stream
: The cuda stream.
 Return Value
CUTENSOR_STATUS_SUCCESS
: The operation completed successfully.CUTENSOR_STATUS_NOT_INITIALIZED
: if the handle is not initialized.CUTENSOR_STATUS_INVALID_VALUE
: if some input data is invalid (this typically indicates an user error).CUTENSOR_STATUS_ARCH_MISMATCH
: if the device is either not ready, or the target architecture is not supported.
cutensorElementwiseBinary()
¶

cutensorStatus_t
cutensorElementwiseBinary
(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *gamma, const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opAC, cudaDataType_t typeScalar, cudaStream_t stream)¶ See
cutensorElementwiseTrinary() for details. Brief
Elementwise tensor operation for two input tensors
 Details
This function performs a elementwise tensor operation of the form:
\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{AC}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]
 Remark
calls asynchronous functions, no reentrant, and threadsafe
 Parameters
[in] handle
: Opaque handle holding cuTENSOR’s library context.[in] alpha
: Scaling factor for A (see equation above) of the type typeScalar. Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.[in] A
: Multimode tensor of type typeA with nmodeA modes. Pointer to the GPUaccessible memory.[in] descA
: A descriptor that holds the information about the data type, modes, and strides of A.[in] modeA
: Array (in host memory) of size descA>numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’}). The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.[in] gamma
: Scaling factor for C (see equation above) of type typeScalar. Pointer to the host memory. If gamma is zero, C is not read and the corresponding unary operator is not applied.[in] C
: Multimode tensor of type typeC with nmodeC many modes. Pointer to the GPUaccessible memory.[in] descC
: The C descriptor that holds information about the data type, modes, and strides of C.[in] modeC
: Array (in host memory) of size descC>numModes that holds the names of the modes of C. The modeC[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.[out] D
: Multimode output tensor of type typeC with nmodeC modes that are ordered according to modeD. Pointer to the GPUaccessible memory. Notice that D may alias any input tensor if they share the same memory layout (i.e., same tensor descriptor).[in] descD
: The D descriptor that holds information about the data type, modes, and strides of D. Notice that we currently request descD and descC to be identical.[in] modeD
: Array (in host memory) of size descD>numModes that holds the names of the modes of D. The modeD[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.[in] opAC
: Elementwise binary operator (see \(\Phi_{AC}\) above).[in] typeScalar
: Scalar type for the intermediate computation.[in] stream
: The cuda stream.
 Return Value
CUTENSOR_STATUS_NOT_SUPPORTED
: if the combination of data types or operations is not supportedCUTENSOR_STATUS_INVALID_VALUE
: if tensor dimensions or modes have an illegal valueCUTENSOR_STATUS_SUCCESS
: The operation completed successfully without errorCUTENSOR_STATUS_NOT_INITIALIZED
: if the handle is not initialized.
cutensorPermutation()
¶

cutensorStatus_t
cutensorPermutation
(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], void *B, const cutensorTensorDescriptor_t *descB, const int32_t modeB[], const cudaDataType_t typeScalar, const cudaStream_t stream)¶ Consequently, this function performs an outofplace tensor permutation and is a specialization of cutensorElementwise.
 Brief
Tensor permutation
 Details
This function performs an elementwise tensor operation of the form:
\[ B_{\Pi^B(i_0,i_1,...,i_n)} = \alpha \Psi(A_{\Pi^A(i_0,i_1,...,i_n)}) \]
Where
A and B are multimode tensors (of arbitrary data types),
\(\Pi^A, \Pi^B\) are permutation operators that permute the modes of A, B respectively,
\(\Psi\) is an unary elementwise operators (e.g., IDENTITY, SQR, CONJUGATE), and
\(\Psi\) is specified in the tensor descriptor descA.
Broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.
Modes may appear in any order. The only restrictions are:
modes that appear in A must also appear in the output tensor.
each mode may appear in each tensor at most once.
 Remark
calls asynchronous functions, no reentrant, and threadsafe
 Parameters
[in] handle
: Opaque handle holding cuTENSOR’s library context.[in] alpha
: Scaling factor for A (see equation above) of the type typeScalar. Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.[in] A
: Multimode tensor of type typeA with nmodeA modes. Pointer to the GPUaccessible memory.[in] descA
: A descriptor that holds information about the data type, modes, and strides of A.[in] modeA
: Array of size descA>numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’})[inout] B
: Multimode tensor of type typeB with nmodeB modes. Pointer to the GPUaccessible memory.[in] descB
: A descriptor that holds information about the data type, modes, and strides of B.[in] modeB
: Array of size descB>numModes that holds the names of the modes of B[in] typeScalar
: data type of alpha[in] stream
: The CUDA stream.
 Return Value
CUTENSOR_STATUS_NOT_SUPPORTED
: if the combination of data types or operations is not supportedCUTENSOR_STATUS_INVALID_VALUE
: if tensor dimensions or modes have an illegal valueCUTENSOR_STATUS_SUCCESS
: The operation completed successfully without errorCUTENSOR_STATUS_NOT_INITIALIZED
: if the handle is not initialized.
Contraction Operations¶
The following functions perform contractions between tensors.
cutensorInitContractionDescriptor()
¶

cutensorStatus_t
cutensorInitContractionDescriptor
(const cutensorHandle_t *handle, cutensorContractionDescriptor_t *desc, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const uint32_t alignmentRequirementA, const cutensorTensorDescriptor_t *descB, const int32_t modeB[], const uint32_t alignmentRequirementB, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], const uint32_t alignmentRequirementC, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], const uint32_t alignmentRequirementD, cutensorComputeType_t typeCompute)¶  Brief
Describes the tensor contraction problem of the form:
\[ D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \] Details
 \[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha \mathcal{A}_{{modes}_\mathcal{A}} B_{{modes}_\mathcal{B}} + \beta \mathcal{C}_{{modes}_\mathcal{C}} \].
 Parameters
[in] handle
: Opaque handle holding cuTENSOR’s library context.[out] desc
: This opaque struct gets filled with the information that encodes the tensor contraction problem.[in] descA
: A descriptor that holds the information about the data type, modes and strides of A.[in] modeA
: Array with ‘nmodeA’ entries that represent the modes of A. The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.[in] alignmentRequirementA
: Alignment that cuTENSOR may require for A’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.[in] descB
: The B descriptor that holds information about the data type, modes, and strides of B.[in] modeB
: Array with ‘nmodeB’ entries that represent the modes of B. The modeB[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.[in] alignmentRequirementB
: Alignment that cuTENSOR may require for B’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.[in] modeC
: Array with ‘nmodeC’ entries that represent the modes of C. The modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.[in] descC
: The C descriptor that holds information about the data type, modes, and strides of C.[in] alignmentRequirementC
: Alignment that cuTENSOR may require for C’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.[in] modeD
: Array with ‘nmodeD’ entries that represent the modes of D (must be identical to modeC for now). The modeD[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.[in] descD
: The D descriptor that holds information about the data type, modes, and strides of D (must be identical to descC for now).[in] alignmentRequirementD
: Alignment that cuTENSOR may require for D’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.[in] typeCompute
: Datatype of for the intermediate computation of typeCompute T = A * B.
 Return Value
CUTENSOR_STATUS_NOT_SUPPORTED
: if the combination of data types or operations is not supportedCUTENSOR_STATUS_INVALID_VALUE
: if tensor dimensions or modes have an illegal valueCUTENSOR_STATUS_SUCCESS
: The operation completed successfully without errorCUTENSOR_STATUS_NOT_INITIALIZED
: if the handle is not initialized.
cutensorContractionDescriptorSetAttribute()
¶

cutensorStatus_t
cutensorContractionDescriptorSetAttribute
(const cutensorHandle_t *handle, cutensorContractionDescriptor_t *desc, cutensorContractionDescriptorAttributes_t attr, const void *buf, size_t sizeInBytes)¶  Brief
Sett attribute for cutensorDescriptor
 Parameters
[in] handle
: Opaque handle holding cuTENSOR’s library context.[inout] desc
: Contraction descriptor that will be modified.[in] attr
: Specifies the attribute that will be set.[in] buf
: This buffer (of size sizeInBytes) determines the value to which attr will be set.[in] sizeInBytes
: Size of buf (in bytes).
 Return Value
CUTENSOR_STATUS_SUCCESS
: The operation completed successfully.CUTENSOR_STATUS_NOT_INITIALIZED
: if the handle is not initialized.
cutensorInitContractionFind()
¶

cutensorStatus_t
cutensorInitContractionFind
(const cutensorHandle_t *handle, cutensorContractionFind_t *find, const cutensorAlgo_t algo)¶  Brief
Limits the search space of viable candidates (a.k.a. algorithms)
 Details
This function gives the user finer control over the candidates that the subsequent call to cutensorInitContractionPlan is allowed to evaluate.
 Parameters
[in] handle
: Opaque handle holding cuTENSOR’s library context.[out] find
:[in] algo
: Allows users to select a specific algorithm. CUTENSOR_ALGO_DEFAULT lets the heuristic choose the algorithm. Any value >= 0 selects a specific GEMMlike algorithm and deactivates the heuristic. If a specified algorithm is not supported CUTENSOR_STATUS_NOT_SUPPORTED is returned. See cutensorAlgo_t for additional choices.
 Return Value
CUTENSOR_STATUS_SUCCESS
: The operation completed successfully.CUTENSOR_STATUS_INVALID_VALUE
: if some input data is invalid (this typically indicates an user error).CUTENSOR_STATUS_NOT_SUPPORTED
:CUTENSOR_STATUS_NOT_INITIALIZED
: if the handle is not initialized.
cutensorContractionFindSetAttribute()
¶

cutensorStatus_t
cutensorContractionFindSetAttribute
(const cutensorHandle_t *handle, cutensorContractionFind_t *find, cutensorContractionFindAttributes_t attr, const void *buf, size_t sizeInBytes)¶  Brief
Set attribute for cutensorContractionFind
 Parameters
[in] handle
: Opaque handle holding cuTENSOR’s library context.[inout] find
: This opaque struct restricts the search space of viable candidates.[in] attr
: Specifies the attribute that will be set.[in] buf
: This buffer (of size sizeInBytes) determines the value to which attr will be set.[in] sizeInBytes
: Size of buf (in bytes).
 Return Value
CUTENSOR_STATUS_SUCCESS
: The operation completed successfully.CUTENSOR_STATUS_NOT_INITIALIZED
: if the handle is not initialized.CUTENSOR_STATUS_INVALID_VALUE
: if some input data is invalid (this typically indicates an user error).
cutensorContractionGetWorkspace()
¶
Warning
doxygenfunction: Cannot find function “cutensorContractionGetWorkspace” in doxygen xml output for project “cuTensor” from directory: _xml
cutensorInitContractionPlan()
¶

cutensorStatus_t
cutensorInitContractionPlan
(const cutensorHandle_t *handle, cutensorContractionPlan_t *plan, const cutensorContractionDescriptor_t *desc, const cutensorContractionFind_t *find, const uint64_t workspaceSize)¶ The plan is created for the active CUDA device.
 Brief
Initializes the contraction plan for a given tensor contraction problem
 Details
This function applies cuTENSOR’s heuristic to select a candidate for a given tensor contraction problem (encoded by desc). The resulting plan can be reused multiple times as long as the tensor contraction problem remains the same.
 Parameters
[in] handle
: Opaque handle holding cuTENSOR’s library context.[out] plan
: Opaque handle holding the contraction execution plan (i.e., the candidate that will be executed as well as all it’s runtime parameters for the given tensor contraction problem).[in] desc
: This opaque struct encodes the given tensor contraction problem.[in] find
: This opaque struct is used to restrict the search space of viable candidates.[in] workspaceSize
: Available workspace size (in bytes).
 Return Value
CUTENSOR_STATUS_SUCCESS
: If a viable candidate has been found.CUTENSOR_STATUS_NOT_SUPPORTED
: If no viable candidate could be found.CUTENSOR_STATUS_NOT_INITIALIZED
: if the handle is not initialized.CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE
: if The provided workspace was insufficient.CUTENSOR_STATUS_INVALID_VALUE
: if some input data is invalid (this typically indicates an user error).
cutensorContraction()
¶

cutensorStatus_t
cutensorContraction
(const cutensorHandle_t *handle, const cutensorContractionPlan_t *plan, const void *alpha, const void *A, const void *B, const void *beta, const void *C, void *D, void *workspace, uint64_t workspaceSize, cudaStream_t stream)¶ The currently active CUDA device must match the CUDA device that was active at the time at which the plan was created.
 Brief
This routine computes the tensor contraction
\[ D = alpha * A * B + beta * C \] Details
 \[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha * \mathcal{A}_{{modes}_\mathcal{A}} B_{{modes}_\mathcal{B}} + \beta \mathcal{C}_{{modes}_\mathcal{C}} \]
Supported datatype combinations are:
typeA
typeB
typeC
typeCompute
Tensor Core
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUTENSOR_COMPUTE_32F
Volta+
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUTENSOR_COMPUTE_32F
Ampere+
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUTENSOR_COMPUTE_32F
No
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUTENSOR_COMPUTE_TF32
Ampere+
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUTENSOR_COMPUTE_16BF
Ampere+
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUTENSOR_COMPUTE_16F
Volta+
CUDA_R_64F
CUDA_R_64F
CUDA_R_64F
CUTENSOR_COMPUTE_64F
Ampere+
CUDA_R_64F
CUDA_R_64F
CUDA_R_64F
CUTENSOR_COMPUTE_32F
No
CUDA_C_32F
CUDA_C_32F
CUDA_C_32F
CUTENSOR_COMPUTE_32F
No
CUDA_C_32F
CUDA_C_32F
CUDA_C_32F
CUTENSOR_COMPUTE_TF32
Ampere+
CUDA_C_64F
CUDA_C_64F
CUDA_C_64F
CUTENSOR_COMPUTE_64F
Ampere+
CUDA_C_64F
CUDA_C_64F
CUDA_C_64F
CUTENSOR_COMPUTE_32F
No
CUDA_R_64F
CUDA_C_64F
CUDA_C_64F
CUTENSOR_COMPUTE_64F
No
CUDA_C_64F
CUDA_R_64F
CUDA_C_64F
CUTENSOR_COMPUTE_64F
No
 [Example]
See https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuTENSOR/contraction.cu for a concrete example.
 Parameters
[in] handle
: Opaque handle holding cuTENSOR’s library context.[in] plan
: Opaque handle holding the contraction execution plan.[in] alpha
: Scaling for A*B. Its data type is determined by ‘typeCompute’. Pointer to the host memory.[in] A
: Pointer to the data corresponding to A in device memory. Pointer to the GPUaccessible memory.[in] B
: Pointer to the data corresponding to B. Pointer to the GPUaccessible memory.[in] beta
: Scaling for C. Its data type is determined by ‘typeCompute’. Pointer to the host memory.[in] C
: Pointer to the data corresponding to C. Pointer to the GPUaccessible memory.[out] D
: Pointer to the data corresponding to D. Pointer to the GPUaccessible memory.[out] workspace
: Optional parameter that may be NULL. This pointer provides additional workspace, in device memory, to the library for additional optimizations; the workspace must be aligned to 128 bytes.[in] workspaceSize
: Size of the workspace array in bytes; please refer to cutensorContractionGetWorkspace() to query the required workspace. While cutensorContraction() does not strictly require a workspace for the reduction, it is still recommended to provided some small workspace (e.g., 128 MB).[in] stream
: The CUDA stream in which all the computation is performed.
 Return Value
CUTENSOR_STATUS_NOT_SUPPORTED
: if operation is not supported.CUTENSOR_STATUS_INVALID_VALUE
: if some input data is invalid (this typically indicates an user error).CUTENSOR_STATUS_SUCCESS
: The operation completed successfully.CUTENSOR_STATUS_NOT_INITIALIZED
: if the handle is not initialized.CUTENSOR_STATUS_ARCH_MISMATCH
: if the plan was created for a different device than the currently active device.CUTENSOR_STATUS_INSUFFICIENT_DRIVER
: if the driver is insufficient.CUTENSOR_STATUS_CUDA_ERROR
: if some unknown CUDA error has occurred (e.g., out of memory).
cutensorContractionMaxAlgos()
¶

cutensorStatus_t
cutensorContractionMaxAlgos
(int32_t *maxNumAlgos)¶  Brief
This routine returns the maximum number of algorithms available to compute tensor contractions
 [NOTE] Not all algorithms might be applicable to your specific problem. cutensorContraction() will return CUTENSOR_STATUS_NOT_SUPPORTED if an algorithm is not applicable.
 Parameters
[out] maxNumAlgos
: This value will hold the maximum number of algorithms available for cutensorContraction(). You can use the returned integer for autotuning purposes (i.e., iterate over all algorithms up to the returned value).
 Return Value
CUTENSOR_STATUS_SUCCESS
: The operation completed successfully.
Reduction Operations¶
The following functions perform tensor reductions.
cutensorReduction()
¶

cutensorStatus_t
cutensorReduction
(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *beta, const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opReduce, cutensorComputeType_t typeCompute, void *workspace, uint64_t workspaceSize, cudaStream_t stream)¶ This function is also able to perform partial reductions; for instance: C[i,j] = alpha * A[k,j,i]; in this case only elements along the kmode are contracted.
 Brief
Implements a tensor reduction of the form
\[ D = alpha * opReduce(opA(A)) + beta * opC(C) \] Details
For example this function enables users to reduce an entire tensor to a scalar: C[] = alpha * A[i,j,k];
The binary opReduce operator provides extra control over what kind of a reduction ought to be perfromed. For instance, opReduce == CUTENSOR_OP_ADD reduces element of A via a summation while CUTENSOR_OP_MAX would find the largest element in A.
Supported datatype combinations are:
typeA
typeB
typeC
typeCompute
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16F
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_16BF
CUDA_R_32F
CUDA_R_32F
CUDA_R_32F
CUDA_R_64F
CUDA_R_64F
CUDA_R_64F
CUDA_C_32F
CUDA_C_32F
CUDA_C_32F
CUDA_C_64F
CUDA_C_64F
CUDA_C_64F
 Parameters
[in] handle
: Opaque handle holding cuTENSOR’s library context.[in] alpha
: Scaling for A; its data type is determined by ‘typeCompute’. Pointer to the host memory.[in] A
: Pointer to the data corresponding to A in device memory. Pointer to the GPUaccessible memory.[in] descA
: A descriptor that holds the information about the data type, modes and strides of A.[in] modeA
: Array with ‘nmodeA’ entries that represent the modes of A. modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor. Modes that only appear in modeA but not in modeC are reduced (contracted).[in] beta
: Scaling for C; its data type is determined by ‘typeCompute’. Pointer to the host memory.[in] C
: Pointer to the data corresponding to C in device memory. Pointer to the GPUaccessible memory.[in] descC
: A descriptor that holds the information about the data type, modes and strides of C.[in] modeC
: Array with ‘nmodeC’ entries that represent the modes of C. modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.[out] D
: Pointer to the data corresponding to C in device memory. Pointer to the GPUaccessible memory.[in] descD
: Must be identical to descC for now.[in] modeD
: Must be identical to modeC for now.[in] opReduce
: binary operator used to reduce elements of A.[in] typeCompute
: All arithmetic is performed using this data type (i.e., it affects the accuracy and performance).[out] workspace
: Scratchpad (device) memory; the workspace must be aligned to 128 bytes.[in] workspaceSize
: Please use cutensorReductionGetWorkspaceSize() to query the required workspace. While lower values, including zero, are valid, they may lead to grossly suboptimal performance.[in] stream
: The CUDA stream in which all the computation is performed.
 Return Value
CUTENSOR_STATUS_NOT_SUPPORTED
: if operation is not supported.CUTENSOR_STATUS_INVALID_VALUE
: if some input data is invalid (this typically indicates an user error).CUTENSOR_STATUS_SUCCESS
: The operation completed successfully.CUTENSOR_STATUS_NOT_INITIALIZED
: if the handle is not initialized.
cutensorReductionGetWorkspace()
¶
Warning
doxygenfunction: Cannot find function “cutensorReductionGetWorkspace” in doxygen xml output for project “cuTensor” from directory: _xml