cuTENSOR Functions#

Helper Functions#

The helper functions initialize cuTENSOR, create tensor descriptors, check error codes, and retrieve library and CUDA runtime versions.

`cutensorCreate()`#

cutensorStatus_t cutensorCreate(cutensorHandle_t *handle)#

Initializes the cuTENSOR library and allocates the memory for the library context.

The device associated with a particular cuTENSOR handle is assumed to remain unchanged after the cutensorCreate call. In order for the cuTENSOR library to use a different device, the application must set the new device to be used by calling cudaSetDevice and then create another cuTENSOR handle, which will be associated with the new device, by calling cutensorCreate.

Moreover, each handle by default has a plan cache that can store the least recently used cutensorPlan_t; its default capacity is 64, but it can be changed via cutensorHandleResizePlanCache if this is too little storage space. See the Plan Cache Guide for more information.

The user is responsible for calling cutensorDestroy to free the resources associated with the handle.

Remark

blocking, no reentrant, and thread-safe

Parameters:: handle – [out] Pointer to cutensorHandle_t
Return values:: CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise

`cutensorDestroy()`#

cutensorStatus_t cutensorDestroy(cutensorHandle_t handle)#

Frees all resources related to the provided library handle.

Remark

blocking, no reentrant, and thread-safe

Parameters:: handle – [inout] Pointer to cutensorHandle_t
Return values:: CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise

`cutensorCreateTensorDescriptor()`#

cutensorStatus_t cutensorCreateTensorDescriptor( const cutensorHandle_t handle, cutensorTensorDescriptor_t *desc, const uint32_t numModes, const int64_t extent[], const int64_t stride[], cudaDataType_t dataType, uint32_t alignmentRequirement, )#

Creates a tensor descriptor.

This allocates a small amount of host-memory.

The user is responsible for calling cutensorDestroyTensorDescriptor() to free the associated resources once the tensor descriptor is no longer used.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] Pointer to the address where the allocated tensor descriptor object will be stored.
numModes – [in] Number of modes.
extent – [in] Extent of each mode (must be larger than zero).
stride – [in] stride[i] denotes the displacement (a.k.a. stride)—in elements of the base type—between two consecutive elements in the ith-mode. If stride is NULL, a packed generalized column-major memory layout is assumed (i.e., the strides increase monotonically from left to right). Each stride must be larger than zero; to be precise, a stride of zero can be achieved by omitting this mode entirely; for instance instead of writing C[a,b] = A[b,a] with strideA(a) = 0, you can write C[a,b] = A[b] directly; cuTENSOR will then automatically infer that the a-mode in A should be broadcasted.
dataType – [in] Data type of the stored entries.
alignmentRequirement – [in] Alignment (in bytes) to the base pointer that will be used in conjunction with this tensor descriptor (e.g., cudaMalloc has a default alignment of 256 bytes).

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_NOT_SUPPORTED – if the requested descriptor is not supported (e.g., due to non-supported data type).
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

Pre:

extent and stride arrays must each contain at least sizeof(int64_t) * numModes bytes

`cutensorDestroyTensorDescriptor()`#

cutensorStatus_t cutensorDestroyTensorDescriptor( cutensorTensorDescriptor_t desc, )#

Frees all resources related to the provided tensor descriptor.

Remark

blocking, no reentrant, and thread-safe

Parameters:: desc – [inout] The cutensorTensorDescriptor_t object that will be deallocated.
Return values:: CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise

`cutensorGetErrorString()`#

const char *cutensorGetErrorString(const cutensorStatus_t error)#

Returns the description string for an error code.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:: error – [in] Error code to convert to string.
Return values:: The – null-terminated error string.

`cutensorGetVersion()`#

size_t cutensorGetVersion()#: Returns Version number of the CUTENSOR library.

`cutensorGetCudartVersion()`#

size_t cutensorGetCudartVersion()#

Returns version number of the CUDA runtime that cuTENSOR was compiled against.

Can be compared against the CUDA runtime version from cudaRuntimeGetVersion().

Element-wise Operations#

The following functions perform element-wise operations between tensors.

`cutensorCreateElementwiseTrinary()`#

cutensorStatus_t cutensorCreateElementwiseTrinary( const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descB, const int32_t modeB[], cutensorOperator_t opB, const cutensorTensorDescriptor_t descC, const int32_t modeC[], cutensorOperator_t opC, const cutensorTensorDescriptor_t descD, const int32_t modeD[], cutensorOperator_t opAB, cutensorOperator_t opABC, const cutensorComputeDescriptor_t descCompute, )#

This function creates an operation descriptor that encodes an elementwise trinary operation.

Said trinary operation has the following general form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{ABC}(\Phi_{AB}(\alpha op_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \beta op_B(B_{\Pi^B(i_0,i_1,...,i_n)})), \gamma op_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

Where

A,B,C,D are multi-mode tensors (of arbitrary data types).
\(\Pi^A, \Pi^B, \Pi^C \) are permutation operators that permute the modes of A, B, and C respectively.
\(op_{A},op_{B},op_{C}\) are unary element-wise operators (e.g., IDENTITY, CONJUGATE).
\(\Phi_{ABC}, \Phi_{AB}\) are binary element-wise operators (e.g., ADD, MUL, MAX, MIN).

Notice that the broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.

Moreover, modes may appear in any order, giving users a greater flexibility. The only restrictions are:

modes that appear in A or B must also appear in the output tensor; a mode that only appears in the input would be contracted and such an operation would be covered by either cutensorContract or cutensorReduce.
each mode may appear in each tensor at most once.

Input tensors may be read even if the value of the corresponding scalar is zero.

Examples:

\( D_{a,b,c,d} = A_{b,d,a,c}\)
\( D_{a,b,c,d} = 2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a}\)
\( D_{a,b,c,d} = 2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a} + C_{a,b,c,d}\)
\( D_{a,b,c,d} = min((2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a}), C_{a,b,c,d})\)

Call cutensorElementwiseTrinaryExecute to perform the actual operation.

Please use cutensorDestroyOperationDescriptor to deallocated the descriptor once it is no longer used.

Supported data-type combinations are:

typeA	typeB	typeC	descCompute
CUDA_R_16F	CUDA_R_16F	CUDA_R_16F	CUTENSOR_COMPUTE_DESC_16F
CUDA_R_16F	CUDA_R_16F	CUDA_R_16F	CUTENSOR_COMPUTE_DESC_32F
CUDA_R_16BF	CUDA_R_16BF	CUDA_R_16BF	CUTENSOR_COMPUTE_DESC_16BF
CUDA_R_16BF	CUDA_R_16BF	CUDA_R_16BF	CUTENSOR_COMPUTE_DESC_32F
CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_32F
CUDA_R_64F	CUDA_R_64F	CUDA_R_64F	CUTENSOR_COMPUTE_DESC_64F
CUDA_C_32F	CUDA_C_32F	CUDA_C_32F	CUTENSOR_COMPUTE_DESC_32F
CUDA_C_64F	CUDA_C_64F	CUDA_C_64F	CUTENSOR_COMPUTE_DESC_64F
CUDA_R_32F	CUDA_R_32F	CUDA_R_16F	CUTENSOR_COMPUTE_DESC_32F
CUDA_R_64F	CUDA_R_64F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_64F
CUDA_C_64F	CUDA_C_64F	CUDA_C_32F	CUTENSOR_COMPUTE_DESC_64F

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the requested elementwise operation.
descA – [in] A descriptor that holds the information about the data type, modes, and strides of A.
modeA – [in] Array (in host memory) of size descA->numModes that holds the names of the modes of A (e.g., if \(A_{a,b,c}\) then modeA = {‘a’,’b’,’c’}). The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorCreateTensorDescriptor.
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descB – [in] A descriptor that holds information about the data type, modes, and strides of B.
modeB – [in] Array (in host memory) of size descB->numModes that holds the names of the modes of B. modeB[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor
opB – [in] Unary operator that will be applied to each element of B before it is further processed. The original data of this tensor remains unchanged.
descC – [in] A descriptor that holds information about the data type, modes, and strides of C.
modeC – [in] Array (in host memory) of size descC->numModes that holds the names of the modes of C. The modeC[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor.
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.
descD – [in] A descriptor that holds information about the data type, modes, and strides of D. Notice that we currently request descD and descC to be identical.
modeD – [in] Array (in host memory) of size descD->numModes that holds the names of the modes of D. The modeD[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor.
opAB – [in] Element-wise binary operator (see \(\Phi_{AB}\) above).
opABC – [in] Element-wise binary operator (see \(\Phi_{ABC}\) above).
descCompute – [in] Determines the precision in which this operations is performed.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_ARCH_MISMATCH – if the device is either not ready, or the target architecture is not supported.

`cutensorElementwiseTrinaryExecute()`#

cutensorStatus_t cutensorElementwiseTrinaryExecute( const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, const void *beta, const void *B, const void *gamma, const void *C, void *D, cudaStream_t stream, )#

Performs an element-wise tensor operation for three input tensors (see cutensorCreateElementwiseTrinary)

This function performs a element-wise tensor operation of the form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{ABC}(\Phi_{AB}(\alpha op_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \beta op_B(B_{\Pi^B(i_0,i_1,...,i_n)})), \gamma op_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

See cutensorCreateElementwiseTrinary() for details.

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding all information about the desired elementwise operation (created by cutensorCreateElementwiseTrinary followed by cutensorCreatePlan).
alpha – [in] Scaling factor for A (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
A – [in] Multi-mode tensor (described by descA as part of cutensorCreateElementwiseTrinary). Pointer to the GPU-accessible memory. The data accessed via this pointer must not overlap with the elements written to D.
beta – [in] Scaling factor for B (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If beta is zero, B is not read and the corresponding unary operator is not applied.
B – [in] Multi-mode tensor (described by descB as part of cutensorCreateElementwiseTrinary). Pointer to the GPU-accessible memory. The data accessed via this pointer must not overlap with the elements written to D.
gamma – [in] Scaling factor for C (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If gamma is zero, C is not read and the corresponding unary operator is not applied.
C – [in] Multi-mode tensor (described by descC as part of cutensorCreateElementwiseTrinary). Pointer to the GPU-accessible memory. The data accessed via this pointer must not overlap with the elements written to D.
D – [out] Multi-mode tensor (described by descD as part of cutensorCreateElementwiseTrinary). Pointer to the GPU-accessible memory (C and D may be identical, if and only if descC == descD).
stream – [in] The CUDA stream used to perform the operation.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorCreateElementwiseBinary()`#

cutensorStatus_t cutensorCreateElementwiseBinary( const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descC, const int32_t modeC[], cutensorOperator_t opC, const cutensorTensorDescriptor_t descD, const int32_t modeD[], cutensorOperator_t opAC, const cutensorComputeDescriptor_t descCompute, )#

This function creates an operation descriptor for an elementwise binary operation.

The binary operation has the following general form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{AC}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

Call cutensorElementwiseBinaryExecute to perform the actual operation.

Supported data-type combinations are:

typeA	typeC	descCompute
CUDA_R_16F	CUDA_R_16F	CUTENSOR_COMPUTE_DESC_16F
CUDA_R_16F	CUDA_R_16F	CUTENSOR_COMPUTE_DESC_32F
CUDA_R_16BF	CUDA_R_16BF	CUTENSOR_COMPUTE_DESC_16BF
CUDA_R_16BF	CUDA_R_16BF	CUTENSOR_COMPUTE_DESC_32F
CUDA_R_32F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_32F
CUDA_R_64F	CUDA_R_64F	CUTENSOR_COMPUTE_DESC_64F
CUDA_C_32F	CUDA_C_32F	CUTENSOR_COMPUTE_DESC_32F
CUDA_C_64F	CUDA_C_64F	CUTENSOR_COMPUTE_DESC_64F
CUDA_R_32F	CUDA_R_16F	CUTENSOR_COMPUTE_DESC_32F
CUDA_R_64F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_64F
CUDA_C_64F	CUDA_C_32F	CUTENSOR_COMPUTE_DESC_64F

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the requested elementwise operation.
descA – [in] The descriptor that holds the information about the data type, modes, and strides of A.
modeA – [in] Array (in host memory) of size descA->numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’}). The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorCreateTensorDescriptor.
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descC – [in] The descriptor that holds information about the data type, modes, and strides of C.
modeC – [in] Array (in host memory) of size descC->numModes that holds the names of the modes of C. The modeC[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor.
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.
descD – [in] The descriptor that holds information about the data type, modes, and strides of D. Notice that we currently request descD and descC to be identical.
modeD – [in] Array (in host memory) of size descD->numModes that holds the names of the modes of D. The modeD[i] corresponds to extent[i] and stride[i] of the cutensorCreateTensorDescriptor.
opAC – [in] Element-wise binary operator (see \(\Phi_{AC}\) above).
descCompute – [in] Determines the precision in which this operations is performed.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorElementwiseBinaryExecute()`#

cutensorStatus_t cutensorElementwiseBinaryExecute( const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, const void *gamma, const void *C, void *D, cudaStream_t stream, )#

Performs an element-wise tensor operation for two input tensors (see cutensorCreateElementwiseBinary)

This function performs a element-wise tensor operation of the form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{AC}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

See cutensorCreateElementwiseBinary() for details.

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding all information about the desired elementwise operation (created by cutensorCreateElementwiseBinary followed by cutensorCreatePlan).
alpha – [in] Scaling factor for A (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
A – [in] Multi-mode tensor (described by descA as part of cutensorCreateElementwiseBinary). Pointer to the GPU-accessible memory. The data accessed via this pointer must not overlap with the elements written to D.
gamma – [in] Scaling factor for C (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE) to query the expected data type). Pointer to the host memory. If gamma is zero, C is not read and the corresponding unary operator is not applied.
C – [in] Multi-mode tensor (described by descC as part of cutensorCreateElementwiseBinary). Pointer to the GPU-accessible memory. The data accessed via this pointer must not overlap with the elements written to D.
D – [out] Multi-mode tensor (described by descD as part of cutensorCreateElementwiseBinary). Pointer to the GPU-accessible memory (C and D may be identical, if and only if descC == descD).
stream – [in] The CUDA stream used to perform the operation.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorCreatePermutation()`#

cutensorStatus_t cutensorCreatePermutation( const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descB, const int32_t modeB[], const cutensorComputeDescriptor_t descCompute, )#

This function creates an operation descriptor for a tensor permutation.

The tensor permutation has the following general form:

\[ B_{\Pi^B(i_0,i_1,...,i_n)} = \alpha op_A(A_{\Pi^A(i_0,i_1,...,i_n)}) \]

Consequently, this function performs an out-of-place tensor permutation and is a specialization of cutensorCreateElementwiseBinary.

Where

A and B are multi-mode tensors (of arbitrary data types),
\(\Pi^A, \Pi^B\) are permutation operators that permute the modes of A, B respectively,
\(op_A\) is an unary element-wise operators (e.g., IDENTITY, SQR, CONJUGATE), and
\(\Psi\) is specified in the tensor descriptor descA.

Broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.

Modes may appear in any order. The only restrictions are:

modes that appear in A must also appear in the output tensor.
each mode may appear in each tensor at most once.

Supported data-type combinations are:

typeA	typeB	descCompute
CUDA_R_16F	CUDA_R_16F	CUTENSOR_COMPUTE_DESC_16F
CUDA_R_16F	CUDA_R_16F	CUTENSOR_COMPUTE_DESC_32F
CUDA_R_16F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_32F
CUDA_R_32F	CUDA_R_16F	CUTENSOR_COMPUTE_DESC_32F
CUDA_R_16BF	CUDA_R_16BF	CUTENSOR_COMPUTE_DESC_16BF
CUDA_R_16BF	CUDA_R_16BF	CUTENSOR_COMPUTE_DESC_32F
CUDA_R_32F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_32F
CUDA_R_64F	CUDA_R_64F	CUTENSOR_COMPUTE_DESC_64F
CUDA_R_32F	CUDA_R_64F	CUTENSOR_COMPUTE_DESC_64F
CUDA_R_64F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_64F
CUDA_C_32F	CUDA_C_32F	CUTENSOR_COMPUTE_DESC_32F
CUDA_C_64F	CUDA_C_64F	CUTENSOR_COMPUTE_DESC_64F
CUDA_C_32F	CUDA_C_64F	CUTENSOR_COMPUTE_DESC_64F
CUDA_C_64F	CUDA_C_32F	CUTENSOR_COMPUTE_DESC_64F

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the requested permutation.
descA – [in] The descriptor that holds information about the data type, modes, and strides of A.
modeA – [in] Array of size descA->numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’})
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descB – [in] The descriptor that holds information about the data type, modes, and strides of B.
modeB – [in] Array of size descB->numModes that holds the names of the modes of B
descCompute – [in] Determines the precision in which this operations is performed.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorPermute()`#

cutensorStatus_t cutensorPermute( const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, void *B, const cudaStream_t stream, )#

Performs the tensor permutation that is encoded by plan (see cutensorCreatePermutation).

This function performs an elementwise tensor operation of the form:

\[ B_{\Pi^B(i_0,i_1,...,i_n)} = \alpha \Psi(A_{\Pi^A(i_0,i_1,...,i_n)}) \]

Consequently, this function performs an out-of-place tensor permutation.

Where

A and B are multi-mode tensors (of arbitrary data types),
\(\Pi^A, \Pi^B\) are permutation operators that permute the modes of A, B respectively,
\(\Psi\) is an unary element-wise operators (e.g., IDENTITY, SQR, CONJUGATE), and
\(\Psi\) is specified in the tensor descriptor descA.

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding all information about the desired tensor reduction (created by cutensorCreatePermutation followed by cutensorCreatePlan).
alpha – [in] Scaling factor for A (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
A – [in] Multi-mode tensor of type typeA with nmodeA modes. Pointer to the GPU-accessible memory. The data accessed via this pointer must not overlap with the elements written to D.
B – [inout] Multi-mode tensor of type typeB with nmodeB modes. Pointer to the GPU-accessible memory.
stream – [in] The CUDA stream.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

Contraction Operations#

The following functions perform contractions between tensors.

`cutensorCreateContraction()`#

cutensorStatus_t cutensorCreateContraction( const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descB, const int32_t modeB[], cutensorOperator_t opB, const cutensorTensorDescriptor_t descC, const int32_t modeC[], cutensorOperator_t opC, const cutensorTensorDescriptor_t descD, const int32_t modeD[], const cutensorComputeDescriptor_t descCompute, )#

This function allocates a cutensorOperationDescriptor_t object that encodes a tensor contraction of the form \( D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \).

Allocates data for desc to be used to perform a tensor contraction of the form

\[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha op_\mathcal{A}(\mathcal{A}_{{modes}_\mathcal{A}}) op_\mathcal{B}(B_{{modes}_\mathcal{B}}) + \beta op_\mathcal{C}(\mathcal{C}_{{modes}_\mathcal{C}}). \]

See cutensorCreatePlan (or cutensorCreatePlanAutotuned) to create the plan (i.e., to select the kernel) followed by a call to cutensorContract to perform the actual contraction.

The user is responsible for calling cutensorDestroyOperationDescriptor to free the resources associated with the descriptor.

Supported data-type combinations are:

typeA	typeB	typeC	descCompute	typeScalar	Tensor Core
CUDA_R_16F	CUDA_R_16F	CUDA_R_16F	CUTENSOR_COMPUTE_DESC_32F	CUDA_R_32F	Volta+
CUDA_R_16BF	CUDA_R_16BF	CUDA_R_16BF	CUTENSOR_COMPUTE_DESC_32F	CUDA_R_32F	Ampere+
CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_32F	CUDA_R_32F	No
CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_TF32	CUDA_R_32F	Ampere+
CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_3XTF32	CUDA_R_32F	Ampere+
CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_16BF	CUDA_R_32F	Ampere+
CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_16F	CUDA_R_32F	Volta+
CUDA_R_64F	CUDA_R_64F	CUDA_R_64F	CUTENSOR_COMPUTE_DESC_64F	CUDA_R_64F	Ampere+
CUDA_R_64F	CUDA_R_64F	CUDA_R_64F	CUTENSOR_COMPUTE_DESC_32F	CUDA_R_64F	No
CUDA_C_32F	CUDA_C_32F	CUDA_C_32F	CUTENSOR_COMPUTE_DESC_32F	CUDA_C_32F	No
CUDA_C_32F	CUDA_C_32F	CUDA_C_32F	CUTENSOR_COMPUTE_DESC_TF32	CUDA_C_32F	Ampere+
CUDA_C_32F	CUDA_C_32F	CUDA_C_32F	CUTENSOR_COMPUTE_DESC_3XTF32	CUDA_C_32F	Ampere+
CUDA_C_64F	CUDA_C_64F	CUDA_C_64F	CUTENSOR_COMPUTE_DESC_64F	CUDA_C_64F	Ampere+
CUDA_C_64F	CUDA_C_64F	CUDA_C_64F	CUTENSOR_COMPUTE_DESC_32F	CUDA_C_64F	No
CUDA_R_64F	CUDA_C_64F	CUDA_C_64F	CUTENSOR_COMPUTE_DESC_64F	CUDA_C_64F	No
CUDA_C_64F	CUDA_R_64F	CUDA_C_64F	CUTENSOR_COMPUTE_DESC_64F	CUDA_C_64F	No

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the tensor contraction operation.
descA – [in] The descriptor that holds the information about the data type, modes and strides of A.
modeA – [in] Array with ‘nmodeA’ entries that represent the modes of A. The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descB – [in] The descriptor that holds information about the data type, modes, and strides of B.
modeB – [in] Array with ‘nmodeB’ entries that represent the modes of B. The modeB[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
opB – [in] Unary operator that will be applied to each element of B before it is further processed. The original data of this tensor remains unchanged.
modeC – [in] Array with ‘nmodeC’ entries that represent the modes of C. The modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
descC – [in] The escriptor that holds information about the data type, modes, and strides of C.
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.
modeD – [in] Array with ‘nmodeD’ entries that represent the modes of D (must be identical to modeC for now). The modeD[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
descD – [in] The descriptor that holds information about the data type, modes, and strides of D (must be identical to descC for now).
descCompute – [in] Determines the precision in which this operations is performed.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorContract()`#

cutensorStatus_t cutensorContract( const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, const void *B, const void *beta, const void *C, void *D, void *workspace, uint64_t workspaceSize, cudaStream_t stream, )#

This routine computes the tensor contraction \( D = alpha * A * B + beta * C \).

\[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha * \mathcal{A}_{{modes}_\mathcal{A}} B_{{modes}_\mathcal{B}} + \beta \mathcal{C}_{{modes}_\mathcal{C}} \]

The active CUDA device must match the CUDA device that was active at the time at which the plan was created.

[Example]: See NVIDIA/CUDALibrarySamples for a concrete example.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding the contraction execution plan (created by cutensorCreateContraction followed by cutensorCreatePlan).
alpha – [in] Scaling for A*B. Its data type is determined by ‘descCompute’ (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory.
A – [in] Pointer to the data corresponding to A. Pointer to the GPU-accessible memory. The data accessed via this pointer must not overlap with the elements written to D.
B – [in] Pointer to the data corresponding to B. Pointer to the GPU-accessible memory. The data accessed via this pointer must not overlap with the elements written to D.
beta – [in] Scaling for C. Its data type is determined by ‘descCompute’ (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory.
C – [in] Pointer to the data corresponding to C. Pointer to the GPU-accessible memory.
D – [out] Pointer to the data corresponding to D. Pointer to the GPU-accessible memory.
workspace – [out] Optional parameter that may be NULL. This pointer provides additional workspace, in device memory, to the library for additional optimizations; the workspace must be aligned to 256 bytes (i.e., the default alignment of cudaMalloc).
workspaceSize – [in] Size of the workspace array in bytes; please refer to cutensorEstimateWorkspaceSize to query the required workspace. While cutensorContract does not strictly require a workspace for the contraction, it is still recommended to provided some small workspace (e.g., 128 MB).
stream – [in] The CUDA stream in which all the computation is performed.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if operation is not supported.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_ARCH_MISMATCH – if the plan was created for a different device than the currently active device.
CUTENSOR_STATUS_INSUFFICIENT_DRIVER – if the driver is insufficient.
CUTENSOR_STATUS_CUDA_ERROR – if some unknown CUDA error has occurred (e.g., out of memory).

`cutensorCreateContractionTrinary()`#

cutensorStatus_t cutensorCreateContractionTrinary( const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descB, const int32_t modeB[], cutensorOperator_t opB, const cutensorTensorDescriptor_t descC, const int32_t modeC[], cutensorOperator_t opC, const cutensorTensorDescriptor_t descD, const int32_t modeD[], cutensorOperator_t opD, const cutensorTensorDescriptor_t descE, const int32_t modeE[], const cutensorComputeDescriptor_t descCompute, )#

This function allocates a cutensorOperationDescriptor_t object that encodes a tensor contraction of the form \( \mathcal{E} = \alpha \mathcal{A} \mathcal{B} \mathcal{C} + \beta \mathcal{D} \).

Allocates data for desc to be used to perform a tensor contraction of the form

\[ \mathcal{E}_{{modes}_\mathcal{E}} \gets \alpha op_\mathcal{A}(\mathcal{A}_{{modes}_\mathcal{A}}) op_\mathcal{B}(\mathcal{B}_{{modes}_\mathcal{B}}) op_\mathcal{C}(\mathcal{C}_{{modes}_\mathcal{C}}) + \beta op_\mathcal{D}(\mathcal{D}_{{modes}_\mathcal{D}}). \]

See cutensorCreatePlan (or cutensorCreatePlanAutotuned) to create the plan (i.e., to select the kernel) followed by a call to cutensorContractTrinary to perform the actual contraction.

The user is responsible for calling cutensorDestroyOperationDescriptor to free the resources associated with the descriptor.

The performance improvements due to this API are currently especially high if your data resides on the host (i.e. out-of-core), targeting Grace-based systems.

Supported data-type combinations are:

typeA	typeB	typeC	typeD	descCompute	typeScalar	Tensor Core
CUDA_R_16F	CUDA_R_16F	CUDA_R_16F	CUDA_R_16F	CUTENSOR_COMPUTE_DESC_32F	CUDA_R_32F	Volta+
CUDA_R_16BF	CUDA_R_16BF	CUDA_R_16BF	CUDA_R_16BF	CUTENSOR_COMPUTE_DESC_32F	CUDA_R_32F	Ampere+
CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_32F	CUDA_R_32F	No
CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_TF32	CUDA_R_32F	Ampere+
CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_3XTF32	CUDA_R_32F	Ampere+
CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_16BF	CUDA_R_32F	Ampere+
CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_16F	CUDA_R_32F	Volta+
CUDA_R_64F	CUDA_R_64F	CUDA_R_64F	CUDA_R_64F	CUTENSOR_COMPUTE_DESC_64F	CUDA_R_64F	Ampere+
CUDA_R_64F	CUDA_R_64F	CUDA_R_64F	CUDA_R_64F	CUTENSOR_COMPUTE_DESC_32F	CUDA_R_64F	No
CUDA_C_32F	CUDA_C_32F	CUDA_C_32F	CUDA_C_32F	CUTENSOR_COMPUTE_DESC_32F	CUDA_C_32F	No
CUDA_C_32F	CUDA_C_32F	CUDA_C_32F	CUDA_C_32F	CUTENSOR_COMPUTE_DESC_TF32	CUDA_C_32F	Ampere+
CUDA_C_32F	CUDA_C_32F	CUDA_C_32F	CUDA_C_32F	CUTENSOR_COMPUTE_DESC_3XTF32	CUDA_C_32F	Ampere+
CUDA_C_64F	CUDA_C_64F	CUDA_C_64F	CUDA_C_64F	CUTENSOR_COMPUTE_DESC_64F	CUDA_C_64F	Ampere+
CUDA_C_64F	CUDA_C_64F	CUDA_C_64F	CUDA_C_64F	CUTENSOR_COMPUTE_DESC_32F	CUDA_C_64F	No

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the tensor contraction operation.
descA – [in] The descriptor that holds the information about the data type, modes and strides of A.
modeA – [in] Array with ‘nmodeA’ entries that represent the modes of A. The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descB – [in] The descriptor that holds information about the data type, modes, and strides of B.
modeB – [in] Array with ‘nmodeB’ entries that represent the modes of B. The modeB[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
opB – [in] Unary operator that will be applied to each element of B before it is further processed. The original data of this tensor remains unchanged.
descC – [in] The escriptor that holds information about the data type, modes, and strides of C.
modeC – [in] Array with ‘nmodeC’ entries that represent the modes of C. The modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.
descD – [in] The escriptor that holds information about the data type, modes, and strides of D.
modeD – [in] Array with ‘nmodeD’ entries that represent the modes of D. The modeD[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
opD – [in] Unary operator that will be applied to each element of D before it is further processed. The original data of this tensor remains unchanged.
modeE – [in] Array with ‘nmodeE’ entries that represent the modes of E (must be identical to modeD for now). The modeE[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
descE – [in] The descriptor that holds information about the data type, modes, and strides of E (must be identical to descD for now).
descCompute – [in] Determines the precision in which this operations is performed.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorContractTrinary()`#

cutensorStatus_t cutensorContractTrinary( const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, const void *B, const void *C, const void *beta, const void *D, void *E, void *workspace, uint64_t workspaceSize, cudaStream_t stream, )#

This routine computes the tensor contraction \( E = alpha * A * B * C + beta * D \).

\[ \mathcal{E}_{{modes}_\mathcal{E}} \gets \alpha * \mathcal{A}_{{modes}_\mathcal{A}} \mathcal{B}_{{modes}_\mathcal{B}} \mathcal{C}_{{modes}_\mathcal{C}} + \beta \mathcal{D}_{{modes}_\mathcal{D}} \]

The active CUDA device must match the CUDA device that was active at the time at which the plan was created.

[Example]: See NVIDIA/CUDALibrarySamples for a concrete example.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding the contraction execution plan (created by cutensorCreateContractionTrinary followed by cutensorCreatePlan).
alpha – [in] Scaling for A*B*C. Its data type is determined by ‘descCompute’ (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory.
A – [in] Pointer to the data corresponding to A. Pointer to the GPU-accessible memory. The data accessed via this pointer must not overlap with the elements written to E.
B – [in] Pointer to the data corresponding to B. Pointer to the GPU-accessible memory. The data accessed via this pointer must not overlap with the elements written to E.
C – [in] Pointer to the data corresponding to C. Pointer to the GPU-accessible memory. The data accessed via this pointer must not overlap with the elements written to E.
beta – [in] Scaling for D. Its data type is determined by ‘descCompute’ (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory.
D – [in] Pointer to the data corresponding to D. Pointer to the GPU-accessible memory.
E – [out] Pointer to the data corresponding to E. Pointer to the GPU-accessible memory.
workspace – [out] Optional parameter that may be NULL. This pointer provides additional workspace, in device memory, to the library for additional optimizations; the workspace must be aligned to 256 bytes (i.e., the default alignment of cudaMalloc).
workspaceSize – [in] Size of the workspace array in bytes; please refer to cutensorEstimateWorkspaceSize to query the required workspace. While cutensorContract does not strictly require a workspace for the contraction, it is still recommended to provided some small workspace (e.g., 128 MB).
stream – [in] The CUDA stream in which all the computation is performed.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if operation is not supported.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_ARCH_MISMATCH – if the plan was created for a different device than the currently active device.
CUTENSOR_STATUS_INSUFFICIENT_DRIVER – if the driver is insufficient.
CUTENSOR_STATUS_CUDA_ERROR – if some unknown CUDA error has occurred (e.g., out of memory).

Reduction Operations#

The following functions perform tensor reductions.

`cutensorCreateReduction()`#

cutensorStatus_t cutensorCreateReduction( const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorTensorDescriptor_t descC, const int32_t modeC[], cutensorOperator_t opC, const cutensorTensorDescriptor_t descD, const int32_t modeD[], cutensorOperator_t opReduce, const cutensorComputeDescriptor_t descCompute, )#

Creates a cutensorOperatorDescriptor_t object that encodes a tensor reduction of the form \( D = alpha * opReduce(opA(A)) + beta * opC(C) \).

For example this function enables users to reduce an entire tensor to a scalar: C[] = alpha * A[i,j,k];

This function is also able to perform partial reductions; for instance: C[i,j] = alpha * A[k,j,i]; in this case only elements along the k-mode are contracted.

The binary opReduce operator provides extra control over what kind of a reduction ought to be performed. For instance, setting opReduce to CUTENSOR_OP_ADD reduces element of A via a summation while CUTENSOR_OP_MAX would find the largest element in A.

Supported data-type combinations are:

typeA	typeB	typeC	typeCompute
`CUDA_R_16F `	`CUDA_R_16F`	`CUDA_R_16F`	`CUTENSOR_COMPUTE_DESC_16F`
`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`	`CUTENSOR_COMPUTE_DESC_32F`
`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_16BF`	`CUTENSOR_COMPUTE_DESC_16BF`
`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_16BF`	`CUTENSOR_COMPUTE_DESC_32F`
`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`	`CUTENSOR_COMPUTE_DESC_32F`
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`	`CUTENSOR_COMPUTE_DESC_64F`
`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`	`CUTENSOR_COMPUTE_DESC_32F`
`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`	`CUTENSOR_COMPUTE_DESC_64F`

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the requested tensor reduction operation.
descA – [in] The descriptor that holds the information about the data type, modes and strides of A.
modeA – [in] Array with ‘nmodeA’ entries that represent the modes of A. modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorCreateTensorDescriptor. Modes that only appear in modeA but not in modeC are reduced (contracted).
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descC – [in] The descriptor that holds the information about the data type, modes and strides of C.
modeC – [in] Array with ‘nmodeC’ entries that represent the modes of C. modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorCreateTensorDescriptor.
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.
descD – [in] Must be identical to descC for now.
modeD – [in] Must be identical to modeC for now.
opReduce – [in] binary operator used to reduce elements of A.
typeCompute – [in] All arithmetic is performed using this data type (i.e., it affects the accuracy and performance).

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if operation is not supported.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorReduce()`#

cutensorStatus_t cutensorReduce( const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *A, const void *beta, const void *C, void *D, void *workspace, uint64_t workspaceSize, cudaStream_t stream, )#

Performs the tensor reduction that is encoded by plan (see cutensorCreateReduction).

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding the reduction execution plan (created by cutensorCreateReduction followed by cutensorCreatePlan).
alpha – [in] Scaling for A. Its data type is determined by ‘descCompute’ (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory.
A – [in] Pointer to the data corresponding to A in device memory. Pointer to the GPU-accessible memory. The data accessed via this pointer must not overlap with the elements written to D.
beta – [in] Scaling for C. Its data type is determined by ‘descCompute’ (see cutensorOperationDescriptorGetAttribute(desc, CUTENSOR_OPERATION_SCALAR_TYPE)). Pointer to the host memory.
C – [in] Pointer to the data corresponding to C in device memory. Pointer to the GPU-accessible memory.
D – [out] Pointer to the data corresponding to C in device memory. Pointer to the GPU-accessible memory.
workspace – [out] Scratchpad (device) memory of size —at least— workspaceSize bytes; the workspace must be aligned to 256 bytes (i.e., the default alignment of cudaMalloc).
workspaceSize – [in] Please use cutensorEstimateWorkspaceSize() to query the required workspace.
stream – [in] The CUDA stream in which all the computation is performed.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

Generic Operation Functions#

The following functions are generic and work with all the different operations.

`cutensorDestroyOperationDescriptor()`#

cutensorStatus_t cutensorDestroyOperationDescriptor( cutensorOperationDescriptor_t desc, )#

Frees all resources related to the provided descriptor.

Remark

blocking, no reentrant, and thread-safe

Parameters:: desc – [inout] The cutensorOperationDescriptor_t object that will be deallocated.
Return values:: CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise

`cutensorOperationDescriptorGetAttribute()`#

cutensorStatus_t cutensorOperationDescriptorGetAttribute( const cutensorHandle_t handle, cutensorOperationDescriptor_t desc, cutensorOperationDescriptorAttribute_t attr, void *buf, size_t sizeInBytes, )#

This function retrieves an attribute of the provided cutensorOperationDescriptor_t object (see cutensorOperationDescriptorAttribute_t).

Block-sparse contraction descriptors only support the attributes CUTENSOR_OPERATION_DESCRIPTOR_SCALAR_TYPE and CUTENSOR_OPERATION_DESCRIPTOR_MOVED_BYTES.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [in] The cutensorOperationDescriptor_t object whos attribute is queried.
attr – [in] Specifies the attribute that will be retrieved.
buf – [out] This buffer (of size sizeInBytes) will hold the requested attribute of the provided cutensorOperationDescriptor_t object.
sizeInBytes – [in] Size of buf (in bytes); see cutensorOperationDescriptorAttribute_t for the exact size.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

`cutensorOperationDescriptorSetAttribute()`#

cutensorStatus_t cutensorOperationDescriptorSetAttribute( const cutensorHandle_t handle, cutensorOperationDescriptor_t desc, cutensorOperationDescriptorAttribute_t attr, const void *buf, size_t sizeInBytes, )#

Set attribute of a cutensorOperationDescriptor_t object.

Currently not supported for blocksparse contraction descriptors.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [inout] Operation descriptor that will be modified.
attr – [in] Specifies the attribute that will be set.
buf – [in] This buffer (of size sizeInBytes) determines the value to which attr will be set.
sizeInBytes – [in] Size of buf (in bytes).

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

`cutensorCreatePlanPreference()`#

cutensorStatus_t cutensorCreatePlanPreference( const cutensorHandle_t handle, cutensorPlanPreference_t *pref, cutensorAlgo_t algo, cutensorJitMode_t jitMode, )#

Allocates the cutensorPlanPreference_t, enabling users to limit the applicable kernels for a given plan/operation.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
pref – [out] Pointer to the structure holding the cutensorPlanPreference_t allocated by this function. See cutensorPlanPreference_t.
algo – [in] Allows users to select a specific algorithm. CUTENSOR_ALGO_DEFAULT lets the heuristic choose the algorithm. Any value >= 0 selects a specific GEMM-like algorithm and deactivates the heuristic. If a specified algorithm is not supported CUTENSOR_STATUS_NOT_SUPPORTED is returned. See cutensorAlgo_t for additional choices.
jitMode – [in] Determines if cuTENSOR is allowed to use JIT-compiled kernels (leading to a longer plan-creation phase); see cutensorJitMode_t.

`cutensorDestroyPlanPreference()`#

cutensorStatus_t cutensorDestroyPlanPreference( cutensorPlanPreference_t pref, )#

Frees all resources related to the provided preference.

Remark

blocking, no reentrant, and thread-safe

Parameters:: pref – [inout] The cutensorPlanPreference_t object that will be deallocated.
Return values:: CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise

`cutensorPlanPreferenceSetAttribute()`#

cutensorStatus_t cutensorPlanPreferenceSetAttribute( const cutensorHandle_t handle, cutensorPlanPreference_t pref, cutensorPlanPreferenceAttribute_t attr, const void *buf, size_t sizeInBytes, )#

Set attribute of a cutensorPlanPreference_t object.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
pref – [inout] This opaque struct restricts the search space of viable candidates.
attr – [in] Specifies the attribute that will be set.
buf – [in] This buffer (of size sizeInBytes) determines the value to which attr will be set.
sizeInBytes – [in] Size of buf (in bytes); see cutensorPlanPreferenceAttribute_t for the exact size.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

`cutensorEstimateWorkspaceSize()`#

cutensorStatus_t cutensorEstimateWorkspaceSize( const cutensorHandle_t handle, const cutensorOperationDescriptor_t desc, const cutensorPlanPreference_t planPref, const cutensorWorksizePreference_t workspacePref, uint64_t *workspaceSizeEstimate, )#

Determines the required workspaceSize for the given operation encoded by desc.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [in] This opaque struct encodes the operation.
planPref – [in] This opaque struct restricts the space of viable candidates.
workspacePref – [in] This parameter influences the size of the workspace; see cutensorWorksizePreference_t for details.
workspaceSizeEstimate – [out] The workspace size (in bytes) that is required for the given operation.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

`cutensorCreatePlan()`#

cutensorStatus_t cutensorCreatePlan( const cutensorHandle_t handle, cutensorPlan_t *plan, const cutensorOperationDescriptor_t desc, const cutensorPlanPreference_t pref, uint64_t workspaceSizeLimit, )#

This function allocates a cutensorPlan_t object, selects an appropriate kernel for a given operation (encoded by desc) and prepares a plan that encodes the execution.

This function applies cuTENSOR’s heuristic to select a candidate/kernel for a given operation (created by either cutensorCreateContraction, cutensorCreateReduction, cutensorCreatePermutation, * cutensorCreateElementwiseBinary, cutensorCreateElementwiseTrinary, cutensorCreateContractionTrinary, or cutensorCreateBlockSparseContraction). The created plan can then be be passed to either cutensorContract, cutensorReduce, cutensorPermute, cutensorElementwiseBinaryExecute, cutensorElementwiseTrinaryExecute, or cutensorContractTrinary to perform the actual operation.

The plan is created for the active CUDA device.

Note: cutensorCreatePlan must not be captured via CUDA graphs if Just-In-Time compilation is enabled (i.e., cutensorJitMode_t is not CUTENSOR_JIT_MODE_NONE).

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [out] Pointer to the data structure created by this function that holds all information (e.g., selected kernel) necessary to perform the desired operation.
desc – [in] This opaque struct encodes the given operation (see cutensorCreateContraction, cutensorCreateReduction, cutensorCreatePermutation, cutensorCreateElementwiseBinary, cutensorCreateElementwiseTrinary, or cutensorCreateContractionTrinary).
pref – [in] This opaque struct is used to restrict the space of applicable candidates/kernels (see cutensorCreatePlanPreference or cutensorPlanPreferenceAttribute_t). May be nullptr, in that case default choices are assumed. Block-sparse contractions currently only support these default settings and ignore other supplied preferences.
workspaceSizeLimit – [in] Denotes the maximal workspace that the corresponding operation is allowed to use (see cutensorEstimateWorkspaceSize)

Return values:

CUTENSOR_STATUS_SUCCESS – If a viable candidate has been found.
CUTENSOR_STATUS_NOT_SUPPORTED – If no viable candidate could be found.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE – if The provided workspace was insufficient.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

`cutensorDestroyPlan()`#

cutensorStatus_t cutensorDestroyPlan(cutensorPlan_t plan)#

Frees all resources related to the provided plan.

Remark

blocking, no reentrant, and thread-safe

Parameters:: plan – [inout] The cutensorPlan_t object that will be deallocated.
Return values:: CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise

`cutensorPlanGetAttribute()`#

cutensorStatus_t cutensorPlanGetAttribute( const cutensorHandle_t handle, const cutensorPlan_t plan, cutensorPlanAttribute_t attr, void *buf, size_t sizeInBytes, )#

Retrieves information about an already-created plan (see cutensorPlanAttribute_t)

Block-sparse contraction plans only support CUTENSOR_PLAN_REQUIRED_WORKSPACE.

Parameters:

plan – [in] Denotes an already-created plan (e.g., via cutensorCreatePlan or cutensorCreatePlanAutotuned)
attr – [in] Requested attribute.
buf – [out] On successful exit: Holds the information of the requested attribute.
sizeInBytes – [in] size of buf in bytes.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

`cutensorPlanPreferenceSetAttribute()`#

Set attribute of a cutensorPlanPreference_t object.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
pref – [inout] This opaque struct restricts the search space of viable candidates.
attr – [in] Specifies the attribute that will be set.
buf – [in] This buffer (of size sizeInBytes) determines the value to which attr will be set.
sizeInBytes – [in] Size of buf (in bytes); see cutensorPlanPreferenceAttribute_t for the exact size.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

Cache-related Operations#

`cutensorHandleResizePlanCache()`#

cutensorStatus_t cutensorHandleResizePlanCache( cutensorHandle_t handle, const uint32_t numEntries, )#

Resizes the plan cache.

This function changes the number of plans that can be stored in the plan cache of the handle.

Resizing invalidates the cache.

While this function is not thread-safe, the resulting cache can be shared across different threads in a thread-safe manner.

Remark

non-blocking, no reentrant, and not thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context. The cache will be attached to the handle
numEntries – [in] Number of entries the cache will support.

Return values:

CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

`cutensorHandleReadPlanCacheFromFile()`#

cutensorStatus_t cutensorHandleReadPlanCacheFromFile( cutensorHandle_t handle, const char filename[], uint32_t *numCachelinesRead, )#

Reads a Plan-Cache from file and overwrites the cachelines of the provided handle.

A cache is only valid for the same cuTENSOR version and CUDA version; moreover, the GPU architecture (incl. multiprocessor count) must match, otherwise CUTENSOR_STATUS_INVALID_VALUE will be returned.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:

handle – [inout] Opaque handle holding cuTENSOR’s library context.
filename – [in] Specifies the filename (including the absolute path) to the file that holds all the cache information that have previously been written by cutensorHandleWritePlanCacheToFile.
numCachelinesRead – [out] On exit, this variable will hold the number of successfully-read cachelines, if CUTENSOR_STATUS_SUCCESS is returned. Otherwise, this variable will hold the number of cachelines that are required to read all cachelines associated to the cache pointed to by filename; in that case CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE is returned.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_INVALID_VALUE – if the stored cache was created by a different cuTENSOR- or CUDA-version or if the GPU architecture (incl. multiprocessor count) doesn’t match
CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE – if the stored cache requires more cachelines than those that are currently attached to the handle
CUTENSOR_STATUS_IO_ERROR – if the file cannot be read
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

`cutensorHandleWritePlanCacheToFile()`#

cutensorStatus_t cutensorHandleWritePlanCacheToFile( const cutensorHandle_t handle, const char filename[], )#

Writes the Plan-Cache (that belongs to the provided handle) to file.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
filename – [in] Specifies the filename (including the absolute path) to the file that should hold all the cache information. Warning: an existing file will be overwritten.

Return values:

CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_INVALID_VALUE – if the no cache has been attached
CUTENSOR_STATUS_IO_ERROR – if the file cannot be written to

`cutensorReadKernelCacheFromFile()`#

cutensorStatus_t cutensorReadKernelCacheFromFile( cutensorHandle_t handle, const char filename[], )#

Reads a kernel cache from file and adds all non-existing JIT compiled kernels to the kernel cache.

A cache is only valid for the same cuTENSOR version and CUDA version; moreover, the GPU architecture (incl. multiprocessor count) must match, otherwise CUTENSOR_STATUS_INVALID_VALUE will be returned.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:

handle – [inout] Opaque handle holding cuTENSOR’s library context.
filename – [in] Specifies the filename (including the absolute path) to the file that holds all the cache information that have previously been written by cutensorWriteKernelCacheToFile.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully, or the file pointed to by filename was empty.
CUTENSOR_STATUS_INVALID_VALUE – if the stored cache was created by a different cuTENSOR- or CUDA-version or if the GPU architecture (incl. multiprocessor count) doesn’t match.
CUTENSOR_STATUS_IO_ERROR – if the file cannot be read.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_NOT_SUPPORTED – if the function is not available for the operating system, CUDA Toolkit, or compute capability of the device.

`cutensorWriteKernelCacheToFile()`#

cutensorStatus_t cutensorWriteKernelCacheToFile( const cutensorHandle_t handle, const char filename[], )#

Writes the —per library— kernel cache to file.

Writes the just-in-time compiled kernels to the provided file (those kernels belong to the library—not to the handle).

Remark

non-blocking, no reentrant, and thread-safe

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
filename – [in] Specifies the filename (including the absolute path) to the file that should hold all the cache information. Warning: an existing file will be overwritten.

Return values:

CUTENSOR_STATUS_SUCCESS – The operation completed successfully or there were no kernels in the cache.
CUTENSOR_STATUS_IO_ERROR – if the file cannot be written to.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_NOT_SUPPORTED – if the function is not available for the operating system, CUDA Toolkit, or compute capability of the device.

Block-sparse Functionality#

cuTENSOR’s block-sparse functionality is a public beta feature. The current block-sparse backend is not as feature-rich as cuTENSOR’s dense counterpart, e.g., only supports a limited set of data types, no conjugate support, only ensors up to 8D) and we are aware of some performance improvements that we are planning to provide as part of follow-up releases. While we intend to keep the API stable from one release to the next, we cannot guarantee this at this point due to the experimental nature of this API.

We would much appreciate your feedback to guide this development process and steer these future optimizations in the right direction.

Beta

This function is in beta and may change in future releases.

`cutensorCreateBlockSparseTensorDescriptor()`#

cutensorStatus_t cutensorCreateBlockSparseTensorDescriptor( cutensorHandle_t handle, cutensorBlockSparseTensorDescriptor_t *desc, const uint32_t numModes, const uint64_t numNonZeroBlocks, const uint32_t numSectionsPerMode[], const int64_t extent[], const int32_t nonZeroCoordinates[], const int64_t stride[], cudaDataType_t dataType, )#

Create a block-sparse tensor descriptor.

A block-sparse tensor descriptor fully specifies the data layout of a block-sparse tensor (currently limited to up to 8 modes).

Let us consider an example for a block-sparse tensor of order 2, i.e., a block-sparse matrix A. Its first mode (rows) is subdivided into 5 sections (with extents 4, 2, 3, 4, 5, respectively), and its second mode (columns) is subdivided into 3 sections (with extents 2, 3, 7). The matrix has 8 non-zero blocks:

4 x 2		4 x 7
	2 x 3
	3 x 3	3 x 7
4 x 2
	5 x 3	5 x 7

Notice that we require the same extent of each sub section across the entire mode, i.e., every block within the same section of a mode has the same extent. For example, in the above picture every block on the right has 7 colums, and all blocks on the bottom have 5 rows.

Moreover, we only store the non-zero blocks (blocks that are zero are left blank in the above representation).

To be precise, the above block-sparse tensor could be created via:

uint32_t numModes = 2;
uint64_t numNonZeroBlocks = 8;
const uint32_t sectionsPerMode[] = {5, 3}; // Array of size numModes
const int64_t extent[] =                   // Array of size Sum(sectionsPerMode)
{
     4, 2, 3, 4, 5, // extents of sections of first  mode
     2, 3, 7        // extents of sections of second mode
};

const int32_t nonZeroCoordinates[] = // Array of size numModes x numNonZeroBlocks
{
     0, 0, // Block 0.     List of non-zero blocks in the block-sparse tensor.
     3, 0, // Block 1.     Blocks may be listed in an arbitrary order, however,
     1, 1, // Block 2.     this order must remain the same for every subsequent 
     2, 1, // Block 3.     operation.
     4, 1, // Block 4.     For example, when calling cutensorBlockSparseContract,
     0, 2, // Block 5.     pointers to the blocks need to be given in this order.
     2, 2, // Block 6.     The strides of the blocks in the stride[] array also     
     4, 2  // Block 7.     need to be given the same order as in this list. 
};
     ^  ^
     |  |
     |  \--- Section number of second mode,  block-column number
     \------ Section number of first  mode,  block-row number


const int64_t stride[] = // Either nullptr or an array of size numModes x numNonZeroBlocks
{
     1, 4,  // Block 0.   Strides for each block, given in the same order as in
     1, 4,  // Block 1.   the array nonZeroCoordinates.  
     1, 2,  // Block 2.   
     1, 3,  // Block 3.   In this example, each block is stored contiguously in
     1, 5,  // Block 4.   column-major order; this is equivalent to passing stride=nullptr.
     1, 4,  // Block 5.   
     1, 3,  // Block 6.   However, other memory layouts are also allowed, see documentation
     1, 5   // Block 7.   below.
};

cudaDataType_t dataType = CUDA_C_64F; // For example: complex-valued double-precision.

As an example:

strides of block #0: 5, 1, 10, 20. Sorted strides would be: 1, 5, 10, 20. The permutation to sort the strides in this case is to swap the first two elements.
strides of block #1: 10, 1, 30, 60. Applying the permutation would result in: 1, 10, 30, 60. The result is sorted in ascending order, this is allowed.
strides of block #2: 1, 5, 50, 100. Applying permuation would result in: 5, 1, 50, 100. The result is not sorted in ascending order, this is not allowed.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:

dataType – [in] Data type of the stored entries. We assume the same datatype for each block. Currently, the only supported values are CUDA_C_64F, CUDA_C_32F, CUDA_R_64F, and CUDA_R_32F.
handle – [in] The library handle.
desc – [out] The resulting block-sparse tensor descriptor.
numModes – [in] The number of modes. Currently, a maximum of 8 modes is supported.
numNonZeroBlocks – [in] The number of non-zero blocks in the block-sparse tensor.
numSectionsePerMode – [in] The number of sections of each mode (host array of size numModes).
extent – [in] The extents of the sections of each mode (host array of size \sum_i^numModes(numSectionsPerMode[i])). First come the extents of the sections of the first mode, then the extents of the sections of the second mode, and so forth.
nonZeroCoordinates – [in] Block-coordinates of each non-zero block (host array of size numModes x numNonZeroBlocks Blocks can be specified in any order, however, that order must be consistent with stride and alignmentRequirement arrays.
stride – [in] The strides of each dense block (either nullptr or a host array of size numModes x numNonZeroBlocks). First the strides of the first block, then the strides of the second block, with the blocks in the same order as in nonZeroCoordinates. Passing nullptr means contiguous column-major order for each block. Moreover, the strides need to be compatible in the following sense: Suppose you sort the strides of the first block, such that they are ascending; this sorting results in a permutation. If you apply this permutation to the strides of any other block, the result needs to be sorted as well.

Beta

This function is in beta and may change in future releases.

`cutensorDestroyBlockSparseTensorDescriptor()`#

cutensorStatus_t cutensorDestroyBlockSparseTensorDescriptor( cutensorBlockSparseTensorDescriptor_t desc, )#

Frees all resources related to the provided block-sparse tensor descriptor.

Remark

blocking, no reentrant, and thread-safe.

Parameters:: desc – [inout] The cutensorBlockSparseTensorDescrptor_t object that will be deallocated.
Return values:: CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise.

Beta

This function is in beta and may change in future releases.

`cutensorCreateBlockSparseContractionDescriptor()`#

cutensorStatus_t cutensorCreateBlockSparseContraction( const cutensorHandle_t handle, cutensorOperationDescriptor_t *desc, const cutensorBlockSparseTensorDescriptor_t descA, const int32_t modeA[], cutensorOperator_t opA, const cutensorBlockSparseTensorDescriptor_t descB, const int32_t modeB[], cutensorOperator_t opB, const cutensorBlockSparseTensorDescriptor_t descC, const int32_t modeC[], cutensorOperator_t opC, const cutensorBlockSparseTensorDescriptor_t descD, const int32_t modeD[], const cutensorComputeDescriptor_t descCompute, )#

This function allocates a cutensorOperationDescriptor_t object that encodes a block-sparse tensor contraction of the form \( D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \).

Allocates data for desc to be used to perform a block-sparse tensor contraction of the form

Only the predefined non-zero blocks of \(\mathcal{D}\) that were specified in cutensorCreateBlockSparseTensorDescriptor() are actually computed. The other blocks are omitted, even if the true result of the contraction would be non-zero. Conversely, if a predefined non-zero block of \(\mathcal{D}\) is present, but the result of the contraction is zero for this block, explicit zeros will be stored.

Currently, the data-types for the tensors A, B, C, and D, as well as the scalars \( \alpha \) and \( \beta \) must all be identical, and the only supported types are CUDA_C_64F, CUDA_C_32F, CUDA_R_64F, and CUDA_R_32F. The compute-type needs to match as well, that is, we currently only support:

typeA	typeB	typeC	typeD	descCompute	typeScalar
CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUTENSOR_COMPUTE_DESC_32F	CUDA_R_32F
CUDA_R_64F	CUDA_R_64F	CUDA_R_64F	CUDA_R_64F	CUTENSOR_COMPUTE_DESC_64F	CUDA_R_64F
CUDA_C_32F	CUDA_C_32F	CUDA_C_32F	CUDA_C_32F	CUTENSOR_COMPUTE_DESC_32F	CUDA_C_32F
CUDA_C_64F	CUDA_C_64F	CUDA_C_64F	CUDA_C_64F	CUTENSOR_COMPUTE_DESC_64F	CUDA_C_64F

For every mode, the segmentation of that mode must be identical in all tensors that it occurs in. For example, if modeA[i] = modeB[j] for some i and j, then numSectionsPerMode[i] of tensor A must have the same value as numSectionsPerMode[j] of tensor B, and the corresponding section extents must be identical.

For example, let A, B, and C be block matrices and consider the ordinary matrix-matrix product \(C_{mn}=A_{mk}B_{kn}\). Then:

Mode ‘m’: C and A need to have the same number of block-rows, and each block-row of C needs to contain the same number of rows as the corresponding block-row of A.
Mode ‘n’: C and B need to have the same number of block-columns of matching size
Mode ‘k’: A needs to have the same number of block-columns as B has block-rows, and each block-column of A needs to consist of the same number of columns as the number of rows in the corresponding block-row of B.

At the moment, descC and descD must be identical, i.e., the same opaque pointer needs to be passed and the layouts of the C and the D tensors need to be identical.

See cutensorCreatePlan to create the plan, cutensorEstimateWorkspaceSize to compute the required workspace, and finally cutensorBlockSparseContract to perform the actual contraction.

The user is responsible for calling cutensorDestroyOperationDescriptor to free the resources associated with the descriptor.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
desc – [out] This opaque struct gets allocated and filled with the information that encodes the tensor contraction operation.
descA – [in] The descriptor that holds the information about the data type, modes, sections, section extents, strides, and non-zero blocks of A.
modeA – [in] Array with ‘nmodeA’ entries that represent the modes of A. Sections, i.e., block-sizes, must match among the involved block-sparse tensors.
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged. Currently, only CUTENSOR_OP_IDENTITY is supported.
descB – [in] The descriptor that holds information about the the data type, modes, sections, section extents, strides, and non-zero blocks of B.
modeB – [in] Array with ‘nmodeB’ entries that represent the modes of B. Sections, i.e., block-sizes, must match among the involved block-sparse tensors.
opB – [in] Unary operator that will be applied to each element of B before it is further processed. The original data of this tensor remains unchanged. Currently, only CUTENSOR_OP_IDENTITY is supported.
modeC – [in] Array with ‘nmodeC’ entries that represent the modes of C. Sections, i.e., block-sizes, must match among the involved block-sparse tensors.
descC – [in] The descriptor that holds information about the data type, modes, sections, section extents, strides, and non-zero blocks of C. Note that the block-sparsity pattern of C (the nonZeroCoordinates[] array used to create the decriptor) of C must be identical to that of D; and it is this block-sparsity pattern that determines which parts of the results are computed; no fill-in is allocated or computed.
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged. Currently, only CUTENSOR_OP_IDENTITY is supported.
descD – [in] For now, this must be the same opaque pointer as descC, and the layouts of C and D must be identical.
modeD – [in] Array with ‘nmodeD’ entries that represent the modes of D (must be identical to modeC for now).
typeCompute – [in] Datatype of for the intermediate computation of typeCompute T = A * B.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE – if tensor dimensions or modes or section sizes have an illegal value
CUTENSOR_STATUS_SUCCESS – The operation completed successfully without error
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.

Beta

This function is in beta and may change in future releases.

`cutensorBlockSparseContract()`#

cutensorStatus_t cutensorBlockSparseContract( const cutensorHandle_t handle, const cutensorPlan_t plan, const void *alpha, const void *const A[], const void *const B[], const void *beta, const void *const C[], void *const D[], void *workspace, uint64_t workspaceSize, cudaStream_t stream, )#

This routine computes the block-sparse tensor contraction \( D = alpha * A * B + beta * C \).

\[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha * \mathcal{A}_{{modes}_\mathcal{A}} B_{{modes}_\mathcal{B}} + \beta \mathcal{C}_{{modes}_\mathcal{C}} \]

The active CUDA device must match the CUDA device that was active at the time at which the plan was created.

The array-parameters A, B, C, and D are host-arrays containing pointers to GPU-accessible memory. For example, A is a host-array whose size equals the number of non-zero blocks in tensor \(\mathcal{A}\). A[i] is a pointer to the GPU-accessible memory location of block number i of \(\mathcal{A}\). The blocks are numbered in the same way as in the construction of \(\mathcal{A}\)’s block-sparse tensor descriptor. The same analogously holds for the other array-parameters B, C, and D.

Parameters:

handle – [in] Opaque handle holding cuTENSOR’s library context.
plan – [in] Opaque handle holding the contraction execution plan (created by cutensorCreateBlockSparseContraction followed by cutensorCreatePlan).
alpha – [in] Scaling for A*B. Its data type is determined by ‘descCompute’ (see cutensorCreateBlockSparseContraction). Pointer to host memory.
A – [in] Host-array of size numNonZeroBlocks(A), containing pointers to GPU-accessible memory, corresponding the blocks of A. The data accessed via these pointers must not overlap with the elements written to D.
B – [in] Host-array of size numNonZeroBlocks(B), containing pointers to GPU-accessible memory, corresponding the blocks of B. The data accessed via these pointers must not overlap with the elements written to D.
beta – [in] Scaling for C. Its data type is determined by ‘descCompute’ (see cutensorCreateBlockSparseContraction). Pointer to host memory.
C – [in] Host-array of size numNonZeroBlocks(C), containing pointers to GPU-accessible memory, corresponding the blocks of C.
D – [out] Host-array of size numNonZeroBlocks(D), containing pointers to GPU-accessible memory, corresponding the blocks of D.
workspace – [out] This pointer provides the required workspace in device memory. The workspace must be aligned to 256 bytes (i.e., the default alignment of cudaMalloc).
workspaceSize – [in] Size of the workspace array in bytes; please refer to cutensorEstimateWorkspaceSize to query the required workspace. For block-sparse contractions, this estimate is exact.
stream – [in] The CUDA stream to which all of the computation is synchronised.

Return values:

CUTENSOR_STATUS_NOT_SUPPORTED – if operation is not supported.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized.
CUTENSOR_STATUS_ARCH_MISMATCH – if the plan was created for a different device than the currently active device.
CUTENSOR_STATUS_INSUFFICIENT_DRIVER – if the driver is insufficient.
CUTENSOR_STATUS_CUDA_ERROR – if some unknown CUDA error has occurred (e.g., out of memory).

Logger Functions#

`cutensorLoggerSetCallback()`#

cutensorStatus_t cutensorLoggerSetCallback( cutensorLoggerCallback_t callback, )#

This function sets the logging callback routine.

Parameters:: callback – [in] Pointer to a callback function. Check cutensorLoggerCallback_t.

`cutensorLoggerSetFile()`#

cutensorStatus_t cutensorLoggerSetFile(FILE *file)#

This function sets the logging output file.

Parameters:: file – [in] An open file with write permission.

`cutensorLoggerOpenFile()`#

cutensorStatus_t cutensorLoggerOpenFile(const char *logFile)#

This function opens a logging output file in the given path.

Parameters:: logFile – [in] Path to the logging output file.

`cutensorLoggerSetLevel()`#

cutensorStatus_t cutensorLoggerSetLevel(int32_t level)#

This function sets the value of the logging level.

Parameters:

level – [in] Log level, should be one of the following:

0. Off
1. Errors
2. Performance Trace
3. Performance Hints
4. Heuristics Trace
5. API Trace

`cutensorLoggerSetMask()`#

cutensorStatus_t cutensorLoggerSetMask(int32_t mask)#

This function sets the value of the log mask.

Parameters:

mask – [in] Log mask, the bitwise OR of the following:

0. Off
1. Errors
2. Performance Trace
4. Performance Hints
8. Heuristics Trace
16. API Trace

`cutensorLoggerForceDisable()`#

cutensorStatus_t cutensorLoggerForceDisable()#: This function disables logging for the entire run.

cuTENSOR Functions#

Helper Functions#

cutensorCreate()#

cutensorDestroy()#

cutensorCreateTensorDescriptor()#

cutensorDestroyTensorDescriptor()#

cutensorGetErrorString()#

cutensorGetVersion()#

cutensorGetCudartVersion()#

Element-wise Operations#

cutensorCreateElementwiseTrinary()#

cutensorElementwiseTrinaryExecute()#

cutensorCreateElementwiseBinary()#

cutensorElementwiseBinaryExecute()#

cutensorCreatePermutation()#

cutensorPermute()#

Contraction Operations#

cutensorCreateContraction()#

cutensorContract()#

cutensorCreateContractionTrinary()#

cutensorContractTrinary()#

Reduction Operations#

cutensorCreateReduction()#

cutensorReduce()#

Generic Operation Functions#

cutensorDestroyOperationDescriptor()#

cutensorOperationDescriptorGetAttribute()#

cutensorOperationDescriptorSetAttribute()#

cutensorCreatePlanPreference()#

cutensorDestroyPlanPreference()#

cutensorPlanPreferenceSetAttribute()#

cutensorEstimateWorkspaceSize()#

cutensorCreatePlan()#

cutensorDestroyPlan()#

cutensorPlanGetAttribute()#

cutensorPlanPreferenceSetAttribute()#

Cache-related Operations#

cutensorHandleResizePlanCache()#

cutensorHandleReadPlanCacheFromFile()#

cutensorHandleWritePlanCacheToFile()#

cutensorReadKernelCacheFromFile()#

cutensorWriteKernelCacheToFile()#

Block-sparse Functionality#

cutensorCreateBlockSparseTensorDescriptor()#

cutensorDestroyBlockSparseTensorDescriptor()#

cutensorCreateBlockSparseContractionDescriptor()#

cutensorBlockSparseContract()#

Logger Functions#

cutensorLoggerSetCallback()#

cutensorLoggerSetFile()#

cutensorLoggerOpenFile()#

cutensorLoggerSetLevel()#

cutensorLoggerSetMask()#

cutensorLoggerForceDisable()#

`cutensorCreate()`#

`cutensorDestroy()`#

`cutensorCreateTensorDescriptor()`#

`cutensorDestroyTensorDescriptor()`#

`cutensorGetErrorString()`#

`cutensorGetVersion()`#

`cutensorGetCudartVersion()`#

`cutensorCreateElementwiseTrinary()`#

`cutensorElementwiseTrinaryExecute()`#

`cutensorCreateElementwiseBinary()`#

`cutensorElementwiseBinaryExecute()`#

`cutensorCreatePermutation()`#

`cutensorPermute()`#

`cutensorCreateContraction()`#

`cutensorContract()`#

`cutensorCreateContractionTrinary()`#

`cutensorContractTrinary()`#

`cutensorCreateReduction()`#

`cutensorReduce()`#

`cutensorDestroyOperationDescriptor()`#

`cutensorOperationDescriptorGetAttribute()`#

`cutensorOperationDescriptorSetAttribute()`#

`cutensorCreatePlanPreference()`#

`cutensorDestroyPlanPreference()`#

`cutensorPlanPreferenceSetAttribute()`#

`cutensorEstimateWorkspaceSize()`#

`cutensorCreatePlan()`#

`cutensorDestroyPlan()`#

`cutensorPlanGetAttribute()`#

`cutensorPlanPreferenceSetAttribute()`#

`cutensorHandleResizePlanCache()`#

`cutensorHandleReadPlanCacheFromFile()`#

`cutensorHandleWritePlanCacheToFile()`#

`cutensorReadKernelCacheFromFile()`#

`cutensorWriteKernelCacheToFile()`#

`cutensorCreateBlockSparseTensorDescriptor()`#

`cutensorDestroyBlockSparseTensorDescriptor()`#

`cutensorCreateBlockSparseContractionDescriptor()`#

`cutensorBlockSparseContract()`#

`cutensorLoggerSetCallback()`#

`cutensorLoggerSetFile()`#

`cutensorLoggerOpenFile()`#

`cutensorLoggerSetLevel()`#

`cutensorLoggerSetMask()`#

`cutensorLoggerForceDisable()`#