cuTENSOR Functions¶

Helper Functions¶

The helper functions initialize cuTENSOR, create tensor descriptors, check error codes, and retrieve library and CUDA runtime versions.

`cutensorInit()`¶

cutensorStatus_t cutensorInit(cutensorHandle_t *handle)¶

Brief

Initializes the cuTENSOR library

Details

The device associated with a particular cuTENSOR handle is assumed to remain unchanged after the cutensorInit() call. In order for the cuTENSOR library to use a different device, the application must set the new device to be used by calling cudaSetDevice() and then create another cuTENSOR handle, which will be associated with the new device, by calling cutensorInit().

Returns

CUTENSOR_STATUS_SUCCESS on success and an error code otherwise

Remark

blocking, no reentrant, and thread-safe

Parameters

[out] handle: Pointer to cutensorHandle_t

`cutensorInitTensorDescriptor()`¶

cutensorStatus_t cutensorInitTensorDescriptor(const cutensorHandle_t *handle, cutensorTensorDescriptor_t *desc, const uint32_t numModes, const int64_t extent[], const int64_t stride[], cudaDataType_t dataType, cutensorOperator_t unaryOp)¶

Brief

Initializes a tensor descriptor

Precondition

extent and stride arrays must each contain at least sizeof(int64_t) * numModes bytes

Returns

CUTENSOR_STATUS_SUCCESS on success and an error code otherwise

Remark

non-blocking, no reentrant, and thread-safe

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[out] desc: Pointer to the address where the allocated tensor descriptor object is stored.
[in] numModes: Number of modes.
[in] extent: Extent of each mode (must be larger than zero).
[in] stride: stride[i] denotes the displacement (stride) between two consecutive elements in the ith-mode. If stride is NULL, a packed generalized column-major memory layout is assumed (i.e., the strides increase monotonically from left to right). Each stride must be larger than zero; to be precise, a stride of zero can be achieved by omitting this mode entirely; for instance instead of writing C[a,b] = A[b,a] with strideA(a) = 0, you can write C[a,b] = A[b] directly; cuTENSOR will then automatically infer that the a-mode in A should be broadcasted).
[in] dataType: Data type of the stored entries.
[in] unaryOp: Unary operator that will be applied to each element of the corresponding tensor in a lazy fashion (i.e., the algorithm uses this tensor as its operand only once). The original data of this tensor remains unchanged.

`cutensorGetAlignmentRequirement()`¶

cutensorStatus_t cutensorGetAlignmentRequirement(const cutensorHandle_t *handle, const void *ptr, const cutensorTensorDescriptor_t *desc, uint32_t *alignmentRequirement)¶

Brief

Computes the minimal alignment requirement for a given pointer and descriptor

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[in] ptr: Raw pointer to the data of the respective tensor.
[in] desc: Tensor descriptor for ptr.
[out] alignmentRequirement: Largest alignment requirement that ptr can fulfill (in bytes).

`cutensorGetErrorString()`¶

const char *cutensorGetErrorString(const cutensorStatus_t error)¶

Brief

Returns the description string for an error code

Returns

the error string

Remark

non-blocking, no reentrant, and thread-safe

Parameters

[in] error: Error code to convert to string.

`cutensorGetVersion()`¶

size_t cutensorGetVersion()¶

Brief: Returns Version number of the CUTENSOR library

`cutensorGetCudartVersion()`¶

size_t cutensorGetCudartVersion()¶

Brief: Returns version number of the CUDA runtime that cuTENSOR was compiled against
Details: Can be compared against the CUDA runtime version from cudaRuntimeGetVersion().

Element-wise Operations¶

The following functions perform element-wise operations between tensors.

`cutensorElementwiseTrinary()`¶

cutensorStatus_t cutensorElementwiseTrinary(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *beta, const void *B, const cutensorTensorDescriptor_t *descB, const int32_t modeB[], const void *gamma, const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opAB, cutensorOperator_t opABC, cudaDataType_t typeScalar, const cudaStream_t stream)¶

Where

A,B,C,D are multi-mode tensors (of arbitrary data types).
\(\Pi^A, \Pi^B, \Pi^C \) are permutation operators that permute the modes of A, B, and C respectively.
\(\Psi_{A},\Psi_{B},\Psi_{C}\) are unary element-wise operators (e.g., IDENTITY, CONJUGATE).
\(\Phi_{ABC}, \Phi_{AB}\) are binary element-wise operators (e.g., ADD, MUL, MAX, MIN).

Brief: Element-wise tensor operation with three inputs
Details: This function performs a element-wise tensor operation of the form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{ABC}(\Phi_{AB}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \beta \Psi_B(B_{\Pi^B(i_0,i_1,...,i_n)})), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

Notice that the broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.

Moreover, modes may appear in any order, giving users a greater flexibility. The only restrictions are:

modes that appear in A or B must also appear in the output tensor; a mode that only appears in the input would be contracted and such an operation would be covered by either cutensorContraction or cutensorReduction.
each mode may appear in each tensor at most once.

Input tensors may be read even if the value of the corresponding scalar is zero.

Examples:

\( D_{a,b,c,d} = A_{b,d,a,c}\)
\( D_{a,b,c,d} = 2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a}\)
\( D_{a,b,c,d} = 2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a} + C_{a,b,c,d}\)
\( D_{a,b,c,d} = min((2.2 * A_{b,d,a,c} + 1.3 * B_{c,b,d,a}), C_{a,b,c,d})\)

Supported data-type combinations are:

Returns

CUTENSOR_STATUS_SUCCESS on success and an error code otherwise

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[in] alpha: Scaling factor for A (see equation above) of the type typeScalar. Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
[in] A: Multi-mode tensor of type typeA with nmodeA modes. Pointer to the GPU-accessible memory.
[in] descA: A descriptor that holds the information about the data type, modes, and strides of A.
[in] modeA: Array (in host memory) of size descA->numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’}). The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
[in] beta: Scaling factor for B (see equation above) of the type typeScalar. Pointer to the host memory. If beta is zero, B is not read and the corresponding unary operator is not applied.
[in] B: Multi-mode tensor of type typeB with nmodeB many modes. Pointer to the GPU-accessible memory.
[in] descB: The B descriptor that holds information about the data type, modes, and strides of B.
[in] modeB: Array (in host memory) of size descB->numModes that holds the names of the modes of B. modeB[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor
[in] gamma: Scaling factor for C (see equation above) of type typeScalar. Pointer to the host memory. If gamma is zero, C is not read and the corresponding unary operator is not applied.
[in] C: Multi-mode tensor of type typeC with nmodeC many modes. Pointer to the GPU-accessible memory.
[in] descC: The C descriptor that holds information about the data type, modes, and strides of C.
[in] modeC: Array (in host memory) of size descC->numModes that holds the names of the modes of C. The modeC[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.
[out] D: Multi-mode output tensor of type typeC with nmodeC modes that are ordered according to modeD. Pointer to the GPU-accessible memory. Notice that D may alias any input tensor if they share the same memory layout (i.e., same tensor descriptor).
[in] descD: The D descriptor that holds information about the data type, modes, and strides of D. Notice that we currently request descD and descC to be identical.
[in] modeD: Array (in host memory) of size descD->numModes that holds the names of the modes of D. The modeD[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.
[in] opAB: Element-wise binary operator (see \(\Phi_{AB}\) above).
[in] opABC: Element-wise binary operator (see \(\Phi_{ABC}\) above).
[in] typeScalar: Scalar type for the intermediate computation.
[in] stream: The cuda stream.

`cutensorElementwiseBinary()`¶

cutensorStatus_t cutensorElementwiseBinary(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *gamma, const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opAC, cudaDataType_t typeScalar, cudaStream_t stream)¶

See

cutensorElementwiseTrinary() for details.

Brief: Element-wise tensor operation for two input tensors
Details: This function performs a element-wise tensor operation of the form:

\[ D_{\Pi^C(i_0,i_1,...,i_n)} = \Phi_{AC}(\alpha \Psi_A(A_{\Pi^A(i_0,i_1,...,i_n)}), \gamma \Psi_C(C_{\Pi^C(i_0,i_1,...,i_n)})) \]

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[in] alpha: Scaling factor for A (see equation above) of the type typeScalar. Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
[in] A: Multi-mode tensor of type typeA with nmodeA modes. Pointer to the GPU-accessible memory.
[in] descA: A descriptor that holds the information about the data type, modes, and strides of A.
[in] modeA: Array (in host memory) of size descA->numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’}). The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
[in] gamma: Scaling factor for C (see equation above) of type typeScalar. Pointer to the host memory. If gamma is zero, C is not read and the corresponding unary operator is not applied.
[in] C: Multi-mode tensor of type typeC with nmodeC many modes. Pointer to the GPU-accessible memory.
[in] descC: The C descriptor that holds information about the data type, modes, and strides of C.
[in] modeC: Array (in host memory) of size descC->numModes that holds the names of the modes of C. The modeC[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.
[out] D: Multi-mode output tensor of type typeC with nmodeC modes that are ordered according to modeD. Pointer to the GPU-accessible memory. Notice that D may alias any input tensor if they share the same memory layout (i.e., same tensor descriptor).
[in] descD: The D descriptor that holds information about the data type, modes, and strides of D. Notice that we currently request descD and descC to be identical.
[in] modeD: Array (in host memory) of size descD->numModes that holds the names of the modes of D. The modeD[i] corresponds to extent[i] and stride[i] of the cutensorInitTensorDescriptor.
[in] opAC: Element-wise binary operator (see \(\Phi_{AC}\) above).
[in] typeScalar: Scalar type for the intermediate computation.
[in] stream: The cuda stream.

Return Value

CUTENSOR_STATUS_NOT_SUPPORTED: if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE: if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS: if execution succeeded without error

`cutensorPermutation()`¶

cutensorStatus_t cutensorPermutation(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], void *B, const cutensorTensorDescriptor_t *descB, const int32_t modeB[], const cudaDataType_t typeScalar, const cudaStream_t stream)¶

Consequently, this function performs an out-of-place tensor permutation and is a specialization of cutensorElementwise.

Brief: Tensor permutation
Details: This function performs an element-wise tensor operation of the form:

\[ B_{\Pi^B(i_0,i_1,...,i_n)} = \alpha \Psi(A_{\Pi^A(i_0,i_1,...,i_n)}) \]

Where

A and B are multi-mode tensors (of arbitrary data types),
\(\Pi^A, \Pi^B\) are permutation operators that permute the modes of A, B respectively,
\(\Psi\) is an unary element-wise operators (e.g., IDENTITY, SQR, CONJUGATE), and
\(\Psi\) is specified in the tensor descriptor descA.

Broadcasting (of a mode) can be achieved by simply omitting that mode from the respective tensor.

Modes may appear in any order. The only restrictions are:

modes that appear in A must also appear in the output tensor.
each mode may appear in each tensor at most once.

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[in] alpha: Scaling factor for A (see equation above) of the type typeScalar. Pointer to the host memory. If alpha is zero, A is not read and the corresponding unary operator is not applied.
[in] A: Multi-mode tensor of type typeA with nmodeA modes. Pointer to the GPU-accessible memory.
[in] descA: A descriptor that holds information about the data type, modes, and strides of A.
[in] modeA: Array of size descA->numModes that holds the names of the modes of A (e.g., if A_{a,b,c} => modeA = {‘a’,’b’,’c’})
[inout] B: Multi-mode tensor of type typeB with nmodeB modes. Pointer to the GPU-accessible memory.
[in] descB: A descriptor that holds information about the data type, modes, and strides of B.
[in] modeB: Array of size descB->numModes that holds the names of the modes of B
[in] typeScalar: data type of alpha
[in] stream: The CUDA stream.

Return Value

CUTENSOR_STATUS_NOT_SUPPORTED: if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE: if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS: if execution succeeded without error

Contraction Operations¶

The following functions perform contractions between tensors.

`cutensorInitContractionDescriptor()`¶

cutensorStatus_t cutensorInitContractionDescriptor(const cutensorHandle_t *handle, cutensorContractionDescriptor_t *desc, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const uint32_t alignmentRequirementA, const cutensorTensorDescriptor_t *descB, const int32_t modeB[], const uint32_t alignmentRequirementB, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], const uint32_t alignmentRequirementC, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], const uint32_t alignmentRequirementD, cutensorComputeType_t typeCompute)¶

Brief

Describes the tensor contraction problem of the form:

\[ D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \]

Details

\[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha \mathcal{A}_{{modes}_\mathcal{A}} B_{{modes}_\mathcal{B}} + \beta \mathcal{C}_{{modes}_\mathcal{C}} \]

.

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[out] desc: This opaque struct gets filled with the information that encodes the tensor contraction problem.
[in] descA: A descriptor that holds the information about the data type, modes and strides of A.
[in] modeA: Array with ‘nmodeA’ entries that represent the modes of A. The modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
[in] alignmentRequirementA: Alignment that cuTENSOR may require for A’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.
[in] descB: The B descriptor that holds information about the data type, modes, and strides of B.
[in] modeB: Array with ‘nmodeB’ entries that represent the modes of B. The modeB[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
[in] alignmentRequirementB: Alignment that cuTENSOR may require for B’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.
[in] modeC: Array with ‘nmodeC’ entries that represent the modes of C. The modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
[in] descC: The C descriptor that holds information about the data type, modes, and strides of C.
[in] alignmentRequirementC: Alignment that cuTENSOR may require for C’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.
[in] modeD: Array with ‘nmodeD’ entries that represent the modes of D (must be identical to modeC for now). The modeD[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
[in] descD: The D descriptor that holds information about the data type, modes, and strides of D (must be identical to descC for now).
[in] alignmentRequirementD: Alignment that cuTENSOR may require for D’s pointer (in bytes); you can use the helper function cutensorGetAlignmentRequirement to determine the best value for a given pointer.
[in] typeCompute: Datatype of for the intermediate computation of typeCompute T = A * B.

Return Value

CUTENSOR_STATUS_NOT_SUPPORTED: if the combination of data types or operations is not supported
CUTENSOR_STATUS_INVALID_VALUE: if tensor dimensions or modes have an illegal value
CUTENSOR_STATUS_SUCCESS: if execution succeeded without error

`cutensorContractionDescriptorSetAttribute()`¶

cutensorStatus_t cutensorContractionDescriptorSetAttribute(const cutensorHandle_t *handle, cutensorContractionDescriptor_t *desc, cutensorContractionDescriptorAttributes_t attr, const void *buf, size_t sizeInBytes)¶

Brief

Sett attribute for cutensorDescriptor

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[inout] desc: Contraction descriptor that will be modified.
[in] attr: Specifies the attribute that will be set.
[in] buf: This buffer (of size sizeInBytes) determines the value to which attr will be set.
[in] sizeInBytes: Size of buf (in bytes).

`cutensorInitContractionFind()`¶

cutensorStatus_t cutensorInitContractionFind(const cutensorHandle_t *handle, cutensorContractionFind_t *find, const cutensorAlgo_t algo)¶

Brief

Limits the search space of viable candidates (a.k.a. algorithms)

Details

This function gives the user finer control over the candidates that the subsequent call to cutensorInitContractionPlan is allowed to evaluate.

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[out] find:
[in] algo: Allows users to select a specific algorithm. CUTENSOR_ALGO_DEFAULT lets the heuristic choose the algorithm. Any value >= 0 selects a specific GEMM-like algorithm and deactivates the heuristic. If a specified algorithm is not supported CUTENSOR_STATUS_NOT_SUPPORTED is returned. See cutensorAlgo_t for additional choices.

Return Value

CUTENSOR_STATUS_SUCCESS:
CUTENSOR_STATUS_INVALID_VALUE:
CUTENSOR_STATUS_NOT_SUPPORTED:

`cutensorContractionFindSetAttribute()`¶

cutensorStatus_t cutensorContractionFindSetAttribute(const cutensorHandle_t *handle, cutensorContractionFind_t *find, cutensorContractionFindAttributes_t attr, const void *buf, size_t sizeInBytes)¶

Brief

Set attribute for cutensorContractionFind

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[inout] find: This opaque struct restricts the search space of viable candidates.
[in] attr: Specifies the attribute that will be set.
[in] buf: This buffer (of size sizeInBytes) determines the value to which attr will be set.
[in] sizeInBytes: Size of buf (in bytes).

`cutensorContractionGetWorkspace()`¶

cutensorStatus_t cutensorContractionGetWorkspace(const cutensorHandle_t *handle, const cutensorContractionDescriptor_t *desc, const cutensorContractionFind_t *find, const cutensorWorksizePreference_t pref, uint64_t *workspaceSize)¶

Brief

Determines the required workspaceSize for a given tensor contraction (see cutensorContraction)

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[in] desc: This opaque struct encodes the tensor contraction problem.
[in] find: This opaque struct restricts the search space of viable candidates.
[in] pref: This parameter influences the size of the workspace; see cutensorWorksizePreference_t for details.
[out] workspaceSize: The workspace size (in bytes) that is required for the given tensor contraction.

`cutensorInitContractionPlan()`¶

cutensorStatus_t cutensorInitContractionPlan(const cutensorHandle_t *handle, cutensorContractionPlan_t *plan, const cutensorContractionDescriptor_t *desc, const cutensorContractionFind_t *find, const uint64_t workspaceSize)¶

The plan is created for the active CUDA device.

Brief: Initializes the contraction plan for a given tensor contraction problem
Details: This function applies cuTENSOR’s heuristic to select a candidate for a given tensor contraction problem (encoded by desc). The resulting plan can be reused multiple times as long as the tensor contraction problem remains the same.

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[out] plan: Opaque handle holding the contraction execution plan (i.e., the candidate that will be executed as well as all it’s runtime parameters for the given tensor contraction problem).
[in] desc: This opaque struct encodes the given tensor contraction problem.
[in] find: This opaque struct is used to restrict the search space of viable candidates.
[in] workspaceSize: Available workspace size (in bytes).

Return Value

CUTENSOR_STATUS_SUCCESS: If a viable candidate has been found.
CUTENSOR_STATUS_NOT_SUPPORTED: If no viable candidate could be found.

`cutensorContraction()`¶

cutensorStatus_t cutensorContraction(const cutensorHandle_t *handle, const cutensorContractionPlan_t *plan, const void *alpha, const void *A, const void *B, const void *beta, const void *C, void *D, void *workspace, uint64_t workspaceSize, cudaStream_t stream)¶

The currently active CUDA device must match the CUDA device that was active at the time at which the plan was created.

Brief: This routine computes the tensor contraction

\[ D = alpha * A * B + beta * C \]
Details: \[ \mathcal{D}_{{modes}_\mathcal{D}} \gets \alpha * \mathcal{A}_{{modes}_\mathcal{A}} B_{{modes}_\mathcal{B}} + \beta \mathcal{C}_{{modes}_\mathcal{C}} \]
.

Supported data-type combinations are:

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[in] plan: Opaque handle holding the contraction execution plan.
[in] alpha: Scaling for A*B. Its data type is determined by ‘typeCompute’. Pointer to the host memory.
[in] A: Pointer to the data corresponding to A in device memory. Pointer to the GPU-accessible memory.
[in] B: Pointer to the data corresponding to B. Pointer to the GPU-accessible memory.
[in] beta: Scaling for C. Its data type is determined by ‘typeCompute’. Pointer to the host memory.
[in] C: Pointer to the data corresponding to C. Pointer to the GPU-accessible memory.
[out] D: Pointer to the data corresponding to D. Pointer to the GPU-accessible memory.
[out] workspace: Optional parameter that may be NULL. This pointer provides additional workspace, in device memory, to the library for additional optimizations; the workspace must be aligned to 128 bytes.
[in] workspaceSize: Size of the workspace array in bytes; please refer to cutensorContractionGetWorkspace() to query the required workspace. While cutensorContraction() does not strictly require a workspace for the reduction, it is still recommended to provided some small workspace (e.g., 128 MB).
[in] stream: The CUDA stream in which all the computation is performed.

see

https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuTENSOR/contraction.cu for a concrete example.

[Example]

D[a,b,c,d] = 1.3 * A[b,e,d,f] * B[f,e,a,c]

Return: CUTENSOR_STATUS_NOT_SUPPORTED, CUTENSOR_STATUS_INVALID_VALUE, CUTENSOR_STATUS_SUCCESS

`cutensorContractionMaxAlgos()`¶

cutensorStatus_t cutensorContractionMaxAlgos(int32_t *maxNumAlgos)¶

Brief

This routine returns the maximum number of algorithms available to compute tensor contractions

[NOTE] Not all algorithms might be applicable to your specific problem. cutensorContraction() will return CUTENSOR_STATUS_NOT_SUPPORTED if an algorithm is not applicable.

Parameters

[out] maxNumAlgos: This value will hold the maximum number of algorithms available for cutensorContraction(). You can use the returned integer for auto-tuning purposes (i.e., iterate over all algorithms up to the returned value).

Reduction Operations¶

The following functions perform tensor reductions.

`cutensorReduction()`¶

cutensorStatus_t cutensorReduction(const cutensorHandle_t *handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *beta, const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opReduce, cutensorComputeType_t typeCompute, void *workspace, uint64_t workspaceSize, cudaStream_t stream)¶

This function is also able to perform partial reductions; for instance: C[i,j] = alpha * A[k,j,i]; in this case only elements along the k-mode are contracted.

Brief: Implements a tensor reduction of the form

\[ D = alpha * opReduce(opA(A)) + beta * opC(C) \]
Details: For example this function enables users to reduce an entire tensor to a scalar: C[] = alpha * A[i,j,k];

It the binary opReduce operator provides extra control over what kind of a reduction ought to be perfromed. For instance, opReduce == CUTENSOR_OP_ADD reduces element of A via a summation while CUTENSOR_OP_MAX would find the largest element in A.

Supported data-type combinations are:

Return

CUTENSOR_STATUS_NOT_SUPPORTED, CUTENSOR_STATUS_INVALID_VALUE, CUTENSOR_STATUS_SUCCESS

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[in] alpha: Scaling for A; its data type is determined by ‘typeCompute’. Pointer to the host memory.
[in] A: Pointer to the data corresponding to A in device memory. Pointer to the GPU-accessible memory.
[in] descA: A descriptor that holds the information about the data type, modes and strides of A.
[in] modeA: Array with ‘nmodeA’ entries that represent the modes of A. modeA[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor. Modes that only appear in modeA but not in modeC are reduced (contracted).
[in] beta: Scaling for C; its data type is determined by ‘typeCompute’. Pointer to the host memory.
[in] C: Pointer to the data corresponding to C in device memory. Pointer to the GPU-accessible memory.
[in] descC: A descriptor that holds the information about the data type, modes and strides of C.
[in] modeC: Array with ‘nmodeC’ entries that represent the modes of C. modeC[i] corresponds to extent[i] and stride[i] w.r.t. the arguments provided to cutensorInitTensorDescriptor.
[out] D: Pointer to the data corresponding to C in device memory. Pointer to the GPU-accessible memory.
[in] descD: Must be identical to descC for now.
[in] modeD: Must be identical to modeC for now.
[in] opReduce: binary operator used to reduce elements of A.
[in] typeCompute: All arithmetic is performed using this data type (i.e., it affects the accuracy and performance).
[out] workspace: Scratchpad (device) memory; the workspace must be aligned to 128 bytes.
[in] workspaceSize: Please use cutensorReductionGetWorkspace() to query the required workspace. While lower values, including zero, are valid, they may lead to grossly suboptimal performance.
[in] stream: The CUDA stream in which all the computation is performed.

`cutensorReductionGetWorkspace()`¶

cutensorStatus_t cutensorReductionGetWorkspace(const cutensorHandle_t *handle, const void *A, const cutensorTensorDescriptor_t *descA, const int32_t modeA[], const void *C, const cutensorTensorDescriptor_t *descC, const int32_t modeC[], const void *D, const cutensorTensorDescriptor_t *descD, const int32_t modeD[], cutensorOperator_t opReduce, cutensorComputeType_t typeCompute, uint64_t *workspaceSize)¶

Brief

Determines the required workspaceSize for a given tensor reduction (see cutensorReduction)

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[in] A: same as in cutensorReduction
[in] descA: same as in cutensorReduction
[in] modeA: same as in cutensorReduction
[in] C: same as in cutensorReduction
[in] descC: same as in cutensorReduction
[in] modeC: same as in cutensorReduction
[in] D: same as in cutensorReduction
[in] descD: same as in cutensorReduction
[in] modeD: same as in cutensorReduction
[in] opReduce: same as in cutensorReduction
[in] typeCompute: same as in cutensorReduction
[out] workspaceSize: The workspace size (in bytes) that is required for the given tensor reduction.

Cache-related Operations (beta)¶

`cutensorHandleDetachPlanCachelines()`¶

cutensorStatus_t cutensorHandleDetachPlanCachelines(cutensorHandle_t *handle)¶

Detaches cachelines from cache (i.e., releases the owner ship of the attached cache lines back to the caller) and deallocates any data structures that have been allocated as part of

cutensorHandleAttachPlanCachelines().

Brief: Detaches cachelines from cache (beta feature).

This function is not thread-safe.

Remark

non-blocking, no reentrant, and not thread-safe

Parameters

[inout] handle: Opaque handle holding cuTENSOR’s library context. The cachelines corresponding to this cache will be detached; after this call the user again takes full ownership over the chacheline buffer.

Return Value

CUTENSOR_STATUS_SUCCESS: on success
CUTENSOR_STATUS_NOT_SUPPORTED: e.g., if no cachelines had been attached

`cutensorHandleAttachPlanCachelines()`¶

cutensorStatus_t cutensorHandleAttachPlanCachelines(cutensorHandle_t *handle, cutensorPlanCacheline_t cachelines[], const uint32_t numCachelines)¶

This function attaches the cachelines to the handle and allocates some internal data structures required for the cache; hence, it is critical that users also call

cutensorHandleDetachPlanCachelines() to free those resources again.

Brief: Attaches cachelines to the plan cache (beta feature).

The handle assumes ownership over the attached cachelines stay valid at least until cutensorHandleDetachPlanCachelines has been called. Moreover, the attached cachelines must not be shared with other handles (i.e., after this call they are assumed to be exclusively used by the handle).

While this function is not thread-safe, the resulting cache can be shared across different threads in a thread-safe manner.

Remark

non-blocking, no reentrant, and not thread-safe

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context. The cachelines will be attached to the handle; the cachelines must remain valid until they have been detached (see cutensorPlanCacheDetachCachelines)
[in] cachelines: array of user-allocated cachelines (host memory).
[in] numCachelines: Number of provided cachelines.

Return Value

CUTENSOR_STATUS_SUCCESS: on success

`cutensorHandleReadCacheFromFile()`¶

cutensorStatus_t cutensorHandleReadCacheFromFile(cutensorHandle_t *handle, const char filename[], uint32_t *numCachelinesRead)¶

A cache is only valid for the same cuTENSOR version and CUDA version; moreover, the GPU architecture (incl. multiprocessor count) must match, otherwise CUTENSOR_STATUS_INVALID_VALUE will be returned.

Brief: Reads a Plan-Cache from file and overwrites the attached cachelines (beta feature).

It’s important that the user already attached sufficient cachelines (via cutensorHandleAttachPlanCachelines), otherwise CUTENSOR_STATUS_INVALID_VALUE will be returned.

This function is thread-safe.

Remark

non-blocking, no reentrant, and thread-safe

Parameters

[inout] handle: Opaque handle holding cuTENSOR’s library context.
[in] filename: Specifies the filename (including the absolute path) to the file that holds all the cache information that have previously been written by cutensorHandleWriteCacheToFile().
[out] numCachelinesRead: On exit, this variable will hold the number of successfully-read cachelines, if CUTENSOR_STATUS_SUCCESS is returned. Otherwise, this variable will hold the number of cachelines that are required to read all cachelines associated to the cache pointed to by filename; in that case CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE is returned.

Return Value

CUTENSOR_STATUS_SUCCESS: on success
CUTENSOR_STATUS_INVALID_VALUE: if the stored cache was created by a different cuTENSOR- or CUDA-version or if the GPU architecture (incl. multiprocessor count) doesn’t match
CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE: if the stored cache requires more cachelines than those that are currently attached to the handle
CUTENSOR_STATUS_IO_ERROR: if the file cannot be read

`cutensorHandleWriteCacheToFile()`¶

cutensorStatus_t cutensorHandleWriteCacheToFile(const cutensorHandle_t *handle, const char filename[])¶

This function is thread-safe.

Brief: Writes the attached Plan-Cache to file (beta feature).

Remark

non-blocking, no reentrant, and thread-safe

Parameters

[in] handle: Opaque handle holding cuTENSOR’s library context.
[in] filename: Specifies the filename (including the absolute path) to the file that should hold all the cache information. Warning: an existing file will be overwritten.

Return Value

CUTENSOR_STATUS_SUCCESS: on success
CUTENSOR_INVALID_VALUE: if the no cache has been attached
CUTENSOR_STATUS_IO_ERROR: if the file cannot be written to

cuTENSOR Functions¶

Helper Functions¶

cutensorInit()¶

cutensorInitTensorDescriptor()¶

cutensorGetAlignmentRequirement()¶

cutensorGetErrorString()¶

cutensorGetVersion()¶

cutensorGetCudartVersion()¶

Element-wise Operations¶

cutensorElementwiseTrinary()¶

cutensorElementwiseBinary()¶

cutensorPermutation()¶

Contraction Operations¶

cutensorInitContractionDescriptor()¶

cutensorContractionDescriptorSetAttribute()¶

cutensorInitContractionFind()¶

cutensorContractionFindSetAttribute()¶

cutensorContractionGetWorkspace()¶

cutensorInitContractionPlan()¶

cutensorContraction()¶

cutensorContractionMaxAlgos()¶

Reduction Operations¶

cutensorReduction()¶

cutensorReductionGetWorkspace()¶

Cache-related Operations (beta)¶

cutensorHandleDetachPlanCachelines()¶

cutensorHandleAttachPlanCachelines()¶

cutensorHandleReadCacheFromFile()¶

cutensorHandleWriteCacheToFile()¶

`cutensorInit()`¶

`cutensorInitTensorDescriptor()`¶

`cutensorGetAlignmentRequirement()`¶

`cutensorGetErrorString()`¶

`cutensorGetVersion()`¶

`cutensorGetCudartVersion()`¶

`cutensorElementwiseTrinary()`¶

`cutensorElementwiseBinary()`¶

`cutensorPermutation()`¶

`cutensorInitContractionDescriptor()`¶

`cutensorContractionDescriptorSetAttribute()`¶

`cutensorInitContractionFind()`¶

`cutensorContractionFindSetAttribute()`¶

`cutensorContractionGetWorkspace()`¶

`cutensorInitContractionPlan()`¶

`cutensorContraction()`¶

`cutensorContractionMaxAlgos()`¶

`cutensorReduction()`¶

`cutensorReductionGetWorkspace()`¶

`cutensorHandleDetachPlanCachelines()`¶

`cutensorHandleAttachPlanCachelines()`¶

`cutensorHandleReadCacheFromFile()`¶

`cutensorHandleWriteCacheToFile()`¶