User Guide

Nomenclature

The term tensor refers to an order-n (a.k.a., n-dimensional) array. One can think of tensors as a generalization of matrices to higher orders. For example, scalars, vectors, and matrices are order-0, order-1, and order-2 tensors, respectively.

An order-n tensor has \(n\) modes. Each mode has an extent (a.k.a. size). For each mode you can specify a stride \(s > 0\). This stride describes how far apart two logically consecutive elements along that mode are in physical memory. They have a function similar to the leading dimension in BLAS and allow, for example, operating on sub-tensors.

cuTENSOR, by default, adheres to a generalized column-major data layout. For example: \(A_{a,b,c} \in {R}^{4\times 8 \times 12}\) is an order-3 tensor with the extent of the a-mode, b-mode, and c-mode respectively being 4, 8, and 12. If not explicitly specified, the strides are assumed to be:

  • \(stride(a) = 1\)

  • \(stride(b) = extent(a)\)

  • \(stride(c) = extent(a) * extent(b)\).

For a general order-n tensor \(A_{i_1,i_2,...,i_n}\) we require that the strides do not lead to overlapping memory accesses; for instance, \(stride(i_1) >= 1\), and \(stride(i_{l}) >=stride(i_{l-1}) * extent(i_{l-1})\).

We say that a tensor is packed if it is contiguously stored in memory along all modes. That is, \(stride(i_1) = 1\) and \(stride(i_l) = stride(i_{l-1}) * extent(i_{l-1})\).

Einstein Notation

We adhere to the “Einstein notation”: modes that appear in the input tensors and not in the output tensor are implicitly contracted.

Performance Guidelines

In this section we assume a generalized column-major data layout (i.e., the modes on the left have the smallest stride). Most of the following performance guidelines are aimed to facilitate more regular memory access patterns:

  • Try to arrange the modes (w.r.t. increasing strides) of the tensor similarly in all tensors. For instance, \(C_{a,b,c} = A_{a,k,c} B_{k,b}\) is preferable to \(C_{a,b,c} = A_{c,k,a} B_{k,b}\).

  • Try to keep batched modes as the slowest-varying modes (i.e., with the largest strides). For instance \(C_{a,b,c,l} = A_{a,k,c,l} B_{k,b,l}\) is preferable to \(C_{a,l,b,c} = A_{l,a,k,c} B_{k,l,b}\).

  • Try to keep the extent of the fastest-varying mode (a.k.a. stride-one mode) as large as possible.

Accuracy Guarantees

cuTENSOR uses its own compute type to set the floating-point accuracy across tensor operations. The cutensorComputeType_t refers to the minimal accuracy that is guaranteed throughout computations. Because it is only a guarantee of minimal accuracy, it is possible that the library chooses a higher accuracy than that requested by the user (e.g., if that compute type is not supported for a given problem, or to have more kernels available to choose from).

For instance, let us consider a tensor contraction for which all tensors are of type CUDA_R_32F but the cutensorComputeType_t is CUTENSOR_R_MIN_16F, in that case cuTENSOR would use Nvidia’s Tensor Cores with an accumulation type of CUDA_R_32F (i.e., providing higher precision than requested by the user).

Another illustrative example is a tensor contraction for which all tensors are of type CUDA_R_16F and the computeType is CUDA_R_MIN_32F: In this case the parallel reduction (if required for performance) would have to be performed in CUDA_R_32F and thus require auxiliary workspace. To be precise, in this case cuTENSOR would not choose a serial reduction –via atomics– through the output tensor since part of the final reduction would be performed in CUDA_R_16F, which is lower than the cutensorComputeType_t requested by the user.

Scalar Types

The scalar type used for element-wise operations is a function of the output type and the compute type:

Output type

cutensorComputeType_t

Scalar type

CUDA_R_16F

CUTENSOR_R_MIN_16F

CUDA_R_32F

CUDA_R_32F

CUTENSOR_R_MIN_16F

CUDA_R_32F

CUDA_R_32F

CUTENSOR_R_MIN_32F

CUDA_R_32F

CUDA_R_64F

CUTENSOR_R_MIN_64F

CUDA_R_64F

CUDA_R_64F

CUTENSOR_R_MIN_32F

CUDA_R_64F

CUDA_C_32F

CUTENSOR_C_MIN_32F

CUDA_C_32F

CUDA_C_64F

CUTENSOR_C_MIN_64F

CUDA_C_64F

CUDA_C_64F

CUTENSOR_C_MIN_32F

CUDA_C_64F

Supported GPUs

cuTENSOR supports any Nvidia GPU with a compute capability larger or equal to 7.0.

Error messages

The library can print additional error diagnostics if it encounters an error. These can be enabled by setting the CUTENSOR_LOGINFO_DBG environment variable to 1.

export CUTENSOR_LOGINFO_DBG=1