User Guide¶
Nomenclature¶
The term tensor refers to an order-n (a.k.a., n-dimensional) array. One can think of tensors as a generalization of matrices to higher orders. For example, scalars, vectors, and matrices are order-0, order-1, and order-2 tensors, respectively.
An order-n tensor has \(n\) modes. Each mode has an extent (a.k.a. size). For each mode you can specify a stride \(s > 0\). This stride describes how far apart two logically consecutive elements along that mode are in physical memory. They have a function similar to the leading dimension in BLAS and allow, for example, operating on sub-tensors.
cuTENSOR, by default, adheres to a generalized column-major data layout. For example: \(A_{a,b,c} \in {R}^{4\times 8 \times 12}\) is an order-3 tensor with the extent of the a-mode, b-mode, and c-mode respectively being 4, 8, and 12. If not explicitly specified, the strides are assumed to be:
\(stride(a) = 1\)
\(stride(b) = extent(a)\)
\(stride(c) = extent(a) * extent(b)\).
For a general order-n tensor \(A_{i_1,i_2,...,i_n}\) we require that the strides do not lead to overlapping memory accesses; for instance, \(stride(i_1) >= 1\), and \(stride(i_{l}) >=stride(i_{l-1}) * extent(i_{l-1})\).
We say that a tensor is packed if it is contiguously stored in memory along all modes. That is, \(stride(i_1) = 1\) and \(stride(i_l) = stride(i_{l-1}) * extent(i_{l-1})\).
Einstein Notation¶
We adhere to the “Einstein notation”: modes that appear in the input tensors and not in the output tensor are implicitly contracted.
Performance Guidelines¶
In this section we assume a generalized column-major data layout (i.e., the modes on the left have the smallest stride). Most of the following performance guidelines are aimed to facilitate more regular memory access patterns:
Try to arrange the modes (w.r.t. increasing strides) of the tensor similarly in all tensors. For instance, \(C_{a,b,c} = A_{a,k,c} B_{k,b}\) is preferable to \(C_{a,b,c} = A_{c,k,a} B_{k,b}\).
Try to keep batched modes as the slowest-varying modes (i.e., with the largest strides). For instance \(C_{a,b,c,l} = A_{a,k,c,l} B_{k,b,l}\) is preferable to \(C_{a,l,b,c} = A_{l,a,k,c} B_{k,l,b}\).
Try to keep the extent of the fastest-varying mode (a.k.a. stride-one mode) as large as possible.
Accuracy Guarantees¶
cuTENSOR uses its own compute type to set the floating-point accuracy across tensor operations. The cutensorComputeType_t refers to the minimal accuracy that is guaranteed throughout computations. Because it is only a guarantee of minimal accuracy, it is possible that the library chooses a higher accuracy than that requested by the user (e.g., if that compute type is not supported for a given problem, or to have more kernels available to choose from).
For instance, let us consider a tensor contraction for which all tensors are of type
CUDA_R_32F
but the cutensorComputeType_t is CUTENSOR_R_MIN_16F
, in that
case cuTENSOR would use Nvidia’s Tensor Cores with an accumulation type of CUDA_R_32F
(i.e., providing higher precision than requested by the user).
Another illustrative example is a tensor contraction for which all tensors are of type
CUDA_R_16F
and the computeType is CUDA_R_MIN_32F
: In this case the parallel reduction
(if required for performance) would have to be performed in CUDA_R_32F
and thus require
auxiliary workspace. To be precise, in this case cuTENSOR would not choose a serial
reduction –via atomics– through the output tensor since part of the final reduction
would be performed in CUDA_R_16F
, which is lower than the cutensorComputeType_t
requested by the user.
Scalar Types¶
The scalar type used for element-wise operations is a function of the output type and the compute type:
Output type |
Scalar type |
|
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Supported GPUs¶
cuTENSOR supports any Nvidia GPU with a compute capability larger or equal to 7.0.
Error messages¶
The library can print additional error diagnostics if it encounters an error.
These can be enabled by setting the CUTENSOR_LOGINFO_DBG
environment variable to 1
.
export CUTENSOR_LOGINFO_DBG=1