User Guide

Nomenclature

The term tensor refers to an order-n (a.k.a., n-dimensional) array. One can think of tensors as a generalization of matrices to higher orders. For example, scalars, vectors, and matrices are order-0, order-1, and order-2 tensors, respectively.

An order-n tensor has \(n\) modes. Each mode has an extent (a.k.a. size). For each mode you can specify a stride \(s > 0\). This stride describes how far apart two logically consecutive elements along that mode are in physical memory. They have a function similar to the leading dimension in BLAS and allow, for example, operating on sub-tensors.

cuTENSOR, by default, adheres to a generalized column-major data layout. For example: \(A_{a,b,c} \in {R}^{4\times 8 \times 12}\) is an order-3 tensor with the extent of the a-mode, b-mode, and c-mode respectively being 4, 8, and 12. If not explicitly specified, the strides are assumed to be:

  • \(stride(a) = 1\)

  • \(stride(b) = extent(a)\)

  • \(stride(c) = extent(a) * extent(b)\).

For a general order-n tensor \(A_{i_1,i_2,...,i_n}\) we require that the strides do not lead to overlapping memory accesses; for instance, \(stride(i_1) >= 1\), and \(stride(i_{l}) >=stride(i_{l-1}) * extent(i_{l-1})\).

We say that a tensor is packed if it is contiguously stored in memory along all modes. That is, \(stride(i_1) = 1\) and \(stride(i_l) = stride(i_{l-1}) * extent(i_{l-1})\).

Einstein Notation

We adhere to the “Einstein notation”: modes that appear in the input tensors and not in the output tensor are implicitly contracted.

Performance Guidelines

In this section we assume a generalized column-major data layout (i.e., the modes on the left have the smallest stride). Most of the following performance guidelines are aimed to facilitate more regular memory access patterns:

  • Try to arrange the modes (w.r.t. increasing strides) of the tensor similarly in all tensors. For instance, \(C_{a,b,c} = A_{a,k,c} B_{k,b}\) is preferable to \(C_{a,b,c} = A_{c,k,a} B_{k,b}\).

  • Try to keep batched modes as the slowest-varying modes (i.e., with the largest strides). For instance \(C_{a,b,c,l} = A_{a,k,c,l} B_{k,b,l}\) is preferable to \(C_{a,l,b,c} = A_{l,a,k,c} B_{k,l,b}\).

  • Try to keep the extent of the fastest-varying mode (a.k.a. stride-one mode) as large as possible.

Software-managed Plan Cache (beta)

This section introduces the software-managed plan cache, it’s key features are:

  • Minimize launch-related overhead (e.g., due to kernel selection)

  • Overhead-free autotuning (a.k.a. incremental autotuning) * This feature enables users to automatically find the best implementation for the

    given problem and thereby increasing the attained performance

  • The cache is implemented in a thread-safe manner and it’s shared across all threads that use the same cutensorHandle_t.

  • Store/read to/from file * Allows users to store the state of the cache to disc and reuse it at a later stage

In essence, the plan cache can be seen as a lookup table from a specific problem instance (e.g., cutensorContractionDescriptor_t) to an actual implementation (encoded by cutensorContractionPlan_t).

The plan cache is an experimental feature at this point – future changes to the API are possible.

Please refer to Plan Cache (beta) for a detailed description.

Accuracy Guarantees

cuTENSOR uses its own compute type to set the floating-point accuracy across tensor operations. The cutensorComputeType_t refers to the minimal accuracy that is guaranteed throughout computations. Because it is only a guarantee of minimal accuracy, it is possible that the library chooses a higher accuracy than that requested by the user (e.g., if that compute type is not supported for a given problem, or to have more kernels available to choose from).

For instance, let us consider a tensor contraction for which all tensors are of type CUDA_R_32F but the cutensorComputeType_t is CUTENSOR_COMPUTE_16F, in that case cuTENSOR would use Nvidia’s Tensor Cores with an accumulation type of CUDA_R_32F (i.e., providing higher precision than requested by the user).

Another illustrative example is a tensor contraction for which all tensors are of type CUDA_R_16F and the computeType is CUDA_R_MIN_32F: In this case the parallel reduction (if required for performance) would have to be performed in CUDA_R_32F and thus require auxiliary workspace. To be precise, in this case cuTENSOR would not choose a serial reduction –via atomics– through the output tensor since part of the final reduction would be performed in CUDA_R_16F, which is lower than the cutensorComputeType_t requested by the user.

cuTENSOR follows the BLAS convention for NaN propagation: Whenever a scalar (alpha, beta, gamma) is set to zero, NaN in the scaled tensor expression are ignored, i.e. a zero from a scalar has precedence over a NaN from a tensor. However, NaN from a tensor follows normal IEEE 754 behavior.

To illustrate, let \(\alpha = 1; \beta = 0; A_{i, j} = 1; A'_{i, j} = 0; B_{i, j} = \textrm{NaN}\). Then \(\alpha A_{i,j} B{i, j} = \textrm{NaN}\), \(\beta A_{i, j} B_{i, j} = 0\), and \(\alpha A'_{i,j} B{i, j} = \textrm{NaN}\), \(\beta A'_{i, j} B_{i, j} = 0\).

Scalar Types

The scalar type used for element-wise operations is a function of the output type and the compute type:

Output type

cutensorComputeType_t

Scalar type

CUDA_R_16F or CUDA_R_16BF or CUDA_R_32F

CUTENSOR_COMPUTE_16F or CUTENSOR_COMPUTE_16BF or CUTENSOR_COMPUTE_TF32 or CUTENSOR_COMPUTE_32F

CUDA_R_32F

CUDA_R_64F

CUTENSOR_COMPUTE_32F or CUTENSOR_COMPUTE_64F

CUDA_R_64F

CUDA_C_16F or CUDA_C_16BF or CUDA_C_32F

CUTENSOR_COMPUTE_16F or CUTENSOR_COMPUTE_16BF or CUTENSOR_COMPUTE_TF32 or CUTENSOR_COMPUTE_32F

CUDA_C_32F

CUDA_C_32F

CUTENSOR_C_MIN_TF32

CUDA_C_32F

CUDA_C_64F

CUTENSOR_COMPUTE_32F or CUTENSOR_COMPUTE_64F

CUDA_C_64F

As of cuTENSOR 1.2.0. cutensorComputeType_t no longer distinguishes between real- and complex-valued compute types (e.g., CUTENSOR_R_MIN_32F, CUTENSOR_C_MIN_32F) have been deprecated.

Supported GPUs

cuTENSOR supports any Nvidia GPU with a compute capability larger or equal to 6.0.

Environment Variables

The environment variables in this section modify cuTENSOR’s runtime behavior. Note that these environment variables are read only when the handle is initialized (i.e., cutensorInit()); hence, changes to the environment variables will only take effect for a newly-initialized handle.

CUTENSOR_LOGINFO_DBG, when set to 1, enables additional error diagnostics if an error is encountered. These error diagnostics are printed to the standard output.

export CUTENSOR_LOGINFO_DBG=1

NVIDIA_TF32_OVERRIDE, when set to 0, will override any defaults or programmatic configuration of NVIDIA libraries, and never accelerate FP32 computations with TF32 tensor cores. This is meant to be a debugging tool only, and no code outside NVIDIA libraries should change behavior based on this environment variable. Any other setting besides 0 is reserved for future use.

export NVIDIA_TF32_OVERRIDE=0

CUTENSOR_DISABLE_PLAN_CACHE, when set to 1, disables the plan cache (see Software-managed Plan Cache (beta))

export CUTENSOR_DISABLE_PLAN_CACHE=1