User Guide

Nomenclature

The term tensor refers to an order-n (a.k.a., n-dimensional) array. One can think of tensors as a generalization of matrices to higher orders. For example, scalars, vectors, and matrices are order-0, order-1, and order-2 tensors, respectively.

An order-n tensor has \(n\) modes. Each mode has an extent (a.k.a. size). For each mode you can specify a stride \(s > 0\). This stride describes how far apart two logically consecutive elements along that mode are in physical memory. They have a function similar to the leading dimension in BLAS and allow, for example, operating on sub-tensors.

cuTENSOR, by default, adheres to a generalized column-major data layout. For example: \(A_{a,b,c} \in {R}^{4\times 8 \times 12}\) is an order-3 tensor with the extent of the a-mode, b-mode, and c-mode respectively being 4, 8, and 12. If not explicitly specified, the strides are assumed to be:

  • \(stride(a) = 1\)

  • \(stride(b) = extent(a)\)

  • \(stride(c) = extent(a) * extent(b)\).

For a general order-n tensor \(A_{i_1,i_2,...,i_n}\) we require that the strides do not lead to overlapping memory accesses; for instance, \(stride(i_1) >= 1\), and \(stride(i_{l}) >=stride(i_{l-1}) * extent(i_{l-1})\).

We say that a tensor is packed if it is contiguously stored in memory along all modes. That is, \(stride(i_1) = 1\) and \(stride(i_l) = stride(i_{l-1}) * extent(i_{l-1})\).

Einstein Notation

We adhere to the “Einstein notation”: modes that appear in the input tensors and not in the output tensor are implicitly contracted.

Performance Guidelines

In this section we assume a generalized column-major data layout (i.e., the modes on the left have the smallest stride). Most of the following performance guidelines are aimed to facilitate more regular memory access patterns:

  • Try to arrange the modes (w.r.t. increasing strides) of the tensor similarly in all tensors. For instance, \(C_{a,b,c} = A_{a,k,c} B_{k,b}\) is preferable to \(C_{a,b,c} = A_{c,k,a} B_{k,b}\).

  • Try to keep batched modes as the slowest-varying modes (i.e., with the largest strides). For instance \(C_{a,b,c,l} = A_{a,k,c,l} B_{k,b,l}\) is preferable to \(C_{a,l,b,c} = A_{l,a,k,c} B_{k,l,b}\).

  • Try to keep the extent of the fastest-varying mode (a.k.a. stride-one mode) as large as possible.

Software-managed Plan Cache (beta)

This section introduces the software-managed plan cache, it’s key features are:

  • Minimize launch-related overhead (e.g., due to kernel selection)

  • Overhead-free autotuning (a.k.a. incremental autotuning) * This feature enables users to automatically find the best implementation for the

    given problem and thereby increasing the attained performance

  • The cache is implemented in a thread-safe manner and it’s shared across all threads that use the same cutensorHandle_t.

  • Store/read to/from file * Allows users to store the state of the cache to disc and reuse it at a later stage

In essence, the plan cache can be seen as a lookup table from a specific problem instance (e.g., cutensorContractionDescriptor_t) to an actual implementation (encoded by cutensorContractionPlan_t).

The plan cache is an experimental feature at this point – future changes to the API are possible.

Please refer to Plan Cache (beta) for a detailed description.

Accuracy Guarantees

cuTENSOR uses its own compute type to set the floating-point accuracy across tensor operations. The cutensorComputeType_t refers to the minimal accuracy that is guaranteed throughout computations. Because it is only a guarantee of minimal accuracy, it is possible that the library chooses a higher accuracy than that requested by the user (e.g., if that compute type is not supported for a given problem, or to have more kernels available to choose from).

For instance, let us consider a tensor contraction for which all tensors are of type CUDA_R_32F but the cutensorComputeType_t is CUTENSOR_COMPUTE_16F, in that case cuTENSOR would use Nvidia’s Tensor Cores with an accumulation type of CUDA_R_32F (i.e., providing higher precision than requested by the user).

Another illustrative example is a tensor contraction for which all tensors are of type CUDA_R_16F and the computeType is CUDA_R_MIN_32F: In this case the parallel reduction (if required for performance) would have to be performed in CUDA_R_32F and thus require auxiliary workspace. To be precise, in this case cuTENSOR would not choose a serial reduction –via atomics– through the output tensor since part of the final reduction would be performed in CUDA_R_16F, which is lower than the cutensorComputeType_t requested by the user.

cuTENSOR follows the BLAS convention for NaN propagation: Whenever a scalar (alpha, beta, gamma) is set to zero, NaN in the scaled tensor expression are ignored, i.e. a zero from a scalar has precedence over a NaN from a tensor. However, NaN from a tensor follows normal IEEE 754 behavior.

To illustrate, let \(\alpha = 1; \beta = 0; A_{i, j} = 1; A'_{i, j} = 0; B_{i, j} = \textrm{NaN}\), then \(\alpha A_{i,j} B_{i, j} = \textrm{NaN}\), \(\beta A_{i, j} B_{i, j} = 0\), \(\alpha A'_{i,j} B_{i, j} = \textrm{NaN}\), and \(\beta A'_{i, j} B_{i, j} = 0\).

Scalar Types

Many operations support multiplication of arguments by a scalar. The type of that scalar is a function of the output type and the compute type. The following table lists the corresponding types:

Output type

cutensorComputeType_t

Scalar type

CUDA_R_16F or CUDA_R_16BF or CUDA_R_32F

CUTENSOR_COMPUTE_16F or CUTENSOR_COMPUTE_16BF or CUTENSOR_COMPUTE_TF32 or CUTENSOR_COMPUTE_32F

CUDA_R_32F

CUDA_R_64F

CUTENSOR_COMPUTE_32F or CUTENSOR_COMPUTE_64F

CUDA_R_64F

CUDA_C_16F or CUDA_C_16BF or CUDA_C_32F

CUTENSOR_COMPUTE_16F or CUTENSOR_COMPUTE_16BF or CUTENSOR_COMPUTE_TF32 or CUTENSOR_COMPUTE_32F

CUDA_C_32F

CUDA_C_32F

CUTENSOR_COMPUTE_TF32

CUDA_C_32F

CUDA_C_64F

CUTENSOR_COMPUTE_32F or CUTENSOR_COMPUTE_64F

CUDA_C_64F

As of cuTENSOR 1.2.0, cutensorComputeType_t no longer distinguishes between real- and complex-valued compute types (e.g., CUTENSOR_R_MIN_32F and CUTENSOR_C_MIN_32F) have been deprecated.

Supported Unary Operators

cuTENSOR supports unary operators for Element-wise Operations. The following table lists the corresponding compute types and operators:

cutensorComputeType_t

Unary operator

CUDA_R_32F or CUDA_R_64F or CUDA_R_16F or CUDA_R_16BF

CUTENSOR_OP_IDENTITY or CUTENSOR_OP_SQRT or CUTENSOR_OP_RCP or CUTENSOR_OP_RELU or CUTENSOR_OP_SIGMOID or CUTENSOR_OP_TANH or CUTENSOR_OP_EXP or CUTENSOR_OP_LOG or CUTENSOR_OP_ABS or CUTENSOR_OP_NEG or CUTENSOR_OP_SIN or CUTENSOR_OP_COS or CUTENSOR_OP_TAN or CUTENSOR_OP_SINH or CUTENSOR_OP_COSH or CUTENSOR_OP_ASIN or CUTENSOR_OP_ACOS or CUTENSOR_OP_ATAN or CUTENSOR_OP_ASINH or CUTENSOR_OP_ACOSH or CUTENSOR_OP_ATANH or CUTENSOR_OP_CEIL or CUTENSOR_OP_FLOOR

CUDA_C_32F or CUDA_C_64F or CUDA_C_16F

CUTENSOR_OP_IDENTITY or CUTENSOR_OP_CONJ

CUDA_R_8I or CUDA_R_8U or CUDA_R_32I or CUDA_R_32U

CUTENSOR_OP_IDENTITY or CUTENSOR_OP_RELU

Supported GPUs

cuTENSOR supports any Nvidia GPU with a compute capability larger or equal to 6.0.

CUDA Graph Support

All operations in cuTENSOR can be captured using CUDA graphs.

The only mode of operation that is not supported for graph capture are contractions (cutensorContraction) while the corresponding plan is actively being autotuned (see Software-managed Plan Cache (beta)). That restriction exists because during auto-tuning, cuTENSOR iterates through different kernels. While graphs capture still works in that case, it is not recommended as it may capture a suboptimal kernel.

cuTENSOR Logging

cuTENSOR logging mechanism can be enabled by setting the following environment variables before launching the target application:

CUTENSOR_LOG_LEVEL=<level> - while level is one of the following levels:

  • “0” - Off - logging is disabled (default)

  • “1” - Error - only errors will be logged

  • “2” - Trace - API calls that launch CUDA kernels will log their parameters and important information

  • “3” - Hints - hints that can potentially improve the application’s performance

  • “4” - Info - provides general information about the library execution, may contain details about heuristic status

  • “5” - API Trace - API calls will log their parameter and important information

CUTENSOR_LOG_MASK=<mask> - while mask is a combination of the following masks:

  • “0” - Off

  • “1” - Error

  • “2” - Trace

  • “4” - Hints

  • “8” - Info

  • “16” - API Trace

CUTENSOR_LOG_FILE=<file_name> - while file name is a path to a logging file. File name may contain %i, that will be replaced with the process id. E.g “<file_name>_%i.log”.

If CUTENSOR_LOG_FILE is not defined, the log messages are printed to stdout.

cuTENSOR can also automatically add NVTX markers for profiling purposes, when setting CUTENSOR_NVTX_LEVEL=<level> - while level is one of the following levels:

  • “0” - Off - nvtx markers are disabled (default)

  • “1” - ON - Selected API calls add NVTX markers

Environment Variables

The environment variables in this section modify cuTENSOR’s runtime behavior. Note that these environment variables are read only when the handle is initialized (i.e., cutensorInit()); hence, changes to the environment variables will only take effect for a newly-initialized handle.

NVIDIA_TF32_OVERRIDE, when set to 0, will override any defaults or programmatic configuration of NVIDIA libraries, and never accelerate FP32 computations with TF32 tensor cores. This is meant to be a debugging tool only, and no code outside NVIDIA libraries should change behavior based on this environment variable. Any other setting besides 0 is reserved for future use.

export NVIDIA_TF32_OVERRIDE=0

CUTENSOR_DISABLE_PLAN_CACHE, when set to 1, disables the plan cache (see Software-managed Plan Cache (beta))

export CUTENSOR_DISABLE_PLAN_CACHE=1