User Guide

Nomenclature

The term tensor refers to an order-n (a.k.a., n-dimensional) array. One can think of tensors as a generalization of matrices to higher orders. For example, scalars, vectors, and matrices are order-0, order-1, and order-2 tensors, respectively.

An order-n tensor has \(n\) modes. Each mode has an extent (a.k.a. size). For each mode you can specify a stride \(s > 0\). This stride describes how far apart two logically consecutive elements along that mode are in physical memory. They have a function similar to the leading dimension in BLAS and allow, for example, operating on sub-tensors.

cuTENSOR, by default, adheres to a generalized column-major data layout. For example: \(A_{a,b,c} \in {R}^{4\times 8 \times 12}\) is an order-3 tensor with the extent of the a-mode, b-mode, and c-mode respectively being 4, 8, and 12. If not explicitly specified, the strides are assumed to be:

  • \(stride(a) = 1\)

  • \(stride(b) = extent(a)\)

  • \(stride(c) = extent(a) * extent(b)\).

A tensor is considered to be packed if it is contiguously stored in memory along all modes. That is, \(stride(i_1) = 1\) and \(stride(i_l) = stride(i_{l-1}) * extent(i_{l-1})\).

Einstein Notation

cuTENSOR adheres to the “Einstein notation”: modes that appear in the input tensors and not in the output tensor are implicitly contracted.

Performance Guidelines

This section assumes a generalized column-major data layout (i.e., the modes on the left have the smallest stride). Most of the following performance guidelines are aimed to facilitate more regular memory access patterns:

  • Try to arrange the modes (w.r.t. increasing strides) of the tensor similarly in all tensors. For instance, \(C_{a,b,c} = A_{a,k,c} B_{k,b}\) is preferable to \(C_{a,b,c} = A_{c,k,a} B_{k,b}\).

  • Try to keep batched modes as the slowest-varying modes (i.e., with the largest strides). For instance \(C_{a,b,c,l} = A_{a,k,c,l} B_{k,b,l}\) is preferable to \(C_{a,l,b,c} = A_{l,a,k,c} B_{k,l,b}\).

  • Try to keep the extent of the fastest-varying mode (a.k.a. stride-one mode) as large as possible.

Software-managed Plan Cache

This section introduces the software-managed plan cache. Its key features are:

  • Minimize launch-related overhead (e.g., due to kernel selection)

  • Overhead-free autotuning (a.k.a. incremental autotuning)

    • This feature enables users to automatically find the best implementation for the given problem and thereby increasing the attained performance

  • The cache is implemented in a thread-safe manner and it’s shared across all threads that use the same cutensorHandle_t.

  • Store/read to/from file

    • Allows users to store the state of the cache to disc and reuse it at a later stage

In essence, the plan cache can be seen as a lookup table from a specific problem instance (e.g., cutensorOperationDescriptor_t) to an actual implementation (encoded by cutensorPlan_t).

The plan cache is an experimental feature at this point – future changes to the API are possible.

Please refer to Plan Cache for a detailed description.

Accuracy Guarantees

cuTENSOR uses its own compute type to set the floating-point accuracy across tensor operations. The cutensorComputeDescriptor_t encodes the minimal accuracy that is guaranteed throughout computations. Because it is only a guarantee of minimal accuracy, it is possible that the library chooses a higher accuracy than that requested by the user (e.g., if that compute type is not supported for a given problem, or to have more kernels available to choose from).

For instance, let us consider a tensor contraction for which all tensors are of type CUTENSOR_R_32F but the compute descriptor is CUTENSOR_COMPUTE_DESC_16F: in that case cuTENSOR can use Nvidia’s Tensor Cores with an accumulation type of CUTENSOR_R_32F (i.e., providing higher precision than requested by the user).

Another illustrative example is a tensor contraction for which all tensors are of type CUTENSOR_R_16F and the compute descriptor is CUTENSOR_COMPUTE_DESC_32F: In this case the parallel reduction (if required for performance) would have to be performed in CUTENSOR_R_32F and thus require auxiliary workspace. To be precise, in this case cuTENSOR would not choose a serial reduction –via atomics– through the output tensor since part of the final reduction would be performed in CUTENSOR_R_16F, which is lower than what the user requested via the compute descriptor.

cuTENSOR follows the BLAS convention for NaN propagation: Whenever a scalar (alpha, beta, gamma) is set to zero, NaN in the scaled tensor expression are ignored, i.e. a zero from a scalar has precedence over a NaN from a tensor. However, NaN from a tensor follows normal IEEE 754 behavior.

To illustrate, let \(\alpha = 1; \beta = 0; A_{i, j} = 1; A'_{i, j} = 0; B_{i, j} = \textrm{NaN}\), then \(\alpha A_{i,j} B_{i, j} = \textrm{NaN}\), \(\beta A_{i, j} B_{i, j} = 0\), \(\alpha A'_{i,j} B_{i, j} = \textrm{NaN}\), and \(\beta A'_{i, j} B_{i, j} = 0\).

Scalar Types

Many operations support multiplication of arguments by a scalar. The type of that scalar is a function of the output type and the compute type. The following table lists the corresponding types:

Output type

cutensorComputeDescriptor_t

Scalar type

CUTENSOR_R_16F or CUTENSOR_R_16BF or CUTENSOR_R_32F

CUTENSOR_COMPUTE_DESC_16F or CUTENSOR_COMPUTE_DESC_16BF or CUTENSOR_COMPUTE_DESC_TF32 or CUTENSOR_COMPUTE_DESC_3XTF32 or CUTENSOR_COMPUTE_DESC_32F

CUTENSOR_R_32F

CUTENSOR_R_64F

CUTENSOR_COMPUTE_DESC_32F or CUTENSOR_COMPUTE_DESC_64F

CUTENSOR_R_64F

CUTENSOR_C_16F or CUTENSOR_C_16BF or CUTENSOR_C_32F

CUTENSOR_COMPUTE_DESC_16F or CUTENSOR_COMPUTE_DESC_16BF or CUTENSOR_COMPUTE_DESC_TF32 or CUTENSOR_COMPUTE_DESC_3XTF32 or CUTENSOR_COMPUTE_DESC_32F

CUTENSOR_C_32F

CUTENSOR_C_64F

CUTENSOR_COMPUTE_DESC_32F or CUTENSOR_COMPUTE_DESC_64F

CUTENSOR_C_64F

Supported Unary Operators

cuTENSOR supports unary operators for Element-wise Operations. The following table lists the corresponding compute types and operators:

cudaDataType_t

Unary operator

CUTENSOR_R_32F or CUTENSOR_R_64F or CUTENSOR_R_16F or CUTENSOR_R_16BF

CUTENSOR_OP_IDENTITY or CUTENSOR_OP_SQRT or CUTENSOR_OP_RCP or CUTENSOR_OP_RELU or CUTENSOR_OP_SIGMOID or CUTENSOR_OP_TANH or CUTENSOR_OP_EXP or CUTENSOR_OP_LOG or CUTENSOR_OP_ABS or CUTENSOR_OP_NEG or CUTENSOR_OP_SIN or CUTENSOR_OP_COS or CUTENSOR_OP_TAN or CUTENSOR_OP_SINH or CUTENSOR_OP_COSH or CUTENSOR_OP_ASIN or CUTENSOR_OP_ACOS or CUTENSOR_OP_ATAN or CUTENSOR_OP_ASINH or CUTENSOR_OP_ACOSH or CUTENSOR_OP_ATANH or CUTENSOR_OP_CEIL or CUTENSOR_OP_FLOOR

CUTENSOR_C_32F or CUTENSOR_C_64F or CUTENSOR_C_16F

CUTENSOR_OP_IDENTITY or CUTENSOR_OP_CONJ

CUTENSOR_R_8I or CUTENSOR_R_8U or CUTENSOR_R_32I or CUTENSOR_R_32U

CUTENSOR_OP_IDENTITY or CUTENSOR_OP_RELU

Supported GPUs

cuTENSOR supports any Nvidia GPU with a compute capability larger or equal to 6.0.

CUDA Graph Support

All operations in cuTENSOR can be captured using CUDA graphs.

The only mode of operation that is not supported for graph capture are operations that are actively being autotuned (see Software-managed Plan Cache). That restriction exists because during auto-tuning, cuTENSOR iterates through different kernels. While graphs capture still works in that case, it is not recommended as it may capture a suboptimal kernel.

Logging

The logging mechanism in cuTENSOR can be enabled by setting the following environment variables before launching the target application:

CUTENSOR_LOG_LEVEL=<level> - while level is one of the following levels:

  • “0” - Off - logging is disabled (default)

  • “1” - Error - only errors will be logged

  • “2” - Trace - API calls that launch CUDA kernels will log their parameters and important information

  • “3” - Hints - hints that can potentially improve the application’s performance

  • “4” - Info - provides general information about the library execution, may contain details about heuristic status

  • “5” - API Trace - API calls will log their parameter and important information

CUTENSOR_LOG_MASK=<mask> - while mask is a combination of the following masks:

  • “0” - Off

  • “1” - Error

  • “2” - Trace

  • “4” - Hints

  • “8” - Info

  • “16” - API Trace

CUTENSOR_LOG_FILE=<file_name> - while file name is a path to a logging file. File name may contain %i, that will be replaced with the process id. E.g “<file_name>_%i.log”.

If CUTENSOR_LOG_FILE is not defined, the log messages are printed to stdout.

cuTENSOR can automatically add NVTX markers for profiling purposes by setting CUTENSOR_NVTX_LEVEL=<level> to one of the following values:

  • “0” - Off - NVTX markers are disabled (default)

  • “1” - ON - Selected API calls add NVTX markers

Environment Variables

The environment variables in this section modify cuTENSOR’s runtime behavior. Note that these environment variables are read only when the handle is initialized (i.e., during cutensorCreate()); hence, changes to the environment variables will only take effect for a newly-initialized handle.

NVIDIA_TF32_OVERRIDE, when set to 0, will override any defaults or programmatic configuration of NVIDIA libraries, and never accelerate FP32 computations with TF32 tensor cores. This is meant to be a debugging tool only, and no code outside NVIDIA libraries should change behavior based on this environment variable. Any other setting besides 0 is reserved for future use.

export NVIDIA_TF32_OVERRIDE=0

CUTENSOR_DISABLE_PLAN_CACHE, when set to 1, disables the plan cache (see Software-managed Plan Cache)

export CUTENSOR_DISABLE_PLAN_CACHE=1