User Guide#

Nomenclature#

The term tensor refers to an order-n (a.k.a., n-dimensional) array. One can think of tensors as a generalization of matrices to higher orders. For example, scalars, vectors, and matrices are order-0, order-1, and order-2 tensors, respectively.

An order-n tensor has \(n\) modes. Each mode has an extent (a.k.a. size). For each mode you can specify a stride \(s > 0\). This stride describes how far apart two logically consecutive elements along that mode are in physical memory. They have a function similar to the leading dimension in BLAS and allow, for example, operating on sub-tensors.

cuTENSOR, by default, adheres to a generalized column-major data layout. For example: \(A_{a,b,c} \in {R}^{4\times 8 \times 12}\) is an order-3 tensor with the extent of the a-mode, b-mode, and c-mode respectively being 4, 8, and 12. If not explicitly specified, the strides are assumed to be:

\(stride(a) = 1\)

\(stride(b) = extent(a)\)

\(stride(c) = extent(a) * extent(b)\).

A tensor is considered to be packed if it is contiguously stored in memory along all modes. That is, \(stride(i_1) = 1\) and \(stride(i_l) = stride(i_{l-1}) * extent(i_{l-1})\).

Einstein Notation#

cuTENSOR adheres to the “Einstein notation”: modes that appear in the input tensors and not in the output tensor are implicitly contracted.

Performance Guidelines#

This section assumes a generalized column-major data layout (i.e., the modes on the left have the smallest stride). Most of the following performance guidelines are aimed to facilitate more regular memory access patterns:

Try to arrange the modes (w.r.t. increasing strides) of the tensor similarly in all tensors. For instance, \(C_{a,b,c} = A_{a,k,c} B_{k,b}\) is preferable to \(C_{a,b,c} = A_{c,k,a} B_{k,b}\).

Try to keep batched modes as the slowest-varying modes (i.e., with the largest strides). For instance \(C_{a,b,c,l} = A_{a,k,c,l} B_{k,b,l}\) is preferable to \(C_{a,l,b,c} = A_{l,a,k,c} B_{k,l,b}\).

Try to keep the extent of the fastest-varying mode (a.k.a. stride-one mode) as large as possible.

Software-managed Plan Cache#

This section introduces the software-managed plan cache. Its key features are:

Minimize launch-related overhead (e.g., due to kernel selection)
Overhead-free autotuning (a.k.a. incremental autotuning)
- This feature enables users to automatically find the best implementation for the given problem and thereby increasing the attained performance
The cache is implemented in a thread-safe manner and it’s shared across all threads that use the same cutensorHandle_t.
Store/read to/from file
- Allows users to store the state of the cache to disc and reuse it at a later stage

In essence, the plan cache can be seen as a lookup table from a specific problem instance (e.g., cutensorOperationDescriptor_t) to an actual implementation (encoded by cutensorPlan_t).

The plan cache is an experimental feature at this point – future changes to the API are possible.

Please refer to Plan Cache for a detailed description.

Accuracy Guarantees#

cuTENSOR uses its own compute type to set the floating-point accuracy across tensor operations. The cutensorComputeDescriptor_t encodes the minimal accuracy that is guaranteed throughout computations. Because it is only a guarantee of minimal accuracy, it is possible that the library chooses a higher accuracy than that requested by the user (e.g., if that compute type is not supported for a given problem, or to have more kernels available to choose from).

For instance, let us consider a tensor contraction for which all tensors are of type CUTENSOR_R_32F but the compute descriptor is CUTENSOR_COMPUTE_DESC_16F: in that case cuTENSOR can use Nvidia’s Tensor Cores with an accumulation type of CUTENSOR_R_32F (i.e., providing higher precision than requested by the user).

Another illustrative example is a tensor contraction for which all tensors are of type CUTENSOR_R_16F and the compute descriptor is CUTENSOR_COMPUTE_DESC_32F: In this case the parallel reduction (if required for performance) would have to be performed in CUTENSOR_R_32F and thus require auxiliary workspace. To be precise, in this case cuTENSOR would not choose a serial reduction –via atomics– through the output tensor since part of the final reduction would be performed in CUTENSOR_R_16F, which is lower than what the user requested via the compute descriptor.

cuTENSOR follows the BLAS convention for NaN propagation: Whenever a scalar (alpha, beta, gamma) is set to zero, NaN in the scaled tensor expression are ignored, i.e. a zero from a scalar has precedence over a NaN from a tensor. However, NaN from a tensor follows normal IEEE 754 behavior.

To illustrate, let \(\alpha = 1; \beta = 0; A_{i, j} = 1; A'_{i, j} = 0; B_{i, j} = \textrm{NaN}\), then \(\alpha A_{i,j} B_{i, j} = \textrm{NaN}\), \(\beta A_{i, j} B_{i, j} = 0\), \(\alpha A'_{i,j} B_{i, j} = \textrm{NaN}\), and \(\beta A'_{i, j} B_{i, j} = 0\).

Scalar Types#

Many operations support multiplication of arguments by a scalar. The type of that scalar is a function of the output type and the compute type. The following table lists the corresponding types:

Output type	cutensorComputeDescriptor_t	Scalar type
`CUTENSOR_R_16F` or `CUTENSOR_R_16BF` or `CUTENSOR_R_32F`	`CUTENSOR_COMPUTE_DESC_16F` or `CUTENSOR_COMPUTE_DESC_16BF` or `CUTENSOR_COMPUTE_DESC_TF32` or `CUTENSOR_COMPUTE_DESC_3XTF32` or `CUTENSOR_COMPUTE_DESC_32F`	`CUTENSOR_R_32F`
`CUTENSOR_R_64F`	`CUTENSOR_COMPUTE_DESC_32F` or `CUTENSOR_COMPUTE_DESC_64F`	`CUTENSOR_R_64F`
`CUTENSOR_C_16F` or `CUTENSOR_C_16BF` or `CUTENSOR_C_32F`	`CUTENSOR_COMPUTE_DESC_16F` or `CUTENSOR_COMPUTE_DESC_16BF` or `CUTENSOR_COMPUTE_DESC_TF32` or `CUTENSOR_COMPUTE_DESC_3XTF32` or `CUTENSOR_COMPUTE_DESC_32F`	`CUTENSOR_C_32F`
`CUTENSOR_C_64F`	`CUTENSOR_COMPUTE_DESC_32F` or `CUTENSOR_COMPUTE_DESC_64F`	`CUTENSOR_C_64F`

Supported Unary Operators#

cuTENSOR supports unary operators for Element-wise Operations. The following table lists the corresponding compute types and operators:

cudaDataType_t	Unary operator
`CUTENSOR_R_32F` or `CUTENSOR_R_64F` or `CUTENSOR_R_16F` or `CUTENSOR_R_16BF`	`CUTENSOR_OP_IDENTITY` or `CUTENSOR_OP_SQRT` or `CUTENSOR_OP_RCP` or `CUTENSOR_OP_RELU` or `CUTENSOR_OP_SIGMOID` or `CUTENSOR_OP_TANH` or `CUTENSOR_OP_EXP` or `CUTENSOR_OP_LOG` or `CUTENSOR_OP_ABS` or `CUTENSOR_OP_NEG` or `CUTENSOR_OP_SIN` or `CUTENSOR_OP_COS` or `CUTENSOR_OP_TAN` or `CUTENSOR_OP_SINH` or `CUTENSOR_OP_COSH` or `CUTENSOR_OP_ASIN` or `CUTENSOR_OP_ACOS` or `CUTENSOR_OP_ATAN` or `CUTENSOR_OP_ASINH` or `CUTENSOR_OP_ACOSH` or `CUTENSOR_OP_ATANH` or `CUTENSOR_OP_CEIL` or `CUTENSOR_OP_FLOOR`
`CUTENSOR_C_32F` or `CUTENSOR_C_64F` or `CUTENSOR_C_16F`	`CUTENSOR_OP_IDENTITY` or `CUTENSOR_OP_CONJ`
`CUTENSOR_R_8I` or `CUTENSOR_R_8U` or `CUTENSOR_R_32I` or `CUTENSOR_R_32U`	`CUTENSOR_OP_IDENTITY` or `CUTENSOR_OP_RELU`

Supported GPUs#

cuTENSOR supports any Nvidia GPU with a compute capability larger or equal to 6.0.

CUDA Graph Support#

All operations in cuTENSOR can be captured using CUDA graphs, with two exceptions:

cutensorCreatePlan() must not be captured via CUDA graphs if Just-In-Time compilation is enabled (i.e., cutensorJitMode_t is not CUTENSOR_JIT_MODE_NONE). A workaround for this is to make sure that the capturing stream is created with the cudaStreamNonBlocking flag.

Operations that are actively being autotuned (see Software-managed Plan Cache) should not be captured via CUDA graphs. That restriction exists because during auto-tuning, cuTENSOR iterates through different kernels. While graphs capture still works in that case, it is not recommended as it may capture a suboptimal kernel.

Logging#

The logging mechanism in cuTENSOR can be enabled by setting the following environment variables before launching the target application:

CUTENSOR_LOG_LEVEL=<level> - while level is one of the following levels:

“0” - Off - logging is disabled (default)

“1” - Error - only errors will be logged

“2” - Trace - API calls that launch CUDA kernels will log their parameters and important information

“3” - Hints - hints that can potentially improve the application’s performance

“4” - Info - provides general information about the library execution, may contain details about heuristic status

“5” - API Trace - API calls will log their parameter and important information

CUTENSOR_LOG_MASK=<mask> - while mask is a combination of the following masks:

“0” - Off

“1” - Error

“2” - Trace

“4” - Hints

“8” - Info

“16” - API Trace

CUTENSOR_LOG_FILE=<file_name> - while file name is a path to a logging file. File name may contain %i, that will be replaced with the process id. E.g “<file_name>_%i.log”.

If CUTENSOR_LOG_FILE is not defined, the log messages are printed to stdout.

cuTENSOR can automatically add NVTX markers for profiling purposes by setting CUTENSOR_NVTX_LEVEL=<level> to one of the following values:

“0” - Off - NVTX markers are disabled (default)

“1” - ON - Selected API calls add NVTX markers

Environment Variables#

The environment variables in this section modify cuTENSOR’s runtime behavior. Note that these environment variables are read only when the handle is initialized (i.e., during cutensorCreate()); hence, changes to the environment variables will only take effect for a newly-initialized handle.

NVIDIA_TF32_OVERRIDE, when set to 0, will override any defaults or programmatic configuration of NVIDIA libraries, and never accelerate FP32 computations with TF32 tensor cores. This is meant to be a debugging tool only, and no code outside NVIDIA libraries should change behavior based on this environment variable. Any other setting besides 0 is reserved for future use.

export NVIDIA_TF32_OVERRIDE=0

CUTENSOR_DISABLE_PLAN_CACHE, when set to 1, disables the plan cache (see Software-managed Plan Cache)

export CUTENSOR_DISABLE_PLAN_CACHE=1