User Guide

Nomenclature

The term tensor refers to an order-n (a.k.a., n-dimensional) array. One can think of tensors as a generalization of matrices to higher orders. For example, scalars, vectors, and matrices are order-0, order-1, and order-2 tensors, respectively.

An order-n tensor has \(n\) modes. Each mode has an extent (a.k.a. size). For each mode you can specify a stride \(s > 0\). This stride describes how far apart two logically consecutive elements along that mode are in physical memory. They have a function similar to the leading dimension in BLAS and allow, for example, operating on sub-tensors.

nvplTENSOR, by default, adheres to a generalized column-major data layout. For example: \(A_{a,b,c} \in {R}^{4\times 8 \times 12}\) is an order-3 tensor with the extent of the a-mode, b-mode, and c-mode respectively being 4, 8, and 12. If not explicitly specified, the strides are assumed to be:

  • \(stride(a) = 1\)

  • \(stride(b) = extent(a)\)

  • \(stride(c) = extent(a) * extent(b)\).

A tensor is considered to be packed if it is contiguously stored in memory along all modes. That is, \(stride(i_1) = 1\) and \(stride(i_l) = stride(i_{l-1}) * extent(i_{l-1})\).

Einstein Notation

nvplTENSOR adheres to the “Einstein notation”: modes that appear in the input tensors and not in the output tensor are implicitly contracted.

Performance Guidelines

This section assumes a generalized column-major data layout (i.e., the modes on the left have the smallest stride). Most of the following performance guidelines are aimed to facilitate more regular memory access patterns:

  • Try to arrange the modes (w.r.t. increasing strides) of the tensor similarly in all tensors. For instance, \(C_{a,b,c} = A_{a,k,c} B_{k,b}\) is preferable to \(C_{a,b,c} = A_{c,k,a} B_{k,b}\).

  • Try to keep batched modes as the slowest-varying modes (i.e., with the largest strides). For instance \(C_{a,b,c,l} = A_{a,k,c,l} B_{k,b,l}\) is preferable to \(C_{a,l,b,c} = A_{l,a,k,c} B_{k,l,b}\).

  • Try to keep the extent of the fastest-varying mode (a.k.a. stride-one mode) as large as possible.

Accuracy Guarantees

nvplTENSOR uses its own compute type to set the floating-point accuracy across tensor operations. The nvpltensorComputeDescriptor_t encodes the minimal accuracy that is guaranteed throughout computations. Because it is only a guarantee of minimal accuracy, it is possible that the library chooses a higher accuracy than that requested by the user (e.g., if that compute type is not supported for a given problem, or to have more kernels available to choose from).

nvplTENSOR follows the BLAS convention for NaN propagation: Whenever a scalar (alpha, beta, gamma) is set to zero, NaN in the scaled tensor expression are ignored, i.e. a zero from a scalar has precedence over a NaN from a tensor. However, NaN from a tensor follows normal IEEE 754 behavior.

To illustrate, let \(\alpha = 1; \beta = 0; A_{i, j} = 1; A'_{i, j} = 0; B_{i, j} = \textrm{NaN}\), then \(\alpha A_{i,j} B_{i, j} = \textrm{NaN}\), \(\beta A_{i, j} B_{i, j} = 0\), \(\alpha A'_{i,j} B_{i, j} = \textrm{NaN}\), and \(\beta A'_{i, j} B_{i, j} = 0\).

Scalar Types

Many operations support multiplication of arguments by a scalar. The type of that scalar is a function of the output type and the compute type. The following table lists the corresponding types:

Output type

nvpltensorComputeDescriptor_t

Scalar type

NVPLTENSOR_R_32F

NVPLTENSOR_COMPUTE_DESC_32F

NVPLTENSOR_R_32F

NVPLTENSOR_R_64F

NVPLTENSOR_COMPUTE_DESC_64F

NVPLTENSOR_R_64F

NVPLTENSOR_C_32F

NVPLTENSOR_COMPUTE_DESC_32F

NVPLTENSOR_C_32F

NVPLTENSOR_C_64F

NVPLTENSOR_COMPUTE_DESC_64F

NVPLTENSOR_C_64F

Supported Operators

nvplTENSOR supports only NVPLTENSOR_OP_IDENTITY for unary operations and NVPLTENSOR_OP_ADD for binary operations.

Logging

The logging mechanism in nvplTENSOR can be enabled by setting the following environment variables before launching the target application:

NVPLTENSOR_LOG_LEVEL=<level> - while level is one of the following levels:

  • “0” - Off - logging is disabled (default)

  • “1” - Error - only errors will be logged

  • “2” - Trace - API calls that launch CUDA kernels will log their parameters and important information

  • “3” - Hints - hints that can potentially improve the application’s performance

  • “4” - Info - provides general information about the library execution, may contain details about heuristic status

  • “5” - API Trace - API calls will log their parameter and important information

NVPLTENSOR_LOG_MASK=<mask> - while mask is a combination of the following masks:

  • “0” - Off

  • “1” - Error

  • “2” - Trace

  • “4” - Hints

  • “8” - Info

  • “16” - API Trace

NVPLTENSOR_LOG_FILE=<file_name> - while file name is a path to a logging file. File name may contain %i, that will be replaced with the process id. E.g “<file_name>_%i.log”.

If NVPLTENSOR_LOG_FILE is not defined, the log messages are printed to stdout.