User Guide¶
Nomenclature¶
The term tensor refers to an order-n (a.k.a., n-dimensional) array. One can think of tensors as a generalization of matrices to higher orders. For example, scalars, vectors, and matrices are order-0, order-1, and order-2 tensors, respectively.
An order-n tensor has \(n\) modes. Each mode has an extent (a.k.a. size). For each mode you can specify a stride \(s > 0\). This stride describes how far apart two logically consecutive elements along that mode are in physical memory. They have a function similar to the leading dimension in BLAS and allow, for example, operating on sub-tensors.
nvplTENSOR, by default, adheres to a generalized column-major data layout. For example: \(A_{a,b,c} \in {R}^{4\times 8 \times 12}\) is an order-3 tensor with the extent of the a-mode, b-mode, and c-mode respectively being 4, 8, and 12. If not explicitly specified, the strides are assumed to be:
\(stride(a) = 1\)
\(stride(b) = extent(a)\)
\(stride(c) = extent(a) * extent(b)\).
A tensor is considered to be packed if it is contiguously stored in memory along all modes. That is, \(stride(i_1) = 1\) and \(stride(i_l) = stride(i_{l-1}) * extent(i_{l-1})\).
Einstein Notation¶
nvplTENSOR adheres to the “Einstein notation”: modes that appear in the input tensors and not in the output tensor are implicitly contracted.
Performance Guidelines¶
This section assumes a generalized column-major data layout (i.e., the modes on the left have the smallest stride). Most of the following performance guidelines are aimed to facilitate more regular memory access patterns:
Try to arrange the modes (w.r.t. increasing strides) of the tensor similarly in all tensors. For instance, \(C_{a,b,c} = A_{a,k,c} B_{k,b}\) is preferable to \(C_{a,b,c} = A_{c,k,a} B_{k,b}\).
Try to keep batched modes as the slowest-varying modes (i.e., with the largest strides). For instance \(C_{a,b,c,l} = A_{a,k,c,l} B_{k,b,l}\) is preferable to \(C_{a,l,b,c} = A_{l,a,k,c} B_{k,l,b}\).
Try to keep the extent of the fastest-varying mode (a.k.a. stride-one mode) as large as possible.
Accuracy Guarantees¶
nvplTENSOR uses its own compute type to set the floating-point accuracy across tensor operations. The nvpltensorComputeDescriptor_t encodes the minimal accuracy that is guaranteed throughout computations. Because it is only a guarantee of minimal accuracy, it is possible that the library chooses a higher accuracy than that requested by the user (e.g., if that compute type is not supported for a given problem, or to have more kernels available to choose from).
nvplTENSOR follows the BLAS convention for NaN propagation: Whenever a scalar (alpha
, beta
, gamma
)
is set to zero, NaN in the scaled tensor expression are ignored, i.e. a zero from a scalar
has precedence over a NaN from a tensor. However, NaN from a tensor follows normal IEEE 754 behavior.
To illustrate, let \(\alpha = 1; \beta = 0; A_{i, j} = 1; A'_{i, j} = 0; B_{i, j} = \textrm{NaN}\), then \(\alpha A_{i,j} B_{i, j} = \textrm{NaN}\), \(\beta A_{i, j} B_{i, j} = 0\), \(\alpha A'_{i,j} B_{i, j} = \textrm{NaN}\), and \(\beta A'_{i, j} B_{i, j} = 0\).
Scalar Types¶
Many operations support multiplication of arguments by a scalar. The type of that scalar is a function of the output type and the compute type. The following table lists the corresponding types:
Output type |
Scalar type |
|
---|---|---|
|
||
|
||
|
||
|
Supported Operators¶
nvplTENSOR supports only NVPLTENSOR_OP_IDENTITY
for unary operations and NVPLTENSOR_OP_ADD
for binary operations.
Logging¶
The logging mechanism in nvplTENSOR can be enabled by setting the following environment variables before launching the target application:
NVPLTENSOR_LOG_LEVEL=<level> - while level is one of the following levels:
“0” - Off - logging is disabled (default)
“1” - Error - only errors will be logged
“2” - Trace - API calls that launch CUDA kernels will log their parameters and important information
“3” - Hints - hints that can potentially improve the application’s performance
“4” - Info - provides general information about the library execution, may contain details about heuristic status
“5” - API Trace - API calls will log their parameter and important information
NVPLTENSOR_LOG_MASK=<mask> - while mask is a combination of the following masks:
“0” - Off
“1” - Error
“2” - Trace
“4” - Hints
“8” - Info
“16” - API Trace
NVPLTENSOR_LOG_FILE=<file_name> - while file name is a path to a logging file. File name may contain %i, that will be replaced with the process id. E.g “<file_name>_%i.log”.
If NVPLTENSOR_LOG_FILE is not defined, the log messages are printed to stdout.