User Guide¶
Nomenclature¶
The term tensor refers to an order-n (a.k.a., n-dimensional) array. One can think of tensors as a generalization of matrices to higher orders. For example, scalars, vectors, and matrices are order-0, order-1, and order-2 tensors, respectively.
An order-n tensor has \(n\) modes. Each mode has an extent (a.k.a. size). For each mode you can specify a stride \(s > 0\). This stride describes how far apart two logically consecutive elements along that mode are in physical memory. They have a function similar to the leading dimension in BLAS and allow, for example, operating on sub-tensors.
cuTENSOR, by default, adheres to a generalized column-major data layout. For example: \(A_{a,b,c} \in {R}^{4\times 8 \times 12}\) is an order-3 tensor with the extent of the a-mode, b-mode, and c-mode respectively being 4, 8, and 12. If not explicitly specified, the strides are assumed to be:
\(stride(a) = 1\)
\(stride(b) = extent(a)\)
\(stride(c) = extent(a) * extent(b)\).
A tensor is considered to be packed if it is contiguously stored in memory along all modes. That is, \(stride(i_1) = 1\) and \(stride(i_l) = stride(i_{l-1}) * extent(i_{l-1})\).
Einstein Notation¶
cuTENSOR adheres to the “Einstein notation”: modes that appear in the input tensors and not in the output tensor are implicitly contracted.
Performance Guidelines¶
This section assumes a generalized column-major data layout (i.e., the modes on the left have the smallest stride). Most of the following performance guidelines are aimed to facilitate more regular memory access patterns:
Try to arrange the modes (w.r.t. increasing strides) of the tensor similarly in all tensors. For instance, \(C_{a,b,c} = A_{a,k,c} B_{k,b}\) is preferable to \(C_{a,b,c} = A_{c,k,a} B_{k,b}\).
Try to keep batched modes as the slowest-varying modes (i.e., with the largest strides). For instance \(C_{a,b,c,l} = A_{a,k,c,l} B_{k,b,l}\) is preferable to \(C_{a,l,b,c} = A_{l,a,k,c} B_{k,l,b}\).
Try to keep the extent of the fastest-varying mode (a.k.a. stride-one mode) as large as possible.
Software-managed Plan Cache¶
This section introduces the software-managed plan cache. Its key features are:
Minimize launch-related overhead (e.g., due to kernel selection)
Overhead-free autotuning (a.k.a. incremental autotuning)
This feature enables users to automatically find the best implementation for the given problem and thereby increasing the attained performance
The cache is implemented in a thread-safe manner and it’s shared across all threads that use the same cutensorHandle_t.
Store/read to/from file
Allows users to store the state of the cache to disc and reuse it at a later stage
In essence, the plan cache can be seen as a lookup table from a specific problem instance (e.g., cutensorOperationDescriptor_t) to an actual implementation (encoded by cutensorPlan_t).
The plan cache is an experimental feature at this point – future changes to the API are possible.
Please refer to Plan Cache for a detailed description.
Accuracy Guarantees¶
cuTENSOR uses its own compute type to set the floating-point accuracy across tensor operations. The cutensorComputeDescriptor_t encodes the minimal accuracy that is guaranteed throughout computations. Because it is only a guarantee of minimal accuracy, it is possible that the library chooses a higher accuracy than that requested by the user (e.g., if that compute type is not supported for a given problem, or to have more kernels available to choose from).
For instance, let us consider a tensor contraction for which all tensors are of type
CUTENSOR_R_32F
but the compute descriptor is CUTENSOR_COMPUTE_DESC_16F
: in that
case cuTENSOR can use Nvidia’s Tensor Cores with an accumulation type of CUTENSOR_R_32F
(i.e., providing higher precision than requested by the user).
Another illustrative example is a tensor contraction for which all tensors are of type
CUTENSOR_R_16F
and the compute descriptor is CUTENSOR_COMPUTE_DESC_32F
: In this case the parallel reduction
(if required for performance) would have to be performed in CUTENSOR_R_32F
and thus require
auxiliary workspace. To be precise, in this case cuTENSOR would not choose a serial
reduction –via atomics– through the output tensor since part of the final reduction
would be performed in CUTENSOR_R_16F
, which is lower than what the user requested
via the compute descriptor.
cuTENSOR follows the BLAS convention for NaN propagation: Whenever a scalar (alpha
, beta
, gamma
)
is set to zero, NaN in the scaled tensor expression are ignored, i.e. a zero from a scalar
has precedence over a NaN from a tensor. However, NaN from a tensor follows normal IEEE 754 behavior.
To illustrate, let \(\alpha = 1; \beta = 0; A_{i, j} = 1; A'_{i, j} = 0; B_{i, j} = \textrm{NaN}\), then \(\alpha A_{i,j} B_{i, j} = \textrm{NaN}\), \(\beta A_{i, j} B_{i, j} = 0\), \(\alpha A'_{i,j} B_{i, j} = \textrm{NaN}\), and \(\beta A'_{i, j} B_{i, j} = 0\).
Scalar Types¶
Many operations support multiplication of arguments by a scalar. The type of that scalar is a function of the output type and the compute type. The following table lists the corresponding types:
Output type |
Scalar type |
|
---|---|---|
|
||
|
||
|
||
|
Supported Unary Operators¶
cuTENSOR supports unary operators for Element-wise Operations. The following table lists the corresponding compute types and operators:
Unary operator |
|
---|---|
|
|
|
Supported GPUs¶
cuTENSOR supports any Nvidia GPU with a compute capability larger or equal to 6.0.
CUDA Graph Support¶
All operations in cuTENSOR can be captured using CUDA graphs.
The only mode of operation that is not supported for graph capture are operations that are actively being autotuned (see Software-managed Plan Cache). That restriction exists because during auto-tuning, cuTENSOR iterates through different kernels. While graphs capture still works in that case, it is not recommended as it may capture a suboptimal kernel.
Logging¶
The logging mechanism in cuTENSOR can be enabled by setting the following environment variables before launching the target application:
CUTENSOR_LOG_LEVEL=<level> - while level is one of the following levels:
“0” - Off - logging is disabled (default)
“1” - Error - only errors will be logged
“2” - Trace - API calls that launch CUDA kernels will log their parameters and important information
“3” - Hints - hints that can potentially improve the application’s performance
“4” - Info - provides general information about the library execution, may contain details about heuristic status
“5” - API Trace - API calls will log their parameter and important information
CUTENSOR_LOG_MASK=<mask> - while mask is a combination of the following masks:
“0” - Off
“1” - Error
“2” - Trace
“4” - Hints
“8” - Info
“16” - API Trace
CUTENSOR_LOG_FILE=<file_name> - while file name is a path to a logging file. File name may contain %i, that will be replaced with the process id. E.g “<file_name>_%i.log”.
If CUTENSOR_LOG_FILE is not defined, the log messages are printed to stdout.
cuTENSOR can automatically add NVTX markers for profiling purposes by setting CUTENSOR_NVTX_LEVEL=<level> to one of the following values:
“0” - Off - NVTX markers are disabled (default)
“1” - ON - Selected API calls add NVTX markers
Environment Variables¶
The environment variables in this section modify cuTENSOR’s runtime behavior. Note that
these environment variables are read only when the handle is initialized (i.e., during
cutensorCreate()
); hence, changes to the environment variables will only take effect for a
newly-initialized handle.
NVIDIA_TF32_OVERRIDE
, when set to 0
, will override any defaults or programmatic
configuration of NVIDIA libraries, and never accelerate FP32 computations with TF32 tensor
cores. This is meant to be a debugging tool only, and no code outside NVIDIA libraries
should change behavior based on this environment variable. Any other setting besides 0
is reserved for future use.
export NVIDIA_TF32_OVERRIDE=0
CUTENSOR_DISABLE_PLAN_CACHE
, when set to 1
, disables the plan cache (see Software-managed Plan Cache)
export CUTENSOR_DISABLE_PLAN_CACHE=1