User Guide#
Nomenclature#
The term tensor refers to an order-n (a.k.a., n-dimensional) array. One can think of tensors as a generalization of matrices to higher orders. For example, scalars, vectors, and matrices are order-0, order-1, and order-2 tensors, respectively.
An order-n tensor has \(n\) modes. Each mode has an extent (a.k.a. size). For each mode you can specify a stride \(s > 0\). This stride describes how far apart two logically consecutive elements along that mode are in physical memory. They have a function similar to the leading dimension in BLAS and allow, for example, operating on sub-tensors.
cuTENSOR, by default, adheres to a generalized column-major data layout. For example: \(A_{a,b,c} \in {R}^{4\times 8 \times 12}\) is an order-3 tensor with the extent of the a-mode, b-mode, and c-mode respectively being 4, 8, and 12. If not explicitly specified, the strides are assumed to be:
\(stride(a) = 1\)
\(stride(b) = extent(a)\)
\(stride(c) = extent(a) * extent(b)\).
A tensor is considered to be packed if it is contiguously stored in memory along all modes. That is, \(stride(i_1) = 1\) and \(stride(i_l) = stride(i_{l-1}) * extent(i_{l-1})\).
Einstein Notation#
cuTENSOR adheres to the “Einstein notation”: modes that appear in the input tensors and not in the output tensor are implicitly contracted.
Performance Guidelines#
This section assumes a generalized column-major data layout (i.e., the modes on the left have the smallest stride). Most of the following performance guidelines are aimed to facilitate more regular memory access patterns:
Try to arrange the modes (w.r.t. increasing strides) of the tensor similarly in all tensors. For instance, \(C_{a,b,c} = A_{a,k,c} B_{k,b}\) is preferable to \(C_{a,b,c} = A_{c,k,a} B_{k,b}\).
Try to keep batched modes as the slowest-varying modes (i.e., with the largest strides). For instance \(C_{a,b,c,l} = A_{a,k,c,l} B_{k,b,l}\) is preferable to \(C_{a,l,b,c} = A_{l,a,k,c} B_{k,l,b}\).
Try to keep the extent of the fastest-varying mode (a.k.a. stride-one mode) as large as possible.
Software-managed Plan Cache#
This section introduces the software-managed plan cache. Its key features are:
Minimize launch-related overhead (e.g., due to kernel selection)
Overhead-free autotuning (a.k.a. incremental autotuning)
This feature enables users to automatically find the best implementation for the given problem and thereby increasing the attained performance
The cache is implemented in a thread-safe manner and it’s shared across all threads that use the same cutensorHandle_t.
Store/read to/from file
Allows users to store the state of the cache to disc and reuse it at a later stage
In essence, the plan cache can be seen as a lookup table from a specific problem instance (e.g., cutensorOperationDescriptor_t) to an actual implementation (encoded by cutensorPlan_t).
The plan cache is an experimental feature at this point – future changes to the API are possible.
Please refer to Plan Cache for a detailed description.
Accuracy Guarantees#
cuTENSOR uses its own compute type to set the floating-point accuracy across tensor operations. The cutensorComputeDescriptor_t encodes the minimal accuracy that is guaranteed throughout computations. Because it is only a guarantee of minimal accuracy, it is possible that the library chooses a higher accuracy than that requested by the user (e.g., if that compute type is not supported for a given problem, or to have more kernels available to choose from).
For instance, let us consider a tensor contraction for which all tensors are of type
CUTENSOR_R_32F
but the compute descriptor is CUTENSOR_COMPUTE_DESC_16F
: in that
case cuTENSOR can use Nvidia’s Tensor Cores with an accumulation type of CUTENSOR_R_32F
(i.e., providing higher precision than requested by the user).
Another illustrative example is a tensor contraction for which all tensors are of type
CUTENSOR_R_16F
and the compute descriptor is CUTENSOR_COMPUTE_DESC_32F
: In this case the parallel reduction
(if required for performance) would have to be performed in CUTENSOR_R_32F
and thus require
auxiliary workspace. To be precise, in this case cuTENSOR would not choose a serial
reduction –via atomics– through the output tensor since part of the final reduction
would be performed in CUTENSOR_R_16F
, which is lower than what the user requested
via the compute descriptor.
cuTENSOR follows the BLAS convention for NaN propagation: Whenever a scalar (alpha
, beta
, gamma
)
is set to zero, NaN in the scaled tensor expression are ignored, i.e. a zero from a scalar
has precedence over a NaN from a tensor. However, NaN from a tensor follows normal IEEE 754 behavior.
To illustrate, let \(\alpha = 1; \beta = 0; A_{i, j} = 1; A'_{i, j} = 0; B_{i, j} = \textrm{NaN}\), then \(\alpha A_{i,j} B_{i, j} = \textrm{NaN}\), \(\beta A_{i, j} B_{i, j} = 0\), \(\alpha A'_{i,j} B_{i, j} = \textrm{NaN}\), and \(\beta A'_{i, j} B_{i, j} = 0\).
Scalar Types#
Many operations support multiplication of arguments by a scalar. The type of that scalar is a function of the output type and the compute type. The following table lists the corresponding types:
Output type |
Scalar type |
|
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Supported Unary Operators#
cuTENSOR supports unary operators for Element-wise Operations. The following table lists the corresponding compute types and operators:
Unary operator |
|
---|---|
|
|
|
|
|
|
Supported GPUs#
cuTENSOR supports any Nvidia GPU with a compute capability larger or equal to 6.0.
CUDA Graph Support#
All operations in cuTENSOR can be captured using CUDA graphs, with two exceptions:
cutensorCreatePlan() must not be captured via CUDA graphs if Just-In-Time compilation is enabled (i.e., cutensorJitMode_t is not
CUTENSOR_JIT_MODE_NONE
). A workaround for this is to make sure that the capturing stream is created with thecudaStreamNonBlocking
flag.Operations that are actively being autotuned (see Software-managed Plan Cache) should not be captured via CUDA graphs. That restriction exists because during auto-tuning, cuTENSOR iterates through different kernels. While graphs capture still works in that case, it is not recommended as it may capture a suboptimal kernel.
Logging#
The logging mechanism in cuTENSOR can be enabled by setting the following environment variables before launching the target application:
CUTENSOR_LOG_LEVEL=<level> - while level is one of the following levels:
“0” - Off - logging is disabled (default)
“1” - Error - only errors will be logged
“2” - Trace - API calls that launch CUDA kernels will log their parameters and important information
“3” - Hints - hints that can potentially improve the application’s performance
“4” - Info - provides general information about the library execution, may contain details about heuristic status
“5” - API Trace - API calls will log their parameter and important information
CUTENSOR_LOG_MASK=<mask> - while mask is a combination of the following masks:
“0” - Off
“1” - Error
“2” - Trace
“4” - Hints
“8” - Info
“16” - API Trace
CUTENSOR_LOG_FILE=<file_name> - while file name is a path to a logging file. File name may contain %i, that will be replaced with the process id. E.g “<file_name>_%i.log”.
If CUTENSOR_LOG_FILE is not defined, the log messages are printed to stdout.
cuTENSOR can automatically add NVTX markers for profiling purposes by setting CUTENSOR_NVTX_LEVEL=<level> to one of the following values:
“0” - Off - NVTX markers are disabled (default)
“1” - ON - Selected API calls add NVTX markers
Environment Variables#
The environment variables in this section modify cuTENSOR’s runtime behavior. Note that
these environment variables are read only when the handle is initialized (i.e., during
cutensorCreate()
); hence, changes to the environment variables will only take effect for a
newly-initialized handle.
NVIDIA_TF32_OVERRIDE
, when set to 0
, will override any defaults or programmatic
configuration of NVIDIA libraries, and never accelerate FP32 computations with TF32 tensor
cores. This is meant to be a debugging tool only, and no code outside NVIDIA libraries
should change behavior based on this environment variable. Any other setting besides 0
is reserved for future use.
export NVIDIA_TF32_OVERRIDE=0
CUTENSOR_DISABLE_PLAN_CACHE
, when set to 1
, disables the plan cache (see Software-managed Plan Cache)
export CUTENSOR_DISABLE_PLAN_CACHE=1