.. _user-guide-label:

User Guide
==========

.. _nomenclature-label:

Nomenclature
------------

The term tensor refers to an **order-n** (a.k.a., n-dimensional) array. 
One can think of tensors as a generalization of matrices to higher **orders**.
For example, scalars, vectors, and matrices are order-0, order-1, and order-2 
tensors, respectively.

An order-n tensor has :math:`n` **modes**. Each mode has an **extent** (a.k.a. size).
For each mode you can specify a **stride** :math:`s > 0`. This **stride**
describes how far apart two logically consecutive elements along that mode are
in physical memory. They have a function similar to the leading dimension in
BLAS and allow, for example, operating on sub-tensors.

cuTENSOR, by default, adheres to a generalized **column-major** data layout.
For example: :math:`A_{a,b,c} \in {R}^{4\times 8 \times 12}`
is an order-3 tensor with the extent of the a-mode, b-mode, and c-mode
respectively being 4, 8, and 12. If not explicitly specified, the strides are
assumed to be: 
  * :math:`stride(a) = 1`
  * :math:`stride(b) = extent(a)`
  * :math:`stride(c) = extent(a) * extent(b)`.

For a general order-n tensor :math:`A_{i_1,i_2,...,i_n}` we require that the strides do
not lead to overlapping memory accesses; for instance, :math:`stride(i_1) >= 1`, and
:math:`stride(i_{l}) >=stride(i_{l-1}) * extent(i_{l-1})`.

We say that a tensor is **packed** if it is contiguously stored in memory along all
modes. That is, :math:`stride(i_1) = 1` and :math:`stride(i_l) = stride(i_{l-1}) *
extent(i_{l-1})`.

.. _einstein-notation-label:

Einstein Notation
-----------------

We adhere to the "`Einstein notation <https://en.wikipedia.org/wiki/Einstein_notation>`_": modes that appear in the input
tensors and not in the output tensor are implicitly contracted.

.. _performance-guidlines-label:

Performance Guidelines
----------------------

In this section we assume a generalized column-major data layout (i.e., the modes on the
left have the smallest stride). Most of the following performance guidelines are aimed
to facilitate more regular memory access patterns:

  * Try to arrange the modes (w.r.t. increasing strides) of the tensor similarly in all tensors. For instance, :math:`C_{a,b,c} = A_{a,k,c} B_{k,b}` is preferable to :math:`C_{a,b,c} = A_{c,k,a} B_{k,b}`.
  * Try to keep batched modes as the slowest-varying modes (i.e., with the largest strides).  For instance :math:`C_{a,b,c,l} = A_{a,k,c,l} B_{k,b,l}` is preferable to :math:`C_{a,l,b,c} = A_{l,a,k,c} B_{k,l,b}`.
  * Try to keep the extent of the fastest-varying mode (a.k.a. stride-one mode) as large as possible.

.. _accuracy-guarantees-label:

Accuracy Guarantees
-------------------

cuTENSOR uses its own compute type to set the floating-point accuracy across tensor operations.
The :ref:`cutensorComputeType-label` refers to the minimal accuracy that is guaranteed throughout
computations. Because it is only a guarantee of minimal accuracy, it is possible that the library
chooses a higher accuracy than that requested by the user (e.g., if that compute type is not 
supported for a given problem, or to have more kernels available to choose from).

For instance, let us consider a tensor contraction for which all tensors are of type
`CUDA_R_32F` but the :ref:`cutensorComputeType-label` is `CUTENSOR_R_MIN_16F`, in that
case cuTENSOR would use Nvidia's Tensor Cores with an accumulation type of `CUDA_R_32F`
(i.e., providing higher precision than requested by the user). 

Another illustrative example is a tensor contraction for which all tensors are of type
`CUDA_R_16F` and the computeType is `CUDA_R_MIN_32F`: In this case the parallel reduction
(if required for performance) would have to be performed in `CUDA_R_32F` and thus require
auxiliary workspace. To be precise, in this case cuTENSOR would not choose a serial
reduction --via atomics-- through the output tensor since part of the final reduction
would be performed in `CUDA_R_16F`, which is lower than the :ref:`cutensorComputeType-label`
requested by the user.

Scalar Types
-------------

The scalar type used for element-wise operations is a function of the output type and the 
compute type:

.. list-table::
  :header-rows: 1
  :align: center

  * - Output type
    - :ref:`cutensorComputeType-label`
    - Scalar type
  * - `CUDA_R_16F`
    - `CUTENSOR_R_MIN_16F`
    - `CUDA_R_32F`

  * - `CUDA_R_32F`
    - `CUTENSOR_R_MIN_16F`
    - `CUDA_R_32F`

  * - `CUDA_R_32F`
    - `CUTENSOR_R_MIN_32F`
    - `CUDA_R_32F`

  * - `CUDA_R_64F`
    - `CUTENSOR_R_MIN_64F`
    - `CUDA_R_64F`

  * - `CUDA_R_64F`
    - `CUTENSOR_R_MIN_32F`
    - `CUDA_R_64F`

  * - `CUDA_C_32F`
    - `CUTENSOR_C_MIN_32F`
    - `CUDA_C_32F`

  * - `CUDA_C_64F`
    - `CUTENSOR_C_MIN_64F`
    - `CUDA_C_64F`

  * - `CUDA_C_64F`
    - `CUTENSOR_C_MIN_32F`
    - `CUDA_C_64F`

Supported GPUs
--------------

cuTENSOR supports any Nvidia GPU with a compute capability larger or equal to 7.0.

Error messages
--------------

The library can print additional error diagnostics if it encounters an error.
These can be enabled by setting the `CUTENSOR_LOGINFO_DBG` environment variable to `1`.

.. code-block:: bash

  export CUTENSOR_LOGINFO_DBG=1