.. _user-guide-label: User Guide ========== .. _nomenclature-label: Nomenclature ------------ The term tensor refers to an **order-n** (a.k.a., n-dimensional) array. One can think of tensors as a generalization of matrices to higher **orders**. For example, scalars, vectors, and matrices are order-0, order-1, and order-2 tensors, respectively. An order-n tensor has :math:`n` **modes**. Each mode has an **extent** (a.k.a. size). For each mode you can specify a **stride** :math:`s > 0`. This **stride** describes how far apart two logically consecutive elements along that mode are in physical memory. They have a function similar to the leading dimension in BLAS and allow, for example, operating on sub-tensors. cuTENSOR, by default, adheres to a generalized **column-major** data layout. For example: :math:`A_{a,b,c} \in {R}^{4\times 8 \times 12}` is an order-3 tensor with the extent of the a-mode, b-mode, and c-mode respectively being 4, 8, and 12. If not explicitly specified, the strides are assumed to be: * :math:`stride(a) = 1` * :math:`stride(b) = extent(a)` * :math:`stride(c) = extent(a) * extent(b)`. For a general order-n tensor :math:`A_{i_1,i_2,...,i_n}` we require that the strides do not lead to overlapping memory accesses; for instance, :math:`stride(i_1) >= 1`, and :math:`stride(i_{l}) >=stride(i_{l-1}) * extent(i_{l-1})`. We say that a tensor is **packed** if it is contiguously stored in memory along all modes. That is, :math:`stride(i_1) = 1` and :math:`stride(i_l) = stride(i_{l-1}) * extent(i_{l-1})`. .. _einstein-notation-label: Einstein Notation ----------------- We adhere to the "`Einstein notation `_": modes that appear in the input tensors and not in the output tensor are implicitly contracted. .. _performance-guidlines-label: Performance Guidelines ---------------------- In this section we assume a generalized column-major data layout (i.e., the modes on the left have the smallest stride). Most of the following performance guidelines are aimed to facilitate more regular memory access patterns: * Try to arrange the modes (w.r.t. increasing strides) of the tensor similarly in all tensors. For instance, :math:`C_{a,b,c} = A_{a,k,c} B_{k,b}` is preferable to :math:`C_{a,b,c} = A_{c,k,a} B_{k,b}`. * Try to keep batched modes as the slowest-varying modes (i.e., with the largest strides). For instance :math:`C_{a,b,c,l} = A_{a,k,c,l} B_{k,b,l}` is preferable to :math:`C_{a,l,b,c} = A_{l,a,k,c} B_{k,l,b}`. * Try to keep the extent of the fastest-varying mode (a.k.a. stride-one mode) as large as possible. .. _accuracy-guarantees-label: Accuracy Guarantees ------------------- cuTENSOR uses its own compute type to set the floating-point accuracy across tensor operations. The :ref:`cutensorComputeType-label` refers to the minimal accuracy that is guaranteed throughout computations. Because it is only a guarantee of minimal accuracy, it is possible that the library chooses a higher accuracy than that requested by the user (e.g., if that compute type is not supported for a given problem, or to have more kernels available to choose from). For instance, let us consider a tensor contraction for which all tensors are of type `CUDA_R_32F` but the :ref:`cutensorComputeType-label` is `CUTENSOR_R_MIN_16F`, in that case cuTENSOR would use Nvidia's Tensor Cores with an accumulation type of `CUDA_R_32F` (i.e., providing higher precision than requested by the user). Another illustrative example is a tensor contraction for which all tensors are of type `CUDA_R_16F` and the computeType is `CUDA_R_MIN_32F`: In this case the parallel reduction (if required for performance) would have to be performed in `CUDA_R_32F` and thus require auxiliary workspace. To be precise, in this case cuTENSOR would not choose a serial reduction --via atomics-- through the output tensor since part of the final reduction would be performed in `CUDA_R_16F`, which is lower than the :ref:`cutensorComputeType-label` requested by the user. Scalar Types ------------- The scalar type used for element-wise operations is a function of the output type and the compute type: .. list-table:: :header-rows: 1 :align: center * - Output type - :ref:`cutensorComputeType-label` - Scalar type * - `CUDA_R_16F` - `CUTENSOR_R_MIN_16F` - `CUDA_R_32F` * - `CUDA_R_32F` - `CUTENSOR_R_MIN_16F` - `CUDA_R_32F` * - `CUDA_R_32F` - `CUTENSOR_R_MIN_32F` - `CUDA_R_32F` * - `CUDA_R_64F` - `CUTENSOR_R_MIN_64F` - `CUDA_R_64F` * - `CUDA_R_64F` - `CUTENSOR_R_MIN_32F` - `CUDA_R_64F` * - `CUDA_C_32F` - `CUTENSOR_C_MIN_32F` - `CUDA_C_32F` * - `CUDA_C_64F` - `CUTENSOR_C_MIN_64F` - `CUDA_C_64F` * - `CUDA_C_64F` - `CUTENSOR_C_MIN_32F` - `CUDA_C_64F` Supported GPUs -------------- cuTENSOR supports any Nvidia GPU with a compute capability larger or equal to 7.0. Error messages -------------- The library can print additional error diagnostics if it encounters an error. These can be enabled by setting the `CUTENSOR_LOGINFO_DBG` environment variable to `1`. .. code-block:: bash export CUTENSOR_LOGINFO_DBG=1