Tensor Primitives Runtime Library APIs

This section describes the Fortran interfaces to the CUDA cuTENSOR library. The cuTENSOR functions are only accessible from host code. Most of the runtime API routines, other than some utilities, are functions that return an error code; they return a value of CUTENSOR_STATUS_SUCCESS if the call was successful, or another cuTENSOR status return value if there was an error. Unlike earlier Fortran modules, we have created a cutensorStatus derived type for the return values. We have also overloaded the .eq. and .ne. logical operators for testing the return status.

Currently we provide two levels of Fortran interfaces to the cuTENSOR library, a low-level module which maps 1-1 to the C interfaces in cuTENSOR v2.0.0, and an experimental high-level module which maps several standard Fortran intrinsic functions to the functionality contained within the cuTENSOR library.

The cuTENSOR documentation for version 2.0 explains how to migrate your low-level code from version 1.x to 2.x. Briefly, the create functions cutensorCreateElementwiseBinary(), cutensorCreateElementwiseTrinary(), cutensorCreatePermutation(), cutensorCreateReduction(), and cutensorCreateContraction() create an Operation Descriptor type. You create a plan preference and a plan using cutensorCreatePlanPreference() and cutensorCreatePlan(), respectively. Some operations require a workspace, which you can query using cutensorEstimateWorkspaceSize(). At this point, you have a cuTensor Plan type which can be passed into one of the execute functions cutensorElementwiseBinaryExecute(), cutensorElementwiseTrinaryExecute(), cutensorPermute(), cutensorReduce(), or cutensorContract(), along with the actual pointers to the device data, and scaling factors.

Chapter 12 contains examples of accessing the cuTENSOR library routines from OpenACC and CUDA Fortran. In both cases, the interfaces to the library can be exposed by adding the line

use cutensor_v2

to your program unit.

Unless a specific kind is provided, the plain integer type used in the interfaces implies integer(4) and the plain real type implies real(4).

CUTENSOR Definitions and Helper Functions

This section contains definitions and data types used in the cuTENSOR library and interfaces to the cuTENSOR helper functions.

Beginning with the NVHPC 24.1 release, the cuTENSOR version 2.x library is bundled in the release package. The version 2 API is significantly different than the previous 1.x releases, and the module name has been changed to avoid confusion. The new library interfaces can be exposed by adding the line

use cutensor_v2

to your program unit.

This cuTENSOR module contains the following derived type definitions:

! Definitions from cutensor.h
integer, parameter :: CUTENSOR_MAJOR = 2
integer, parameter :: CUTENSOR_MINOR = 0
integer, parameter :: CUTENSOR_PATCH = 0

! Types from cutensor/types.h
! Algorithm Control
type, bind(c) :: cutensorAlgo
    integer(4) :: algo
end type
type(cutensorAlgo), parameter :: &
    CUTENSOR_ALGO_DEFAULT_PATIENT = cutensorAlgo(-6), &
    CUTENSOR_ALGO_GETT    = cutensorAlgo(-4), &
    CUTENSOR_ALGO_TGETT   = cutensorAlgo(-3), &
    CUTENSOR_ALGO_TTGT    = cutensorAlgo(-2), &
    CUTENSOR_ALGO_DEFAULT = cutensorAlgo(-1)

! Workspace Control
type, bind(c) :: cutensorWorksizePreference
    integer(4) :: wksp
end type
type(cutensorWorksizePreference), parameter :: &
    CUTENSOR_WORKSPACE_MIN         = cutensorWorksizePreference(1), &
    CUTENSOR_WORKSPACE_RECOMMENDED = cutensorWorksizePreference(2), &
    CUTENSOR_WORKSPACE_MAX         = cutensorWorksizePreference(3)

! Unary and Binary Element-wise Operations
type, bind(c) :: cutensorOperator
    integer(4) :: opno
end type
type(cutensorOperator), parameter :: &
    ! Unary
    CUTENSOR_OP_IDENTITY = cutensorOperator(1), & ! Identity operator
    CUTENSOR_OP_SQRT    = cutensorOperator(2), &  ! Square root
    CUTENSOR_OP_RELU    = cutensorOperator(8), &  ! Rectified linear unit
    CUTENSOR_OP_CONJ    = cutensorOperator(9), &  ! Complex conjugate
    CUTENSOR_OP_RCP     = cutensorOperator(10), & ! Reciprocal
    CUTENSOR_OP_SIGMOID = cutensorOperator(11), & ! y=1/(1+exp(-x))
    CUTENSOR_OP_TANH    = cutensorOperator(12), & ! y=tanh(x)
    CUTENSOR_OP_EXP     = cutensorOperator(22), & ! Exponentiation.
    CUTENSOR_OP_LOG     = cutensorOperator(23), & ! Log (base e).
    CUTENSOR_OP_ABS     = cutensorOperator(24), & ! Absolute value.
    CUTENSOR_OP_NEG     = cutensorOperator(25), & ! Negation.
    CUTENSOR_OP_SIN     = cutensorOperator(26), & ! Sine.
    CUTENSOR_OP_COS     = cutensorOperator(27), & ! Cosine.
    CUTENSOR_OP_TAN     = cutensorOperator(28), & ! Tangent.
    CUTENSOR_OP_SINH    = cutensorOperator(29), & ! Hyperbolic sine.
    CUTENSOR_OP_COSH    = cutensorOperator(30), & ! Hyperbolic cosine.
    CUTENSOR_OP_ASIN    = cutensorOperator(31), & ! Inverse sine.
    CUTENSOR_OP_ACOS    = cutensorOperator(32), & ! Inverse cosine.
    CUTENSOR_OP_ATAN    = cutensorOperator(33), & ! Inverse tangent.
    CUTENSOR_OP_ASINH   = cutensorOperator(34), & ! Inverse hyperbolic sine.
    CUTENSOR_OP_ACOSH   = cutensorOperator(35), & ! Inverse hyperbolic cosine.
    CUTENSOR_OP_ATANH   = cutensorOperator(36), & ! Inverse hyperbolic tangent.
    CUTENSOR_OP_CEIL    = cutensorOperator(37), & ! Ceiling.
    CUTENSOR_OP_FLOOR   = cutensorOperator(38), & ! Floor.
    CUTENSOR_OP_MISH    = cutensorOperator(39), & ! Mish y=x*tanh(softplus(x)).
    CUTENSOR_OP_SWISH   = cutensorOperator(40), & ! Swish y=x*sigmoid(x).
    CUTENSOR_OP_SOFT_PLUS = cutensorOperator(41), & ! Softplus y=log(exp(x)+1).
    CUTENSOR_OP_SOFT_SIGN = cutensorOperator(42), & ! Softsign y=x/(abs(x)+1).

    ! Binary
    CUTENSOR_OP_ADD     = cutensorOperator(3), &  ! Addition of two elements
    CUTENSOR_OP_MUL     = cutensorOperator(5), &  ! Multiplication of 2 elements
    CUTENSOR_OP_MAX     = cutensorOperator(6), &  ! Maximum of two elements
    CUTENSOR_OP_MIN     = cutensorOperator(7), &  ! Minimum of two elements
    CUTENSOR_OP_UNKNOWN = cutensorOperator(126)   ! reserved for internal use only

! Status Return Values
type, bind(c) :: cutensorStatus
    integer(4) :: stat
end type
type(cutensorStatus), parameter :: &
    ! The operation completed successfully.
    CUTENSOR_STATUS_SUCCESS                = cutensorStatus(0), &
    ! The cuTENSOR library was not initialized.
    CUTENSOR_STATUS_NOT_INITIALIZED        = cutensorStatus(1), &
    ! Resource allocation failed inside the cuTENSOR library.
    CUTENSOR_STATUS_ALLOC_FAILED           = cutensorStatus(3), &
    ! An unsupported value or parameter was passed to the function.
    CUTENSOR_STATUS_INVALID_VALUE          = cutensorStatus(7), &
    ! Indicates that the device is either not ready,
    ! or the target architecture is not supported.
    CUTENSOR_STATUS_ARCH_MISMATCH          = cutensorStatus(8), &
    ! An access to GPU memory space failed, which is usually caused
    ! by a failure to bind a texture.
    CUTENSOR_STATUS_MAPPING_ERROR          = cutensorStatus(11), &
    ! The GPU program failed to execute. This is often caused by a
    ! launch failure of the kernel on the GPU, which can be caused by
    ! multiple reasons.
    CUTENSOR_STATUS_EXECUTION_FAILED       = cutensorStatus(13), &
    ! An internal cuTENSOR error has occurred.
    CUTENSOR_STATUS_INTERNAL_ERROR         = cutensorStatus(14), &
    ! The requested operation is not supported.
    CUTENSOR_STATUS_NOT_SUPPORTED          = cutensorStatus(15), &
    ! The functionality requested requires some license and an error
    ! was detected when trying to check the current licensing.
    CUTENSOR_STATUS_LICENSE_ERROR          = cutensorStatus(16), &
    ! A call to CUBLAS did not succeed.
    CUTENSOR_STATUS_CUBLAS_ERROR           = cutensorStatus(17), &
    ! Some unknown CUDA error has occurred.
    CUTENSOR_STATUS_CUDA_ERROR             = cutensorStatus(18), &
    ! The provided workspace was insufficient.
    CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE = cutensorStatus(19), &
    ! Indicates that the driver version is insufficient.
    CUTENSOR_STATUS_INSUFFICIENT_DRIVER    = cutensorStatus(20), &
    ! Indicates an error related to file I/O
    CUTENSOR_STATUS_IO_ERROR               = cutensorStatus(21)

! Data Type
type, bind(c) :: cutensorDataType
    integer(4) :: cudaDataType
end type
type(cutensorDataType), parameter :: &
    CUTENSOR_R_16F  =  cutensorDataType(2), & ! real as a half
    CUTENSOR_C_16F  =  cutensorDataType(6), & ! complex as a pair of half numbers
    CUTENSOR_R_16BF = cutensorDataType(14), & ! real as a nv_bfloat16
    CUTENSOR_C_16BF = cutensorDataType(15), & ! complex as a pair of nv_bfloat16 numbers
    CUTENSOR_R_32F  =  cutensorDataType(0), & ! real as a float
    CUTENSOR_C_32F  =  cutensorDataType(4), & ! complex as a pair of float numbers
    CUTENSOR_R_64F  =  cutensorDataType(1), & ! real as a double
    CUTENSOR_C_64F  =  cutensorDataType(5), & ! complex as a pair of double numbers
    CUTENSOR_R_4I   = cutensorDataType(16), & ! real as a signed 4-bit int
    CUTENSOR_C_4I   = cutensorDataType(17), & ! complex as a pair of signed 4-bit int numbers
    CUTENSOR_R_4U   = cutensorDataType(18), & ! real as a unsigned 4-bit int
    CUTENSOR_C_4U   = cutensorDataType(19), & ! complex as a pair of unsigned 4-bit int numbers
    CUTENSOR_R_8I   =  cutensorDataType(3), & ! real as a signed 8-bit int
    CUTENSOR_C_8I   =  cutensorDataType(7), & ! complex as a pair of signed 8-bit int numbers
    CUTENSOR_R_8U   =  cutensorDataType(8), & ! real as a unsigned 8-bit int
    CUTENSOR_C_8U   =  cutensorDataType(9), & ! complex as a pair of unsigned 8-bit int numbers
    CUTENSOR_R_16I  = cutensorDataType(20), & ! real as a signed 16-bit int
    CUTENSOR_C_16I  = cutensorDataType(21), & ! complex as a pair of signed 16-bit int numbers
    CUTENSOR_R_16U  = cutensorDataType(22), & ! real as a unsigned 16-bit int
    CUTENSOR_C_16U  = cutensorDataType(23), & ! complex as a pair of unsigned 16-bit int numbers
    CUTENSOR_R_32I  = cutensorDataType(10), & ! real as a signed 32-bit int
    CUTENSOR_C_32I  = cutensorDataType(11), & ! complex as a pair of signed 32-bit int numbers
    CUTENSOR_R_32U  = cutensorDataType(12), & ! real as a unsigned 32-bit int
    CUTENSOR_C_32U  = cutensorDataType(13), & ! complex as a pair of unsigned 32-bit int numbers
    CUTENSOR_R_64I  = cutensorDataType(24), & ! real as a signed 64-bit int
    CUTENSOR_C_64I  = cutensorDataType(25), & ! complex as a pair of signed 64-bit int numbers
    CUTENSOR_R_64U  = cutensorDataType(26), & ! real as a unsigned 64-bit int
    CUTENSOR_C_64U  = cutensorDataType(27)        ! complex as a pair of unsigned 64-bit int numbers

! Compute Type, no longer used, will be removed
type, bind(c) :: cutensorComputeType
    integer(4) :: ctyp
end type
type(cutensorComputeType), parameter :: &
    CUTENSOR_R_MIN_16F = cutensorComputeType(2**0), & ! real as a half
    CUTENSOR_C_MIN_16F = cutensorComputeType(2**1), & ! complex as a half
    CUTENSOR_R_MIN_32F = cutensorComputeType(2**2), & ! real as a float
    CUTENSOR_C_MIN_32F = cutensorComputeType(2**3), & ! complex as a float
    CUTENSOR_R_MIN_64F = cutensorComputeType(2**4), & ! real as a double
    CUTENSOR_C_MIN_64F = cutensorComputeType(2**5), & ! complex as a double
    CUTENSOR_R_MIN_8U  = cutensorComputeType(2**6), & ! real as a uint8
    CUTENSOR_R_MIN_32U = cutensorComputeType(2**7), & ! real as a uint32
    CUTENSOR_R_MIN_8I  = cutensorComputeType(2**8), & ! real as a int8
    CUTENSOR_R_MIN_32I = cutensorComputeType(2**9), & ! real as a int32
    CUTENSOR_R_MIN_16BF = cutensorComputeType(2**10), & ! real as a bfloat16
    CUTENSOR_R_MIN_TF32 = cutensorComputeType(2**11), & ! real as a tf32
    CUTENSOR_C_MIN_TF32 = cutensorComputeType(2**12) ! complex as a tf32

! Operation Descriptor attribute
type, bind(c) :: cutensorOperationDescriptorAttribute
    integer(4) :: attr
end type
type(cutensorOperationDescriptorAttribute), parameter :: &
    CUTENSOR_OPERATION_DESCRIPTOR_TAG = &
      cutensorOperationDescriptorAttribute(0), &  ! integer(4)
    CUTENSOR_OPERATION_DESCRIPTOR_SCALAR_TYPE = &
      cutensorOperationDescriptorAttribute(1), &  ! cutensorDataType
    CUTENSOR_OPERATION_DESCRIPTOR_FLOPS = &
      cutensorOperationDescriptorAttribute(2), &  ! real(4)
    CUTENSOR_OPERATION_DESCRIPTOR_MOVED_BYTES = &
      cutensorOperationDescriptorAttribute(3), &  ! real(4)
    CUTENSOR_OPERATION_DESCRIPTOR_PADDING_LEFT = &
      cutensorOperationDescriptorAttribute(4), &  ! integer(4)
    CUTENSOR_OPERATION_DESCRIPTOR_PADDING_RIGHT = &
      cutensorOperationDescriptorAttribute(5), &  ! integer(4)
    CUTENSOR_OPERATION_DESCRIPTOR_PADDING_VALUE = &
      cutensorOperationDescriptorAttribute(6)     ! type(c_ptr)

! Plan Preference attribute
type, bind(c) :: cutensorPlanPreferenceAttribute
    integer(4) :: attr
end type
type(cutensorPlanPreferenceAttribute), parameter :: &
    CUTENSOR_PLAN_PREFERENCE_AUTOTUNE_MODE = &
      cutensorPlanPreferenceAttribute(0), &  ! cutensorAutotuneMode type
    CUTENSOR_PLAN_PREFERENCE_CACHE_MODE = &
      cutensorPlanPreferenceAttribute(1), &  ! cutensorCacheMode type
    CUTENSOR_PLAN_PREFERENCE_INCREMENTAL_COUNT = &
      cutensorPlanPreferenceAttribute(2), &  ! integer(4)
    CUTENSOR_PLAN_PREFERENCE_ALGO = &
      cutensorPlanPreferenceAttribute(3), &  ! cutensorAlgo type
    CUTENSOR_PLAN_PREFERENCE_KERNEL_RANK = &
      cutensorPlanPreferenceAttribute(4), &  ! integer(4)
    CUTENSOR_PLAN_PREFERENCE_JIT = &
      cutensorPlanPreferenceAttribute(5)     ! cutensorJitMode type

! Plan Info attribute
type, bind(c) :: cutensorPlanInfoAttribute
    integer(4) :: attr
end type
type(cutensorPlanInfoAttribute), parameter :: &
    ! Pass integer(8), return exact required workspace in bytes needed
    CUTENSOR_PLAN_REQUIRED_WORKSPACE = cutensorPlanInfoAttribute(0)

! Autotune Mode
type, bind(c) :: cutensorAutotuneMode
    integer(4) :: mode
end type
type(cutensorAutotuneMode), parameter :: &
    CUTENSOR_AUTOTUNE_MODE_NONE        = cutensorAutotuneMode(0), &
    CUTENSOR_AUTOTUNE_MODE_INCREMENTAL = cutensorAutotuneMode(1)

! JIT Mode
type, bind(c) :: cutensorJitMode
    integer(4) :: mode
end type
type(cutensorJitMode), parameter :: &
    CUTENSOR_JIT_MODE_NONE        = cutensorJitMode(0), &
    CUTENSOR_JIT_MODE_DEFAULT     = cutensorJitMode(1)

! Cache Mode
type, bind(c) :: cutensorCacheMode
    integer(4) :: mode
end type
type(cutensorCacheMode), parameter :: &
    CUTENSOR_CACHE_MODE_NONE        = cutensorCacheMode(0), &
    CUTENSOR_CACHE_MODE_PEDANTIC    = cutensorCacheMode(1)

! New 2.0 handle types
type cutensorHandle
  TYPE(C_PTR) :: handle
end type

type cutensorTensorDescriptor
  TYPE(C_PTR) :: desc
end type

type cutensorOperationDescriptor
  TYPE(C_PTR) :: desc
end type

type cutensorComputeDescriptor
  TYPE(C_PTR) :: desc
end type

type cutensorPlan
  TYPE(C_PTR) :: desc
end type

type cutensorPlanPreference
  TYPE(C_PTR) :: desc
end type

! These are global C symbols in libcutensor
type(cutensorComputeDescriptor), &
  bind(C, name='CUTENSOR_COMPUTE_DESC_16F') ::  cutensor_Compute_Desc_16F
type(cutensorComputeDescriptor), &
  bind(C, name='CUTENSOR_COMPUTE_DESC_16BF') :: cutensor_Compute_Desc_16BF
type(cutensorComputeDescriptor), &
  bind(C, name='CUTENSOR_COMPUTE_DESC_TF32') :: cutensor_Compute_Desc_TF32
type(cutensorComputeDescriptor), &
  bind(C, name='CUTENSOR_COMPUTE_DESC_3XTF32') :: cutensor_Compute_Desc_3XTF32
type(cutensorComputeDescriptor), &
  bind(C, name='CUTENSOR_COMPUTE_DESC_32F') :: cutensor_Compute_Desc_32F
type(cutensorComputeDescriptor), &
  bind(C, name='CUTENSOR_COMPUTE_DESC_64F') :: cutensor_Compute_Desc_64F

cutensorCreate

This function initializes the cuTENSOR library and returns a handle for subsequent cuTENSOR calls.

type(cutensorStatus) function cutensorCreate(handle)
   type(cutensorHandle) :: handle

cutensorDestroy

This function releases the resources used by the cuTENSOR library. This function is usually the last call with a particular handle to the cuTENSOR API.

type(cutensorStatus) function cutensorDestroy(handle)
   type(cutensorHandle) :: handle

cutensorCreateTensorDescriptor

This function creates and initializes a cuTENSOR descriptor, given the number of modes, extents, strides, and type of the data.

type(cutensorStatus) function cutensorCreateTensorDescriptor(handle, desc, &
        numModes, extent, stride, dataType, alignmentRequirement)
    type(cutensorHandle) :: handle
    type(cutensorTensorDescriptor) :: desc
    integer(4) :: numModes
    integer(8), dimension(*) :: extent
    integer(8), dimension(*) :: stride
    type(cutensorDataType) :: dataType
    integer(4) :: alignmentRequirement

cutensorDestroyTensorDescriptor

This function deletes the cuTENSOR tensor descriptor and frees the resources associated with it.

type(cutensorStatus) function cutensorDestroyTensorDescriptor(desc)
    type(cutensorTensorDescriptor) :: desc

cutensorOperationDescriptorGetAttribute

This function retrieves an attribute of the Operation Descriptor type

type(cutensorStatus) function cutensorOperationDescriptorGetAttribute(handle, &
    desc, attr, buf, sizeInBytes)
    type(cutensorHandle) :: handle
    type(cutensorOperationDescriptor) :: desc
    type(cutensorOperationDescriptorAttribute) :: attr
    integer(4) :: buf(*)  ! Any type, of size sizeInBytes
    integer(8) :: sizeInBytes

cutensorOperationDescriptorSetAttribute

This function sets an attribute of an Operation Descriptor type

type(cutensorStatus) function cutensorOperationDescriptorSetAttribute(handle, &
    desc, attr, buf, sizeInBytes)
    type(cutensorHandle) :: handle
    type(cutensorOperationDescriptor) :: desc
    type(cutensorOperationDescriptorAttribute) :: attr
    integer(4) :: buf(*)  ! Any type, of size sizeInBytes
    integer(8) :: sizeInBytes

cutensorDestroyOperationDescriptor

This function frees the resources related to the Operation Descriptor type

type(cutensorStatus) function cutensorDestroyOperationDescriptor(desc)
    type(cutensorOperationDescriptor) :: desc

cutensorCreatePlanPreference

This function allocates and sets a cuTENSOR plan preference type.

type(cutensorStatus) function cutensorCreatePlanPreference(handle, pref, algo, &
        jitMode)
    type(cutensorHandle) :: handle
    type(cutensorPlanPreference) :: pref
    type(cutensorAlgo) :: algo
    type(cutensorJitMode) :: jitMode

cutensorDestroyPlanPreference

This function frees the resources associated with a plan preference type.

type(cutensorStatus) function cutensorDestroyPlanPreference(pref)
    type(cutensorPlanPreference) :: pref

cutensorPlanPreferenceSetAttribute

This function sets an attribute of an Operation Descriptor type

type(cutensorStatus) function cutensorPlanPreferenceSetAttribute(handle, pref, &
    attr, buf, sizeInBytes)
    type(cutensorHandle) :: handle
    type(cutensorPlanPreference) :: pref
    type(cutensorPlanPreferenceAttribute) :: attr
    integer(4) :: buf(*)  ! Any type, of size sizeInBytes
    integer(8) :: sizeInBytes

cutensorEstimateWorkspaceSize

This function determines the size of the required workspace for a given operation and preferences.

type(cutensorStatus) function cutensorEstimateWorkspaceSize(handle, desc, &
        pref, workspacePref, workspaceSizeEstimate)
    type(cutensorHandle) :: handle
    type(cutensorOperationDescriptor) :: desc
    type(cutensorPlanPreference) :: pref
    type(cutensorWorksizePreference) :: workspacePref
    integer(8) :: workspaceSizeEstimate

cutensorCreatePlan

This function allocates and sets a cuTENSOR plan type.

type(cutensorStatus) function cutensorCreatePlan(handle, plan, &
        desc, pref, workspaceSizeLimit)
    type(cutensorHandle) :: handle
    type(cutensorPlan), intent(out) :: plan
    type(cutensorOperationDescriptor) :: desc
    type(cutensorPlanPreference) :: pref
    integer(8) :: workspaceSizeLimit

cutensorDestroyPlan

This function frees the resources associated with a cuTENSOR plan type.

type(cutensorStatus) function cutensorDestroyPlan(plan)
    type(cutensorPlan) :: plan

cutensorGetErrorString

This function returns the description string for an error code.

character*128 function cutensorGetErrorString(ierr)
   type(cutensorStatus) :: ierr

cutensorGetVersion

This function returns the version number of the cuTENSOR library.

integer(8) function cutensorGetVersion()

cutensorGetCudartVersion

This function returns the version of the CUDA runtime that the cuTENSOR library was compiled against.

integer(8) function cutensorGetCudartVersion()

CUTENSOR Element-wise Operations

This section contains interfaces for the cuTENSOR functions that perform element-wise operations between tensors.

cutensorCreatePermutation

This function creates an out-of-place tensor permutation operator of the form

B = alpha * opA(perm(A)) The permutation operation information is stored in the intent(out) desc argument.

The arrays A, B can be of any supported type, kind, and rank. The permutations of A, B are set up using the mode arguments.

type(cutensorStatus) function cutensorCreatePermutation(handle, desc, &
    descA, modeA, opA, descB, modeB, descCompute)
    type(cutensorHandle) :: handle
    type(cutensorOperationDescriptor), intent(out) :: desc
    type(cutensorTensorDescriptor) :: descA, descB
    integer(4), dimension(*) :: modeA, modeB
    type(cutensorOperator) :: opA
    type(cutensorComputeDescriptor) :: descCompute

cutensorPermute

This function executes an out-of-place tensor permutation of the form

B = alpha * opA(perm(A))

The type and kind of the alpha scalar is determined by the typeScalar argument. The arrays A, B can be of any supported type, kind, and rank. The permutations of A, B are set up using the mode arguments. The operation opA(perm(A)) is set up in the call to cutensorInitTensorDescriptor() via the unaryOp argument.

type(cutensorStatus) function cutensorPermute(handle, plan, &
    alpha, A, B, stream)
    type(cutensorHandle) :: handle
    type(cutensorPlan) :: plan
    real :: alpha
    real, device, dimension(*) :: A, B
    integer(kind=cuda_stream_kind) :: stream

cutensorCreateElementwiseBinary

This function creates an element-wise tensor operation on two inputs of the form

D=opAC(alpha*op(perm(A)),gamma*op(perm(C)))

The opA, opC arguments are an element-wise unary operator. The opAC argument is an element-wise binary operator. The arrays A, C, D can be of any supported type, kind, and rank. The permutations of A, C, D are set up using the mode arguments.

type(cutensorStatus) function cutensorCreateElementwiseBinary(handle, desc, &
        descA, modeA, opA, descC, modeC, opC, &
        descD, modeD, opAC, descCompute)
    type(cutensorHandle) :: handle
    type(cutensorOperationDescriptor) :: desc
    type(cutensorTensorDescriptor) :: descA, descC, descD
    integer(4), dimension(*) :: modeA, modeC, modeD
    type(cutensorOperator) :: opA, opC, opAC
    type(cutensorComputeDescriptor) :: descCompute

cutensorElementwiseBinaryExecute

This function executes an element-wise tensor operation on two inputs of the form

D=opAC(alpha*op(perm(A)),gamma*op(perm(C)))

The specified plan is created by a call to cutensorCreatePlan(). The arrays A, C, D can be of any supported type, kind, and rank. The allowable type and kind of the alpha, gamma scalars is determined by the compute descriptor in the call to cutensorCreateElementwiseBinary.

type(cutensorStatus) function cutensorElementwiseBinaryExecute(handle, &
    plan, alpha, A, gamma, C, D, stream)
    type(cutensorHandle) :: handle
    type(cutensorPlan) :: plan
    real :: alpha, gamma  ! Any compatible type and kind
    real, device, dimension(*) :: A, C, D
    integer(kind=cuda_stream_kind) :: stream

cutensorCreateElementwiseTrinary

This function creates an element-wise tensor operation on three inputs of the form

D=opABC(opAB(alpha*op(perm(A)),beta*op(perm(B))),gamma*op(perm(C)))

The opA, opB, opC arguments are an element-wise unary operator. The opAB argument is an element-wise binary operator. The arrays A, B, C, D can be of any supported type, kind, and rank. The permutations of A, B, C, D are set up using the mode arguments.

type(cutensorStatus) function cutensorCreateElementwiseTrinary(handle, desc, &
        descA, modeA, opA, descB, modeB, opB, descC, modeC, opC, &
        descD, modeD, opAB, descCompute)
    type(cutensorHandle) :: handle
    type(cutensorOperationDescriptor) :: desc
    type(cutensorTensorDescriptor) :: descA, descB, descC, descD
    integer(4), dimension(*) :: modeA, modeB, modeC, modeD
    type(cutensorOperator) :: opA, opB, opC, opAB
    type(cutensorComputeDescriptor) :: descCompute

cutensorElementwiseTrinaryExecute

This function executes an element-wise tensor operation on three inputs of the form

D=opABC(opAB(alpha*op(perm(A)),beta*op(perm(B))),gamma*op(perm(C)))

The specified plan is created by a call to cutensorCreatePlan(). The arrays A, B, C, D can be of any supported type, kind, and rank. The allowable type and kind of the alpha, gamma scalars is determined by the compute descriptor in the call to cutensorCreateElementwiseTrinary.

type(cutensorStatus) function cutensorElementwiseTrinaryExecute(handle, plan,  &
    alpha, A, beta, B, gamma, C, D, stream)
    type(cutensorHandle) :: handle
    type(cutensorPlan) :: plan
    real :: alpha, beta, gamma  ! Any compatible type and kind
    real, device, dimension(*) :: A, B, C, D
    integer(kind=cuda_stream_kind) :: stream

CUTENSOR Reduction Operations

This section contains interfaces for the cuTENSOR functions that perform reduction operations on tensors.

cutensorCreateReduction

This function creates a reduction tensor operation of the form:

D = alpha * opReduce(opA(A)) + beta * opC(C)

The opA, opC arguments are an element-wise unary operator. The opReduce argument is a binary operator. The arrays A, C, D can be of any supported type, kind, and rank. The permutations of A, C are set up using the mode arguments.

type(cutensorStatus) function cutensorCreateReduction(handle, desc, &
        descA, modeA, opA, descC, modeC, opC, &
        descD, modeD, opReduce, descCompute)
    type(cutensorHandle) :: handle
    type(cutensorOperationDescriptor) :: desc
    type(cutensorTensorDescriptor) :: descA, descC, descD
    integer(4), dimension(*) :: modeA, modeC, modeD
    type(cutensorOperator) :: opA, opC, opReduce
    type(cutensorComputeDescriptor) :: descCompute

cutensorReduce

This function executes a reduction of the form:

D = alpha * opReduce(opA(A)) + beta * opC(C)

The specified plan is created by a call to cutensorCreatePlan(). The arrays A, C, D can be of any supported type, kind, and rank. The allowable type and kind of the alpha, beta scalars is determined by the compute descriptor in the call to cutensorCreateReduction.

type(cutensorStatus) function cutensorReduce(handle, plan, &
    alpha, A, beta, C, D, workspace, workspaceSize, stream)
    type(cutensorHandle) :: handle
    type(cutensorPlan) :: plan
    real :: alpha, beta  ! Any compatible type and kind
    real, device, dimension(*) :: A, C, D
    real, device, dimension(*) :: workspace  ! Any type
    integer(8) :: workspaceSize
    integer(kind=cuda_stream_kind) :: stream

CUTENSOR Contraction Operations

This section contains interfaces for the cuTENSOR functions that perform contraction operations between tensors.

cutensorCreateContraction

This function creates a contraction tensor operation of the form:

D = alpha*(AxB) + beta*C

The opA, opB, opC arguments are an element-wise unary operator. The arrays A, B, C, D can be of any supported type, kind, and rank. The permutations of A, B, C are set up using the mode arguments.

type(cutensorStatus) function cutensorCreateContraction(handle, desc, &
        descA, modeA, opA, descB, modeB, opB, &
        descC, modeC, opC, descD, modeD, descCompute)
    type(cutensorHandle) :: handle
    type(cutensorOperationDescriptor) :: desc
    type(cutensorTensorDescriptor) :: descA, descB, descC, descD
    integer(4), dimension(*) :: modeA, modeB, modeC, modeD
    type(cutensorOperator) :: opA, opB, opC
    type(cutensorComputeDescriptor) :: descCompute

cutensorContract

This function executes a contraction of the form:

D = alpha*(AxB) + beta*C

The specified plan is created by a call to cutensorCreatePlan(). The arrays A, B, C, D can be of any supported type, kind, and rank. The allowable type and kind of the alpha, beta scalars is determined by the compute descriptor in the call to cutensorCreateContraction.

type(cutensorStatus) function cutensorContract(handle, plan, &
    alpha, A, B, beta, C, D, workspace, workspaceSize, stream)
    type(cutensorHandle) :: handle
    type(cutensorPlan) :: plan
    real :: alpha, beta  ! Any compatible type and kind
    real, device, dimension(*) :: A, B, C, D
    real, device, dimension(*) :: workspace  ! Any type
    integer(8) :: workspaceSize
    integer(kind=cuda_stream_kind) :: stream

CUTENSOR Fortran Extensions

This section contains extensions to the cuTENSOR interfaces for Fortran array intrinsic function operations, array expressions, and array assignment, built upon the cuTENSOR library. In CUDA Fortran, these operations take data with the device or managed attribute. In OpenACC, they can be invoked with a host_data construct. The interfaces to these extensions are exposed by adding the line use cutensorEx to your program unit, or can be applied to specific statements using the Fortran block feature, as shown in the next example.

block; use cutensorEx
  D = reshape(A,shape=[ni,nk,nj],order=[1,3,2])
end block

To enable the operations to take place in one underlying kernel invocation, the RHS expression is deferred until the overloaded assignment operation has both the LHS and RHS available. This guarantees performance on par with the low-level cuTENSOR API whenever possible. Generality is affected, so only specific forms of the Fortran statements are currently supported, and those are documented in the subsequent sections of this chapter.

Since the cuTENSOR library operations take a stream argument, we have added a way to set a cuTENSOR default stream that our runtime will maintain, one copy per CPU thread. That is:

integer function cutensorexSetStream(stream)
integer(kind=cuda_stream_kind) :: stream

Fortran Reshape

This Fortran function changes the shape of an array and possibly permutes the dimensions and layout. It is invoked as:

D = alpha * func(reshape(A, shape=[...], order=[...]))

The arrays A and D can be of type real(2), real(4), real(8), complex(4), or complex(8). The rank (number of dimensions) of A and D can be from 1 to 7. The alpha value is expected to be the same type as A, or as func(reshape(A)), if that differs. Accepted functions which can be applied to the result of reshape are listed at the end of this section. The pad argument to the F90 reshape function is not currently supported. This Fortran call, besides initialization and setting up cuTENSOR descriptors, maps to cutensorPermutation().

! Example to switch the 2nd and 3rd dimension layout
D = reshape(a,shape=[ni,nk,nj], order=[1,3,2])
! Same example, take the absolute value and scale by 2.5
D = 2.5 * abs(reshape(a,shape=[ni,nk,nj], order=[1,3,2]))

Fortran Transpose

This Fortran function tranposes a matrix (a 2-dimensional array). It is invoked as:

D = alpha * func(transpose(A))

The arrays A and D can be of type real(2), real(4), real(8), complex(4), or complex(8). The rank (number of dimensions) of A and D is 2. Applying scaling (the alpha argument) or applying a function to the transpose result is optional. The alpha value is expected to be the same type as A, or as func(transpose(A)), if that differs. Accepted functions which can be applied to the result of the transpose are listed at the end of this section. This Fortran call, besides initialization and setting up cuTENSOR descriptors, maps to cutensorPermutation().

! Example of transpose
D = transpose(A)
! Same example, take the absolute value and scale by 2.5
D = 2.5 * abs(tranpose(A))

Fortran Spread

This Fortran function increases the rank of an array by one across the specified dimension and broadcasts the values over the new dimension. It is invoked as:

D = alpha * func(spread(A, dim=i, ncopies=n))

The arrays A and D can be of type real(2), real(4), real(8), complex(4), or complex(8). The rank (number of dimensions) of A and D can be from 1 to 7. The alpha value is expected to be the same type as A. Accepted functions which can be applied to the result of spread are listed at the end of this section. This Fortran call, besides initialization and setting up cuTENSOR descriptors, maps to cutensorPermutation().

! Example to add and broadcast values over the new first dimension
D = spread(A, dim=1, ncopies=n1)
! Same example, take the absolute value and scale by 2.5
D = 2.5 * abs(spread(A, dim=1, ncopies=n1))

Fortran Element-wise Expressions

There is some limited support for converting expressions involving two or three source arrays into cuTENSOR calls. The first one or two operands can be a permuted array, the result of a call to reshape(), transpose(), or spread(). An elemental function can be applied to the array operands, permuted or not, and they can also be scaled. Here are some supported forms:

D = A + B

D = permute(A) + B

D = A + permute(B)

D = permute(A) - B

D = A - permute(B)

D = A + func(permute(B))

D = func(permute(A)) + permute(B)

D = alpha * func(permute(A)) + beta * permute(B) + gamma * C

The arrays A, B, C, and D can be of type real(2), real(4), real(8), complex(4), or complex(8). The rank (number of dimensions) of A, B, C, and D can be from 1 to 7. For the three-operand case, arrays C and D must have the same shape, strides, and type. The alpha value is expected to be the same type as A. The same applies for beta and B, and gamma and C. The Fortran wrapper does no type conversion, though cuTENSOR may. Compile-time checking of array conformance is limited. Other runtime checks for unsupported combinations may come from either the Fortran wrapper or from cuTENSOR. Accepted functions which can be applied to permuted or unpermuted arrays are listed at the end of this section. These Fortran expressions, besides initialization and setting up cuTENSOR descriptors, map to either cutensorElementwiseBinary() or cutensorElementwiseTrinary().

! Example to scale and add two arrays together
D = alpha * A + beta * B
! Same example, take the absolute value of A and B and add to C
D = alpha * abs(A) + beta * abs(B) + C
! Transpose the first array before adding to the second
D = alpha * abs(transpose(A)) + beta * abs(B) + C

Fortran Matmul Operations

Matrix multiplication is one instance of tensor contraction. Either operand to matmul can be a permuted array, the result of a call to reshape(), transpose(), or spread(). The cuTENSOR library does not currently support applying an elemental function to the array operands, but the result and accumlator can be scaled. Here are some supported forms:

D = matmul(A, B)

D = matmul(permute(A), B)

D = matmul(A, permute(B))

D = matmul(permute(A), permute(B))

D = C + matmul(A, B)

D = C - matmul(A, B)

D = alpha * matmul(A, B) + beta * C

The arrays A, B, C, and D can be of type real(2), real(4), real(8), complex(4), or complex(8). The rank (number of dimensions) of A, B, C, and D must be 2, after any permutations. Arrays C and D must currently have the same shape, strides, and type. The alpha value is expected to be the same type as A and B. The beta value should have the same type as C. The Fortran wrapper does no type conversion, though cuTENSOR may. Compile-time checking of array conformance is limited. Other runtime checks for unsupported combinations may come from either the Fortran wrapper or from cuTENSOR. Fortran support for Matmul, besides initialization and setting up cuTENSOR descriptors, maps to cutensorContraction().

! Example to multiply two matrices together
D = matmul(A, B)
! Same example, accumulate into C
C = C +  matmul(A, B)
! Same example, transpose the first argument
C = C + matmul(transpose(A), B)

On GPUs which support the TF32 type, to direct a contraction to use the compute type CUTENSOR_R_MIN_TF32 rather than CUTENSOR_R_MIN_32F for real(4) (or similar for complex(4)), we have provided a way to set an internal parameter, similar to the default stream, that our runtime will maintain. The default opt level is 0. Setting the opt level to be greater than 0 will use CUTENSOR_R_MIN_TF32.

integer function cutensorExSetOptLevel(level)
integer(4) :: level

Fortran Dot_Product Operations

A Dot Product is another instance of tensor contraction. In this implementation, we have added a new dim argument to make dot_product more generally applicable to higher-order arrays. It follows very closely to the matmul features in the previous section. Either of the two operands to dot_product can be a permuted array, the result of a call to reshape(), transpose(), or spread(). Note that only reshape() can generate a 1-D array. The other calls can be used along with the dim argument. The cuTENSOR library does not currently support applying an elemental function to the array operands, but the result and accumlator can be scaled. Here are some supported forms:

X = dot_product(A, B)

X = dot_product(reshape(A,shape=[n]), B)

X = dot_product((A, reshape(B,shape=[n]))

D = dot_product(permute(A), permute(B),dim=i)

D = C + dot_product(A, B, dim=i)

D = C - dot_product(A, B, dim=i)

D = C + alpha * dot_product(A, B, dim=i)

The arrays A, B, C, and D and scalar X can be of type real(2), real(4), real(8), complex(4), or complex(8). Arrays C and D must currently have the same shape, strides, and type. The rank of D and C is one less than A and B for the case using the dim argument. The alpha value is expected to be the same type as A and B. The Fortran wrapper does no type conversion, though cuTENSOR may. Compile-time checking of array conformance is limited. Other runtime checks for unsupported combinations may come from either the Fortran wrapper or from cuTENSOR. Fortran support for Dot_Product, besides initialization and setting up cuTENSOR descriptors, maps to cutensorContraction().

Supported Element-wise Functions

From the list of supported element-wise functions for the cuTENSOR definitions listed above, several are supported from the high-level interface. The cuTENSOR library does not currently support many of the functions listed for complex data; consult the cuTENSOR documentation for the latest information. Here are the functions with at least some level of Fortran interface support:

SQRT RELU CONJG RCP SIGMOID

TANH EXP LOG ABS NEG

SIN COS TAN SINH COSH

ASIN ACOS ATAN ASINH ACOSH

ATANH CEIL FLOOR

Note the C complex conjugate function conj is spelled conjg in Fortran. Also, the functions for ceil and floor differ between C and Fortran, in their return type. We have kept the C spelling and behavior (returning a real, not an integer).

Only ABS, CEIL, CONJG, COS, SIN can be used on bare arrays in the current implementation. All functions can be applied to the result of a permutation. Use reshape(A, shape=shape(A)) as a NOP to put the bare array into a form that can currently be recognized.

Users will find that the performance of binary or trinary kernels, such as:

D = sin(A) + cos(B) + C

will not perform as well as kernels written and compiled for that specific operation (using CUDA, CUDA Fortran, or OpenACC) due to overhead in the cuTENSOR kernels needed for generally applying functions to the operands. If the operation permutes the operands, such as:

D = sin(permute(A)) + cos(permute(B)) + C

users may see good performance compared to other naive implementations depending on how complex the permutations on the input arrays are.