cuTENSORMg API

General

The general functions initialize the library and create tensor descriptors.


cutensorMgHostDevice_t

enum cutensorMgHostDevice_t

Brief

Enumerated device codes for host-side tensors

Values:

enumerator CUTENSOR_MG_DEVICE_HOST

The memory is located on the host in regular memory.

enumerator CUTENSOR_MG_DEVICE_HOST_PINNED

The memory is located on the host in pinned memory.


cutensorMgHandle_t

typedef struct cutensorMgHandle_s *cutensorMgHandle_t

Brief

Encodes the devices that participate in operations

Details

The handle contains information about each device that participates in operations as well as host threads to orchestrate host-to-device operations.


cutensorMgTensorDescriptor_t

typedef struct cutensorMgTensorDescriptor_s *cutensorMgTensorDescriptor_t

Brief

Represents a tensor that may be distributed

Details

The tensor is laid out in a block-cyclic fashion across devices. It may either be fully located on the host, or distributed across multiple devices.


cutensorMgCreate()

cutensorStatus_t cutensorMgCreate(cutensorMgHandle_t *handle, uint32_t numDevices, const int32_t devices[])

Remark

blocking, no reentrant, and thread-safe

Brief

Create a library handle

Details

The handle contains information about the devices that should be participating in calculations. All devices that hold any tensor data or participate in any of cuTENSORMg’s operations should also be included in the handle. Each device may only occur once in the list. It is advisable that all devices are identical (i.e., have the same peak performance) to avoid load-balancing issues, and are connected via NVLink to avoid costly device-host-device transfers. This call will enable peering between all devices that have been passed to it, if possible.

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[out] The resulting library handle.

  • numDevices[in] The number of devices participating in all subsequent computations.

  • devices[in] The devices that participate in all computations.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgDestroy()

cutensorStatus_t cutensorMgDestroy(cutensorMgHandle_t handle)

Remark

blocking, no reentrant, and thread-safe

Brief

Destroy a library handle

Details

All outstanding operations must be completed before calling this function. Frees all associated resources. Any descriptors or plans created with the handle become invalid and may only be destructed.

Returns

A status code indicating the success or failure of the operation

Parameters:

handle[in] The handle to be destroyed.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgCreateTensorDescriptor()

cutensorStatus_t cutensorMgCreateTensorDescriptor(cutensorMgHandle_t handle, cutensorMgTensorDescriptor_t *desc, uint32_t numModes, const int64_t extent[], const int64_t elementStride[], const int64_t blockSize[], const int64_t blockStride[], const int32_t deviceCount[], uint32_t numDevices, const int32_t devices[], cudaDataType_t type)

The extent describes the total size of each tensor mode. For example, an 9 by 9 matrix would have an extent of 9 and 9.

Brief

Create a tensor descriptor

Details

A tensor descriptor fully specifies the data layout of a (potentially) distributed tensor. It does so mainly through five pieces of data: The extent, the element stride, the block size, the block stride, and the device count.

The block size describes how the data is blocked. For example, with a block size of 4 by 2, there would be three blocks in the first and five blocks in the second mode.

4 x 2

4 x 2

1 x 2

4 x 2

4 x 2

1 x 2

4 x 2

4 x 2

1 x 2

4 x 2

4 x 2

1 x 2

4 x 1

4 x 1

1 x 1

The device count then describes how many devices the blocks are distributed across in that mode. A device count of 2 by 2, for example, would mean that the blocks are distributed across two devices in each mode, i.e., four devices total. The devices are aranged first along the first and then the second mode as follows:

Dev. 0

Dev. 2

Dev. 1

Dev. 3

In particular, device 0 would own the first, and third block in the first dimension, the first, third, and fifth block in the second dimension (so a total of six blocks), device 1 would own the first and third block in the first dimension, and the second and fourth block in the second dimension (four blocks total), device 2 would own the second block in the first dimension, and the first, third, and fifth block in the second dimension (for a total of three blocks), and, finally, device 3 would own the second block in the first dimension and the second and fourth block in the second dimension (for a total of two blocks).

Dev. 0

Dev. 2

Dev. 0

Dev. 1

Dev. 3

Dev. 1

Dev. 0

Dev. 2

Dev. 0

Dev. 1

Dev. 3

Dev. 1

Dev. 0

Dev. 2

Dev. 0

The element stride and block stride then describe how the blocks are laid out on the individual devices, i.e. the distance between elements and blocks in that mode. Finally, the devices array describes which device the blocks are mapped to. Here, it is permissible to specify CUTENSOR_MG_DEVICE_HOST to express that those blocks are located on the host. A tensor must either be located fully on-device or fully on-host.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • desc[out] The resulting tensor descriptor.

  • numModes[in] The number of modes.

  • extent[in] The extent of the tensor in each mode (array of size numModes).

  • elementStride[in] The offset (in linear memory) between two adjacent elements in each mode (array of size numModes), may be NULL for a dense tensor.

  • blockSize[in] The size of a block in each mode (array of size numModes), may be NULL for an unblocked tensor (i.e., each mode only has a single block that is equal to its extent).

  • blockStride[in] The offset (in linear memory) between two adjacent blocks in each mode (array of size numModes), may be NULL for a dense block-interleaved layout.

  • deviceCount[in] The number of devices that each mode is distributed across in a block-cyclic fashion (array of size numModes), may be NULL for a non-distributed tensor.

  • numDevices[in] The total number of devices that the tensor is distributed across (i.e., the product of all elements in deviceCount).

  • devices[in] The devices that the blocks are distributed across, in column-major order, i.e., stride 1 first (array of size numDevices).

  • type[in] The data type of the tensor.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.

  • CUTENSOR_STATUS_NOT_SUPPORTED – This layout or data type is not supported.


cutensorMgDestroyTensorDescriptor()

cutensorStatus_t cutensorMgDestroyTensorDescriptor(cutensorMgTensorDescriptor_t desc)

Remark

non-blocking, no reentrant, and thread-safe

Brief

Destroy a tensor descriptor

Returns

A status code indicating the success or failure of the operation

Parameters:

desc[in] The descriptor to be destroyed.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


Copy Operations

The following types and functions perform copies between tensors.


cutensorMgCopyDescriptor_t

typedef struct cutensorMgCopyDescriptor_s *cutensorMgCopyDescriptor_t

Brief

Describes the copy of a tensor from one data layout to another

Details

It may describe the full cartesion product of copy from and to host, single device, and multiple devices, as well as permutations and layout changes.


cutensorMgCopyPlan_t

typedef struct cutensorMgCopyPlan_s *cutensorMgCopyPlan_t

Brief

Describes a specific way to implement the copy operation

Details

It encodes blockings and other implementation details, and may be reused to reduce planning overhead.


cutensorMgCreateCopyDescriptor()

cutensorStatus_t cutensorMgCreateCopyDescriptor(const cutensorMgHandle_t handle, cutensorMgCopyDescriptor_t *desc, const cutensorMgTensorDescriptor_t descDst, const int32_t modesDst[], const cutensorMgTensorDescriptor_t descSrc, const int32_t modesSrc[])

Remark

non-blocking, no reentrant, and thread-safe

Brief

Create a copy descriptor

Details

A copy descriptor encodes the source and the destination for a copy operation. The copy operation supports tensors on host, single, or multiple devices. It also supports layout changes and mode permutations. The only restriction is that the extents of the corresponding modes (in the input and output tensors) must match.

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • desc[out] The resulting copy descriptor.

  • descDst[in] The destination tensor descriptor.

  • modesDst[in] The destination tensor modes.

  • descSrc[in] The source tensor descriptor.

  • modesSrc[in] The source tensor modes.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.

  • CUTENSOR_STATUS_NOT_SUPPORTED – This tensor layout or precision combination is not supported.


cutensorMgDestroyCopyDescriptor()

cutensorStatus_t cutensorMgDestroyCopyDescriptor(cutensorMgCopyDescriptor_t desc)

Remark

non-blocking, no reentrant, and thread-safe

Brief

Destroy a copy descriptor and free all its previously-allocated resources.

Returns

A status code indicating the success or failure of the operation

Parameters:

desc[in] The descriptor to be destroyed.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgCopyGetWorkspace()

cutensorStatus_t cutensorMgCopyGetWorkspace(const cutensorMgHandle_t handle, const cutensorMgCopyDescriptor_t desc, int64_t deviceWorkspaceSize[], int64_t *hostWorkspaceSize)

Remark

non-blocking, no reentrant, and thread-safe

Brief

Computes the workspace that is needed for the copy

Details

The function calculates the minimum workspace required for the copy operation to succeed. It returns the device workspace size in the same order as the devices are passed to the library handle.

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • desc[in] The copy descriptor.

  • deviceWorkspaceSize[out] The workspace size in bytes, for each device in the handle.

  • hostWorkspaceSize[out] The workspace size in bytes for pinned host memory.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgCreateCopyPlan()

cutensorStatus_t cutensorMgCreateCopyPlan(const cutensorMgHandle_t handle, cutensorMgCopyPlan_t *plan, const cutensorMgCopyDescriptor_t desc, const int64_t deviceWorkspaceSize[], int64_t hostWorkspaceSize)

Remark

non-blocking, no reentrant, and thread-safe

Brief

Create a copy plan

Details

A copy plan implements the copy operation expressed through the copy descriptor. It contains all the information needed to execute a copy operation. Planning may fail if insufficient workspace is provided.

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • plan[out] The resulting copy plan.

  • desc[in] The copy descriptor.

  • deviceWorkspaceSize[in] The amount of workspace that will be provided, for each device in the handle.

  • hostWorkspaceSize[in] The amount of pinned host workspace that will be provided.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgDestroyCopyPlan()

cutensorStatus_t cutensorMgDestroyCopyPlan(cutensorMgCopyPlan_t plan)

Remark

non-blocking, no reentrant, and thread-safe

Brief

Destroy a copy plan

Details

When called, all outstanding operations must be completed. Frees all associated resources.

Returns

A status code indicating the success or failure of the operation

Parameters:

plan[in] The plan to be destroyed.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgCopy()

cutensorStatus_t cutensorMgCopy(const cutensorMgHandle_t handle, const cutensorMgCopyPlan_t plan, void *ptrDst[], const void *ptrSrc[], void *deviceWorkspace[], void *hostWorkspace, cudaStream_t streams[])

Remark

calls asynchronous functions, conditionally blocking, no reentrant, and conditionally thread-safe

Brief

Execute a copy operation

Details

Executes a copy operation according to the given plan. It receives the source and destination pointers in the order prescribed by the devices parameter of the respective tensor descriptor and the device workspace and streams in the order prescribed by the devices parameter of the handle. If host transfers are involved in the execution the function will block until those host transfers have been completed. The function is thread safe as long as concurrent threads use different library handles.

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • plan[in] The copy plan.

  • ptrDst[out] The destination tensor pointers.

  • ptrSrc[in] The source tensor pointers.

  • deviceWorkspace[out] The device workspace.

  • hostWorkspace[out] The host pinned memory workspace.

  • streams[in] The execution streams.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.

  • CUTENSOR_STATUS_CUDA_ERROR – An issue interacting with the CUDA runtime occurred.


cutensorStatus_t cutensorContractionMaxAlgos(int32_t *maxNumAlgos)

Brief

This routine returns the maximum number of algorithms available to compute tensor contractions

[NOTE] Not all algorithms might be applicable to your specific problem. cutensorContraction() will return CUTENSOR_STATUS_NOT_SUPPORTED if an algorithm is not applicable.

Parameters:

maxNumAlgos[out] This value will hold the maximum number of algorithms available for cutensorContraction(). You can use the returned integer for auto-tuning purposes (i.e., iterate over all algorithms up to the returned value).

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).

Contraction Operations

The following types and functions implement contraction operations.


cutensorMgContractionDescriptor_t

typedef struct cutensorMgContractionDescriptor_s *cutensorMgContractionDescriptor_t

Brief

Describes the contraction of two tensors into a third tensor with an optional source

Details

Only supports device-side tensors.


cutensorMgContractionFind_t

typedef struct cutensorMgContractionFind_s *cutensorMgContractionFind_t

Brief

Describes the algorithmic details of implementing a tensor contraction


cutensorMgContractionPlan_t

typedef struct cutensorMgContractionPlan_s *cutensorMgContractionPlan_t

Brief

Describes a specific way to implement a contraction operation

Details

It encodes blockings, permutations and other implementation details, and may be reused to reduce planning overhead.


cutensorMgAlgo_t

enum cutensorMgAlgo_t

Brief

Represents the selected algorithm when planning for a contraction operation

Values:

enumerator CUTENSORMG_ALGO_DEFAULT

Lets the internal heuristic choose.


cutensorMgCreateContractionDescriptor()

cutensorStatus_t cutensorMgCreateContractionDescriptor(const cutensorMgHandle_t handle, cutensorMgContractionDescriptor_t *desc, const cutensorMgTensorDescriptor_t descA, const int32_t modesA[], const cutensorMgTensorDescriptor_t descB, const int32_t modesB[], const cutensorMgTensorDescriptor_t descC, const int32_t modesC[], const cutensorMgTensorDescriptor_t descD, const int32_t modesD[], cutensorComputeType_t compute)

Remark

non-blocking, no reentrant, and thread-safe

Brief

Create a contraction descriptor

Details

A contraction descriptor encodes the operands for a contraction operation of the form

\[ D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \]
. The contraction operation presently supports tensors that are either on one or multiple devices, but does not support tensors stored on the host (for now). It uses the einstein notation, i.e., modes shared between only modesA and modesB are contracted. Currently, descC and descD as well as modesC and modesD must be identical. The compute type represents the lowest precision that may be used in the course of the calculation.

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • desc[out] The resulting tensor contraction descriptor.

  • descA[in] The tensor descriptor for operand A.

  • modesA[in] The modes for operand A.

  • descB[in] The tensor descriptor for operand B.

  • modesB[in] The modes for operand B.

  • descC[in] The tensor descriptor for operand C.

  • modesC[in] The modes for operand C.

  • descD[in] The tensor descriptor for operand D.

  • modesD[in] The modes for operand D.

  • compute[in] The compute type for the operation.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.

  • CUTENSOR_STATUS_NOT_SUPPORTED – This tensor layout or precision combination is not supported.


cutensorMgDestroyContractionDescriptor()

cutensorStatus_t cutensorMgDestroyContractionDescriptor(cutensorMgContractionDescriptor_t desc)

Remark

non-blocking, no reentrant, and thread-safe

Brief

Destroy a contraction descriptor

Returns

A status code indicating the success or failure of the operation

Parameters:

desc[in] The descriptor to be destroyed.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgCreateContractionFind()

cutensorStatus_t cutensorMgCreateContractionFind(const cutensorMgHandle_t handle, cutensorMgContractionFind_t *find, const cutensorMgAlgo_t algo)

Remark

non-blocking, no reentrant, and thread-safe

Brief

Create a contraction find

Details

The contraction find contains all the algorithmic options to execute a tensor contraction. For now, its only parameter is an algorithm, which currently only has one default value. It may gain additional options in the future.

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • find[out] The resulting find.

  • algo[in] The desired algorithm.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgDestroyContractionFind()

cutensorStatus_t cutensorMgDestroyContractionFind(cutensorMgContractionFind_t find)

Remark

non-blocking, no reentrant, and thread-safe

Brief

Destroy a contraction find

Returns

A status code indicating the success or failure of the operation

Parameters:

find[in] The find to be destroyed.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgContractionGetWorkspace()

cutensorStatus_t cutensorMgContractionGetWorkspace(const cutensorMgHandle_t handle, const cutensorMgContractionDescriptor_t desc, const cutensorMgContractionFind_t find, cutensorWorksizePreference_t preference, int64_t deviceWorkspaceSize[], int64_t *hostWorkspaceSize)

Remark

non-blocking, no reentrant, and thread-safe

Brief

Computes the workspace that is needed for the contraction

Details

The function calculates the workspace required for the contraction operation to succeed. It takes a workspace preference, which can tune how much workspace is needed. It returns the device workspace size in the same order as the devices are passed to the library handle.

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • desc[in] The contraction descriptor.

  • find[in] The contraction find.

  • preference[in] The workspace preference.

  • deviceWorkspaceSize[out] The amount of workspace in bytes, for each device in the handle.

  • hostWorkspaceSize[out] The amount of pinned host memory in bytes.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.

  • CUTENSOR_STATUS_NOT_SUPPORTED – This tensor layout or precision combination is not supported.


cutensorMgCreateContractionPlan()

cutensorStatus_t cutensorMgCreateContractionPlan(const cutensorMgHandle_t handle, cutensorMgContractionPlan_t *plan, const cutensorMgContractionDescriptor_t desc, const cutensorMgContractionFind_t find, const int64_t deviceWorkspaceSize[], int64_t hostWorkspaceSize)

Remark

non-blocking, no reentrant, and thread-safe

Brief

Create a contraction plan

Details

A contraction plan implements the contraction operation expressed through the contraction descriptor in accordance to the options specified in the contraction find. It contains all the information needed to execute a contraction operation. Planning may fail if insufficient workspace is provided.

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • plan[out] The resulting contraction plan.

  • desc[in] The contraction descriptor.

  • find[in] The contraction find.

  • deviceWorkspaceSize[in] The amount of workspace in bytes, for each device in the handle.

  • hostWorkspaceSize[in] The amount of pinned host memory in bytes.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.

  • CUTENSOR_STATUS_NOT_SUPPORTED – This tensor layout or precision combination is not supported.


cutensorMgDestroyContractionPlan()

cutensorStatus_t cutensorMgDestroyContractionPlan(cutensorMgContractionPlan_t plan)

Remark

non-blocking, no reentrant, and thread-safe

Brief

Destroy a contraction plan

Details

When called, all outstanding operations must be completed. Frees all associated resources.

Returns

A status code indicating the success or failure of the operation

Parameters:

plan[in] The plan to be destroyed.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgContraction()

cutensorStatus_t cutensorMgContraction(const cutensorMgHandle_t handle, const cutensorMgContractionPlan_t plan, const void *alpha, const void *ptrA[], const void *ptrB[], const void *beta, const void *ptrC[], void *ptrD[], void *deviceWorkspace[], void *hostWorkspace, cudaStream_t streams[])

Remark

calls asynchronous functions, non-blocking, no reentrant, and conditionally thread-safe

Brief

Execute a contraction operation

Details

Executes a contraction operation according to the provided plan. It receives all the operands as arrays of pointers that are ordered according to their tensor descriptors’ devices parameter. The device workspace and streams are ordered according to the library handle’s devices parameter. The function is thread safe as long as concurrent threads use different library handles.

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • plan[in] The copy plan.

  • alpha[in] The alpha scaling factor (host pointer).

  • ptrA[in] The A operand tensor pointers.

  • ptrB[in] The B operand tensor pointers.

  • beta[in] The beta scaling factor (host pointer).

  • ptrC[in] The operand C tensor pointers.

  • ptrD[out] The operand D tensor pointers.

  • deviceWorkspace[out] The device workspace.

  • hostWorkspace[out] The host pinned memory workspace.

  • streams[in] The execution streams.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.

  • CUTENSOR_STATUS_CUDA_ERROR – An issue interacting with the CUDA runtime occurred.