API Reference - cuTENSORMg

General

The general functions initialize the library and create tensor descriptors.


cutensorMgHostDevice_t

enum cutensorMgHostDevice_t

Enumerated device codes for host-side tensors.

Values:

enumerator CUTENSOR_MG_DEVICE_HOST

The memory is located on the host in regular memory.

enumerator CUTENSOR_MG_DEVICE_HOST_PINNED

The memory is located on the host in pinned memory.


cutensorMgHandle_t

typedef struct cutensorMgHandle_s *cutensorMgHandle_t

Encodes the devices that participate in operations.

The handle contains information about each device that participates in operations as well as host threads to orchestrate host-to-device operations.


cutensorMgTensorDescriptor_t

typedef struct cutensorMgTensorDescriptor_s *cutensorMgTensorDescriptor_t

Represents a tensor that may be distributed.

The tensor is laid out in a block-cyclic fashion across devices. It may either be fully located on the host, or distributed across multiple devices.


cutensorMgCreate()

cutensorStatus_t cutensorMgCreate(cutensorMgHandle_t *handle, uint32_t numDevices, const int32_t devices[])

Create a library handle.

The handle contains information about the devices that should be participating in calculations. All devices that hold any tensor data or participate in any of cuTENSORMg’s operations should also be included in the handle. Each device may only occur once in the list. It is advisable that all devices are identical (i.e., have the same peak performance) to avoid load-balancing issues, and are connected via NVLink to avoid costly device-host-device transfers. This call will enable peering between all devices that have been passed to it, if possible.

Remark

blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[out] The resulting library handle.

  • numDevices[in] The number of devices participating in all subsequent computations.

  • devices[in] The devices that participate in all computations.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgDestroy()

cutensorStatus_t cutensorMgDestroy(cutensorMgHandle_t handle)

Destroy a library handle.

All outstanding operations must be completed before calling this function. Frees all associated resources. Any descriptors or plans created with the handle become invalid and may only be destructed.

Remark

blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:

handle[in] The handle to be destroyed.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgCreateTensorDescriptor()

cutensorStatus_t cutensorMgCreateTensorDescriptor(cutensorMgHandle_t handle, cutensorMgTensorDescriptor_t *desc, uint32_t numModes, const int64_t extent[], const int64_t elementStride[], const int64_t blockSize[], const int64_t blockStride[], const int32_t deviceCount[], uint32_t numDevices, const int32_t devices[], cudaDataType_t type)

Create a tensor descriptor.

A tensor descriptor fully specifies the data layout of a (potentially) distributed tensor. It does so mainly through five pieces of data: The extent, the element stride, the block size, the block stride, and the device count.

The extent describes the total size of each tensor mode. For example, an 9 by 9 matrix would have an extent of 9 and 9.

The block size describes how the data is blocked. For example, with a block size of 4 by 2, there would be three blocks in the first and five blocks in the second mode.

4 x 2

4 x 2

1 x 2

4 x 2

4 x 2

1 x 2

4 x 2

4 x 2

1 x 2

4 x 2

4 x 2

1 x 2

4 x 1

4 x 1

1 x 1

The device count then describes how many devices the blocks are distributed across in that mode. A device count of 2 by 2, for example, would mean that the blocks are distributed across two devices in each mode, i.e., four devices total. The devices are aranged first along the first and then the second mode as follows:

Dev. 0

Dev. 2

Dev. 1

Dev. 3

In particular, device 0 would own the first, and third block in the first dimension, the first, third, and fifth block in the second dimension (so a total of six blocks), device 1 would own the first and third block in the first dimension, and the second and fourth block in the second dimension (four blocks total), device 2 would own the second block in the first dimension, and the first, third, and fifth block in the second dimension (for a total of three blocks), and, finally, device 3 would own the second block in the first dimension and the second and fourth block in the second dimension (for a total of two blocks).

Dev. 0

Dev. 2

Dev. 0

Dev. 1

Dev. 3

Dev. 1

Dev. 0

Dev. 2

Dev. 0

Dev. 1

Dev. 3

Dev. 1

Dev. 0

Dev. 2

Dev. 0

The element stride and block stride then describe how the blocks are laid out on the individual devices, i.e. the distance between elements and blocks in that mode. Finally, the devices array describes which device the blocks are mapped to. Here, it is permissible to specify CUTENSOR_MG_DEVICE_HOST to express that those blocks are located on the host. A tensor must either be located fully on-device or fully on-host.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • desc[out] The resulting tensor descriptor.

  • numModes[in] The number of modes.

  • extent[in] The extent of the tensor in each mode (array of size numModes).

  • elementStride[in] The offset (in linear memory) between two adjacent elements in each mode (array of size numModes), may be NULL for a dense tensor.

  • blockSize[in] The size of a block in each mode (array of size numModes), may be NULL for an unblocked tensor (i.e., each mode only has a single block that is equal to its extent).

  • blockStride[in] The offset (in linear memory) between two adjacent blocks in each mode (array of size numModes), may be NULL for a dense block-interleaved layout.

  • deviceCount[in] The number of devices that each mode is distributed across in a block-cyclic fashion (array of size numModes), may be NULL for a non-distributed tensor.

  • numDevices[in] The total number of devices that the tensor is distributed across (i.e., the product of all elements in deviceCount).

  • devices[in] The devices that the blocks are distributed across, in column-major order, i.e., stride 1 first (array of size numDevices).

  • type[in] The data type of the tensor.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.

  • CUTENSOR_STATUS_NOT_SUPPORTED – This layout or data type is not supported.


cutensorMgDestroyTensorDescriptor()

cutensorStatus_t cutensorMgDestroyTensorDescriptor(cutensorMgTensorDescriptor_t desc)

Destroy a tensor descriptor.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:

desc[in] The descriptor to be destroyed.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


Copy Operations

The following types and functions perform copies between tensors.


cutensorMgCopyDescriptor_t

typedef struct cutensorMgCopyDescriptor_s *cutensorMgCopyDescriptor_t

Describes the copy of a tensor from one data layout to another.

It may describe the full cartesion product of copy from and to host, single device, and multiple devices, as well as permutations and layout changes.


cutensorMgCopyPlan_t

typedef struct cutensorMgCopyPlan_s *cutensorMgCopyPlan_t

Describes a specific way to implement the copy operation.

It encodes blockings and other implementation details, and may be reused to reduce planning overhead.


cutensorMgCreateCopyDescriptor()

cutensorStatus_t cutensorMgCreateCopyDescriptor(const cutensorMgHandle_t handle, cutensorMgCopyDescriptor_t *desc, const cutensorMgTensorDescriptor_t descDst, const int32_t modesDst[], const cutensorMgTensorDescriptor_t descSrc, const int32_t modesSrc[])

Create a copy descriptor.

A copy descriptor encodes the source and the destination for a copy operation. The copy operation supports tensors on host, single, or multiple devices. It also supports layout changes and mode permutations. The only restriction is that the extents of the corresponding modes (in the input and output tensors) must match.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • desc[out] The resulting copy descriptor.

  • descDst[in] The destination tensor descriptor.

  • modesDst[in] The destination tensor modes.

  • descSrc[in] The source tensor descriptor.

  • modesSrc[in] The source tensor modes.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.

  • CUTENSOR_STATUS_NOT_SUPPORTED – This tensor layout or precision combination is not supported.


cutensorMgDestroyCopyDescriptor()

cutensorStatus_t cutensorMgDestroyCopyDescriptor(cutensorMgCopyDescriptor_t desc)

Destroy a copy descriptor and free all its previously-allocated resources.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:

desc[in] The descriptor to be destroyed.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgCopyGetWorkspace()

cutensorStatus_t cutensorMgCopyGetWorkspace(const cutensorMgHandle_t handle, const cutensorMgCopyDescriptor_t desc, int64_t deviceWorkspaceSize[], int64_t *hostWorkspaceSize)

Computes the workspace that is needed for the copy.

The function calculates the minimum workspace required for the copy operation to succeed. It returns the device workspace size in the same order as the devices are passed to the library handle.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • desc[in] The copy descriptor.

  • deviceWorkspaceSize[out] The workspace size in bytes, for each device in the handle.

  • hostWorkspaceSize[out] The workspace size in bytes for pinned host memory.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgCreateCopyPlan()

cutensorStatus_t cutensorMgCreateCopyPlan(const cutensorMgHandle_t handle, cutensorMgCopyPlan_t *plan, const cutensorMgCopyDescriptor_t desc, const int64_t deviceWorkspaceSize[], int64_t hostWorkspaceSize)

Create a copy plan.

A copy plan implements the copy operation expressed through the copy descriptor. It contains all the information needed to execute a copy operation. Planning may fail if insufficient workspace is provided.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • plan[out] The resulting copy plan.

  • desc[in] The copy descriptor.

  • deviceWorkspaceSize[in] The amount of workspace that will be provided, for each device in the handle.

  • hostWorkspaceSize[in] The amount of pinned host workspace that will be provided.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgDestroyCopyPlan()

cutensorStatus_t cutensorMgDestroyCopyPlan(cutensorMgCopyPlan_t plan)

Destroy a copy plan.

When called, all outstanding operations must be completed. Frees all associated resources.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:

plan[in] The plan to be destroyed.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgCopy()

cutensorStatus_t cutensorMgCopy(const cutensorMgHandle_t handle, const cutensorMgCopyPlan_t plan, void *ptrDst[], const void *ptrSrc[], void *deviceWorkspace[], void *hostWorkspace, cudaStream_t streams[])

Execute a copy operation.

Executes a copy operation according to the given plan. It receives the source and destination pointers in the order prescribed by the devices parameter of the respective tensor descriptor and the device workspace and streams in the order prescribed by the devices parameter of the handle. If host transfers are involved in the execution the function will block until those host transfers have been completed. The function is thread safe as long as concurrent threads use different library handles.

Remark

calls asynchronous functions, conditionally blocking, no reentrant, and conditionally thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • plan[in] The copy plan.

  • ptrDst[out] The destination tensor pointers.

  • ptrSrc[in] The source tensor pointers.

  • deviceWorkspace[out] The device workspace.

  • hostWorkspace[out] The host pinned memory workspace.

  • streams[in] The execution streams.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.

  • CUTENSOR_STATUS_CUDA_ERROR – An issue interacting with the CUDA runtime occurred.


Contraction Operations

The following types and functions implement contraction operations.


cutensorMgContractionDescriptor_t

typedef struct cutensorMgContractionDescriptor_s *cutensorMgContractionDescriptor_t

Describes the contraction of two tensors into a third tensor with an optional source.

Only supports device-side tensors.


cutensorMgContractionFind_t

typedef struct cutensorMgContractionFind_s *cutensorMgContractionFind_t

Describes the algorithmic details of implementing a tensor contraction.


cutensorMgContractionPlan_t

typedef struct cutensorMgContractionPlan_s *cutensorMgContractionPlan_t

Describes a specific way to implement a contraction operation.

It encodes blockings, permutations and other implementation details, and may be reused to reduce planning overhead.


cutensorMgAlgo_t

enum cutensorMgAlgo_t

Represents the selected algorithm when planning for a contraction operation.

Values:

enumerator CUTENSORMG_ALGO_DEFAULT

Lets the internal heuristic choose.


cutensorMgCreateContractionDescriptor()

cutensorStatus_t cutensorMgCreateContractionDescriptor(const cutensorMgHandle_t handle, cutensorMgContractionDescriptor_t *desc, const cutensorMgTensorDescriptor_t descA, const int32_t modesA[], const cutensorMgTensorDescriptor_t descB, const int32_t modesB[], const cutensorMgTensorDescriptor_t descC, const int32_t modesC[], const cutensorMgTensorDescriptor_t descD, const int32_t modesD[], cutensorComputeType_t compute)

Create a contraction descriptor.

A contraction descriptor encodes the operands for a contraction operation of the form

\[ D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \]
. The contraction operation presently supports tensors that are either on one or multiple devices, but does not support tensors stored on the host (for now). It uses the einstein notation, i.e., modes shared between only modesA and modesB are contracted. Currently, descC and descD as well as modesC and modesD must be identical. The compute type represents the lowest precision that may be used in the course of the calculation.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • desc[out] The resulting tensor contraction descriptor.

  • descA[in] The tensor descriptor for operand A.

  • modesA[in] The modes for operand A.

  • descB[in] The tensor descriptor for operand B.

  • modesB[in] The modes for operand B.

  • descC[in] The tensor descriptor for operand C.

  • modesC[in] The modes for operand C.

  • descD[in] The tensor descriptor for operand D.

  • modesD[in] The modes for operand D.

  • compute[in] The compute type for the operation.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.

  • CUTENSOR_STATUS_NOT_SUPPORTED – This tensor layout or precision combination is not supported.


cutensorMgDestroyContractionDescriptor()

cutensorStatus_t cutensorMgDestroyContractionDescriptor(cutensorMgContractionDescriptor_t desc)

Destroy a contraction descriptor.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:

desc[in] The descriptor to be destroyed.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgCreateContractionFind()

cutensorStatus_t cutensorMgCreateContractionFind(const cutensorMgHandle_t handle, cutensorMgContractionFind_t *find, const cutensorMgAlgo_t algo)

Create a contraction find.

The contraction find contains all the algorithmic options to execute a tensor contraction. For now, its only parameter is an algorithm, which currently only has one default value. It may gain additional options in the future.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • find[out] The resulting find.

  • algo[in] The desired algorithm.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgDestroyContractionFind()

cutensorStatus_t cutensorMgDestroyContractionFind(cutensorMgContractionFind_t find)

Destroy a contraction find.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:

find[in] The find to be destroyed.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgContractionGetWorkspace()

cutensorStatus_t cutensorMgContractionGetWorkspace(const cutensorMgHandle_t handle, const cutensorMgContractionDescriptor_t desc, const cutensorMgContractionFind_t find, cutensorWorksizePreference_t preference, int64_t deviceWorkspaceSize[], int64_t *hostWorkspaceSize)

Computes the workspace that is needed for the contraction.

The function calculates the workspace required for the contraction operation to succeed. It takes a workspace preference, which can tune how much workspace is needed. It returns the device workspace size in the same order as the devices are passed to the library handle.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • desc[in] The contraction descriptor.

  • find[in] The contraction find.

  • preference[in] The workspace preference.

  • deviceWorkspaceSize[out] The amount of workspace in bytes, for each device in the handle.

  • hostWorkspaceSize[out] The amount of pinned host memory in bytes.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.

  • CUTENSOR_STATUS_NOT_SUPPORTED – This tensor layout or precision combination is not supported.


cutensorMgCreateContractionPlan()

cutensorStatus_t cutensorMgCreateContractionPlan(const cutensorMgHandle_t handle, cutensorMgContractionPlan_t *plan, const cutensorMgContractionDescriptor_t desc, const cutensorMgContractionFind_t find, const int64_t deviceWorkspaceSize[], int64_t hostWorkspaceSize)

Create a contraction plan.

A contraction plan implements the contraction operation expressed through the contraction descriptor in accordance to the options specified in the contraction find. It contains all the information needed to execute a contraction operation. Planning may fail if insufficient workspace is provided.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • plan[out] The resulting contraction plan.

  • desc[in] The contraction descriptor.

  • find[in] The contraction find.

  • deviceWorkspaceSize[in] The amount of workspace in bytes, for each device in the handle.

  • hostWorkspaceSize[in] The amount of pinned host memory in bytes.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.

  • CUTENSOR_STATUS_NOT_SUPPORTED – This tensor layout or precision combination is not supported.


cutensorMgDestroyContractionPlan()

cutensorStatus_t cutensorMgDestroyContractionPlan(cutensorMgContractionPlan_t plan)

Destroy a contraction plan.

When called, all outstanding operations must be completed. Frees all associated resources.

Remark

non-blocking, no reentrant, and thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:

plan[in] The plan to be destroyed.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.


cutensorMgContraction()

cutensorStatus_t cutensorMgContraction(const cutensorMgHandle_t handle, const cutensorMgContractionPlan_t plan, const void *alpha, const void *ptrA[], const void *ptrB[], const void *beta, const void *ptrC[], void *ptrD[], void *deviceWorkspace[], void *hostWorkspace, cudaStream_t streams[])

Execute a contraction operation.

Executes a contraction operation according to the provided plan. It receives all the operands as arrays of pointers that are ordered according to their tensor descriptors’ devices parameter. The device workspace and streams are ordered according to the library handle’s devices parameter. The function is thread safe as long as concurrent threads use different library handles.

Remark

calls asynchronous functions, non-blocking, no reentrant, and conditionally thread-safe

Returns

A status code indicating the success or failure of the operation

Parameters:
  • handle[in] The library handle.

  • plan[in] The copy plan.

  • alpha[in] The alpha scaling factor (host pointer).

  • ptrA[in] The A operand tensor pointers.

  • ptrB[in] The B operand tensor pointers.

  • beta[in] The beta scaling factor (host pointer).

  • ptrC[in] The operand C tensor pointers.

  • ptrD[out] The operand D tensor pointers.

  • deviceWorkspace[out] The device workspace.

  • hostWorkspace[out] The host pinned memory workspace.

  • streams[in] The execution streams.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – Some input parameters were invalid.

  • CUTENSOR_STATUS_CUDA_ERROR – An issue interacting with the CUDA runtime occurred.