API Reference - cuTENSORMp#

General#


cutensorMpHandle_t#

typedef struct cuTensorMpHandle *cutensorMpHandle_t#

cutensorMpCreate()#

cutensorStatus_t cutensorMpCreate(
cutensorMpHandle_t *handle,
ncclComm_t comm,
int local_device_id,
cudaStream_t stream,
)#

Initializes the cutensorMp library and creates a handle for distributed tensor operations.

This function creates a cutensorMp handle that serves as the context for all distributed tensor operations. The handle is associated with a specific MPI communicator, local CUDA device, and CUDA stream. This allows cutensorMp to coordinate tensor operations across multiple processes and GPUs.

The communicator defines the group of processes that will participate in distributed tensor operations. The local device ID specifies which CUDA device on the current process will be used for computations. The CUDA stream enables asynchronous execution and synchronization with other CUDA operations.

The user is responsible for calling cutensorMpDestroy to free the resources associated with the handle.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:
  • handle[out] Pointer to cutensorMpHandle_t that will hold the created handle

  • comm[in] NCCL communicator that defines the group of processes for distributed operations

  • local_device_id[in] CUDA device ID to use on the current process (must be valid and accessible)

  • stream[in] CUDA stream for asynchronous operations

Return values:

CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise


cutensorMpDestroy()#

cutensorStatus_t cutensorMpDestroy(cutensorMpHandle_t handle)#

Frees all resources associated with the provided cutensorMp handle.

This function deallocates all memory and resources associated with a cutensorMp handle that was previously created by cutensorMpCreate. After calling this function, the handle becomes invalid and should not be used in subsequent cutensorMp operations.

Remark

blocking, no reentrant, and thread-safe

Parameters:

handle[inout] The cutensorMpHandle_t object that will be deallocated

Return values:

CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise


Tensor Descriptors#


cutensorMpTensorDescriptor_t#

typedef struct cuTensorMpTensorDescriptor *cutensorMpTensorDescriptor_t#

cutensorMpCreateTensorDescriptor()#

cutensorStatus_t cutensorMpCreateTensorDescriptor(
const cutensorMpHandle_t handle,
cutensorMpTensorDescriptor_t *desc,
const uint32_t numModes,
const int64_t extent[],
const int64_t elementStride[],
const int64_t blockSize[],
const int64_t blockStride[],
const int64_t nranksPerMode[],
const uint32_t nranks,
const int32_t ranks[],
const cudaDataType_t dataType,
)#

Creates a distributed tensor descriptor for multi-process tensor operations.

This function creates a tensor descriptor that defines the structure and distribution of a multi-dimensional tensor across multiple processes. Unlike regular cuTENSOR tensor descriptors, this descriptor includes information about how the tensor is partitioned and distributed across different processes in the MPI communicator.

The tensor is described by its modes (dimensions), extents (sizes along each mode), and strides for elements and blocks. The distribution is specified through block sizes, block strides, and nranks-per-mode, which determine how the tensor data is partitioned across the participating processes.

The user is responsible for calling cutensorMpDestroyTensorDescriptor to free the resources associated with the descriptor once it is no longer needed.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:
  • handle[in] Opaque handle holding cutensorMp’s library context

  • desc[out] Pointer to the address where the allocated tensor descriptor object will be stored

  • numModes[in] Number of modes (dimensions) in the tensor (must be greater than zero)

  • extent[in] Extent (size) of each mode (size: numModes, all values must be greater than zero)

  • elementStride[in] Stride between consecutive elements in each mode (size: numModes)

  • blockSize[in] Size of each block along each mode for distribution (size: numModes), passing null will using extent[i]/nranksPerMode[i]

  • blockStride[in] Stride between consecutive blocks in each mode (size: numModes)

  • nranksPerMode[in] Number of processes along each mode (size: numModes)

  • nranks[in] Total number of ranks (processes) participating in the tensor distribution

  • ranks[in] Array of rank IDs for each participating process (size: nranks), passing null will use the range [0, nranks)

  • dataType[in] Data type of the tensor elements

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates a user error)

  • CUTENSOR_STATUS_NOT_SUPPORTED – if the requested descriptor configuration is not supported


cutensorMpDestroyTensorDescriptor()#

cutensorStatus_t cutensorMpDestroyTensorDescriptor(
cutensorMpTensorDescriptor_t desc,
)#

Frees all resources related to the provided distributed tensor descriptor.

This function deallocates all memory and resources associated with a cutensorMp tensor descriptor that was previously created by cutensorMpCreateTensorDescriptor. After calling this function, the descriptor becomes invalid and should not be used in subsequent cutensorMp operations.

Remark

blocking, no reentrant, and thread-safe

Parameters:

desc[inout] The cutensorMpTensorDescriptor_t object that will be deallocated

Return values:

CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise

Generic Operation Functions#

The following functions are generic and work with all the different operations.


cutensorMpDestroyOperationDescriptor()#

cutensorStatus_t cutensorMpDestroyOperationDescriptor(
cutensorMpOperationDescriptor_t desc,
)#

Frees all resources related to the provided distributed contraction descriptor.

This function deallocates all memory and resources associated with a cutensorMp operation descriptor that was previously created by cutensorMpCreateContraction. After calling this function, the descriptor becomes invalid and should not be used in subsequent cutensorMp operations.

Remark

blocking, no reentrant, and thread-safe

Parameters:

desc[inout] The cutensorMpOperationDescriptor_t object that will be deallocated

Return values:

CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise


Plan Preferences#


cutensorMpAlgo_t#

enum cutensorMpAlgo_t#

Values:

enumerator CUTENSORMP_ALGO_DEFAULT#

Default algorithm.


cutensorMpPlanPreference_t#

typedef struct cuTensorMpPlanPreference *cutensorMpPlanPreference_t#

cutensorMpCreatePlanPreference()#

cutensorStatus_t cutensorMpCreatePlanPreference(
const cutensorMpHandle_t handle,
cutensorMpPlanPreference_t *pref,
const cutensorMpAlgo_t cutensormp_algo,
const uint64_t cutensormp_workspace_size_device,
const uint64_t cutensormp_workspace_size_host,
)#

Creates a plan preference object for controlling distributed tensor operation planning.

This function creates a preference object that allows users to control various aspects of the execution plan for distributed tensor operations. The preferences include algorithm selection, workspace size limits, and JIT compilation options that affect both the underlying cuTENSOR operations and the distributed communication patterns.

The plan preference provides fine-grained control over:

  • Local cuTENSOR algorithm selection and JIT mode

  • Distributed algorithm strategy (non-packing, packing with permutation, or packing with P2P)

  • Workspace size limits for both device and host memory

  • cuTENSOR workspace preferences

The user is responsible for calling cutensorMpDestroyPlanPreference to free the resources associated with the preference object.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:
  • handle[in] Opaque handle holding cutensorMp’s library context

  • pref[out] Pointer to the plan preference object that will be created

  • cutensormp_algo[in] Algorithm selection for distributed communication patterns

  • cutensormp_workspace_size_device[in] Maximum device workspace size for cutensorMp operations (bytes), minimum 2GB is required

  • cutensormp_workspace_size_host[in] Maximum host workspace size for cutensorMp operations (bytes)

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates a user error)


cutensorMpDestroyPlanPreference()#

cutensorStatus_t cutensorMpDestroyPlanPreference(
cutensorMpPlanPreference_t pref,
)#

Frees all resources related to the provided plan preference object.

This function deallocates all memory and resources associated with a cutensorMp plan preference object that was previously created by cutensorMpCreatePlanPreference. After calling this function, the preference object becomes invalid and should not be used in subsequent cutensorMp operations.

Remark

blocking, no reentrant, and thread-safe

Parameters:

pref[inout] The cutensorMpPlanPreference_t object that will be deallocated

Return values:

CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise


Plans#


cutensorMpPlan_t#

typedef struct cuTensorMpPlan *cutensorMpPlan_t#

cutensorMpPlanAttribute_t#

enum cutensorMpPlanAttribute_t#

Values:

enumerator CUTENSORMP_PLAN_REQUIRED_WORKSPACE_DEVICE#

uint64_t: exact required workspace in bytes that is needed to execute the plan

enumerator CUTENSORMP_PLAN_REQUIRED_WORKSPACE_HOST#

uint64_t: exact required workspace in bytes that is needed to execute the plan


cutensorMpCreatePlan()#

cutensorStatus_t cutensorMpCreatePlan(
const cutensorMpHandle_t handle,
cutensorMpPlan_t *plan,
const cutensorMpOperationDescriptor_t desc,
const cutensorMpPlanPreference_t pref,
)#

Creates an execution plan for distributed tensor contractions.

This function creates an optimized execution plan for the distributed tensor contraction encoded by the operation descriptor. The plan selects the most appropriate algorithms and communication strategies based on the tensor distributions, available resources, and user preferences.

The planning process analyzes the distributed tensor layout, communication requirements, and computational resources to determine an efficient execution strategy. This may involve data redistribution, local contractions, and result aggregation phases that minimize communication overhead while maximizing computational efficiency.

The user is responsible for calling cutensorMpDestroyPlan to free the resources associated with the plan once it is no longer needed.

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:
  • handle[in] Opaque handle holding cutensorMp’s library context

  • plan[out] Pointer to the execution plan object that will be created

  • desc[in] Operation descriptor encoding the distributed contraction (created by cutensorMpCreateContraction)

  • pref[in] Plan preference object specifying algorithm and workspace preferences (may be NULL for default preferences)

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates a user error)

  • CUTENSOR_STATUS_NOT_SUPPORTED – if no viable execution plan could be found


cutensorMpDestroyPlan()#

cutensorStatus_t cutensorMpDestroyPlan(cutensorMpPlan_t plan)#

Frees all resources related to the provided distributed contraction plan.

This function deallocates all memory and resources associated with a cutensorMp execution plan that was previously created by cutensorMpCreatePlan. After calling this function, the plan becomes invalid and should not be used in subsequent cutensorMp operations.

Remark

blocking, no reentrant, and thread-safe

Parameters:

plan[inout] The cutensorMpPlan_t object that will be deallocated

Return values:

CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise


cutensorMpPlanGetAttribute()#

cutensorStatus_t cutensorMpPlanGetAttribute(
const cutensorMpHandle_t handle,
const cutensorMpPlan_t plan,
const cutensorMpPlanAttribute_t attribute,
void *buf,
const size_t sizeInBytes,
)#

Retrieves information about an already-created plan (see cutensorPlanAttribute_t)

Parameters:
  • plan[in] Denotes an already-created plan (e.g., via cutensorMpCreatePlan)

  • attr[in] Requested attribute.

  • buf[out] On successful exit: Holds the information of the requested attribute.

  • sizeInBytes[in] size of buf in bytes.

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully.

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).


Contraction Operations#


cutensorMpOperationDescriptor_t#

typedef struct cuTensorMpOperationDescriptor *cutensorMpOperationDescriptor_t#

cutensorMpCreateContraction()#

cutensorStatus_t cutensorMpCreateContraction(
const cutensorMpHandle_t handle,
cutensorMpOperationDescriptor_t *desc,
const cutensorMpTensorDescriptor_t descA,
const int32_t modesA[],
cutensorOperator_t opA,
const cutensorMpTensorDescriptor_t descB,
const int32_t modesB[],
cutensorOperator_t opB,
const cutensorMpTensorDescriptor_t descC,
const int32_t modesC[],
cutensorOperator_t opC,
const cutensorMpTensorDescriptor_t descD,
const int32_t modesD[],
const cutensorComputeDescriptor_t descCompute,
)#

Creates an operation descriptor that encodes a distributed tensor contraction.

This function creates an operation descriptor for distributed tensor contractions of the form \( D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \), where the tensors A, B, C, and D are distributed across multiple processes as specified by their respective tensor descriptors.

The distributed contraction leverages both intra-process cuTENSOR operations and inter-process communication to efficiently compute tensor contractions that exceed the memory capacity or computational resources of a single GPU. The operation automatically handles data redistribution, local contractions, and result aggregation across the participating processes.

The user is responsible for calling cutensorMpDestroyOperationDescriptor to free the resources associated with the descriptor once it is no longer needed.

Remark

non-blocking, no reentrant, and thread-safe

Parameters:
  • handle[in] Opaque handle holding cutensorMp’s library context

  • desc[out] Pointer to the operation descriptor that will be created and filled with information encoding the distributed contraction operation

  • descA[in] Distributed tensor descriptor for input tensor A

  • modesA[in] Modes of the input tensor A

  • opA[in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.

  • descB[in] Distributed tensor descriptor for input tensor B

  • modesB[in] Modes of the input tensor B

  • opB[in] Unary operator that will be applied to each element of B before it is further processed. The original data of this tensor remains unchanged.

  • descC[in] Distributed tensor descriptor for input tensor C

  • modesC[in] Modes of the input tensor C

  • opC[in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.

  • descD[in] Distributed tensor descriptor for output tensor D (currently must be identical to descC)

  • modesD[in] Modes of the output tensor D

  • descCompute[in] Compute descriptor that determines the precision for the operation

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates a user error)

  • CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of tensor configurations is not supported


cutensorMpContract()#

cutensorStatus_t cutensorMpContract(
const cutensorMpHandle_t handle,
const cutensorMpPlan_t plan,
const void *alpha,
const void *A,
const void *B,
const void *beta,
const void *C,
void *D,
void *device_workspace,
void *host_workspace,
)#

Performs a distributed tensor contraction across multiple processes.

This function executes the distributed tensor contraction \( D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \) according to the execution plan created by cutensorMpCreatePlan. The operation coordinates computation and communication across multiple processes and GPUs to efficiently perform tensor contractions that exceed the capacity of a single device.

The execution involves several phases:

  1. Data redistribution to align tensor blocks for efficient computation

  2. Local tensor contractions using cuTENSOR on each participating device

  3. Communication and aggregation of partial results across processes

  4. Final result assembly in the distributed output tensor

All participating processes in the MPI communicator must call this function with consistent parameters. The input and output tensors must be distributed according to their respective tensor descriptors, with each process providing its local portion of the data.

Remark

calls asynchronous functions, no reentrant, and thread-safe

Parameters:
  • handle[in] Opaque handle holding cutensorMp’s library context

  • plan[in] Execution plan for the distributed contraction (created by cutensorMpCreatePlan)

  • alpha[in] Scaling factor for the A*B product. Pointer to host memory with data type determined by the compute descriptor. The data type follows that of cuTENSOR (i.e., the data type of the scalar is deptermined by the data type of C:CUDA_R_16F and CUDA_R_16BF use CUDA_R_32Fscalars, all data types of the scalar are identical to the type of C)

  • A[in] Pointer to the local portion of distributed tensor A in GPU memory

  • B[in] Pointer to the local portion of distributed tensor B in GPU memory

  • beta[in] Scaling factor for tensor C. Pointer to host memory with data type determined by the compute descriptor. The data type follows that of cuTENSOR (i.e., the data type of the scalar is deptermined by the data type of C:CUDA_R_16F and CUDA_R_16BF use CUDA_R_32Fscalars, all data types of the scalar are identical to the type of C)

  • C[in] Pointer to the local portion of distributed tensor C in GPU memory

  • D[out] Pointer to the local portion of distributed tensor D in GPU memory (may be identical to C)

  • device_workspace[in] Pointer to device workspace memory (size determined by cutensorMpPlanGetAttribute with CUTENSORMP_PLAN_REQUIRED_WORKSPACE_DEVICE)

  • host_workspace[in] Pointer to host workspace memory (size determined by cutensorMpPlanGetAttribute with CUTENSORMP_PLAN_REQUIRED_WORKSPACE_HOST)

Return values:
  • CUTENSOR_STATUS_SUCCESS – The operation completed successfully

  • CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized

  • CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates a user error)

  • CUTENSOR_STATUS_NOT_SUPPORTED – if the operation is not supported with the given configuration

  • CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE – if the provided workspace is insufficient

  • CUTENSOR_STATUS_ARCH_MISMATCH – if the plan was created for a different device architecture

  • CUTENSOR_STATUS_CUDA_ERROR – if a CUDA error occurred during execution