API Reference - cuTENSORMp#
General#
cutensorMpHandle_t#
-
typedef struct cuTensorMpHandle *cutensorMpHandle_t#
cutensorMpCreate()#
- cutensorStatus_t cutensorMpCreate(
- cutensorMpHandle_t *handle,
- ncclComm_t comm,
- int local_device_id,
- cudaStream_t stream,
Initializes the cutensorMp library and creates a handle for distributed tensor operations.
This function creates a cutensorMp handle that serves as the context for all distributed tensor operations. The handle is associated with a specific MPI communicator, local CUDA device, and CUDA stream. This allows cutensorMp to coordinate tensor operations across multiple processes and GPUs.
The communicator defines the group of processes that will participate in distributed tensor operations. The local device ID specifies which CUDA device on the current process will be used for computations. The CUDA stream enables asynchronous execution and synchronization with other CUDA operations.
The user is responsible for calling cutensorMpDestroy to free the resources associated with the handle.
Remark
non-blocking, no reentrant, and thread-safe
- Parameters:
handle – [out] Pointer to cutensorMpHandle_t that will hold the created handle
comm – [in] NCCL communicator that defines the group of processes for distributed operations
local_device_id – [in] CUDA device ID to use on the current process (must be valid and accessible)
stream – [in] CUDA stream for asynchronous operations
- Return values:
CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise
cutensorMpDestroy()#
-
cutensorStatus_t cutensorMpDestroy(cutensorMpHandle_t handle)#
Frees all resources associated with the provided cutensorMp handle.
This function deallocates all memory and resources associated with a cutensorMp handle that was previously created by cutensorMpCreate. After calling this function, the handle becomes invalid and should not be used in subsequent cutensorMp operations.
Remark
blocking, no reentrant, and thread-safe
- Parameters:
handle – [inout] The cutensorMpHandle_t object that will be deallocated
- Return values:
CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise
Tensor Descriptors#
cutensorMpTensorDescriptor_t#
-
typedef struct cuTensorMpTensorDescriptor *cutensorMpTensorDescriptor_t#
cutensorMpCreateTensorDescriptor()#
- cutensorStatus_t cutensorMpCreateTensorDescriptor(
- const cutensorMpHandle_t handle,
- cutensorMpTensorDescriptor_t *desc,
- const uint32_t numModes,
- const int64_t extent[],
- const int64_t elementStride[],
- const int64_t blockSize[],
- const int64_t blockStride[],
- const int64_t nranksPerMode[],
- const uint32_t nranks,
- const int32_t ranks[],
- const cudaDataType_t dataType,
Creates a distributed tensor descriptor for multi-process tensor operations.
This function creates a tensor descriptor that defines the structure and distribution of a multi-dimensional tensor across multiple processes. Unlike regular cuTENSOR tensor descriptors, this descriptor includes information about how the tensor is partitioned and distributed across different processes in the MPI communicator.
The tensor is described by its modes (dimensions), extents (sizes along each mode), and strides for elements and blocks. The distribution is specified through block sizes, block strides, and nranks-per-mode, which determine how the tensor data is partitioned across the participating processes.
The user is responsible for calling cutensorMpDestroyTensorDescriptor to free the resources associated with the descriptor once it is no longer needed.
Remark
non-blocking, no reentrant, and thread-safe
- Parameters:
handle – [in] Opaque handle holding cutensorMp’s library context
desc – [out] Pointer to the address where the allocated tensor descriptor object will be stored
numModes – [in] Number of modes (dimensions) in the tensor (must be greater than zero)
extent – [in] Extent (size) of each mode (size: numModes, all values must be greater than zero)
elementStride – [in] Stride between consecutive elements in each mode (size: numModes)
blockSize – [in] Size of each block along each mode for distribution (size: numModes), passing null will using extent[i]/nranksPerMode[i]
blockStride – [in] Stride between consecutive blocks in each mode (size: numModes)
nranksPerMode – [in] Number of processes along each mode (size: numModes)
nranks – [in] Total number of ranks (processes) participating in the tensor distribution
ranks – [in] Array of rank IDs for each participating process (size: nranks), passing null will use the range [0, nranks)
dataType – [in] Data type of the tensor elements
- Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates a user error)
CUTENSOR_STATUS_NOT_SUPPORTED – if the requested descriptor configuration is not supported
cutensorMpDestroyTensorDescriptor()#
- cutensorStatus_t cutensorMpDestroyTensorDescriptor( )#
Frees all resources related to the provided distributed tensor descriptor.
This function deallocates all memory and resources associated with a cutensorMp tensor descriptor that was previously created by cutensorMpCreateTensorDescriptor. After calling this function, the descriptor becomes invalid and should not be used in subsequent cutensorMp operations.
Remark
blocking, no reentrant, and thread-safe
- Parameters:
desc – [inout] The cutensorMpTensorDescriptor_t object that will be deallocated
- Return values:
CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise
Generic Operation Functions#
The following functions are generic and work with all the different operations.
cutensorMpDestroyOperationDescriptor()#
- cutensorStatus_t cutensorMpDestroyOperationDescriptor( )#
Frees all resources related to the provided distributed contraction descriptor.
This function deallocates all memory and resources associated with a cutensorMp operation descriptor that was previously created by cutensorMpCreateContraction. After calling this function, the descriptor becomes invalid and should not be used in subsequent cutensorMp operations.
Remark
blocking, no reentrant, and thread-safe
- Parameters:
desc – [inout] The cutensorMpOperationDescriptor_t object that will be deallocated
- Return values:
CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise
Plan Preferences#
cutensorMpAlgo_t#
cutensorMpPlanPreference_t#
-
typedef struct cuTensorMpPlanPreference *cutensorMpPlanPreference_t#
cutensorMpCreatePlanPreference()#
- cutensorStatus_t cutensorMpCreatePlanPreference(
- const cutensorMpHandle_t handle,
- cutensorMpPlanPreference_t *pref,
- const cutensorMpAlgo_t cutensormp_algo,
- const uint64_t cutensormp_workspace_size_device,
- const uint64_t cutensormp_workspace_size_host,
Creates a plan preference object for controlling distributed tensor operation planning.
This function creates a preference object that allows users to control various aspects of the execution plan for distributed tensor operations. The preferences include algorithm selection, workspace size limits, and JIT compilation options that affect both the underlying cuTENSOR operations and the distributed communication patterns.
The plan preference provides fine-grained control over:
Local cuTENSOR algorithm selection and JIT mode
Distributed algorithm strategy (non-packing, packing with permutation, or packing with P2P)
Workspace size limits for both device and host memory
cuTENSOR workspace preferences
The user is responsible for calling cutensorMpDestroyPlanPreference to free the resources associated with the preference object.
Remark
non-blocking, no reentrant, and thread-safe
- Parameters:
handle – [in] Opaque handle holding cutensorMp’s library context
pref – [out] Pointer to the plan preference object that will be created
cutensormp_algo – [in] Algorithm selection for distributed communication patterns
cutensormp_workspace_size_device – [in] Maximum device workspace size for cutensorMp operations (bytes), minimum 2GB is required
cutensormp_workspace_size_host – [in] Maximum host workspace size for cutensorMp operations (bytes)
- Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates a user error)
cutensorMpDestroyPlanPreference()#
- cutensorStatus_t cutensorMpDestroyPlanPreference( )#
Frees all resources related to the provided plan preference object.
This function deallocates all memory and resources associated with a cutensorMp plan preference object that was previously created by cutensorMpCreatePlanPreference. After calling this function, the preference object becomes invalid and should not be used in subsequent cutensorMp operations.
Remark
blocking, no reentrant, and thread-safe
- Parameters:
pref – [inout] The cutensorMpPlanPreference_t object that will be deallocated
- Return values:
CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise
Plans#
cutensorMpPlan_t#
-
typedef struct cuTensorMpPlan *cutensorMpPlan_t#
cutensorMpPlanAttribute_t#
-
enum cutensorMpPlanAttribute_t#
Values:
-
enumerator CUTENSORMP_PLAN_REQUIRED_WORKSPACE_DEVICE#
uint64_t: exact required workspace in bytes that is needed to execute the plan
-
enumerator CUTENSORMP_PLAN_REQUIRED_WORKSPACE_HOST#
uint64_t: exact required workspace in bytes that is needed to execute the plan
-
enumerator CUTENSORMP_PLAN_REQUIRED_WORKSPACE_DEVICE#
cutensorMpCreatePlan()#
- cutensorStatus_t cutensorMpCreatePlan(
- const cutensorMpHandle_t handle,
- cutensorMpPlan_t *plan,
- const cutensorMpOperationDescriptor_t desc,
- const cutensorMpPlanPreference_t pref,
Creates an execution plan for distributed tensor contractions.
This function creates an optimized execution plan for the distributed tensor contraction encoded by the operation descriptor. The plan selects the most appropriate algorithms and communication strategies based on the tensor distributions, available resources, and user preferences.
The planning process analyzes the distributed tensor layout, communication requirements, and computational resources to determine an efficient execution strategy. This may involve data redistribution, local contractions, and result aggregation phases that minimize communication overhead while maximizing computational efficiency.
The user is responsible for calling cutensorMpDestroyPlan to free the resources associated with the plan once it is no longer needed.
Remark
calls asynchronous functions, no reentrant, and thread-safe
- Parameters:
handle – [in] Opaque handle holding cutensorMp’s library context
plan – [out] Pointer to the execution plan object that will be created
desc – [in] Operation descriptor encoding the distributed contraction (created by cutensorMpCreateContraction)
pref – [in] Plan preference object specifying algorithm and workspace preferences (may be NULL for default preferences)
- Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates a user error)
CUTENSOR_STATUS_NOT_SUPPORTED – if no viable execution plan could be found
cutensorMpDestroyPlan()#
-
cutensorStatus_t cutensorMpDestroyPlan(cutensorMpPlan_t plan)#
Frees all resources related to the provided distributed contraction plan.
This function deallocates all memory and resources associated with a cutensorMp execution plan that was previously created by cutensorMpCreatePlan. After calling this function, the plan becomes invalid and should not be used in subsequent cutensorMp operations.
Remark
blocking, no reentrant, and thread-safe
- Parameters:
plan – [inout] The cutensorMpPlan_t object that will be deallocated
- Return values:
CUTENSOR_STATUS_SUCCESS – on success and an error code otherwise
cutensorMpPlanGetAttribute()#
- cutensorStatus_t cutensorMpPlanGetAttribute(
- const cutensorMpHandle_t handle,
- const cutensorMpPlan_t plan,
- const cutensorMpPlanAttribute_t attribute,
- void *buf,
- const size_t sizeInBytes,
Retrieves information about an already-created plan (see cutensorPlanAttribute_t)
- Parameters:
plan – [in] Denotes an already-created plan (e.g., via cutensorMpCreatePlan)
attr – [in] Requested attribute.
buf – [out] On successful exit: Holds the information of the requested attribute.
sizeInBytes – [in] size of
bufin bytes.
- Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully.
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates an user error).
Contraction Operations#
cutensorMpOperationDescriptor_t#
-
typedef struct cuTensorMpOperationDescriptor *cutensorMpOperationDescriptor_t#
cutensorMpCreateContraction()#
- cutensorStatus_t cutensorMpCreateContraction(
- const cutensorMpHandle_t handle,
- cutensorMpOperationDescriptor_t *desc,
- const cutensorMpTensorDescriptor_t descA,
- const int32_t modesA[],
- cutensorOperator_t opA,
- const cutensorMpTensorDescriptor_t descB,
- const int32_t modesB[],
- cutensorOperator_t opB,
- const cutensorMpTensorDescriptor_t descC,
- const int32_t modesC[],
- cutensorOperator_t opC,
- const cutensorMpTensorDescriptor_t descD,
- const int32_t modesD[],
- const cutensorComputeDescriptor_t descCompute,
Creates an operation descriptor that encodes a distributed tensor contraction.
This function creates an operation descriptor for distributed tensor contractions of the form \( D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \), where the tensors A, B, C, and D are distributed across multiple processes as specified by their respective tensor descriptors.
The distributed contraction leverages both intra-process cuTENSOR operations and inter-process communication to efficiently compute tensor contractions that exceed the memory capacity or computational resources of a single GPU. The operation automatically handles data redistribution, local contractions, and result aggregation across the participating processes.
The user is responsible for calling cutensorMpDestroyOperationDescriptor to free the resources associated with the descriptor once it is no longer needed.
Remark
non-blocking, no reentrant, and thread-safe
- Parameters:
handle – [in] Opaque handle holding cutensorMp’s library context
desc – [out] Pointer to the operation descriptor that will be created and filled with information encoding the distributed contraction operation
descA – [in] Distributed tensor descriptor for input tensor A
modesA – [in] Modes of the input tensor A
opA – [in] Unary operator that will be applied to each element of A before it is further processed. The original data of this tensor remains unchanged.
descB – [in] Distributed tensor descriptor for input tensor B
modesB – [in] Modes of the input tensor B
opB – [in] Unary operator that will be applied to each element of B before it is further processed. The original data of this tensor remains unchanged.
descC – [in] Distributed tensor descriptor for input tensor C
modesC – [in] Modes of the input tensor C
opC – [in] Unary operator that will be applied to each element of C before it is further processed. The original data of this tensor remains unchanged.
descD – [in] Distributed tensor descriptor for output tensor D (currently must be identical to descC)
modesD – [in] Modes of the output tensor D
descCompute – [in] Compute descriptor that determines the precision for the operation
- Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates a user error)
CUTENSOR_STATUS_NOT_SUPPORTED – if the combination of tensor configurations is not supported
cutensorMpContract()#
- cutensorStatus_t cutensorMpContract(
- const cutensorMpHandle_t handle,
- const cutensorMpPlan_t plan,
- const void *alpha,
- const void *A,
- const void *B,
- const void *beta,
- const void *C,
- void *D,
- void *device_workspace,
- void *host_workspace,
Performs a distributed tensor contraction across multiple processes.
This function executes the distributed tensor contraction \( D = \alpha \mathcal{A} \mathcal{B} + \beta \mathcal{C} \) according to the execution plan created by cutensorMpCreatePlan. The operation coordinates computation and communication across multiple processes and GPUs to efficiently perform tensor contractions that exceed the capacity of a single device.
The execution involves several phases:
Data redistribution to align tensor blocks for efficient computation
Local tensor contractions using cuTENSOR on each participating device
Communication and aggregation of partial results across processes
Final result assembly in the distributed output tensor
All participating processes in the MPI communicator must call this function with consistent parameters. The input and output tensors must be distributed according to their respective tensor descriptors, with each process providing its local portion of the data.
Remark
calls asynchronous functions, no reentrant, and thread-safe
- Parameters:
handle – [in] Opaque handle holding cutensorMp’s library context
plan – [in] Execution plan for the distributed contraction (created by cutensorMpCreatePlan)
alpha – [in] Scaling factor for the A*B product. Pointer to host memory with data type determined by the compute descriptor. The data type follows that of cuTENSOR (i.e., the data type of the scalar is deptermined by the data type of
C:CUDA_R_16FandCUDA_R_16BFuseCUDA_R_32Fscalars, all data types of the scalar are identical to the type ofC)A – [in] Pointer to the local portion of distributed tensor A in GPU memory
B – [in] Pointer to the local portion of distributed tensor B in GPU memory
beta – [in] Scaling factor for tensor C. Pointer to host memory with data type determined by the compute descriptor. The data type follows that of cuTENSOR (i.e., the data type of the scalar is deptermined by the data type of
C:CUDA_R_16FandCUDA_R_16BFuseCUDA_R_32Fscalars, all data types of the scalar are identical to the type ofC)C – [in] Pointer to the local portion of distributed tensor C in GPU memory
D – [out] Pointer to the local portion of distributed tensor D in GPU memory (may be identical to C)
device_workspace – [in] Pointer to device workspace memory (size determined by cutensorMpPlanGetAttribute with CUTENSORMP_PLAN_REQUIRED_WORKSPACE_DEVICE)
host_workspace – [in] Pointer to host workspace memory (size determined by cutensorMpPlanGetAttribute with CUTENSORMP_PLAN_REQUIRED_WORKSPACE_HOST)
- Return values:
CUTENSOR_STATUS_SUCCESS – The operation completed successfully
CUTENSOR_STATUS_NOT_INITIALIZED – if the handle is not initialized
CUTENSOR_STATUS_INVALID_VALUE – if some input data is invalid (this typically indicates a user error)
CUTENSOR_STATUS_NOT_SUPPORTED – if the operation is not supported with the given configuration
CUTENSOR_STATUS_INSUFFICIENT_WORKSPACE – if the provided workspace is insufficient
CUTENSOR_STATUS_ARCH_MISMATCH – if the plan was created for a different device architecture
CUTENSOR_STATUS_CUDA_ERROR – if a CUDA error occurred during execution