cuBLASMp Data Types#

Data types#

cublasMpHandle_t#

The cublasMpHandle_t structure holds the cuBLASMp library context (device properties, system information, etc.).
The handle must be initialized and destroyed using cublasMpCreate() and cublasMpDestroy() functions respectively.

cublasMpGrid_t#

The cublasMpGrid_t structure holds information about the grid dimensions and stores the communicator associated to the grid of processes.
It must be initialized and destroyed using cublasMpGridCreate() and cublasMpGridDestroy() functions respectively.

cublasMpMatrixDescriptor_t#

The cublasMpMatrixDescriptor_t structure captures the shape and characteristics of a distributed matrix.
It must be initialized and destroyed using cublasMpMatrixDescriptorCreate() and cublasMpMatrixDescriptorDestroy() functions respectively.

cublasMpMatmulDescriptor_t#

The cublasMpMatmulDescriptor_t structure captures the properties of a distributed matrix-matrix multiplication performed using cublasMpMatmul().
It must be initialized and destroyed using cublasMpMatmulDescriptorCreate() and cublasMpMatmulDescriptorDestroy() functions respectively.

Enumerators#

cublasMpStatus_t#

The type is used for function status returns. All cuBLASMp library functions return their status, which can have the following values.

Value

Meaning

CUBLASMP_STATUS_SUCCESS

The operation completed successfully.

CUBLASMP_STATUS_NOT_INITIALIZED

The cuBLASMp library was not initialized.

CUBLASMP_STATUS_ALLOCATION_FAILED

Resource allocation failed inside the cuBLASMp library.

CUBLASMP_STATUS_INVALID_VALUE

An unsupported value or parameter was passed to the function.

CUBLASMP_STATUS_ARCHITECTURE_MISMATCH

The function requires a feature absent from the device architecture.

CUBLASMP_STATUS_EXECUTION_FAILED

The GPU program failed to execute.

CUBLASMP_STATUS_INTERNAL_ERROR

An internal cuBLASMp operation failed.

CUBLASMP_STATUS_NOT_SUPPORTED

The functionality requested is not supported.

cublasMpGridLayout_t#

Describes the ordering of the grid of processes.

Value

Meaning

CUBLASMP_GRID_MAPPING_ROW_MAJOR

The grid of processes will be accessed in row-major ordering.

CUBLASMP_GRID_MAPPING_COL_MAJOR

The grid of processes will be accessed in column-major ordering.

cublasMpLoggerCallback_t#

Function pointer type for cuBLASMp logging callbacks. The callback function is called by the library to report logging information.
The callback can be set using cublasMpLoggerSetCallback().

The callback function takes the following parameters:

Parameter

Type

Description

logLevel

int

The severity level of the log message

functionName

const char*

Name of the function generating the log message

message

const char*

The actual log message text

cublasMpMatmulDescriptorAttribute_t#

Attributes of cublasMpMatmulDescriptor_t that can be set using cublasMpMatmulDescriptorAttributeSet() and queried using cublasMpMatmulDescriptorAttributeGet().

Value (CUBLASMP_MATMUL_ DESCRIPTOR_ATTRIBUTE_*)

Meaning

Type

Value (complete name)

TRANSA

Indicates which operation needs to be performed with the dense matrix A.

cublasOperation_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_TRANSA

TRANSB

Indicates which operation needs to be performed with the dense matrix B.

cublasOperation_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_TRANSB

COMPUTE_TYPE

Indicates the compute type of the matrix multiplication.

cublasComputeType_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_COMPUTE_TYPE

ALGO_TYPE

Hints the algorithm type to be used. If not supported, cuBLASMp will fallback to the default algorithm.

cublasMpMatmulAlgoType_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_ALGO_TYPE

COMMUNICATION_SM_COUNT

Indicates the number of SMs to be used for communication.

cublasMpMatmulAlgoType_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_COMMUNICATION_SM_COUNT

EPILOGUE

Specifies the epilogue operation to be performed after matrix multiplication (e.g., activation functions, bias addition).

cublasMpMatmulEpilogue_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE

BIAS_POINTER

Bias or bias gradient vector pointer in the device memory. Input vector with length that matches the number of rows of matrix D when one of the following epilogues is used: CUBLASMP_MATMUL_EPILOGUE_BIAS, CUBLASMP_MATMUL_EPILOGUE_RELU_BIAS, CUBLASMP_MATMUL_EPILOGUE_RELU_AUX_BIAS, CUBLASMP_MATMUL_EPILOGUE_GELU_BIAS, CUBLASMP_MATMUL_EPILOGUE_GELU_AUX_BIAS. Output vector with length that matches the number of rows of matrix D when one of the following epilogues is used: CUBLASMP_MATMUL_EPILOGUE_DRELU_BGRAD, CUBLASMP_MATMUL_EPILOGUE_DGELU_BGRAD, CUBLASMP_MATMUL_EPILOGUE_BGRADA. Output vector with length that matches the number of columns of matrix D when one of the following epilogues is used: CUBLASMP_MATMUL_EPILOGUE_BGRADB. Bias vector elements are the same type as the datatype of matrix D.

void*

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_BIAS_POINTER

BIAS_BATCH_STRIDE

Stride in elements between bias values in batched operations. Currently, batched matmul is not supported and this parameter has to be to 0 (default value).

int64_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_BIAS_BATCH_STRIDE

BIAS_DATA_TYPE

Data type of the bias vector (e.g., CUDA_R_16F, CUDA_R_32F).

cudaDataType_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_BIAS_DATA_TYPE

EPILOGUE_AUX_POINTER

Pointer for epilogue auxiliary buffer. If CUBLASMP_MATMUL_EPILOGUE_RELU_AUX or CUBLASMP_MATMUL_EPILOGUE_RELU_AUX_BIAS epilogue is used - output vector for ReLu bit-mask in forward pass; if CUBLASMP_MATMUL_EPILOGUE_DRELU or CUBLASMP_MATMUL_EPILOGUE_DRELU_BGRAD epilogue is used - input vector for ReLu bit-mask in backward pass; if CUBLASMP_MATMUL_EPILOGUE_GELU_AUX_BIAS epilogue is used - output of GELU input matrix in forward pass; if CUBLASMP_MATMUL_EPILOGUE_DGELU or CUBLASMP_MATMUL_EPILOGUE_DGELU_BGRAD epilogue is used - input of GELU input matrix for backward pass. For the data type, see CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_DATA_TYPE. Requires setting the CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_LD attribute.

void*

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_POINTER

EPILOGUE_AUX_LD

Leading dimension for epilogue auxiliary buffer. If CUBLASMP_MATMUL_EPILOGUE_RELU_AUX, CUBLASMP_MATMUL_EPILOGUE_RELU_AUX_BIAS, CUBLASMP_MATMUL_EPILOGUE_DRELU_BGRAD, or CUBLASMP_MATMUL_EPILOGUE_DRELU_BGRAD epilogue is used - ReLu bit-mask matrix leading dimension in elements (i.e. bits). Must be divisible by 128 and be no less than the number of rows in the output matrix. If CUBLASMP_MATMUL_EPILOGUE_GELU_AUX_BIAS, CUBLASMP_MATMUL_EPILOGUE_DGELU, or CUBLASMP_MATMUL_EPILOGUE_DGELU_BGRAD epilogue is used - GELU input matrix leading dimension in elements. Must be divisible by 8 and be no less than the number of rows in the output matrix.

int64_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_LD

EPILOGUE_AUX_BATCH_STRIDE

Stride in bytes between auxiliary output buffers in batched epilogue operations. Currently, batched matmul is not supported and this parameter has to be to 0 (default value).

int64_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_BATCH_STRIDE

EPILOGUE_AUX_DATA_TYPE

The type of the data that will be stored in CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_POINTER. If unset, the data type is set to be the output matrix element data type (DType) with some exceptions: ReLu uses a bit-mask. For FP8 kernels with an output type (DType) of CUDA_R_8F_E4M3, the data type can be set to a non-default value if: AType and BType are CUDA_R_8F_E4M3, Bias Type is CUDA_R_16F, CType is CUDA_R_16BF or CUDA_R_16F, and CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE is set to CUBLASMP_MATMUL_EPILOGUE_GELU_AUX. When CType is CUDA_R_16F, the data type may be set to CUDA_R_16F or CUDA_R_8F_E4M3. When CType is CUDA_R_16BF, the data type may be set to CUDA_R_16BF. Otherwise, the data type should be left unset.

cudaDataType_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_DATA_TYPE

EPILOGUE_AUX_SCALE_POINTER

Device pointer to the scaling factor value to convert results from compute type data range to storage data range in the auxiliary matrix that is set via CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_POINTER. The scaling factor value must have the same type as the compute type. If not specified, the scaling factor is assumed to be 1.

void*

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_SCALE_POINTER

EPILOGUE_AUX_AMAX_POINTER

Device pointer to the memory location that on completion will be set to the maximum of absolute values in the buffer that is set via CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_POINTER. The computed value has the same type as the compute type. If not specified, the maximum absolute value is not computed.

void*

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_AMAX_POINTER

EPILOGUE_AUX_SCALE_MODE

Scaling mode that defines how the matrix scaling factor for the auxiliary matrix is interpreted. Default value: CUBLASMP_MATMUL_MATRIX_SCALE_SCALAR_FP32.

cublasMpMatmulMatrixScale_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_SCALE_MODE

A_SCALE_POINTER

Device pointer to the scale factor value that converts data in matrix A to the compute data type range. The scaling factor must have the same type as the compute type. Matrix scaling is supported only when compute type is CUBLAS_COMPUTE_32F and the output matrix type is not complex. If not specified, the scaling factor is assumed to be 1.

const void*

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_A_SCALE_POINTER

B_SCALE_POINTER

Equivalent to CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_A_SCALE_POINTER for matrix B.

const void*

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_B_SCALE_POINTER

C_SCALE_POINTER

Equivalent to CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_A_SCALE_POINTER for matrix C.

const void*

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_C_SCALE_POINTER

D_SCALE_POINTER

Equivalent to CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_A_SCALE_POINTER for matrix D.

const void*

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_D_SCALE_POINTER

A_SCALE_MODE

Scaling mode that defines how the matrix scaling factor for matrix A is interpreted. Default value: CUBLASMP_MATMUL_MATRIX_SCALE_SCALAR_FP32.

cublasMpMatmulMatrixScale_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_A_SCALE_MODE

B_SCALE_MODE

Scaling mode that defines how the matrix scaling factor for matrix B is interpreted. Default value: CUBLASMP_MATMUL_MATRIX_SCALE_SCALAR_FP32.

cublasMpMatmulMatrixScale_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_B_SCALE_MODE

C_SCALE_MODE

Scaling mode that defines how the matrix scaling factor for matrix C is interpreted. Default value: CUBLASMP_MATMUL_MATRIX_SCALE_SCALAR_FP32.

cublasMpMatmulMatrixScale_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_C_SCALE_MODE

D_SCALE_MODE

Scaling mode that defines how the matrix scaling factor for matrix D is interpreted. Default value: CUBLASMP_MATMUL_MATRIX_SCALE_SCALAR_FP32.

cublasMpMatmulMatrixScale_t

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_D_SCALE_MODE

AMAX_D_POINTER

Device pointer to the memory location that on completion will be set to the maximum of absolute values in the output matrix. The computed value has the same type as the compute type. If not specified, the maximum absolute value is not computed.

void*

CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_AMAX_D_POINTER

cublasMpMatmulAlgoType_t#

Matrix-matrix multiplication algorithms types to be used. This is treated as a hint and it is not guaranteed that cuBLASMp will use the requested implementation.

Value

Meaning

CUBLASMP_MATMUL_ALGO_TYPE_DEFAULT

Default algorithm.

CUBLASMP_MATMUL_ALGO_TYPE_SPLIT_P2P

Use split matmul with p2p communication.

CUBLASMP_MATMUL_ALGO_TYPE_SPLIT_MULTICAST

Use split matmul with multicast communication. Only valid for GEMM+ReduceScatter and GEMM+AllReduce on Compute Capability 9.0 (Hopper) and newer GPUs connected with an NVSwitch.

CUBLASMP_MATMUL_ALGO_TYPE_ATOMIC_MULTICAST

Use atomic matmul with multicast communication. Only valid for GEMM+ReduceScatter on Compute Capability 9.0 (Hopper) and newer GPUs connected with an NVSwitch. [DEPRECATED]

cublasMpMatmulMatrixScale_t#

An enumerated type used to specify the scaling mode that defines how scaling factor pointers are interpreted.

Value

Meaning

CUBLASMP_MATMUL_MATRIX_SCALE_SCALAR_FP32

Scaling factors are single-precision scalars applied to the whole tensors.

cublasMpMatmulEpilogue_t#

An enumerated type used to specify the epilogue operation to be performed after matrix multiplication. Epilogues enable fusion of additional operations such as activation functions, bias addition, and their derivatives for training, providing better performance by avoiding separate kernel launches.

Value (CUBLASMP_MATMUL_ EPILOGUE_*)

Meaning

Supported Matmul algorithms

Value (complete name)

DEFAULT

No special postprocessing.

all

CUBLASMP_MATMUL_EPILOGUE_DEFAULT

ALLREDUCE

Performs AllReduce communication operation after matrix multiplication.

all

CUBLASMP_MATMUL_EPILOGUE_ALLREDUCE

RELU

Apply ReLU point-wise transform to the results (x := max(x, 0)).

all

CUBLASMP_MATMUL_EPILOGUE_RELU

RELU_AUX

Apply ReLU point-wise transform to the results (x := max(x, 0)). This epilogue mode produces an extra output, see CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_POINTER of cublasMpMatmulDescriptorAttribute_t.

all

CUBLASMP_MATMUL_EPILOGUE_RELU_AUX

BIAS

Apply (broadcast) bias from the bias vector. Bias vector length must match matrix D rows, and it must be packed (such as stride between vector elements is 1). Bias vector is broadcast to all columns and added before applying the final postprocessing.

AllGather+GEMM, GEMM+ReduceScatter, GEMM+AllReduce

CUBLASMP_MATMUL_EPILOGUE_BIAS

RELU_BIAS

Apply bias and then ReLU transform.

AllGather+GEMM, GEMM+ReduceScatter, GEMM+AllReduce

CUBLASMP_MATMUL_EPILOGUE_RELU_BIAS

RELU_AUX_BIAS

Apply bias and then ReLU transform. This epilogue mode produces an extra output, see CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_POINTER of cublasMpMatmulDescriptorAttribute_t.

AllGather+GEMM, GEMM+ReduceScatter, GEMM+AllReduce

CUBLASMP_MATMUL_EPILOGUE_RELU_AUX_BIAS

GELU

Apply GELU point-wise transform to the results (x := GELU(x)).

all

CUBLASMP_MATMUL_EPILOGUE_GELU

GELU_AUX

Apply GELU point-wise transform to the results (x := GELU(x)). This epilogue mode outputs GELU input as a separate matrix (useful for training). See CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_POINTER of cublasMpMatmulDescriptorAttribute_t.

all

CUBLASMP_MATMUL_EPILOGUE_GELU_AUX

GELU_BIAS

Apply Bias and then GELU transform.

AllGather+GEMM, GEMM+ReduceScatter, GEMM+AllReduce

CUBLASMP_MATMUL_EPILOGUE_GELU_BIAS

GELU_AUX_BIAS

Apply Bias and then GELU transform. This epilogue mode outputs GELU input as a separate matrix (useful for training). See CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_POINTER of cublasMpMatmulDescriptorAttribute_t.

AllGather+GEMM, GEMM+ReduceScatter, GEMM+AllReduce

CUBLASMP_MATMUL_EPILOGUE_GELU_AUX_BIAS

DGELU

Apply GELU gradient to matmul output. Store GELU gradient in the output matrix. This epilogue mode requires an extra input, see CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_POINTER of cublasMpMatmulDescriptorAttribute_t.

AllGather+GEMM, GEMM+AllReduce

CUBLASMP_MATMUL_EPILOGUE_DGELU

DGELU_BGRAD

Apply independently GELU and Bias gradient to matmul output. Store GELU gradient in the output matrix, and Bias gradient in the bias buffer (see CUBLASLT_MATMUL_DESC_BIAS_POINTER). This epilogue mode requires an extra input, see CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_EPILOGUE_AUX_POINTER of cublasMpMatmulDescriptorAttribute_t.

AllGather+GEMM, GEMM+AllReduce

CUBLASMP_MATMUL_EPILOGUE_DGELU_BGRAD

BGRADA

Apply Bias gradient to the input matrix A. The bias size corresponds to the number of rows of the matrix D. The reduction happens over the GEMM’s “k” dimension. Store Bias gradient in the bias buffer, see CUBLASLT_MATMUL_DESC_BIAS_POINTER of cublasMpMatmulDescriptorAttribute_t.

all

CUBLASMP_MATMUL_EPILOGUE_BGRADA

BGRADB

Apply Bias gradient to the input matrix B. The bias size corresponds to the number of columns of the matrix D. The reduction happens over the GEMM’s “k” dimension. Store Bias gradient in the bias buffer, see CUBLASLT_MATMUL_DESC_BIAS_POINTER of cublasMpMatmulDescriptorAttribute_t.

all

CUBLASMP_MATMUL_EPILOGUE_BGRADB

DRELU

DRELU epilogue is currently not supported.

none

CUBLASMP_MATMUL_EPILOGUE_DRELU

DRELU_BGRAD

DRELU_BGRAD epilogue is currently not supported.

none

CUBLASMP_MATMUL_EPILOGUE_DRELU_BGRAD