cuBLASMp C API#

Library Management#

cublasMpCreate#

cublasMpStatus_t cublasMpCreate(
        cublasMpHandle_t *handle,
        cudaStream_t stream);
This function initializes the cuBLASMp library handle (cublasMpHandle_t) which holds the cuBLASMp library context. It allocates light hardware resources on the host, and must be called prior to making any other cuBLASMp library calls.
Calling any cuBLASMp function which uses cublasMpHandle_t without a previous call of cublasMpCreate() will return an error.
The cuBLASMp library context is tied to the current CUDA device and the given CUDA stream.
Sharing a device with multiple processes may result in undefined behavior.

Parameter

Memory

In/Out

Description

handle

Host

Out

cuBLASMp library handle.

stream

Host

In

Stream that will be assigned to the handle.

See cublasMpStatus_t for the description of the return status.

cublasMpDestroy#

cublasMpStatus_t cublasMpDestroy(
        cublasMpHandle_t handle);
This function destroys the cuBLASMp library handle (cublasMpHandle_t) which holds the cuBLASMp library context.
The cuBLASMp library context is tied to the CUDA device that was set when calling cublasMpCreate(). Only one handle per process and per GPU supported.

Parameter

Memory

In/Out

Description

handle

Host

In/Out

cuBLASMp library handle to destroy.

See cublasMpStatus_t for the description of the return status.

cublasMpStreamSet#

cublasMpStatus_t cublasMpStreamSet(
        cublasMpHandle_t handle,
        cudaStream_t stream);
This function sets the CUDA stream to be used in the computations.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

stream

Host

In

CUDA stream pointer to set.

See cublasMpStatus_t for the description of the return status.

cublasMpStreamGet#

cublasMpStatus_t cublasMpStreamGet(
        cublasMpHandle_t handle,
        cudaStream_t* stream);
This function returns the current CUDA stream that is being used in the computations.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

stream

Host

Out

CUDA stream pointer to set.

See cublasMpStatus_t for the description of the return status.

cublasMpGetVersion#

cublasMpStatus_t cublasMpGetVersion(
        int *version);
This function returns the version number of the cuBLASMp library.

Parameter

Memory

In/Out

Description

version

Host

Out

cuBLASMp library version. Value is CUBLASMP_VER_MAJOR * 1000 + CUBLASMP_VER_MINOR * 100 + CUBLASMP_VER_PATCH.

See cublasMpStatus_t for the description of the return status.

Grid Management#

cublasMpGridCreate#

cublasMpStatus_t cublasMpGridCreate(
        int64_t nprow,
        int64_t npcol,
        cublasMpGridLayout_t layout,
        cal_comm_t comm,
        cublasMpGrid_t* grid);
This function initializes the grid opaque data structure. It maps the given resources (communicator, grid dimensions and grid layout) to a grid object.
All the processes defined to be in this grid must enter this function.

If NVSHMEM is not initialized by the user, cuBLASMp will initialize it as the first grid is created. If NVSHMEM is initialized already, it will not be re-initialized. cuBLASMp will only call nvshmem_finalize if it initialized NVSHMEM.

Parameter

Memory

In/Out

Description

nprow

Host

In

How many row processes the grid contains.

npcol

Host

In

How many column processes the grid contains.

layout

Host

In

Grid’s layout (cublasMpGridLayout_t).

comm

Host

In

Communicator associated with the grid.

grid

Host

In/Out

Pointer to a grid object.

See cublasMpStatus_t for the description of the return status.

cublasMpGridDestroy#

cublasMpStatus_t cublasMpGridDestroy(
        cublasMpGrid_t grid);
This function destroys the given grid object.
All the processes defined to be in this grid must enter this function.

Parameter

Memory

In/Out

Description

grid

Host

In/Out

Grid object to destroy.

See cublasMpStatus_t for the description of the return status.

Matrix Management#

cublasMpMatrixDescriptorCreate#

cublasMpStatus_t cublasMpMatrixDescriptorCreate(
        int64_t m,
        int64_t n,
        int64_t mb,
        int64_t nb,
        int64_t rsrc,
        int64_t csrc,
        int64_t lld,
        cudaDataType_t type,
        cublasMpGrid_t grid,
        cublasMpMatrixDescriptor_t* desc);
This function initializes cublasMpMatrixDescriptor_t object.

Parameter

Memory

In/Out

Description

m

Host

In

Number of rows in the global matrix.

n

Host

In

Number of columns in the global matrix.

mb

Host

In

Blocking factor used to distribute the rows of the global matrix.

nb

Host

In

Blocking factor used to distribute the columns of the global matrix.

rsrc

Host

In

Row rank of the process who owns the first row block of the global matrix.

csrc

Host

In

Column rank of the process who owns the first column block of the global matrix.

lld

Host

In

Leading dimension of the local matrix.

type

Host

In

Data type of the matrix.

grid

Host

In

Grid object associated with the matrix descriptor.

desc

Host

Out

Matrix descriptor object initialized by this function.

Supported values for type argument are listed.

Data Type

Description

CUDA_R_8I

8-bit real signed integer.

CUDA_R_32I

32-bit real signed integer.

CUDA_R_8F_E4M3

8-bit real floating point in E4M3 format.

CUDA_R_8F_E5M2

8-bit real floating point in E5M2 format.

CUDA_R_16F

16-bit real half precision floating-point.

CUDA_R_16BF

16-bit real bfloat16 floating-point.

CUDA_R_32F

32-bit real single precision floating-point.

CUDA_R_64F

64-bit real double precision floating-point.

CUDA_C_32F

64-bit structure comprised of two single precision floating-points representing a complex number.

CUDA_C_64F

128-bit structure comprised of two double precision floating-points representing a complex number.

See cublasMpStatus_t for the description of the return status.

cublasMpMatrixDescriptorDestroy#

cublasMpStatus_t cublasMpMatrixDescriptorDestroy(
        cublasMpMatrixDescriptor_t desc);
This function destroys cublasMpMatrixDescriptor_t object.

Parameter

Memory

In/Out

Description

desc

Host

In/Out

Matrix descriptor object to destroy.

See cublasMpStatus_t for the description of the return status.

cublasMpMatrixDescriptorInit#

cublasMpStatus_t cublasMpMatrixDescriptorInit(
        int64_t m,
        int64_t n,
        int64_t mb,
        int64_t nb,
        int64_t rsrc,
        int64_t csrc,
        int64_t lld,
        cudaDataType_t type,
        cublasMpGrid_t grid,
        cublasMpMatrixDescriptor_t desc);
This function initializes the values of a cublasMpMatrixDescriptor_t object. This function does not allocate additional memory.

Parameter

Memory

In/Out

Description

m

Host

In

Number of rows in the global matrix.

n

Host

In

Number of columns in the global matrix.

mb

Host

In

Blocking factor used to distribute the rows of the global matrix.

nb

Host

In

Blocking factor used to distribute the columns of the global matrix.

rsrc

Host

In

Row rank of the process who owns the first row block of the global matrix.

csrc

Host

In

Column rank of the process who owns the first column block of the global matrix.

lld

Host

In

Leading dimension of the local matrix.

type

Host

In

Data type of the matrix.

grid

Host

In

Grid object associated with the matrix descriptor.

desc

Host

Out

Matrix descriptor object initialized by this function.

Supported values for type argument are listed below:

Data Type

Description

CUDA_R_8I

8-bit real signed integer.

CUDA_R_32I

32-bit real signed integer.

CUDA_R_8F_E4M3

8-bit real floating point in E4M3 format.

CUDA_R_8F_E5M2

8-bit real floating point in E5M2 format.

CUDA_R_16F

16-bit real half precision floating-point.

CUDA_R_16BF

16-bit real bfloat16 floating-point.

CUDA_R_32F

32-bit real single precision floating-point.

CUDA_R_64F

64-bit real double precision floating-point.

CUDA_C_32F

64-bit structure comprised of two single precision floating-points representing a complex number.

CUDA_C_64F

128-bit structure comprised of two double precision floating-points representing a complex number.

See cublasMpStatus_t for the description of the return status.

Matmul Properties#

cublasMpMatmulDescriptorCreate#

cublasMpStatus_t cublasMpMatmulDescriptorCreate(
        cublasMpMatmulDescriptor_t* matmulDesc,
        cublasComputeType_t computeType);
This function initializes the values of a cublasMpMatmulDescriptor_t object used in cublasMpMatmul().

Parameter

Memory

In/Out

Description

matmulDesc

Host

In/Out

Pointer to a cublasMpMatmulDescriptor object to initialize.

computeType

Host

In

cuBLAS compute type used for computations. See table below for supported combinations.

Supported values for computeType argument are listed.

Compute Types

CUBLAS_COMPUTE_32I

CUBLAS_COMPUTE_32I_PEDANTIC

CUBLAS_COMPUTE_16F

CUBLAS_COMPUTE_16F_PEDANTIC

CUBLAS_COMPUTE_32F

CUBLAS_COMPUTE_32F_PEDANTIC

CUBLAS_COMPUTE_32F_FAST_16F

CUBLAS_COMPUTE_32F_FAST_16BF

CUBLAS_COMPUTE_32F_FAST_TF32

CUBLAS_COMPUTE_64F

CUBLAS_COMPUTE_64F_PEDANTIC

See cublasMpStatus_t for the description of the return status.

cublasMpMatmulDescriptorDestroy#

cublasMpStatus_t cublasMpMatmulDescriptorDestroy(
        cublasMpMatmulDescriptor_t matmulDesc);
This function destroys cublasMpMatmulDescriptor_t object used in cublasMpMatmul().

Parameter

Memory

In/Out

Description

matmulDesc

Host

In/Out

Matmul descriptor object to destroy.

See cublasMpStatus_t for the description of the return status.

cublasMpMatmulDescriptorAttributeSet#

cublasMpStatus_t cublasMpMatmulDescriptorAttributeSet(
        cublasMpMatmulDescriptor_t matmulDesc,
        cublasMpMatmulDescriptorAttribute_t attr,
        const void* buf,
        size_t sizeInBytes);
This function sets attributes of cublasMpMatmulDescriptor_t object used in cublasMpMatmul(). The attributes are of type cublasMpMatmulDescriptorAttribute_t.

Parameter

Memory

In/Out

Description

matmulDesc

Host

In

Matmul descriptor object to set its attribute.

attr

Host

In

Matmul descriptor attribute to set.

buf

Host

In

Attribute value to set.

sizeInBytes

Host

In

Attribute buffer size in bytes.

See cublasMpStatus_t for the description of the return status.

cublasMpMatmulDescriptorAttributeGet#

cublasMpStatus_t cublasMpMatmulDescriptorAttributeGet(
        cublasMpMatmulDescriptor_t matmulDesc,
        cublasMpMatmulDescriptorAttribute_t attr,
        const void* buf,
        size_t sizeInBytes,
        size_t* sizeWritten);
This function returns the attributes of cublasMpMatmulDescriptor_t object used in cublasMpMatmul().

Parameter

Memory

In/Out

Description

matmulDesc

Host

In

Matmul descriptor object to set its attribute.

attr

Host

In

Matmul descriptor attribute to set.

buf

Host

Out

Attribute value to set.

sizeInBytes

Host

In

Attribute buffer size in bytes.

sizeWritten

Host

Out

Size of the attribute written into buf in bytes.

See cublasMpStatus_t for the description of the return status.

Utility#

cublasMpNumroc#

int64_t cublasMpNumroc(
        int64_t n,
        int64_t nb,
        uint32_t iproc,
        uint32_t isrcproc,
        uint32_t nprocs);
Computes the number of rows or columns of a distributed matrix owned by the process indicated by iproc argument.

Parameter

Memory

In/Out

Description

n

Host

In

Number of rows or columns in the global distributed matrix.

nb

Host

In

Row or column blocking size of the global matrix.

iproc

Host

In

The coordinate of the process whose local array row or column is to be determined.

isrcproc

Host

In

The coordinate of the process that owns the first row or column of the distributed matrix.

nprocs

Host

In

The total number of row or column processes over which the matrix is distributed.


cublasMpGemr2D#

cublasMpStatus_t cublasMpGemr2D(
        cublasMpHandle_t handle,
        int64_t m,
        int64_t n,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost,
        cal_comm_t global_comm);
This function redistributes general rectangular matrix A according to the distribution properties of matrix B.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

m

Host

In

Number of rows of sub(A) and sub(B).

n

Host

In

Number of columns of sub(A) and sub(B).

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A).

ja

Host

In

Column index of the first column of the sub(A).

descA

Host

In

Matrix descriptor associated to the global matrix A. descA’s grid value must be set to null in processes that are not part of the grid of A.

b

Device

Out

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B).

jb

Host

In

Column index of the first column of the sub(B).

descB

Host

In

Matrix descriptor associated to the global matrix B. descB’s grid value must be set to null in processes that are not part of the grid of B.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpGemr2D_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpGemr2D_bufferSize().

global_comm

Host

In

A communicator containing at least the union of all processes in the communicators of A and B. All processes in the communicator must call this function, even if they do not own a piece of either matrix.

See cublasMpStatus_t for the description of the return status.

cublasMpGemr2D_bufferSize#

cublasMpStatus_t cublasMpGemr2D_bufferSize(
        cublasMpHandle_t handle,
        int64_t m,
        int64_t n,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost,
        cal_comm_t global_comm);
This function returns the required buffer sizes to perform cublasMpGemr2D() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

m

Host

In

Number of rows of sub(A) and sub(B).

n

Host

In

Number of columns of sub(B) and sub(B).

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A).

ja

Host

In

Column index of the first column of the sub(A).

descA

Host

In

Matrix descriptor associated to the global matrix A. descA’s grid value must be set to null in processes that are not part of the grid of A.

b

Device

In

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B).

jb

Host

In

Column index of the first column of the sub(B).

descB

Host

In

Matrix descriptor associated to the global matrix B. descB’s grid value must be set to null in processes that are not part of the grid of B.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpGemr2D().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpGemr2D().

global_comm

Host

In

A communicator containing at least the union of all processes in the communicators of A and B. All processes in the communicator must call this function, even if they do not own a piece of either matrix.

See cublasMpStatus_t for the description of the return status.

cublasMpTrmr2D#

cublasMpStatus_t cublasMpTrmr2D(
        cublasMpHandle_t handle,
        cublasFillMode_t uplo,
        cublasDiagType_t diag,
        int64_t m,
        int64_t n,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost,
        cal_comm_t global_comm);
This function redistributes trapezoidal matrix A according to the distribution properties of trapezoidal matrix B.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

uplo

Host

In

Indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

diag

Host

In

Indicates if the elements on the main diagonal of matrix A are unity and should not be accessed.

m

Host

In

Number of rows of sub(A) and sub(B).

n

Host

In

Number of columns of sub(A) and sub(B).

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A).

ja

Host

In

Column index of the first column of the sub(A).

descA

Host

In

Matrix descriptor associated to the global matrix A. descA’s grid value must be set to null in processes that are not part of the grid of A.

b

Device

Out

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B).

jb

Host

In

Column index of the first column of the sub(B).

descB

Host

In

Matrix descriptor associated to the global matrix B. descB’s grid value must be set to null in processes that are not part of the grid of B.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpTrmr2D_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpTrmr2D_bufferSize().

global_comm

Host

In

A communicator containing at least the union of all processes in the communicators of A and B. All processes in the communicator must call this function, even if they do not own a piece of either matrix.

See cublasMpStatus_t for the description of the return status.

cublasMpTrmr2D_bufferSize#

cublasMpStatus_t cublasMpTrmr2D_bufferSize(
        cublasMpHandle_t handle,
        cublasFillMode_t uplo,
        cublasDiagType_t diag,
        int64_t m,
        int64_t n,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost,
        cal_comm_t global_comm);
This function returns the required buffer sizes to perform cublasMpTrmr2D() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

uplo

Host

In

Indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

diag

Host

In

Indicates if the elements on the main diagonal of matrix A are unity and should not be accessed.

m

Host

In

Number of rows of sub(A) and sub(B).

n

Host

In

Number of columns of sub(B) and sub(B).

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A).

ja

Host

In

Column index of the first column of the sub(A).

descA

Host

In

Matrix descriptor associated to the global matrix A. descA’s grid value must be set to null in processes that are not part of the grid of A.

b

Device

In

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B).

jb

Host

In

Column index of the first column of the sub(B).

descB

Host

In

Matrix descriptor associated to the global matrix B. descB’s grid value must be set to null in processes that are not part of the grid of B.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpTrmr2D().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpTrmr2D().

global_comm

Host

In

A communicator containing at least the union of all processes in the communicators of A and B. All processes in the communicator must call this function, even if they do not own a piece of either matrix.

See cublasMpStatus_t for the description of the return status.

Logging#

cublasMpLoggerSetCallback#

cublasMpStatus_t cublasMpLoggerSetCallback(
        cublasMpLoggerCallback_t callback);
This function sets the logging callback function.

Parameter

Memory

In/Out

Description

callback

Host

In

Pointer to a callback function. See cublasMpLoggerCallback_t.

See cublasMpStatus_t for the description of the return status.

Warning

This is an experimental feature.


cublasMpLoggerSetFile#

cublasMpStatus_t cublasMpLoggerSetFile(
        FILE *file);
This function sets the logging output file. Note: once registered using this function call, the provided file handle must not be closed unless the function is called again to switch to a different file handle.

Parameter

Memory

In/Out

Description

file

Host

In

Pointer to an open file. File should have write permission.

See cublasMpStatus_t for the description of the return status.

Warning

This is an experimental feature.


cublasMpLoggerOpenFile#

cublasMpStatus_t cublasMpLoggerOpenFile(
        const char* logFile);
This function opens a logging output file in the given path.

Parameter

Memory

In/Out

Description

logFile

Host

In

Path of the logging output file.

See cublasMpStatus_t for the description of the return status.

Warning

This is an experimental feature.


cublasMpLoggerSetLevel#

cublasMpStatus_t cublasMpLoggerSetLevel(
        int level);
This function sets the logging level.

Parameter

Memory

In/Out

Description

level

Host

In

Value of the logging level. See cuBLASMp Logging.

See cublasMpStatus_t for the description of the return status.

Warning

This is an experimental feature.


cublasMpLoggerSetMask#

cublasMpStatus_t cublasMpLoggerSetMask(
        int mask);
This function sets the value of the logging mask.

Parameter

Memory

In/Out

Description

mask

Host

In

Value of the logging mask. See cuBLASMp Logging.

See cublasMpStatus_t for the description of the return status.

Warning

This is an experimental feature.


cublasMpLoggerForceDisable#

cublasMpStatus_t cublasMpLoggerForceDisable();
This function disables logging for the entire run.
See cublasMpStatus_t for the description of the return status.

Warning

This is an experimental feature.


Dense Linear Algebra APIs#

cublasMpTrsm#

cublasMpStatus_t cublasMpTrsm(
        cublasMpHandle_t handle,
        cublasSideMode_t side,
        cublasFillMode_t uplo,
        cublasOperation_t trans,
        cublasDiagType_t diag,
        int64_t m,
        int64_t n,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        cublasComputeType_t computeType,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost);
This function solves the triangular linear system with multiple right-hand-sides

\(\left\{ \begin{matrix} {\text{op}(A)X = \alpha B} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\ {X\text{op}(A) = \alpha B} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\ \end{matrix} \right.\)

where \(A\) is a triangular matrix stored in lower or upper mode with or without the main diagonal, \(X\) and \(B\) are \(m \times n\) matrices, and \(\alpha\) is a scalar. Also, for matrix \(A\)

\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)

The solution \(X\) overwrites the right-hand-sides \(B\) on exit.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

side

Host

In

Indicates if matrix A is on the left or right of X.

uplo

Host

In

Indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

trans

Host

In

Operation op(A) that is non- or (conj.) transpose.

diag

Host

In

Indicates if the elements on the main diagonal of matrix A are unity and should not be accessed.

m

Host

In

Number of rows of matrix sub(B), with matrix sub(A) sized accordingly.

n

Host

In

Number of columns of matrix sub(B), with matrix sub(A) is sized accordingly.

alpha

Host

In

Scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

b

Device

In/Out

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B). ib must be a multiple of the row blocking dimension mbB.

jb

Host

In

Column index of the first column of the sub(B). jb must be a multiple of the column blocking dimension nbB.

descB

Host

In

Matrix descriptor associated to the global matrix B.

computeType

Host

In

cuBLAS compute type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpTrsm_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpTrsm_bufferSize().

This function requires square block size.
This routine supports the following combinations of data types:

Compute Type

Scale Type (alpha and beta)

Atype/Btype

Ctype

CUBLAS_COMPUTE_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

The computeType parameter provided to this function is used only for internal matrix-matrix multiplications.
See cublasMpStatus_t for the description of the return status.

cublasMpTrsm_bufferSize#

cublasMpStatus_t cublasMpTrsm_bufferSize(
        cublasMpHandle_t handle,
        cublasSideMode_t side,
        cublasFillMode_t uplo,
        cublasOperation_t trans,
        cublasDiagType_t diag,
        int64_t m,
        int64_t n,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        cublasComputeType_t computeType,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost);
This function returns the required buffer sizes to perform cublasMpTrsm() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

side

Host

In

Indicates if matrix A is on the left or right of X.

uplo

Host

In

Indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

trans

Host

In

Operation op(A) that is non- or (conj.) transpose.

diag

Host

In

Indicates if the elements on the main diagonal of matrix A are unity and should not be accessed.

m

Host

In

Number of rows of matrix sub(B), with matrix sub(A) sized accordingly.

n

Host

In

Number of columns of matrix sub(B), with matrix sub(A) is sized accordingly.

alpha

Host

In

Scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

b

Device

In

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B). ib must be a multiple of the row blocking dimension mbB.

jb

Host

In

Column index of the first column of the sub(B). jb must be a multiple of the column blocking dimension nbB.

descB

Host

In

Matrix descriptor associated to the global matrix B.

computeType

Host

In

cuBLAS compute type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpTrsm().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpTrsm().

This function requires square block size.
This routine supports the following combinations of data types:

Compute Type

Scale Type (alpha and beta)

Atype/Btype

Ctype

CUBLAS_COMPUTE_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

The computeType parameter provided to this function is used only for internal matrix-matrix multiplications.
See cublasMpStatus_t for the description of the return status.

cublasMpGemm#

cublasMpStatus_t cublasMpGemm(
        cublasMpHandle_t handle,
        cublasOperation_t transA,
        cublasOperation_t transB,
        int64_t m,
        int64_t n,
        int64_t k,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        cublasComputeType_t computeType,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost);
This function performs the matrix-matrix multiplication

\(C = \alpha\text{op}(A)\text{op}(B) + \beta C\)

where \(\alpha\) and \(\beta\) are scalars, and \(A\) , \(B\) and \(C\) are matrices stored in column-major format with dimensions \(\text{op}(A)\) \(m \times k\) , \(\text{op}(B)\) \(k \times n\) and \(C\) \(m \times n\) , respectively. Also, for matrix \(A\)

\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transA == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transA == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{transA == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)

and \(\text{op}(B)\) is defined similarly for matrix \(B\) .

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

transA

Host

In

Operation op(A) that is non- or (conj.) transpose.

transB

Host

In

Operation op(B) that is non- or (conj.) transpose.

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(B) and sub(C).

k

Host

In

Number of columns of sub(A) and rows of sub(B).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

b

Device

In

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B). ib must be a multiple of the row blocking dimension mbB.

jb

Host

In

Column index of the first column of the sub(B). jb must be a multiple of the column blocking dimension nbB.

descB

Host

In

Matrix descriptor associated to the global matrix B.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In/Out

Pointer to the first entry of the local portion of the global matrix C.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

computeType

Host

In

cuBLAS compute type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpGemm_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpGemm_bufferSize().

This routine supports the following combinations of data types:

Compute Type

Scale Type (alpha and beta)

Atype/Btype

Ctype

CUBLAS_COMPUTE_16F or

CUBLAS_COMPUTE_16F_PEDANTIC

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUBLAS_COMPUTE_32F or

CUBLAS_COMPUTE_32F_PEDANTIC

CUDA_R_32F

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16F

CUDA_R_16F

CUDA_R_8I

CUDA_R_32F

CUDA_R_16BF

CUDA_R_32F

CUDA_R_16F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_8I

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_32F_FAST_16F or

CUBLAS_COMPUTE_32F_FAST_16BF or

CUBLAS_COMPUTE_32F_FAST_TF32

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_64F or

CUBLAS_COMPUTE_64F_PEDANTIC

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

The computeType parameter provided to this function is used only for internal matrix-matrix multiplications.
See cublasMpStatus_t for the description of the return status.

cublasMpGemm_bufferSize#

cublasMpStatus_t cublasMpGemm_bufferSize(
        cublasMpHandle_t handle,
        cublasOperation_t transA,
        cublasOperation_t transB,
        int64_t m,
        int64_t n,
        int64_t k,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        cublasComputeType_t computeType,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost);
This function returns the required buffer sizes to perform cublasMpGemm() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

transA

Host

In

Operation op(A) that is non- or (conj.) transpose.

transB

Host

In

Operation op(B) that is non- or (conj.) transpose.

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(B) and sub(C).

k

Host

In

Number of columns of sub(A) and rows of sub(B).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

b

Device

In

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B). ib must be a multiple of the row blocking dimension mbB.

jb

Host

In

Column index of the first column of the sub(B). jb must be a multiple of the column blocking dimension nbB.

descB

Host

In

Matrix descriptor associated to the global matrix B.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In

Pointer to the first entry of the local portion of the global matrix C.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

computeType

Host

In

cuBLAS compute type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpGemm().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpGemm().

This routine supports the following combinations of data types:

Compute Type

Scale Type (alpha and beta)

Atype/Btype

Ctype

CUBLAS_COMPUTE_16F or

CUBLAS_COMPUTE_16F_PEDANTIC

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUBLAS_COMPUTE_32F or

CUBLAS_COMPUTE_32F_PEDANTIC

CUDA_R_32F

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16F

CUDA_R_16F

CUDA_R_8I

CUDA_R_32F

CUDA_R_16BF

CUDA_R_32F

CUDA_R_16F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_8I

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_32F_FAST_16F or

CUBLAS_COMPUTE_32F_FAST_16BF or

CUBLAS_COMPUTE_32F_FAST_TF32

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_64F or

CUBLAS_COMPUTE_64F_PEDANTIC

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

The computeType parameter provided to this function is used only for internal matrix-matrix multiplications.
See cublasMpStatus_t for the description of the return status.

cublasMpMatmul#

cublasMpStatus_t cublasMpMatmul(
        cublasMpHandle_t handle,
        cublasMpMatmulDescriptor_t matmulDesc,
        int64_t m,
        int64_t n,
        int64_t k,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        const void* beta,
        const void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        void* d,
        int64_t id,
        int64_t jd,
        cublasMpMatrixDescriptor_t descD,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost);
This function performs the matrix-matrix multiplication

\(D = \alpha\text{op}(A)\text{op}(B) + \beta C\)

where \(\alpha\) and \(\beta\) are scalars, and \(A\), \(B\), \(C\) and \(D\) are matrices stored in column-major format with dimensions \(\text{op}(A)\) \(m \times k\) , \(\text{op}(B)\) \(k \times n\), \(C\) \(m \times n\) and \(D\) \(m \times n\) , respectively. \(\text{op}(A)\) depends on the value of CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_TRANSA attribute of the matmulDesc descriptor:

\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{TRANSA == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{TRANSA == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{TRANSA == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)

and \(\text{op}(B)\) is defined similarly for matrix \(B\) based on the CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_TRANSB attribute.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

matmulDesc

Host

In

Descriptor of the operation to perform, created with cublasMpMatmulDescriptorCreate().

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(B) and sub(C).

k

Host

In

Number of columns of sub(A) and rows of sub(B).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

b

Device

In

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B). ib must be a multiple of the row blocking dimension mbB.

jb

Host

In

Column index of the first column of the sub(B). jb must be a multiple of the column blocking dimension nbB.

descB

Host

In

Matrix descriptor associated to the global matrix B.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In

Pointer to the first entry of the local portion of the global matrix C.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

d

Device

Out

Pointer to the first entry of the local portion of the global matrix D.

id

Host

In

Row index of the first row of the sub(D). id must be a multiple of the row blocking dimension mbD.

jd

Host

In

Column index of the first column of the sub(D). jd must be a multiple of the column blocking dimension nbD.

descD

Host

In

Matrix descriptor associated to the global matrix D.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpMatmul_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpMatmul_bufferSize().

This routine supports the following combinations of data types:

Compute Type

Scale Type (alpha and beta)

Atype

Btype

Ctype

Dtype

CUBLAS_COMPUTE_16F or

CUBLAS_COMPUTE_16F_PEDANTIC

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUBLAS_COMPUTE_32F or

CUBLAS_COMPUTE_32F_PEDANTIC

CUDA_R_32F

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_8I

CUDA_R_8I

CUDA_R_32F

CUDA_R_32F

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_32F

CUDA_R_32F

CUDA_R_16F

CUDA_R_16F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_8I

CUDA_C_8I

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_32F

CUDA_R_32F

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_16BF

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_16F

CUDA_R_16F

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_16F

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_32F

CUDA_R_32F

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_16BF

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_16BF

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_16F

CUDA_R_16F

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_16F

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_16F

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_32F

CUDA_R_32F

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_16BF

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_16BF

CUDA_R_8F_E5M2

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_16F

CUDA_R_16F

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_16F

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_16F

CUDA_R_8F_E5M2

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_32F

CUDA_R_32F

CUBLAS_COMPUTE_32F_FAST_16F or

CUBLAS_COMPUTE_32F_FAST_16BF or

CUBLAS_COMPUTE_32F_FAST_TF32

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_64F or

CUBLAS_COMPUTE_64F_PEDANTIC

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

The computeType parameter provided to this function is used only for internal matrix-matrix multiplications.
See cublasMpStatus_t for the description of the return status.

cublasMpMatmul_bufferSize#

cublasMpStatus_t cublasMpMatmul_bufferSize(
        cublasMpHandle_t handle,
        cublasMpMatmulDescriptor_t matmulDesc,
        int64_t m,
        int64_t n,
        int64_t k,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        const void* beta,
        const void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        void* d,
        int64_t id,
        int64_t jd,
        cublasMpMatrixDescriptor_t descD,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost);
This function returns the required buffer sizes to perform cublasMpMatmul() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

matmulDesc

Host

In

Descriptor of the operation to perform, created with cublasMpMatmulDescriptorCreate().

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(B) and sub(C).

k

Host

In

Number of columns of sub(A) and rows of sub(B).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

b

Device

In

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B). ib must be a multiple of the row blocking dimension mbB.

jb

Host

In

Column index of the first column of the sub(B). jb must be a multiple of the column blocking dimension nbB.

descB

Host

In

Matrix descriptor associated to the global matrix B.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In

Pointer to the first entry of the local portion of the global matrix C.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

d

Device

Out

Pointer to the first entry of the local portion of the global matrix D.

id

Host

In

Row index of the first row of the sub(D). id must be a multiple of the row blocking dimension mbD.

jd

Host

In

Column index of the first column of the sub(D). jd must be a multiple of the column blocking dimension nbD.

descD

Host

In

Matrix descriptor associated to the global matrix D.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpGemm().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpGemm().

This routine supports the following combinations of data types:

Compute Type

Scale Type (alpha and beta)

Atype

Btype

Ctype

Dtype

CUBLAS_COMPUTE_16F or

CUBLAS_COMPUTE_16F_PEDANTIC

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUBLAS_COMPUTE_32F or

CUBLAS_COMPUTE_32F_PEDANTIC

CUDA_R_32F

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUDA_R_8I

CUDA_R_8I

CUDA_R_32F

CUDA_R_32F

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_32F

CUDA_R_32F

CUDA_R_16F

CUDA_R_16F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_8I

CUDA_C_8I

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_32F

CUDA_R_32F

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_16BF

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_16F

CUDA_R_16F

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_16F

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_32F

CUDA_R_32F

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_16BF

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_16BF

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_16F

CUDA_R_16F

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_16F

CUDA_R_8F_E4M3

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_16F

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_32F

CUDA_R_32F

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_16BF

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_16BF

CUDA_R_8F_E5M2

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_16F

CUDA_R_16F

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_16F

CUDA_R_8F_E4M3

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_16F

CUDA_R_8F_E5M2

CUDA_R_8F_E5M2

CUDA_R_8F_E4M3

CUDA_R_32F

CUDA_R_32F

CUBLAS_COMPUTE_32F_FAST_16F or

CUBLAS_COMPUTE_32F_FAST_16BF or

CUBLAS_COMPUTE_32F_FAST_TF32

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_64F or

CUBLAS_COMPUTE_64F_PEDANTIC

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

The computeType parameter provided to this function is used only for internal matrix-matrix multiplications.
See cublasMpStatus_t for the description of the return status.

cublasMpSyrk#

cublasMpStatus_t cublasMpSyrk(
        cublasMpHandle_t handle,
        cublasFillMode_t uplo,
        cublasOperation_t trans,
        int64_t n,
        int64_t k,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        cublasComputeType_t computeType,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost);
This function performs the symmetric rank- \(k\) update

\(C = \alpha\text{op}(A)\text{op}(A)^{T} + \beta C\)

where \(\alpha\) and \(\beta\) are scalars, \(C\) is a symmetric matrix stored in lower or upper mode, and \(A\) is a matrix with dimensions \(\text{op}(A)\) \(n \times k\) . Also, for matrix \(A\)

\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ \end{matrix} \right.\)

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

uplo

Host

In

Indicates if matrix C lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.

trans

Host

In

Operation op(A) that is non- or transpose.

n

Host

In

Number of rows of sub(A) and sub(C).

k

Host

In

Number of columns of sub(A).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In/Out

Pointer to the first entry of the local portion of the global matrix A.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

computeType

Host

In

cuBLAS compute type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpSyrk_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpSyrk_bufferSize().

This function requires square block size.
This routine supports the following combinations of data types:

Compute Type

Scale Type (alpha and beta)

Atype/Btype

Ctype

CUBLAS_COMPUTE_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

The computeType parameter provided to this function is used only for internal matrix-matrix multiplications.
See cublasMpStatus_t for the description of the return status.

cublasMpSyrk_bufferSize#

cublasMpStatus_t cublasMpSyrk_bufferSize(
        cublasMpHandle_t handle,
        cublasFillMode_t uplo,
        cublasOperation_t trans,
        int64_t n,
        int64_t k,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        cublasComputeType_t computeType,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost);
This function returns the required buffer sizes to perform cublasMpSyrk() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

uplo

Host

In

Indicates if matrix C lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.

trans

Host

In

Operation op(A) that is non- or transpose.

n

Host

In

Number of rows of sub(A) and sub(C).

k

Host

In

Number of columns of sub(A).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

computeType

Host

In

cuBLAS compute type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpSyrk().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpSyrk().

This function requires square block size.
This routine supports the following combinations of data types:

Compute Type

Scale Type (alpha and beta)

Atype/Btype

Ctype

CUBLAS_COMPUTE_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

The computeType parameter provided to this function is used only for internal matrix-matrix multiplications.
See cublasMpStatus_t for the description of the return status.

cublasMpGeadd#

cublasMpStatus_t cublasMpGeadd(
        cublasMpHandle_t handle,
        cublasOperation_t trans,
        int64_t m,
        int64_t n,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost);
This function performs the matrix-matrix addition

\(C = \alpha\text{op}(A) + \beta C\)

where \(\alpha\) and \(\beta\) are scalars, and \(A\) and \(C\) are matrices stored in column-major format with dimensions \(\text{op}(A)\) \(m \times n\) and \(C\) \(m \times n\) , respectively. Also, for matrix \(A\)

\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

trans

Host

In

Operation op(A) that is non- or (conj.) transpose.

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(A) and sub(C).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In/Out

Pointer to the first entry of the local portion of the global matrix C.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpGeadd_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpGeadd_bufferSize().

This function requires square block size.
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cublasMpStatus_t for the description of the return status.

cublasMpGeadd_bufferSize#

cublasMpStatus_t cublasMpGeadd_bufferSize(
        cublasMpHandle_t handle,
        cublasOperation_t trans,
        int64_t m,
        int64_t n,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost);
This function returns the required buffer sizes to perform cublasMpGeadd() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

trans

Host

In

Operation op(A) that is non- or (conj.) transpose.

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(A) and sub(C).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In

Pointer to the first entry of the local portion of the global matrix C.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpGeadd().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpGeadd().

This function requires square block size.
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cublasMpStatus_t for the description of the return status.

cublasMpTradd#

cublasMpStatus_t cublasMpTradd(
        cublasMpHandle_t handle,
        cublasFillMode_t uplo,
        cublasOperation_t trans,
        int64_t m,
        int64_t n,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost);
This function performs the trapezoidal matrix-matrix addition

\(C = \alpha\text{op}(A) + \beta C\)

where \(\alpha\) and \(\beta\) are scalars, and \(A\) and \(C\) are matrices stored in column-major format with dimensions \(\text{op}(A)\) \(m \times n\) and \(C\) \(m \times n\) , respectively. Also, for matrix \(A\)

\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

uplo

Host

In

Indicates if matrix C lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

trans

Host

In

Operation op(A) that is non- or (conj.) transpose.

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(A) and sub(C).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In/Out

Pointer to the first entry of the local portion of the global matrix C.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpTradd_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpTradd_bufferSize().

This function requires square block size.
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cublasMpStatus_t for the description of the return status.

cublasMpTradd_bufferSize#

cublasMpStatus_t cublasMpTradd_bufferSize(
        cublasMpHandle_t handle,
        cublasFillMode_t uplo,
        cublasOperation_t trans,
        int64_t m,
        int64_t n,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost);
This function returns the required buffer sizes to perform cublasMpTradd() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

uplo

Host

In

Indicates if matrix C lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

trans

Host

In

Operation op(A) that is non- or (conj.) transpose.

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(A) and sub(C).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In

Pointer to the first entry of the local portion of the global matrix C.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpTradd().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpTradd().

This function requires square block size.
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cublasMpStatus_t for the description of the return status.