cuBLASMp C API

Library Management

cublasMpCreate

cublasStatus_t cublasMpCreate(
        cublasMpHandle_t *handle,
        cudaStream_t stream);
The function initializes the cuBLASMp library handle (cublasMpHandle_t) which holds the cuBLASMp library context. It allocates light hardware resources on the host, and must be called prior to making any other cuBLASMp library calls.
Calling any cuBLASMp function which uses cublasMpHandle_t without a previous call of cublasMpCreate() will return an error.
The cuBLASMp library context is tied to the current CUDA device and the given CUDA stream.
Sharing a device with multiple processes may result in undefined behavior.

Parameter

Memory

In/Out

Description

handle

Host

Out

cuBLASMp library handle

stream

Host

In

Stream that will be assigned to the handle.

See cublasStatus_t for the description of the return status.

cublasMpDestroy

cublasStatus_t cublasMpDestroy(
        cublasMpHandle_t handle);
The function destroy the cuBLASMp library handle (cublasMpHandle_t) which holds the cuBLASMp library context.
The cuBLASMp library context is tied to the CUDA device that was set when calling cublasMpCreate(). Only one handle per process and per GPU supported.

Parameter

Memory

In/Out

Description

handle

Host

In/Out

cuBLASMp library handle to destroy

See cublasStatus_t for the description of the return status.

cublasMpGetVersion

cublasStatus_t cublasMpGetVersion(
        cublasMpHandle_t handle,
        int *version);
This function returns the version number of the cuBLASMp library.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle

version

Host

Out

cuBLASMp library version. Value is CUBLASMP_VER_MAJOR * 1000 + CUBLASMP_VER_MINOR * 100 + CUBLASMP_VER_PATCH

See cublasStatus_t for the description of the return status.

cublasMpSetMathMode

cublasStatus_t cublasMpSetMathMode(
        cublasMpHandle_t handle,
        cublasMath_t mode);
The function enables you to choose the compute precision modes as defined by cublasMath_t. Users are allowed to set the compute precision mode as a logical combination of them (except the deprecated CUBLAS_TENSOR_OP_MATH). For example, cublasMpSetMathMode(handle, CUBLAS_DEFAULT_MATH | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION). Please note that the default math mode is CUBLAS_DEFAULT_MATH.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle

mode

Host

In

cuBLAS math mode.

See cublasStatus_t for the description of the return status.

cublasMpGetMathMode

cublasStatus_t cublasMpGetMathMode(
        cublasMpHandle_t handle,
        cublasMath_t* mode);
This function returns the math mode used by the library routines.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle

mode

Host

Out

cuBLAS math mode.

See cublasStatus_t for the description of the return status.

Grid Management

cublasMpGridCreate

cublasStatus_t cublasMpGridCreate(
        cublasMpHandle_t handle,
        int64_t nprow,
        int64_t npcol,
        int64_t myprow,
        int64_t mypcol,
        cal_comm_t comm,
        cublasMpGrid_t* grid);
The function initializes the grid opaque data structure. It maps the given resources (communicator, grid dimensions and grid layout) to a grid object.
All the processes defined to be in this grid must enter this function.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle

nprow

Host

In

How many row process the grid contains.

npcol

Host

In

How many column process the grid contains.

myprow

Host

In

What is the current process’s row rank.

mypcol

Host

In

What is the current process’s column rank

comm

Host

In

Communicator associated with the grid.

grid

Host

In/Out

Pointer to a grid object.

See cublasStatus_t for the description of the return status.

cublasMpGridDestroy

cublasStatus_t cublasMpGridDestroy(
        cublasMpHandle_t handle,
        cublasMpGrid_t grid);
The function destroys the given grid object.
All the processes defined to be in this grid must enter this function.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle

grid

Host

In/Out

Grid object to destroy.

See cublasStatus_t for the description of the return status.

Matrix Management

cublasMpMatrixDescriptorCreate

cublasStatus_t cublasMpMatrixDescriptorCreate(
        cublasMpHandle_t handle,
        int64_t m,
        int64_t n,
        int64_t mb,
        int64_t nb,
        int64_t rsrc,
        int64_t csrc,
        int64_t lld,
        cudaDataType_t type,
        cublasMpGrid_t grid,
        cublasMpMatrixDescriptor_t* desc);
The function initializes cublasMpMatrixDescriptor_t object.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle

m

Host

In

Number of rows in the global array.

n

Host

In

Number of columns in the global matrix.

mb

Host

In

Blocking factor used to distribute the rows of the global matrix.

nb

Host

In

Blocking factor used to distribute the columns of the global matrix.

rsrc

Host

In

Row rank of the process who owns the first row block of the global matrix.

csrc

Host

In

Column rank of the process who owns the first column block of the global matrix.

lld

Host

In

Leading dimension of the local matrix.

type

Host

In

Data type of the matrix.

grid

Host

In

Grid object associated with the matrix descriptor

desc

Host

Out

Matrix descriptor object initialized by this function.

Supported values for dataType argument are listed.

Data Type of A

Description

CUDA_R_32F

Single precision real values.

CUDA_R_64F

Double precision real values.

CUDA_C_32F

Single precision complex values.

CUDA_C_64F

Double precision complex values.

See cublasStatus_t for the description of the return status.

cublasMpMatrixDescriptorDestroy

cublasStatus_t cublasMpMatrixDescriptorDestroy(
        cublasMpHandle_t handle,
        cublasMpMatrixDescriptor_t desc);
The function destroys cublasMpMatrixDescriptor_t object.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle

desc

Host

In/Out

Matrix descriptor object to destroy.

See cublasStatus_t for the description of the return status.

Utility

cublasMpNumroc

int64_t cublasMpNumroc(
        int64_t n,
        int64_t nb,
        uint32_t iproc,
        uint32_t isrcproc,
        uint32_t nprocs);
Computes the number of rows or columns of a distributed matrix owned by the process indicated by iproc argument.

Parameter

Memory

In/Out

Description

n

Host

In

Number of rows or columns in the global distributed matrix.

nb

Host

In

Row or column blocking size of the global matrix.

iproc

Host

In

The coordinate of the process whole local array row or column is to be determined.

isrcproc

Host

In

The coordinate of the process that owns the first row or column of the distributed matrix.

nprocs

Host

In

The total number of row or column processes over which the matrix is distributed.


cublasMpGemr2D

cublasStatus_t cublasMpGemr2D(
        cublasMpHandle_t handle,
        int64_t m,
        int64_t n,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost,
        cal_comm_t global_comm);
This function redistributes general rectangular matrix A according to the distribution properties of matrix B.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

m

Host

In

Number of rows of sub(A) and sub(B).

n

Host

In

Number of columns of sub(A) and sub(B).

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A).

ja

Host

In

Column index of the first column of the sub(A).

descA

Host

In

Matrix descriptor associated to the global matrix A. descA’s grid value must be set to null in processes that are not part of the grid of A.

b

Device

Out

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B).

jb

Host

In

Column index of the first column of the sub(B).

descB

Host

In

Matrix descriptor associated to the global matrix B. descB’s grid value must be set to null in processes that are not part of the grid of B.

d_work

Device

Out

Host workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpGemr2D_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpGemr2D_bufferSize()

global_comm

Host

In

A communicator containing at least the union of all processes in the communicators of A and B. All processes in the communicator must call this function, even if they do not own a piece of either matrix.

See cublasStatus_t for the description of the return status.

cublasMpGemr2D_bufferSize

cublasStatus_t cublasMpGemr2D_bufferSize(
        cublasMpHandle_t handle,
        int64_t m,
        int64_t n,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost,
        cal_comm_t global_comm);
This function returns the required buffer sizes to perform cublasMpGemr2D() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(B) and sub(C).

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A).

ja

Host

In

Column index of the first column of the sub(A).

descA

Host

In

Matrix descriptor associated to the global matrix A. descA’s grid value must be set to null in processes that are not part of the grid of A.

b

Device

In

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B).

jb

Host

In

Column index of the first column of the sub(B).

descB

Host

In

Matrix descriptor associated to the global matrix B. descB’s grid value must be set to null in processes that are not part of the grid of B.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpGemr2D().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpGemr2D().

global_comm

Host

In

A communicator containing at least the union of all processes in the communicators of A and B. All processes in the communicator must call this function, even if they do not own a piece of either matrix.

See cublasStatus_t for the description of the return status.

cublasMpTrmr2D

cublasStatus_t cublasMpTrmr2D(
        cublasMpHandle_t handle,
        cublasFillMode_t uplo,
        cublasDiagType_t diag,
        int64_t m,
        int64_t n,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost,
        cal_comm_t global_comm);
This function redistributes trapezoidal matrix A according to the distribution properties of trapezoidal matrix B.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

uplo

Host

In

Indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

diag

Host

In

Indicates if the elements on the main diagonal of matrix A are unity and should not be accessed.

m

Host

In

Number of rows of sub(A) and sub(B).

n

Host

In

Number of columns of sub(A) and sub(B).

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A).

ja

Host

In

Column index of the first column of the sub(A).

descA

Host

In

Matrix descriptor associated to the global matrix A. descA’s grid value must be set to null in processes that are not part of the grid of A.

b

Device

Out

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B).

jb

Host

In

Column index of the first column of the sub(B).

descB

Host

In

Matrix descriptor associated to the global matrix B. descB’s grid value must be set to null in processes that are not part of the grid of B.

d_work

Device

Out

Host workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpTrmr2D_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpTrmr2D_bufferSize()

global_comm

Host

In

A communicator containing at least the union of all processes in the communicators of A and B. All processes in the communicator must call this function, even if they do not own a piece of either matrix.

See cublasStatus_t for the description of the return status.

cublasMpTrmr2D_bufferSize

cublasStatus_t cublasMpTrmr2D_bufferSize(
        cublasMpHandle_t handle,
        cublasFillMode_t uplo,
        cublasDiagType_t diag,
        int64_t m,
        int64_t n,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost,
        cal_comm_t global_comm);
This function returns the required buffer sizes to perform cublasMpTrmr2D() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

uplo

Host

In

Indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

diag

Host

In

Indicates if the elements on the main diagonal of matrix A are unity and should not be accessed.

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(B) and sub(C).

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A).

ja

Host

In

Column index of the first column of the sub(A).

descA

Host

In

Matrix descriptor associated to the global matrix A. descA’s grid value must be set to null in processes that are not part of the grid of A.

b

Device

In

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B).

jb

Host

In

Column index of the first column of the sub(B).

descB

Host

In

Matrix descriptor associated to the global matrix B. descB’s grid value must be set to null in processes that are not part of the grid of B.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpTrmr2D().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpTrmr2D().

global_comm

Host

In

A communicator containing at least the union of all processes in the communicators of A and B. All processes in the communicator must call this function, even if they do not own a piece of either matrix.

See cublasStatus_t for the description of the return status.

Logging

cublasMpLoggerSetCallback

cublasStatus_t cublasMpLoggerSetCallback(
        cublasMpLoggerCallback_t callback);
This function sets the logging callback function.

Parameter

Memory

In/Out

Description

callback

Host

In

Pointer to a callback function. See cublasMpLoggerCallback_t.

See cublasStatus_t for the description of the return status.

Warning

This is an experimental feature.


cublasMpLoggerSetFile

cublasStatus_t cublasMpLoggerSetFile(
        FILE *file);
This function sets the logging output file. Note: once registered using this function call, the provided file handle must not be closed unless the function is called again to switch to a different file handle.

Parameter

Memory

In/Out

Description

file

Host

In

Pointer to an open file. File should have write permission

See cublasStatus_t for the description of the return status.

Warning

This is an experimental feature.


cublasMpLoggerOpenFile

cublasStatus_t cublasMpLoggerOpenFile(
        const char* logFile);
This function opens a logging output file in the given path.

Parameter

Memory

In/Out

Description

logFile

Host

In

Path of the logging output file.

See cublasStatus_t for the description of the return status.

Warning

This is an experimental feature.


cublasMpLoggerSetLevel

cublasStatus_t cublasMpLoggerSetLevel(
        int level);
Complete

Parameter

Memory

In/Out

Description

level

Host

In

Value of the logging level. See cuBLASMp Logging.

See cublasStatus_t for the description of the return status.

Warning

This is an experimental feature.


cublasMpLoggerSetMask

cublasStatus_t cublasMpLoggerSetMask(
        int mask);
This function sets the value of the logging mask.

Parameter

Memory

In/Out

Description

mask

Host

In

Value of the logging mask. See cuBLASMp Logging.

See cublasStatus_t for the description of the return status.

Warning

This is an experimental feature.


cublasMpLoggerForceDisable

cublasStatus_t cublasMpLoggerForceDisable(
        int level);
This function disables logging for the entier run.
See cublasStatus_t for the description of the return status.

Warning

This is an experimental feature.


Dense Linear Algebra APIs

cublasMpTrsm

cublasStatus_t cublasMpTrsm(
        cublasMpHandle_t handle,
        cublasSideMode_t side,
        cublasFillMode_t uplo,
        cublasOperation_t trans,
        cublasDiagType_t diag,
        int64_t m,
        int64_t n,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        cublasComputeType_t computeType,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost);
This function solves the triangular linear system with multiple right-hand-sides

\(\left\{ \begin{matrix} {\text{op}(A)X = \alpha B} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\ {X\text{op}(A) = \alpha B} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\ \end{matrix} \right.\)

where \(A\) is a triangular matrix stored in lower or upper mode with or without the main diagonal, \(X\) and \(B\) are \(m \times n\) matrices, and \(\alpha\) is a scalar. Also, for matrix \(A\)

\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)

The solution \(X\) overwrites the right-hand-sides \(B\) on exit.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

side

Host

In

Indicates if matrix A is on the left or right of X.

uplo

Host

In

Indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

trans

Host

In

Operation op(A) that is non- or (conj.) transpose.

diag

Host

In

Indicates if the elements on the main diagonal of matrix A are unity and should not be accessed.

m

Host

In

Number of rows of matrix sub(B), with matrix sub(A) sized accordingly.

n

Host

In

Number of columns of matrix sub(B), with matrix sub(A) is sized accordingly.

alpha

Host

In

Scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A

b

Device

In/Out

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B). ib must be a multiple of the row blocking dimension mbB.

jb

Host

In

Column index of the first column of the sub(B). jb must be a multiple of the column blocking dimension nbB.

descB

Host

In

Matrix descriptor associated to the global matrix B

computeType

Host

In

cuBLAS compute type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpTrsm_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpTrsm_bufferSize()

This function requires square block size.
This routine supports the following combinations of data types:

Compute Type

Scale Type (alpha and beta)

Atype/Btype

Ctype

CUBLAS_COMPUTE_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

The computeType parameter provdied to this function is used only for internal matrix-matrix multiplications.
See cublasStatus_t for the description of the return status.

cublasMpTrsm_bufferSize

cublasStatus_t cublasMpTrsm_bufferSize(
        cublasMpHandle_t handle,
        cublasSideMode_t side,
        cublasFillMode_t uplo,
        cublasOperation_t trans,
        cublasDiagType_t diag,
        int64_t m,
        int64_t n,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        cublasComputeType_t computeType,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost);
This function returns the required buffer sizes to perform cublasMpTrsm() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

side

Host

In

Indicates if matrix A is on the left or right of X.

uplo

Host

In

Indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

trans

Host

In

Operation op(A) that is non- or (conj.) transpose.

diag

Host

In

Indicates if the elements on the main diagonal of matrix A are unity and should not be accessed.

m

Host

In

Number of rows of matrix sub(B), with matrix sub(A) sized accordingly.

n

Host

In

Number of columns of matrix sub(B), with matrix sub(A) is sized accordingly.

alpha

Host

In

Scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A

b

Device

In

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B). ib must be a multiple of the row blocking dimension mbB.

jb

Host

In

Column index of the first column of the sub(B). jb must be a multiple of the column blocking dimension nbB.

descB

Host

In

Matrix descriptor associated to the global matrix B

computeType

Host

In

cuBLAS compute type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpTrsm().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpTrsm().

This function requires square block size.
This routine supports the following combinations of data types:

Compute Type

Scale Type (alpha and beta)

Atype/Btype

Ctype

CUBLAS_COMPUTE_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

The computeType parameter provdied to this function is used only for internal matrix-matrix multiplications.
See cublasStatus_t for the description of the return status.

cublasMpGemm

cublasStatus_t cublasMpGemm(
        cublasMpHandle_t handle,
        cublasOperation_t transA,
        cublasOperation_t transB,
        int64_t m,
        int64_t n,
        int64_t k,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        cublasComputeType_t computeType,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost);
This function performs the matrix-matrix multiplication

\(C = \alpha\text{op}(A)\text{op}(B) + \beta C\)

where \(\alpha\) and \(\beta\) are scalars, and \(A\) , \(B\) and \(C\) are matrices stored in column-major format with dimensions \(\text{op}(A)\) \(m \times k\) , \(\text{op}(B)\) \(k \times n\) and \(C\) \(m \times n\) , respectively. Also, for matrix \(A\)

\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)

and \(\text{op}(B)\) is defined similarly for matrix \(B\) .

trans

Form of the linear system

CUBLAS_OP_N

\(sub(A) \cdot X = sub(B)\)

CUBLAS_OP_T

\(sub(A)^T \cdot X = sub(B)\)

CUBLAS_OP_C

\(sub(A)^H \cdot X = sub(B)\)

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

transA

Host

In

Operation op(A) that is non- or (conj.) transpose.

transB

Host

In

Operation op(B) that is non- or (conj.) transpose.

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(B) and sub(C).

k

Host

In

Number of columns of sub(A) and rows of sub(B).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

b

Device

In

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B). ib must be a multiple of the row blocking dimension mbB.

jb

Host

In

Column index of the first column of the sub(B). jb must be a multiple of the column blocking dimension nbB.

descB

Host

In

Matrix descriptor associated to the global matrix B.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In/Out

Pointer to the first entry of the local portion of the global matrix C.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

computeType

Host

In

cuBLAS compute type used for computations. See table below for supported combinations.

d_work

Device

Out

Host workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpGemm_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpGemm_bufferSize()

This routine supports the following combinations of data types:

Compute Type

Scale Type (alpha and beta)

Atype/Btype

Ctype

CUBLAS_COMPUTE_16F or

CUBLAS_COMPUTE_16F_PEDANTIC

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUBLAS_COMPUTE_32F or

CUBLAS_COMPUTE_32F_PEDANTIC

CUDA_R_32F

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16F

CUDA_R_16F

CUDA_R_8I

CUDA_R_32F

CUDA_R_16BF

CUDA_R_32F

CUDA_R_16F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_8I

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_32F_FAST_16F or

CUBLAS_COMPUTE_32F_FAST_16BF or

CUBLAS_COMPUTE_32F_FAST_TF32

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_64F or

CUBLAS_COMPUTE_64F_PEDANTIC

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

The computeType parameter provdied to this function is used only for internal matrix-matrix multiplications.
See cublasStatus_t for the description of the return status.

cublasMpGemm_bufferSize

cublasStatus_t cublasMpGemm_bufferSize(
        cublasMpHandle_t handle,
        cublasOperation_t transA,
        cublasOperation_t transB,
        int64_t m,
        int64_t n,
        int64_t k,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* b,
        int64_t ib,
        int64_t jb,
        cublasMpMatrixDescriptor_t descB,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        cublasComputeType_t computeType,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost);
This function returns the required buffer sizes to perform cublasMpGemm() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

transA

Host

In

Operation op(A) that is non- or (conj.) transpose.

transB

Host

In

Operation op(B) that is non- or (conj.) transpose.

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(B) and sub(C).

k

Host

In

Number of columns of sub(A) and rows of sub(B).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

b

Device

In

Pointer to the first entry of the local portion of the global matrix B.

ib

Host

In

Row index of the first row of the sub(B). ib must be a multiple of the row blocking dimension mbB.

jb

Host

In

Column index of the first column of the sub(B). jb must be a multiple of the column blocking dimension nbB.

descB

Host

In

Matrix descriptor associated to the global matrix B.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In

Pointer to the first entry of the local portion of the global matrix C.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

computeType

Host

In

cuBLAS compute type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpGemm().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpGemm().

This routine supports the following combinations of data types:

Compute Type

Scale Type (alpha and beta)

Atype/Btype

Ctype

CUBLAS_COMPUTE_16F or

CUBLAS_COMPUTE_16F_PEDANTIC

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

CUBLAS_COMPUTE_32F or

CUBLAS_COMPUTE_32F_PEDANTIC

CUDA_R_32F

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_16F

CUDA_R_16F

CUDA_R_8I

CUDA_R_32F

CUDA_R_16BF

CUDA_R_32F

CUDA_R_16F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_8I

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_32F_FAST_16F or

CUBLAS_COMPUTE_32F_FAST_16BF or

CUBLAS_COMPUTE_32F_FAST_TF32

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_64F or

CUBLAS_COMPUTE_64F_PEDANTIC

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

The computeType parameter provdied to this function is used only for internal matrix-matrix multiplications.
See cublasStatus_t for the description of the return status.

cublasMpSyrk

cublasStatus_t cublasMpSyrk(
        cublasMpHandle_t handle,
        cublasFillMode_t uplo,
        cublasOperation_t trans,
        int64_t n,
        int64_t k,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        cublasComputeType_t computeType,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost);
This function performs the symmetric rank- \(k\) update

\(C = \alpha\text{op}(A)\text{op}(A)^{T} + \beta C\)

where \(\alpha\) and \(\beta\) are scalars, \(C\) is a symmetric matrix stored in lower or upper mode, and \(A\) is a matrix with dimensions \(\text{op}(A)\) \(n \times k\) . Also, for matrix \(A\)

\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ \end{matrix} \right.\)

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

uplo

Host

In

Indicates if matrix C lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.

trans

Host

In

Operation op(A) that is non- or transpose.

n

Host

In

Number of rows of sub(A) and sub(C).

k

Host

In

Number of columns of sub(A).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In/Out

Pointer to the first entry of the local portion of the global matrix A.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

computeType

Host

In

cuBLAS compute type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpSyrk_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpSyrk_bufferSize().

This function requires square block size.
This routine supports the following combinations of data types:

Compute Type

Scale Type (alpha and beta)

Atype/Btype

Ctype

CUBLAS_COMPUTE_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

The computeType parameter provdied to this function is used only for internal matrix-matrix multiplications.
See cublasStatus_t for the description of the return status.

cublasMpSyrk_bufferSize

cublasStatus_t cublasMpSyrk_bufferSize(
        cublasMpHandle_t handle,
        cublasFillMode_t uplo,
        cublasOperation_t trans,
        int64_t n,
        int64_t k,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        cublasComputeType_t computeType,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost);
This function returns the required buffer sizes to perform cublasMpSyrk() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

uplo

Host

In

Indicates if matrix C lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.

trans

Host

In

Operation op(A) that is non- or transpose.

n

Host

In

Number of rows of sub(A) and sub(C).

k

Host

In

Number of columns of sub(A).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

computeType

Host

In

cuBLAS compute type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpSyrk().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpSyrk().

This function requires square block size.
This routine supports the following combinations of data types:

Compute Type

Scale Type (alpha and beta)

Atype/Btype

Ctype

CUBLAS_COMPUTE_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUBLAS_COMPUTE_64F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

The computeType parameter provdied to this function is used only for internal matrix-matrix multiplications.
See cublasStatus_t for the description of the return status.

cublasMpGeadd

cublasStatus_t cublasMpGeadd(
        cublasMpHandle_t handle,
        cublasOperation_t trans,
        int64_t m,
        int64_t n,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost);
This function performs the matrix-matrix addition

\(C = \alpha\text{op}(A) + \beta C\)

where \(\alpha\) and \(\beta\) are scalars, and \(A\) and \(C\) are matrices stored in column-major format with dimensions \(\text{op}(A)\) \(m \times n\) and \(C\) \(m \times n\) , respectively. Also, for matrix \(A\)

\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

trans

Host

In

Operation op(A) that is non- or (conj.) transpose.

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(A) and sub(C).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In/Out

Pointer to the first entry of the local portion of the global matrix C.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

d_work

Device

Out

Host workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpGeadd_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpGeadd_bufferSize()

This function requires square block size.
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cublasStatus_t for the description of the return status.

cublasMpGeadd_bufferSize

cublasStatus_t cublasMpGeadd_bufferSize(
        cublasMpHandle_t handle,
        cublasOperation_t trans,
        int64_t m,
        int64_t n,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost);
This function returns the required buffer sizes to perform cublasMpGeadd() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

trans

Host

In

Operation op(A) that is non- or (conj.) transpose.

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(A) and sub(C).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In

Pointer to the first entry of the local portion of the global matrix C.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpGeadd().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpGeadd().

This function requires square block size.
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cublasStatus_t for the description of the return status.

cublasMpTradd

cublasStatus_t cublasMpTradd(
        cublasMpHandle_t handle,
        cublasFillMode_t uplo,
        cublasOperation_t trans,
        int64_t m,
        int64_t n,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        void* d_work,
        size_t workspaceSizeInBytesOnDevice,
        void* h_work,
        size_t workspaceSizeInBytesOnHost);
This function performs the trapezoidal matrix-matrix addition

\(C = \alpha\text{op}(A) + \beta C\)

where \(\alpha\) and \(\beta\) are scalars, and \(A\) and \(C\) are matrices stored in column-major format with dimensions \(\text{op}(A)\) \(m \times n\) and \(C\) \(m \times n\) , respectively. Also, for matrix \(A\)

\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

uplo

Host

In

Indicates if matrix C lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

trans

Host

In

Operation op(A) that is non- or (conj.) transpose.

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(A) and sub(C).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In/Out

Pointer to the first entry of the local portion of the global matrix C.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

d_work

Device

Out

Host workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cublasMpTradd_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cublasMpTradd_bufferSize()

This function requires square block size.
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cublasStatus_t for the description of the return status.

cublasMpTradd_bufferSize

cublasStatus_t cublasMpTradd_bufferSize(
        cublasMpHandle_t handle,
        cublasFillMode_t uplo,
        cublasOperation_t trans,
        int64_t m,
        int64_t n,
        const void* alpha,
        const void* a,
        int64_t ia,
        int64_t ja,
        cublasMpMatrixDescriptor_t descA,
        const void* beta,
        void* c,
        int64_t ic,
        int64_t jc,
        cublasMpMatrixDescriptor_t descC,
        size_t* workspaceSizeInBytesOnDevice,
        size_t* workspaceSizeInBytesOnHost);
This function returns the required buffer sizes to perform cublasMpTradd() on the given input.

Parameter

Memory

In/Out

Description

handle

Host

In

cuBLASMp library handle.

uplo

Host

In

Indicates if matrix C lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.

trans

Host

In

Operation op(A) that is non- or (conj.) transpose.

m

Host

In

Number of rows of sub(A) and sub(C).

n

Host

In

Number of columns of sub(A) and sub(C).

alpha

Host

In

<type> scalar used for multiplication.

a

Device

In

Pointer to the first entry of the local portion of the global matrix A.

ia

Host

In

Row index of the first row of the sub(A). ia must be a multiple of the row blocking dimension mbA.

ja

Host

In

Column index of the first column of the sub(A). ja must be a multiple of the column blocking dimension nbA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

beta

Host

In

<type> scalar used for multiplication.

c

Device

In

Pointer to the first entry of the local portion of the global matrix C.

ic

Host

In

Row index of the first row of the sub(C). ic must be a multiple of the row blocking dimension mbC.

jc

Host

In

Column index of the first column of the sub(C). jc must be a multiple of the column blocking dimension nbC.

descC

Host

In

Matrix descriptor associated to the global matrix C.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cublasMpTradd().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cublasMpTradd().

This function requires square block size.
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cublasStatus_t for the description of the return status.