1. Introduction 

2.7.8. cublas<t>syr2k()

cublasStatus_t cublasSsyr2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const float           *alpha,
                            const float           *A, int lda,
                            const float           *B, int ldb,
                            const float           *beta,
                            float           *C, int ldc)
cublasStatus_t cublasDsyr2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const double          *alpha,
                            const double          *A, int lda,
                            const double          *B, int ldb,
                            const double          *beta,
                            double          *C, int ldc)
cublasStatus_t cublasCsyr2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, int lda,
                            const cuComplex       *B, int ldb,
                            const cuComplex       *beta,
                            cuComplex       *C, int ldc)
cublasStatus_t cublasZsyr2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, int lda,
                            const cuDoubleComplex *B, int ldb,
                            const cuDoubleComplex *beta,
                            cuDoubleComplex *C, int ldc)

This function supports the 64-bit Integer Interface.

This function performs the symmetric rank- $2k$ update

$C = \alpha(\text{op}(A)\text{op}(B)^{T} + \text{op}(B)\text{op}(A)^{T}) + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a symmetric matrix stored in lower or upper mode, and $A$ and $B$ are matrices with dimensions $\text{op}(A)$ $n \times k$ and $\text{op}(B)$ $n \times k$ , respectively. Also, for matrix $A$ and $B$

$\text{op(}A\text{) and op(}B\text{)} = \left\{ \begin{matrix} {A\text{ and }B} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ {A^{T}\text{ and }B^{T}} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ \end{matrix} \right.$

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `C` lower or upper part, is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimensions `ldb x k` with `ldb>=max(1,n)` if `transb == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	<type> scalar used for multiplication, if `beta==0`, then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,n)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `n`, `k` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `lda` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldb` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldc` < max(1, `n`) or if `alpha` == NULL or `beta` == NULL or `C` == NULL if `C` needs to be scaled
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

2.7.9. cublas<t>syrkx()

cublasStatus_t cublasSsyrkx(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const float           *alpha,
                            const float           *A, int lda,
                            const float           *B, int ldb,
                            const float           *beta,
                            float           *C, int ldc)
cublasStatus_t cublasDsyrkx(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const double          *alpha,
                            const double          *A, int lda,
                            const double          *B, int ldb,
                            const double          *beta,
                            double          *C, int ldc)
cublasStatus_t cublasCsyrkx(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, int lda,
                            const cuComplex       *B, int ldb,
                            const cuComplex       *beta,
                            cuComplex       *C, int ldc)
cublasStatus_t cublasZsyrkx(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, int lda,
                            const cuDoubleComplex *B, int ldb,
                            const cuDoubleComplex *beta,
                            cuDoubleComplex *C, int ldc)

This function supports the 64-bit Integer Interface.

This function performs a variation of the symmetric rank- $k$ update

$C = \alpha\text{op}(A)\text{op}(B)^{T} + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a symmetric matrix stored in lower or upper mode, and $A$ and $B$ are matrices with dimensions $\text{op}(A)$ $n \times k$ and $\text{op}(B)$ $n \times k$ , respectively. Also, for matrices $A$ and $B$

$\text{op(}A\text{) and op(}B\text{)} = \left\{ \begin{matrix} {A\text{ and }B} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ {A^{T}\text{ and }B^{T}} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ \end{matrix} \right.$

This routine can be used when B is in such way that the result is guaranteed to be symmetric. A usual example is when the matrix B is a scaled form of the matrix A: this is equivalent to B being the product of the matrix A and a diagonal matrix. For an efficient computation of the product of a regular matrix with a diagonal matrix, refer to the routine cublas<t>dgmm.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `C` lower or upper part, is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimensions `ldb x k` with `ldb>=max(1,n)` if `transb == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	<type> scalar used for multiplication, if `beta==0`, then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,n)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `n`, `k` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `lda` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldb` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldc` < max(1, `n`) or if `alpha` == NULL or `beta` == NULL or `C` == NULL if `C` needs to be scaled
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

ssyrk, dsyrk, csyrk, zsyrk and

2.7.10. cublas<t>trmm()

cublasStatus_t cublasStrmm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const float           *alpha,
                           const float           *A, int lda,
                           const float           *B, int ldb,
                           float                 *C, int ldc)
cublasStatus_t cublasDtrmm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const double          *alpha,
                           const double          *A, int lda,
                           const double          *B, int ldb,
                           double                *C, int ldc)
cublasStatus_t cublasCtrmm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           const cuComplex       *B, int ldb,
                           cuComplex             *C, int ldc)
cublasStatus_t cublasZtrmm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           const cuDoubleComplex *B, int ldb,
                           cuDoubleComplex       *C, int ldc)

This function supports the 64-bit Integer Interface.

This function performs the triangular matrix-matrix multiplication

$C = \left\{ \begin{matrix} {\alpha\text{op}(A)B} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\ {\alpha B\text{op}(A)} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\ \end{matrix} \right.$

where $A$ is a triangular matrix stored in lower or upper mode with or without the main diagonal, $B$ and $C$ are $m \times n$ matrix, and $\alpha$ is a scalar. Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

Notice that in order to achieve better parallelism cuBLAS differs from the BLAS API only for this routine. The BLAS API assumes an in-place implementation (with results written back to B), while the cuBLAS API assumes an out-of-place implementation (with results written into C). The application can obtain the in-place functionality of BLAS in the cuBLAS API by passing the address of the matrix B in place of the matrix C. No other overlapping in the input parameters is supported.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
side		input	indicates if matrix `A` is on the left or right of `B`.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
diag		input	indicates if the elements on the main diagonal of matrix `A` are unity and should not be accessed.
m		input	number of rows of matrix `B`, with matrix `A` sized accordingly.
n		input	number of columns of matrix `B`, with matrix `A` sized accordingly.
alpha	host or device	input	<type> scalar used for multiplication, if `alpha==0` then `A` is not referenced and `B` does not have to be a valid input.
A	device	input	<type> array of dimension `lda x m` with `lda>=max(1,m)` if `side == CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,m)`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
C	device	in/out	<type> array of dimension `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m`, `n` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `side` != `CUBLAS_SIDE_LEFT`, `CUBLAS_SIDE_RIGHT` or if `lda` < max(1, `m`) if `side` == `CUBLAS_SIDE_LEFT` and `lda` < max(1, `n`) otherwise or if `ldb` < max(1, `m`) or `C` == NULL if `C` needs to be scaled
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

strmm, dtrmm, ctrmm, ztrmm

2.7.11. cublas<t>trsm()

cublasStatus_t cublasStrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const float           *alpha,
                           const float           *A, int lda,
                           float           *B, int ldb)
cublasStatus_t cublasDtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const double          *alpha,
                           const double          *A, int lda,
                           double          *B, int ldb)
cublasStatus_t cublasCtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           cuComplex       *B, int ldb)
cublasStatus_t cublasZtrsm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           cublasOperation_t trans, cublasDiagType_t diag,
                           int m, int n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           cuDoubleComplex *B, int ldb)

This function supports the 64-bit Integer Interface.

This function solves the triangular linear system with multiple right-hand-sides

$\left\{ \begin{matrix} {\text{op}(A)X = \alpha B} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\ {X\text{op}(A) = \alpha B} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\ \end{matrix} \right.$

where $A$ is a triangular matrix stored in lower or upper mode with or without the main diagonal, $X$ and $B$ are $m \times n$ matrices, and $\alpha$ is a scalar. Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

The solution $X$ overwrites the right-hand-sides $B$ on exit.

No test for singularity or near-singularity is included in this function.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
side		input	indicates if matrix `A` is on the left or right of `X`.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
diag		input	indicates if the elements on the main diagonal of matrix `A` are unity and should not be accessed.
m		input	number of rows of matrix `B`, with matrix `A` sized accordingly.
n		input	number of columns of matrix `B`, with matrix `A` is sized accordingly.
alpha	host or device	input	<type> scalar used for multiplication, if `alpha==0` then `A` is not referenced and `B` does not have to be a valid input.
A	device	input	<type> array of dimension `lda x m` with `lda>=max(1,m)` if `side == CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	in/out	<type> array. It has dimensions `ldb x n` with `ldb>=max(1,m)`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m` < 0 or `n` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `side` != `CUBLAS_SIDE_LEFT`, `CUBLAS_SIDE_RIGHT` or if `diag` != `CUBLAS_DIAG_NON_UNIT`, `CUBLAS_DIAG_UNIT` or if `lda` < max(1, `m`) if `side` == `CUBLAS_SIDE_LEFT` and `lda` < max(1, `n`) otherwise or if `ldb` < max(1, `m`) or `alpha` == NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

strsm, dtrsm, ctrsm, ztrsm

2.7.12. cublas<t>trsmBatched()

cublasStatus_t cublasStrsmBatched( cublasHandle_t    handle,
                                   cublasSideMode_t  side,
                                   cublasFillMode_t  uplo,
                                   cublasOperation_t trans,
                                   cublasDiagType_t  diag,
                                   int m,
                                   int n,
                                   const float *alpha,
                                   const float *const A[],
                                   int lda,
                                   float *const B[],
                                   int ldb,
                                   int batchCount);
cublasStatus_t cublasDtrsmBatched( cublasHandle_t    handle,
                                   cublasSideMode_t  side,
                                   cublasFillMode_t  uplo,
                                   cublasOperation_t trans,
                                   cublasDiagType_t  diag,
                                   int m,
                                   int n,
                                   const double *alpha,
                                   const double *const A[],
                                   int lda,
                                   double *const B[],
                                   int ldb,
                                   int batchCount);
cublasStatus_t cublasCtrsmBatched( cublasHandle_t    handle,
                                   cublasSideMode_t  side,
                                   cublasFillMode_t  uplo,
                                   cublasOperation_t trans,
                                   cublasDiagType_t  diag,
                                   int m,
                                   int n,
                                   const cuComplex *alpha,
                                   const cuComplex *const A[],
                                   int lda,
                                   cuComplex *const B[],
                                   int ldb,
                                   int batchCount);
cublasStatus_t cublasZtrsmBatched( cublasHandle_t    handle,
                                   cublasSideMode_t  side,
                                   cublasFillMode_t  uplo,
                                   cublasOperation_t trans,
                                   cublasDiagType_t  diag,
                                   int m,
                                   int n,
                                   const cuDoubleComplex *alpha,
                                   const cuDoubleComplex *const A[],
                                   int lda,
                                   cuDoubleComplex *const B[],
                                   int ldb,
                                   int batchCount);

This function supports the 64-bit Integer Interface.

This function solves an array of triangular linear systems with multiple right-hand-sides

$\left\{ \begin{matrix} {\text{op}(A\lbrack i\rbrack)X\lbrack i\rbrack = \alpha B\lbrack i\rbrack} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\ {X\lbrack i\rbrack\text{op}(A\lbrack i\rbrack) = \alpha B\lbrack i\rbrack} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\ \end{matrix} \right.$

where $A\lbrack i\rbrack$ is a triangular matrix stored in lower or upper mode with or without the main diagonal, $X\lbrack i\rbrack$ and $B\lbrack i\rbrack$ are $m \times n$ matrices, and $\alpha$ is a scalar. Also, for matrix $A$

$\text{op}(A\lbrack i\rbrack) = \left\{ \begin{matrix} {A\lbrack i\rbrack} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ {A^{T}\lbrack i\rbrack} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ {A^{H}\lbrack i\rbrack} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

The solution $X\lbrack i\rbrack$ overwrites the right-hand-sides $B\lbrack i\rbrack$ on exit.

No test for singularity or near-singularity is included in this function.

This function works for any sizes but is intended to be used for matrices of small sizes where the launch overhead is a significant factor. For bigger sizes, it might be advantageous to call batchCount times the regular cublas<t>trsm within a set of CUDA streams.

The current implementation is limited to devices with compute capability above or equal 2.0.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
side		input	indicates if matrix `A[i]` is on the left or right of `X[i]`.
uplo		input	indicates if matrix `A[i]` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A[i]`) that is non- or (conj.) transpose.
diag		input	indicates if the elements on the main diagonal of matrix `A[i]` are unity and should not be accessed.
m		input	number of rows of matrix `B[i]`, with matrix `A[i]` sized accordingly.
n		input	number of columns of matrix `B[i]`, with matrix `A[i]` is sized accordingly.
alpha	host or device	input	<type> scalar used for multiplication, if `alpha==0` then `A[i]` is not referenced and `B[i]` does not have to be a valid input.
A	device	input	array of pointers to <type> array, with each array of dim. `lda x m` with `lda>=max(1,m)` if `side == CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A[i]`.
B	device	in/out	array of pointers to <type> array, with each array of dim. `ldb x n` with `ldb>=max(1,m)`. Matrices `B[i]` should not overlap; otherwise, undefined behavior is expected.
ldb		input	leading dimension of two-dimensional array used to store matrix `B[i]`.
batchCount		input	number of pointers contained in A and B.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m` < 0 or `n` < 0 or `batchCount` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `side` != `CUBLAS_SIDE_LEFT`, `CUBLAS_SIDE_RIGHT` or if `diag` != `CUBLAS_DIAG_NON_UNIT`, `CUBLAS_DIAG_UNIT` or if `lda` < max(1, `m`) if `side` == `CUBLAS_SIDE_LEFT` and `lda` < max(1, `n`) otherwise or `ldb` < max(1, `m`)
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

strsm, dtrsm, ctrsm, ztrsm

2.7.13. cublas<t>hemm()

cublasStatus_t cublasChemm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           int m, int n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           const cuComplex       *B, int ldb,
                           const cuComplex       *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasZhemm(cublasHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           int m, int n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           const cuDoubleComplex *B, int ldb,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, int ldc)

This function supports the 64-bit Integer Interface.

This function performs the Hermitian matrix-matrix multiplication

$C = \left\{ \begin{matrix} {\alpha AB + \beta C} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\ {\alpha BA + \beta C} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\ \end{matrix} \right.$

where $A$ is a Hermitian matrix stored in lower or upper mode, $B$ and $C$ are $m \times n$ matrices, and $\alpha$ and $\beta$ are scalars.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
side		input	indicates if matrix `A` is on the left or right of `B`.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
m		input	number of rows of matrix `C` and `B`, with matrix `A` sized accordingly.
n		input	number of columns of matrix `C` and `B`, with matrix `A` sized accordingly.
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x m` with `lda>=max(1,m)` if `side==CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise. The imaginary parts of the diagonal elements are assumed to be zero.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,m)`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta		input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m` < 0 or `n` < 0 or if `side` != `CUBLAS_SIDE_LEFT`, `CUBLAS_SIDE_RIGHT` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `lda` < max(1, `m`) if `side` == `CUBLAS_SIDE_LEFT` and `lda` < max(1, `n`) otherwise or if `ldb` < max(1, `m`) or if `ldc` < max(1, `m`) or if `alpha` == NULL or `beta` == NULL or `C` == NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

chemm, zhemm

2.7.14. cublas<t>herk()

cublasStatus_t cublasCherk(cublasHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const float  *alpha,
                           const cuComplex       *A, int lda,
                           const float  *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasZherk(cublasHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const double *alpha,
                           const cuDoubleComplex *A, int lda,
                           const double *beta,
                           cuDoubleComplex *C, int ldc)

This function supports the 64-bit Integer Interface.

This function performs the Hermitian rank- $k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{H} + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a Hermitian matrix stored in lower or upper mode, and $A$ is a matrix with dimensions $\text{op}(A)$ $n \times k$ . Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `C` lower or upper part is stored, the other Hermitian part is not referenced.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`) and `C`.
k		input	number of columns of matrix op(`A`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
beta		input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or `k` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `lda` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldc` < max(1, `n`) or if `alpha` == NULL or `beta` == NULL or `C` == NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

cherk, zherk

2.7.15. cublas<t>her2k()

cublasStatus_t cublasCher2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, int lda,
                            const cuComplex       *B, int ldb,
                            const float  *beta,
                            cuComplex       *C, int ldc)
cublasStatus_t cublasZher2k(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, int lda,
                            const cuDoubleComplex *B, int ldb,
                            const double *beta,
                            cuDoubleComplex *C, int ldc)

This function supports the 64-bit Integer Interface.

This function performs the Hermitian rank- $2k$ update

$C = \alpha\text{op}(A)\text{op}(B)^{H} + \overset{ˉ}{\alpha}\text{op}(B)\text{op}(A)^{H} + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a Hermitian matrix stored in lower or upper mode, and $A$ and $B$ are matrices with dimensions $\text{op}(A)$ $n \times k$ and $\text{op}(B)$ $n \times k$ , respectively. Also, for matrix $A$ and $B$

$\text{op(}A\text{) and op(}B\text{)} = \left\{ \begin{matrix} {A\text{ and }B} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ {A^{H}\text{ and }B^{H}} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `C` lower or upper part is stored, the other Hermitian part is not referenced.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimension `ldb x k` with `ldb>=max(1,n)` if `transb == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or `k` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `lda` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldb` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldc` < max(1, `n`) or if `alpha` == NULL or `beta` == NULL or `C` == NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

2.7.16. cublas<t>herkx()

cublasStatus_t cublasCherkx(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, int lda,
                            const cuComplex       *B, int ldb,
                            const float  *beta,
                            cuComplex       *C, int ldc)
cublasStatus_t cublasZherkx(cublasHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            int n, int k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, int lda,
                            const cuDoubleComplex *B, int ldb,
                            const double *beta,
                            cuDoubleComplex *C, int ldc)

This function supports the 64-bit Integer Interface.

This function performs a variation of the Hermitian rank- $k$ update

$C = \alpha\text{op}(A)\text{op}(B)^{H} + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a Hermitian matrix stored in lower or upper mode, and $A$ and $B$ are matrices with dimensions $\text{op}(A)$ $n \times k$ and $\text{op}(B)$ $n \times k$ , respectively. Also, for matrix $A$ and $B$

$\text{op(}A\text{) and op(}B\text{)} = \left\{ \begin{matrix} {A\text{ and }B} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ {A^{H}\text{ and }B^{H}} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

This routine can be used when the matrix B is in such way that the result is guaranteed to be hermitian. An usual example is when the matrix B is a scaled form of the matrix A: this is equivalent to B being the product of the matrix A and a diagonal matrix. For an efficient computation of the product of a regular matrix with a diagonal matrix, refer to the routine cublas<t>dgmm.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `C` lower or upper part is stored, the other Hermitian part is not referenced.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	device	input	<type> array of dimension `ldb x k` with `ldb>=max(1,n)` if `transb == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	real scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or `k` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `uplo` != `CUBLAS_FILL_MODE_LOWER`, `CUBLAS_FILL_MODE_UPPER` or if `lda` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldb` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldc` < max(1, `n`) or if `alpha` == NULL or `beta` == NULL or `C` == NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

cherk, zherk and

2.8. BLAS-like Extension

This section describes the BLAS-extension functions that perform matrix-matrix operations.

2.8.1. cublas<t>geam()

cublasStatus_t cublasSgeam(cublasHandle_t handle,
                          cublasOperation_t transa, cublasOperation_t transb,
                          int m, int n,
                          const float           *alpha,
                          const float           *A, int lda,
                          const float           *beta,
                          const float           *B, int ldb,
                          float           *C, int ldc)
cublasStatus_t cublasDgeam(cublasHandle_t handle,
                          cublasOperation_t transa, cublasOperation_t transb,
                          int m, int n,
                          const double          *alpha,
                          const double          *A, int lda,
                          const double          *beta,
                          const double          *B, int ldb,
                          double          *C, int ldc)
cublasStatus_t cublasCgeam(cublasHandle_t handle,
                          cublasOperation_t transa, cublasOperation_t transb,
                          int m, int n,
                          const cuComplex       *alpha,
                          const cuComplex       *A, int lda,
                          const cuComplex       *beta ,
                          const cuComplex       *B, int ldb,
                          cuComplex       *C, int ldc)
cublasStatus_t cublasZgeam(cublasHandle_t handle,
                          cublasOperation_t transa, cublasOperation_t transb,
                          int m, int n,
                          const cuDoubleComplex *alpha,
                          const cuDoubleComplex *A, int lda,
                          const cuDoubleComplex *beta,
                          const cuDoubleComplex *B, int ldb,
                          cuDoubleComplex *C, int ldc)

This function supports the 64-bit Integer Interface.

This function performs the matrix-matrix addition/transposition

$C = \alpha\text{op}(A) + \beta\text{op}(B)$

where $\alpha$ and $\beta$ are scalars, and $A$ , $B$ and $C$ are matrices stored in column-major format with dimensions $\text{op}(A)$ $m \times n$ , $\text{op}(B)$ $m \times n$ and $C$ $m \times n$ , respectively. Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

and $\text{op}(B)$ is defined similarly for matrix $B$ .

The operation is out-of-place if C does not overlap A or B.

The in-place mode supports the following two operations,

$C = \alpha\text{*}C + \beta\text{op}(B)$

$C = \alpha\text{op}(A) + \beta\text{*}C$

For in-place mode, if C = A, ldc = lda and transa = CUBLAS_OP_N. If C = B, ldc = ldb and transb = CUBLAS_OP_N. If the user does not meet above requirements, CUBLAS_STATUS_INVALID_VALUE is returned.

The operation includes the following special cases:

the user can reset matrix C to zero by setting *alpha=*beta=0.

the user can transpose matrix A by setting *alpha=1 and *beta=0.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
transa		input	operation op(`A`) that is non- or (conj.) transpose.
transb		input	operation op(`B`) that is non- or (conj.) transpose.
m		input	number of rows of matrix op(`A`) and `C`.
n		input	number of columns of matrix op(`B`) and `C`.
alpha	host or device	input	<type> scalar used for multiplication. If `*alpha == 0`, `A` does not have to be a valid input.
A	device	input	<type> array of dimensions `lda x n` with `lda>=max(1,m)` if `transa == CUBLAS_OP_N` and `lda x m` with `lda>=max(1,n)` otherwise.
lda		input	leading dimension of two-dimensional array used to store the matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,m)` if `transb == CUBLAS_OP_N` and `ldb x m` with `ldb>=max(1,n)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	<type> scalar used for multiplication. If `*beta == 0`, `B` does not have to be a valid input.
C	device	output	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of a two-dimensional array used to store the matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m` < 0 or `n` < 0 or if `transa` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `transb` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `lda` < max(1, `m`) if `transa` == `CUBLAS_OP_N` and `lda` < max(1, `n`) otherwise or if `ldb` < max(1, `m`) if `transb` == `CUBLAS_OP_N` and `ldb` < max(1, `n`) otherwise or if `ldc` < max(1, `m`) or if `A` == `C`, ((`CUBLAS_OP_N` != `transa`) \|\| (`lda` != `ldc`)) or if `B` == `C`, ((`CUBLAS_OP_N` != `transb`) \|\| (`ldb` != `ldc`)) or `alpha` == NULL or `beta` == NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

2.8.2. cublas<t>dgmm()

cublasStatus_t cublasSdgmm(cublasHandle_t handle, cublasSideMode_t mode,
                          int m, int n,
                          const float           *A, int lda,
                          const float           *x, int incx,
                          float           *C, int ldc)
cublasStatus_t cublasDdgmm(cublasHandle_t handle, cublasSideMode_t mode,
                          int m, int n,
                          const double          *A, int lda,
                          const double          *x, int incx,
                          double          *C, int ldc)
cublasStatus_t cublasCdgmm(cublasHandle_t handle, cublasSideMode_t mode,
                          int m, int n,
                          const cuComplex       *A, int lda,
                          const cuComplex       *x, int incx,
                          cuComplex       *C, int ldc)
cublasStatus_t cublasZdgmm(cublasHandle_t handle, cublasSideMode_t mode,
                          int m, int n,
                          const cuDoubleComplex *A, int lda,
                          const cuDoubleComplex *x, int incx,
                          cuDoubleComplex *C, int ldc)

This function supports the 64-bit Integer Interface.

This function performs the matrix-matrix multiplication

$C = \left\{ \begin{matrix} {A \times diag(X)} & {\text{if }\textsf{mode == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\ {diag(X) \times A} & {\text{if }\textsf{mode == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\ \end{matrix} \right.$

where $A$ and $C$ are matrices stored in column-major format with dimensions $m \times n$ . $X$ is a vector of size $n$ if mode == CUBLAS_SIDE_RIGHT and of size $m$ if mode == CUBLAS_SIDE_LEFT. $X$ is gathered from one-dimensional array x with stride incx. The absolute value of incx is the stride and the sign of incx is direction of the stride. If incx is positive, then we forward x from the first element. Otherwise, we backward x from the last element. The formula of X is

$X\lbrack j\rbrack = \left\{ \begin{matrix} {x\lbrack j \times incx\rbrack} & {\text{if }incx \geq 0} \\ {x\lbrack(\chi - 1) \times |incx| - j \times |incx|\rbrack} & {\text{if }incx < 0} \\ \end{matrix} \right.$

where $\chi = m$ if mode == CUBLAS_SIDE_LEFT and $\chi = n$ if mode == CUBLAS_SIDE_RIGHT.

Example 1: if the user wants to perform $diag(diag(B)) \times A$ , then $incx = ldb + 1$ where $ldb$ is leading dimension of matrix B, either row-major or column-major.

Example 2: if the user wants to perform $\alpha \times A$ , then there are two choices, either cublas<t>geam() with *beta=0 and transa == CUBLAS_OP_N or cublas<t>dgmm() with incx=0 and x[0]=alpha.

The operation is out-of-place. The in-place only works if lda = ldc.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
mode		input	left multiply if `mode == CUBLAS_SIDE_LEFT` or right multiply if `mode == CUBLAS_SIDE_RIGHT`
m		input	number of rows of matrix `A` and `C`.
n		input	number of columns of matrix `A` and `C`.
A	device	input	<type> array of dimensions `lda x n` with `lda>=max(1,m)`
lda		input	leading dimension of two-dimensional array used to store the matrix `A`.
x	device	input	one-dimensional <type> array of size $\|inc\| \times m$ if `mode == CUBLAS_SIDE_LEFT` and $\|inc\| \times n$ if `mode == CUBLAS_SIDE_RIGHT`
incx		input	stride of one-dimensional array `x`.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of a two-dimensional array used to store the matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m` < 0 or `n` < 0 or if `mode` != `CUBLAS_SIDE_LEFT`, `CUBLAS_SIDE_RIGHT` or if `lda` < max(1, `m`) or `ldc` < max(1, `m`)
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

2.8.3. cublas<t>getrfBatched()

cublasStatus_t cublasSgetrfBatched(cublasHandle_t handle,
                                   int n,
                                   float *const Aarray[],
                                   int lda,
                                   int *PivotArray,
                                   int *infoArray,
                                   int batchSize);

cublasStatus_t cublasDgetrfBatched(cublasHandle_t handle,
                                   int n,
                                   double *const Aarray[],
                                   int lda,
                                   int *PivotArray,
                                   int *infoArray,
                                   int batchSize);

cublasStatus_t cublasCgetrfBatched(cublasHandle_t handle,
                                   int n,
                                   cuComplex *const Aarray[],
                                   int lda,
                                   int *PivotArray,
                                   int *infoArray,
                                   int batchSize);

cublasStatus_t cublasZgetrfBatched(cublasHandle_t handle,
                                   int n,
                                   cuDoubleComplex *const Aarray[],
                                   int lda,
                                   int *PivotArray,
                                   int *infoArray,
                                   int batchSize);

Aarray is an array of pointers to matrices stored in column-major format with dimensions nxn and leading dimension lda.

This function performs the LU factorization of each Aarray[i] for i = 0, …, batchSize-1 by the following equation

$\text{P}\text{*}{Aarray}\lbrack i\rbrack = L\text{*}U$

where P is a permutation matrix which represents partial pivoting with row interchanges. L is a lower triangular matrix with unit diagonal and U is an upper triangular matrix.

Formally P is written by a product of permutation matrices Pj, for j = 1,2,...,n, say P = P1 * P2 * P3 * .... * Pn. Pj is a permutation matrix which interchanges two rows of vector x when performing Pj*x. Pj can be constructed by j element of PivotArray[i] by the following Matlab code

// In Matlab PivotArray[i] is an array of base-1.
// In C, PivotArray[i] is base-0.
Pj = eye(n);
swap Pj(j,:) and Pj(PivotArray[i][j]  ,:)

L and U are written back to original matrix A, and diagonal elements of L are discarded. The L and U can be constructed by the following Matlab code

// A is a matrix of nxn after getrf.
L = eye(n);
for j = 1:n
    L(j+1:n,j) = A(j+1:n,j)
end
U = zeros(n);
for i = 1:n
    U(i,i:n) = A(i,i:n)
end

If matrix A(=Aarray[i]) is singular, getrf still works and the value of info(=infoArray[i]) reports first row index that LU factorization cannot proceed. If info is k, U(k,k) is zero. The equation P*A=L*U still holds, however L and U reconstruction needs different Matlab code as follows:

// A is a matrix of nxn after getrf.
// info is k, which means U(k,k) is zero.
L = eye(n);
for j = 1:k-1
    L(j+1:n,j) = A(j+1:n,j)
end
U = zeros(n);
for i = 1:k-1
    U(i,i:n) = A(i,i:n)
end
for i = k:n
    U(i,k:n) = A(i,k:n)
end

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

cublas<t>getrfBatched supports non-pivot LU factorization if PivotArray is NULL.

cublas<t>getrfBatched supports arbitrary dimension.

cublas<t>getrfBatched only supports compute capability 2.0 or above.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
n		input	number of rows and columns of `Aarray[i]`.
Aarray	device	input/output	array of pointers to <type> array, with each array of dim. `n x n` with `lda>=max(1,n)`. Matrices `Aarray[i]` should not overlap; otherwise, undefined behavior is expected.
lda		input	leading dimension of two-dimensional array used to store each matrix `Aarray[i]`.
PivotArray	device	output	array of size `n x batchSize` that contains the pivoting sequence of each factorization of `Aarray[i]` stored in a linear fashion. If `PivotArray` is NULL, pivoting is disabled.
infoArray	device	output	array of size `batchSize` that info(=infoArray[i]) contains the information of factorization of `Aarray[i]`. If info=0, the execution is successful. If info = -j, the j-th parameter had an illegal value. If info = k, U(k,k) is 0. The factorization has been completed, but U is exactly singular.
batchSize		input	number of pointers contained in A

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,batchSize,lda <0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sgeqrf, dgeqrf, cgeqrf, zgeqrf

2.8.4. cublas<t>getrsBatched()

cublasStatus_t cublasSgetrsBatched(cublasHandle_t handle,
                                   cublasOperation_t trans,
                                   int n,
                                   int nrhs,
                                   const float *const Aarray[],
                                   int lda,
                                   const int *devIpiv,
                                   float *const Barray[],
                                   int ldb,
                                   int *info,
                                   int batchSize);

cublasStatus_t cublasDgetrsBatched(cublasHandle_t handle,
                                   cublasOperation_t trans,
                                   int n,
                                   int nrhs,
                                   const double *const Aarray[],
                                   int lda,
                                   const int *devIpiv,
                                   double *const Barray[],
                                   int ldb,
                                   int *info,
                                   int batchSize);

cublasStatus_t cublasCgetrsBatched(cublasHandle_t handle,
                                   cublasOperation_t trans,
                                   int n,
                                   int nrhs,
                                   const cuComplex *const Aarray[],
                                   int lda,
                                   const int *devIpiv,
                                   cuComplex *const Barray[],
                                   int ldb,
                                   int *info,
                                   int batchSize);

cublasStatus_t cublasZgetrsBatched(cublasHandle_t handle,
                                   cublasOperation_t trans,
                                   int n,
                                   int nrhs,
                                   const cuDoubleComplex *const Aarray[],
                                   int lda,
                                   const int *devIpiv,
                                   cuDoubleComplex *const Barray[],
                                   int ldb,
                                   int *info,
                                   int batchSize);

This function solves an array of systems of linear equations of the form:

$\text{op}(A\lbrack i \rbrack) X\lbrack i\rbrack = B\lbrack i\rbrack$

where $A\lbrack i\rbrack$ is a matrix which has been LU factorized with pivoting, $X\lbrack i\rbrack$ and $B\lbrack i\rbrack$ are $n \times {nrhs}$ matrices. Also, for matrix $A$

$\text{op}(A\lbrack i\rbrack) = \left\{ \begin{matrix} {A\lbrack i\rbrack} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ {A^{T}\lbrack i\rbrack} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ {A^{H}\lbrack i\rbrack} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

cublas<t>getrsBatched supports non-pivot LU factorization if devIpiv is NULL.

cublas<t>getrsBatched supports arbitrary dimension.

cublas<t>getrsBatched only supports compute capability 2.0 or above.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows and columns of `Aarray[i]`.
nrhs		input	number of columns of `Barray[i]`.
Aarray	device	input	array of pointers to <type> array, with each array of dim. `n x n` with `lda>=max(1,n)`.
lda		input	leading dimension of two-dimensional array used to store each matrix `Aarray[i]`.
devIpiv	device	input	array of size `n x batchSize` that contains the pivoting sequence of each factorization of `Aarray[i]` stored in a linear fashion. If `devIpiv` is NULL, pivoting for all `Aarray[i]` is ignored.
Barray	device	input/output	array of pointers to <type> array, with each array of dim. `n x nrhs` with `ldb>=max(1,n)`. Matrices `Barray[i]` should not overlap; otherwise, undefined behavior is expected.
ldb		input	leading dimension of two-dimensional array used to store each solution matrix `Barray[i]`.
info	host	output	If info=0, the execution is successful. If info = -j, the j-th parameter had an illegal value.
batchSize		input	number of pointers contained in A

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or `nrhs` < 0 or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `lda` < max(1, `n`) or `ldb` < max(1, `n`)
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sgeqrs, dgeqrs, cgeqrs, zgeqrs

2.8.5. cublas<t>getriBatched()

cublasStatus_t cublasSgetriBatched(cublasHandle_t handle,
                                   int n,
                                   const float *const Aarray[],
                                   int lda,
                                   int *PivotArray,
                                   float *const Carray[],
                                   int ldc,
                                   int *infoArray,
                                   int batchSize);

cublasStatus_t cublasDgetriBatched(cublasHandle_t handle,
                                   int n,
                                   const double *const Aarray[],
                                   int lda,
                                   int *PivotArray,
                                   double *const Carray[],
                                   int ldc,
                                   int *infoArray,
                                   int batchSize);

cublasStatus_t cublasCgetriBatched(cublasHandle_t handle,
                                   int n,
                                   const cuComplex *const Aarray[],
                                   int lda,
                                   int *PivotArray,
                                   cuComplex *const Carray[],
                                   int ldc,
                                   int *infoArray,
                                   int batchSize);

cublasStatus_t cublasZgetriBatched(cublasHandle_t handle,
                                   int n,
                                   const cuDoubleComplex *const Aarray[],
                                   int lda,
                                   int *PivotArray,
                                   cuDoubleComplex *const Carray[],
                                   int ldc,
                                   int *infoArray,
                                   int batchSize);

Aarray and Carray are arrays of pointers to matrices stored in column-major format with dimensions n*n and leading dimension lda and ldc respectively.

This function performs the inversion of matrices A[i] for i = 0, …, batchSize-1.

Prior to calling cublas<t>getriBatched, the matrix A[i] must be factorized first using the routine cublas<t>getrfBatched. After the call of cublas<t>getrfBatched, the matrix pointing by Aarray[i] will contain the LU factors of the matrix A[i] and the vector pointing by (PivotArray+i) will contain the pivoting sequence.

Following the LU factorization, cublas<t>getriBatched uses forward and backward triangular solvers to complete inversion of matrices A[i] for i = 0, …, batchSize-1. The inversion is out-of-place, so memory space of Carray[i] cannot overlap memory space of Array[i].

Typically all parameters in cublas<t>getrfBatched would be passed into cublas<t>getriBatched. For example,

// step 1: perform in-place LU decomposition, P*A = L*U.
//      Aarray[i] is n*n matrix A[i]
    cublasDgetrfBatched(handle, n, Aarray, lda, PivotArray, infoArray, batchSize);
//      check infoArray[i] to see if factorization of A[i] is successful or not.
//      Array[i] contains LU factorization of A[i]

// step 2: perform out-of-place inversion, Carray[i] = inv(A[i])
    cublasDgetriBatched(handle, n, Aarray, lda, PivotArray, Carray, ldc, infoArray, batchSize);
//      check infoArray[i] to see if inversion of A[i] is successful or not.

The user can check singularity from either cublas<t>getrfBatched or cublas<t>getriBatched.

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

If cublas<t>getrfBatched is performed by non-pivoting, PivotArray of cublas<t>getriBatched should be NULL.

cublas<t>getriBatched supports arbitrary dimension.

cublas<t>getriBatched only supports compute capability 2.0 or above.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
n		input	number of rows and columns of `Aarray[i]`.
Aarray	device	input	array of pointers to <type> array, with each array of dimension `n*n` with `lda>=max(1,n)`.
lda		input	leading dimension of two-dimensional array used to store each matrix `Aarray[i]`.
PivotArray	device	output	array of size `n*batchSize` that contains the pivoting sequence of each factorization of `Aarray[i]` stored in a linear fashion. If `PivotArray` is NULL, pivoting is disabled.
Carray	device	output	array of pointers to <type> array, with each array of dimension `n*n` with `ldc>=max(1,n)`. Matrices `Carray[i]` should not overlap; otherwise, undefined behavior is expected.
ldc		input	leading dimension of two-dimensional array used to store each matrix `Carray[i]`.
infoArray	device	output	array of size `batchSize` that info(=infoArray[i]) contains the information of inversion of `A[i]`. If info=0, the execution is successful. If info = k, U(k,k) is 0. The U is exactly singular and the inversion failed.
batchSize		input	number of pointers contained in A

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or `lda` < 0 or `ldc` < 0 or `batchSize` < 0 or `lda` < `n` or `ldc` < `n`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

2.8.6. cublas<t>matinvBatched()

cublasStatus_t cublasSmatinvBatched(cublasHandle_t handle,
                                    int n,
                                    const float *const A[],
                                    int lda,
                                    float *const Ainv[],
                                    int lda_inv,
                                    int *info,
                                    int batchSize);

cublasStatus_t cublasDmatinvBatched(cublasHandle_t handle,
                                    int n,
                                    const double *const A[],
                                    int lda,
                                    double *const Ainv[],
                                    int lda_inv,
                                    int *info,
                                    int batchSize);

cublasStatus_t cublasCmatinvBatched(cublasHandle_t handle,
                                    int n,
                                    const cuComplex *const A[],
                                    int lda,
                                    cuComplex *const Ainv[],
                                    int lda_inv,
                                    int *info,
                                    int batchSize);

cublasStatus_t cublasZmatinvBatched(cublasHandle_t handle,
                                    int n,
                                    const cuDoubleComplex *const A[],
                                    int lda,
                                    cuDoubleComplex *const Ainv[],
                                    int lda_inv,
                                    int *info,
                                    int batchSize);

A and Ainv are arrays of pointers to matrices stored in column-major format with dimensions n*n and leading dimension lda and lda_inv respectively.

This function performs the inversion of matrices A[i] for i = 0, …, batchSize-1.

This function is a short cut of cublas<t>getrfBatched plus cublas<t>getriBatched. However it doesn’t work if n is greater than 32. If not, the user has to go through cublas<t>getrfBatched and cublas<t>getriBatched.

If the matrix A[i] is singular, then info[i] reports singularity, the same as cublas<t>getrfBatched.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
n		input	number of rows and columns of `A[i]`.
A	device	input	array of pointers to <type> array, with each array of dimension `n*n` with `lda>=max(1,n)`.
lda		input	leading dimension of two-dimensional array used to store each matrix `A[i]`.
Ainv	device	output	array of pointers to <type> array, with each array of dimension `n*n` with `lda_inv>=max(1,n)`. Matrices `Ainv[i]` should not overlap; otherwise, undefined behavior is expected.
lda_inv		input	leading dimension of two-dimensional array used to store each matrix `Ainv[i]`.
info	device	output	array of size `batchSize` that info[i] contains the information of inversion of `A[i]`. If info[i]=0, the execution is successful. If info[i]=k, U(k,k) is 0. The U is exactly singular and the inversion failed.
batchSize		input	number of pointers contained in A.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or `lda` < 0 or `lda_inv` < 0 or `batchSize` < 0 or if `lda` < `n` or `lda_inv` < `n` or `n` > 32
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

2.8.7. cublas<t>geqrfBatched()

cublasStatus_t cublasSgeqrfBatched( cublasHandle_t handle,
                                    int m,
                                    int n,
                                    float *const Aarray[],
                                    int lda,
                                    float *const TauArray[],
                                    int *info,
                                    int batchSize);

cublasStatus_t cublasDgeqrfBatched( cublasHandle_t handle,
                                    int m,
                                    int n,
                                    double *const Aarray[],
                                    int lda,
                                    double *const TauArray[],
                                    int *info,
                                    int batchSize);

cublasStatus_t cublasCgeqrfBatched( cublasHandle_t handle,
                                    int m,
                                    int n,
                                    cuComplex *const Aarray[],
                                    int lda,
                                    cuComplex *const TauArray[],
                                    int *info,
                                    int batchSize);

cublasStatus_t cublasZgeqrfBatched( cublasHandle_t handle,
                                    int m,
                                    int n,
                                    cuDoubleComplex *const Aarray[],
                                    int lda,
                                    cuDoubleComplex *const TauArray[],
                                    int *info,
                                    int batchSize);

Aarray is an array of pointers to matrices stored in column-major format with dimensions m x n and leading dimension lda. TauArray is an array of pointers to vectors of dimension of at least max (1, min(m, n).

This function performs the QR factorization of each Aarray[i] for i = 0, ...,batchSize-1 using Householder reflections. Each matrix Q[i] is represented as a product of elementary reflectors and is stored in the lower part of each Aarray[i] as follows :

Q[j] = H[j][1] H[j][2] . . . H[j](k), where k = min(m,n).

Each H[j][i] has the form

H[j][i] = I - tau[j] * v * v'

where tau[j] is a real scalar, and v is a real vector with v(1:i-1) = 0 and v(i) = 1; v(i+1:m) is stored on exit in Aarray[j][i+1:m,i], and tau in TauArray[j][i].

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

cublas<t>geqrfBatched supports arbitrary dimension.

cublas<t>geqrfBatched only supports compute capability 2.0 or above.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
m		input	number of rows `Aarray[i]`.
n		input	number of columns of `Aarray[i]`.
Aarray	device	input	array of pointers to <type> array, with each array of dim. `m x n` with `lda>=max(1,m)`.
lda		input	leading dimension of two-dimensional array used to store each matrix `Aarray[i]`.
TauArray	device	output	array of pointers to <type> vector, with each vector of dim. `max(1,min(m,n))`.
info	host	output	If info=0, the parameters passed to the function are valid If info<0, the parameter in postion -info is invalid
batchSize		input	number of pointers contained in A

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m` < 0 or `n` < 0 or `batchSize` < 0 or `lda` < max(1, `m`)
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sgeqrf, dgeqrf, cgeqrf, zgeqrf

2.8.8. cublas<t>gelsBatched()

cublasStatus_t cublasSgelsBatched( cublasHandle_t handle,
                                   cublasOperation_t trans,
                                   int m,
                                   int n,
                                   int nrhs,
                                   float *const Aarray[],
                                   int lda,
                                   float *const Carray[],
                                   int ldc,
                                   int *info,
                                   int *devInfoArray,
                                   int batchSize );

cublasStatus_t cublasDgelsBatched( cublasHandle_t handle,
                                   cublasOperation_t trans,
                                   int m,
                                   int n,
                                   int nrhs,
                                   double *const Aarray[],
                                   int lda,
                                   double *const Carray[],
                                   int ldc,
                                   int *info,
                                   int *devInfoArray,
                                   int batchSize );

cublasStatus_t cublasCgelsBatched( cublasHandle_t handle,
                                   cublasOperation_t trans,
                                   int m,
                                   int n,
                                   int nrhs,
                                   cuComplex *const Aarray[],
                                   int lda,
                                   cuComplex *const Carray[],
                                   int ldc,
                                   int *info,
                                   int *devInfoArray,
                                   int batchSize );

cublasStatus_t cublasZgelsBatched( cublasHandle_t handle,
                                   cublasOperation_t trans,
                                   int m,
                                   int n,
                                   int nrhs,
                                   cuDoubleComplex *const Aarray[],
                                   int lda,
                                   cuDoubleComplex *const Carray[],
                                   int ldc,
                                   int *info,
                                   int *devInfoArray,
                                   int batchSize );

Aarray is an array of pointers to matrices stored in column-major format. Carray is an array of pointers to matrices stored in column-major format.

This function find the least squares solution of a batch of overdetermined systems: it solves the least squares problem described as follows :

minimize  || Carray[i] - Aarray[i]*Xarray[i] || , with i = 0, ...,batchSize-1

On exit, each Aarray[i] is overwritten with their QR factorization and each Carray[i] is overwritten with the least square solution

cublas<t>gelsBatched supports only the non-transpose operation and only solves over-determined systems (m >= n).

cublas<t>gelsBatched only supports compute capability 2.0 or above.

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
trans		input	operation op(`Aarray[i]`) that is non- or (conj.) transpose. Only non-transpose operation is currently supported.
m		input	number of rows of each `Aarray[i]` and `Carray[i]` if `trans == CUBLAS_OP_N`, numbers of columns of each `Aarray[i]` otherwise (not supported currently).
n		input	number of columns of each `Aarray[i]` if `trans == CUBLAS_OP_N`, and number of rows of each `Aarray[i]` and `Carray[i]` otherwise (not supported currently).
nrhs		input	number of columns of each `Carray[i]`.
Aarray	device	input/output	array of pointers to <type> array, with each array of dim. `m x n` with `lda>=max(1,m)` if `trans == CUBLAS_OP_N`, and `n x m` with `lda>=max(1,n)` otherwise (not supported currently). Matrices `Aarray[i]` should not overlap; otherwise, undefined behavior is expected.
lda		input	leading dimension of two-dimensional array used to store each matrix `Aarray[i]`.
Carray	device	input/output	array of pointers to <type> array, with each array of dim. `m x nrhs` with `ldc>=max(1,m)` if `trans == CUBLAS_OP_N`, and `n x nrhs` with `lda>=max(1,n)` otherwise (not supported currently). Matrices `Carray[i]` should not overlap; otherwise, undefined behavior is expected.
ldc		input	leading dimension of two-dimensional array used to store each matrix `Carray[i]`.
info	host	output	If info=0, the parameters passed to the function are valid If info<0, the parameter in position -info is invalid
devInfoArray	device	output	optional array of integers of dimension batchsize. If non-null, every element devInfoArray[i] contain a value V with the following meaning: V = 0 : the i-th problem was sucessfully solved V > 0 : the V-th diagonal element of the Aarray[i] is zero. Aarray[i] does not have full rank.
batchSize		input	number of pointers contained in Aarray and Carray

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `m` < 0 or `n` < 0 or `nrhs` < 0 or `batchSize` < 0 or `lda` < max(1, `m`) or `ldc` < max(1, `m`)
`CUBLAS_STATUS_NOT_SUPPORTED`	the parameters `m <n` or `trans` is different from non-transpose.
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sgels, dgels, cgels, zgels

2.8.9. cublas<t>tpttr()

cublasStatus_t cublasStpttr ( cublasHandle_t handle,
                              cublasFillMode_t uplo,
                              int n,
                              const float *AP,
                              float *A,
                              int lda );

cublasStatus_t cublasDtpttr ( cublasHandle_t handle,
                              cublasFillMode_t uplo,
                              int n,
                              const double *AP,
                              double *A,
                              int lda );

cublasStatus_t cublasCtpttr ( cublasHandle_t handle,
                              cublasFillMode_t uplo,
                              int n,
                              const cuComplex *AP,
                              cuComplex *A,
                              int lda );

cublasStatus_t cublasZtpttr ( cublasHandle_t handle,
                              cublasFillMode_t uplo
                              int n,
                              const cuDoubleComplex *AP,
                              cuDoubleComplex *A,
                              int lda );

This function performs the conversion from the triangular packed format to the triangular format

If uplo == CUBLAS_FILL_MODE_LOWER then the elements of AP are copied into the lower triangular part of the triangular matrix A and the upper part of A is left untouched. If uplo == CUBLAS_FILL_MODE_UPPER then the elements of AP are copied into the upper triangular part of the triangular matrix A and the lower part of A is left untouched.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates if matrix `AP` contains lower or upper part of matrix `A`.
n		input	number of rows and columns of matrix `A`.
AP	device	input	<type> array with $A$ stored in packed format.
A	device	output	<type> array of dimensions `lda x n` , with `lda>=max(1,n)`. The opposite side of A is left untouched.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or if `uplo` != `CUBLAS_FILL_MODE_UPPER`, `CUBLAS_FILL_MODE_LOWER` or `lda` < max(1, `n`)
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

stpttr, dtpttr, ctpttr, ztpttr

2.8.10. cublas<t>trttp()

cublasStatus_t cublasStrttp ( cublasHandle_t handle,
                              cublasFillMode_t uplo,
                              int n,
                              const float *A,
                              int lda,
                              float *AP );

cublasStatus_t cublasDtrttp ( cublasHandle_t handle,
                              cublasFillMode_t uplo,
                              int n,
                              const double *A,
                              int lda,
                              double *AP );

cublasStatus_t cublasCtrttp ( cublasHandle_t handle,
                              cublasFillMode_t uplo,
                              int n,
                              const cuComplex *A,
                              int lda,
                              cuComplex *AP );

cublasStatus_t cublasZtrttp ( cublasHandle_t handle,
                              cublasFillMode_t uplo,
                              int n,
                              const cuDoubleComplex *A,
                              int lda,
                              cuDoubleComplex *AP );

This function performs the conversion from the triangular format to the triangular packed format

If uplo == CUBLAS_FILL_MODE_LOWER then the lower triangular part of the triangular matrix A is copied into the array AP. If uplo == CUBLAS_FILL_MODE_UPPER then then the upper triangular part of the triangular matrix A is copied into the array AP.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
uplo		input	indicates which matrix `A` lower or upper part is referenced.
n		input	number of rows and columns of matrix `A`.
A	device	input	<type> array of dimensions `lda x n` , with `lda>=max(1,n)`.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
AP	device	output	<type> array with $A$ stored in packed format.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or if `uplo` != `CUBLAS_FILL_MODE_UPPER`, `CUBLAS_FILL_MODE_LOWER` or `lda` < max(1, `n`)
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

strttp, dtrttp, ctrttp, ztrttp

2.8.11. cublas<t>gemmEx()

cublasStatus_t cublasSgemmEx(cublasHandle_t handle,
                           cublasOperation_t transa,
                           cublasOperation_t transb,
                           int m,
                           int n,
                           int k,
                           const float    *alpha,
                           const void     *A,
                           cudaDataType_t Atype,
                           int lda,
                           const void     *B,
                           cudaDataType_t Btype,
                           int ldb,
                           const float    *beta,
                           void           *C,
                           cudaDataType_t Ctype,
                           int ldc)
cublasStatus_t cublasCgemmEx(cublasHandle_t handle,
                           cublasOperation_t transa,
                           cublasOperation_t transb,
                           int m,
                           int n,
                           int k,
                           const cuComplex *alpha,
                           const void      *A,
                           cudaDataType_t  Atype,
                           int lda,
                           const void      *B,
                           cudaDataType_t  Btype,
                           int ldb,
                           const cuComplex *beta,
                           void            *C,
                           cudaDataType_t  Ctype,
                           int ldc)

This function supports the 64-bit Integer Interface.

This function is an extension of cublas<t>gemm. In this function the input matrices and output matrices can have a lower precision but the computation is still done in the type <t>. For example, in the type float for cublasSgemmEx() and in the type cuComplex for cublasCgemmEx().

$C = \alpha\text{op}(A)\text{op}(B) + \beta C$

where $\alpha$ and $\beta$ are scalars, and $A$ , $B$ and $C$ are matrices stored in column-major format with dimensions $\text{op}(A)$ $m \times k$ , $\text{op}(B)$ $k \times n$ and $C$ $m \times n$ , respectively. Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

and $\text{op}(B)$ is defined similarly for matrix $B$ .

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
transa		input	operation op(`A`) that is non- or (conj.) transpose.
transb		input	operation op(`B`) that is non- or (conj.) transpose.
m		input	number of rows of matrix op(`A`) and `C`.
n		input	number of columns of matrix op(`B`) and `C`.
k		input	number of columns of op(`A`) and rows of op(`B`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimensions `lda x k` with `lda>=max(1,m)` if `transa == CUBLAS_OP_N` and `lda x m` with `lda>=max(1,k)` otherwise.
Atype		input	enumerant specifying the datatype of matrix `A`.
lda		input	leading dimension of two-dimensional array used to store the matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,k)` if `transb == CUBLAS_OP_N` and `ldb x k` with `ldb>=max(1,n)` otherwise.
Btype		input	enumerant specifying the datatype of matrix `B`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	<type> scalar used for multiplication. If `beta==0`, `C` does not have to be a valid input.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
Ctype		input	enumerant specifying the datatype of matrix `C`.
ldc		input	leading dimension of a two-dimensional array used to store the matrix `C`.

The matrix types combinations supported for cublasSgemmEx() are listed below:

C	A/B
`CUDA_R_16BF`	`CUDA_R_16BF`
`CUDA_R_16F`	`CUDA_R_16F`
`CUDA_R_32F`	`CUDA_R_8I`
	`CUDA_R_16BF`
	`CUDA_R_16F`
	`CUDA_R_32F`

The matrix types combinations supported for cublasCgemmEx() are listed below :

C	A/B
`CUDA_C_32F`	`CUDA_C_8I`
	`CUDA_C_32F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_ARCH_MISMATCH`	cublasCgemmEx() is only supported for GPU with architecture capabilities equal or greater than 5.0
`CUBLAS_STATUS_NOT_SUPPORTED`	the combination of the parameters `Atype`,`Btype` and `Ctype` is not supported
`CUBLAS_STATUS_INVALID_VALUE`	If `m` < 0 or `n` < 0 or `k` < 0 or if `transa` or `transb` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `lda` < max(1, `m`) if `transa` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldb` < max(1, `k`) if `transb` == `CUBLAS_OP_N` and `ldb` < max(1, `n`) otherwise or `ldc` < max(1, `m`)
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sgemm

For more information about the numerical behavior of some GEMM algorithms, refer to the GEMM Algorithms Numerical Behavior section.

2.8.12. cublasGemmEx()

cublasStatus_t cublasGemmEx(cublasHandle_t handle,
                           cublasOperation_t transa,
                           cublasOperation_t transb,
                           int m,
                           int n,
                           int k,
                           const void    *alpha,
                           const void     *A,
                           cudaDataType_t Atype,
                           int lda,
                           const void     *B,
                           cudaDataType_t Btype,
                           int ldb,
                           const void    *beta,
                           void           *C,
                           cudaDataType_t Ctype,
                           int ldc,
                           cublasComputeType_t computeType,
                           cublasGemmAlgo_t algo)

#if defined(__cplusplus)
cublasStatus_t cublasGemmEx(cublasHandle_t handle,
                           cublasOperation_t transa,
                           cublasOperation_t transb,
                           int m,
                           int n,
                           int k,
                           const void     *alpha,
                           const void     *A,
                           cudaDataType   Atype,
                           int lda,
                           const void     *B,
                           cudaDataType   Btype,
                           int ldb,
                           const void     *beta,
                           void           *C,
                           cudaDataType   Ctype,
                           int ldc,
                           cudaDataType   computeType,
                           cublasGemmAlgo_t algo)
#endif

This function supports the 64-bit Integer Interface.

This function is an extension of cublas<t>gemm that allows the user to individually specify the data types for each of the A, B and C matrices, the precision of computation and the GEMM algorithm to be run. Supported combinations of arguments are listed further down in this section.

Note

The second variant of cublasGemmEx() function is provided for backward compatibility with C++ applications code, where the computeType parameter is of cudaDataType instead of cublasComputeType_t. C applications would still compile with the updated function signature.

This function is only supported on devices with compute capability 5.0 or later.

$C = \alpha\text{op}(A)\text{op}(B) + \beta C$

where $\alpha$ and $\beta$ are scalars, and $A$ , $B$ and $C$ are matrices stored in column-major format with dimensions $\text{op}(A)$ $m \times k$ , $\text{op}(B)$ $k \times n$ and $C$ $m \times n$ , respectively. Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

and $\text{op}(B)$ is defined similarly for matrix $B$ .

Param.	Memory	In/out	Meaning
handle		input	Handle to the cuBLAS library context.
transa		input	Operation op(`A`) that is non- or (conj.) transpose.
transb		input	Operation op(`B`) that is non- or (conj.) transpose.
m		input	Number of rows of matrix op(`A`) and `C`.
n		input	Number of columns of matrix op(`B`) and `C`.
k		input	Number of columns of op(`A`) and rows of op(`B`).
alpha	host or device	input	Scaling factor for A*B of the type that corresponds to the computeType and Ctype, see the table below for details.
A	device	input	<type> array of dimensions `lda x k` with `lda>=max(1,m)` if `transa == CUBLAS_OP_N` and `lda x m` with `lda>=max(1,k)` otherwise.
Atype		input	Enumerant specifying the datatype of matrix `A`.
lda		input	Leading dimension of two-dimensional array used to store the matrix `A`.
B	device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,k)` if `transb == CUBLAS_OP_N` and `ldb x k` with `ldb>=max(1,n)` otherwise.
Btype		input	Enumerant specifying the datatype of matrix `B`.
ldb		input	Leading dimension of two-dimensional array used to store matrix `B`.
beta	host or device	input	Scaling factor for C of the type that corresponds to the computeType and Ctype, see the table below for details. If `beta==0`, `C` does not have to be a valid input.
C	device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
Ctype		input	Enumerant specifying the datatype of matrix `C`.
ldc		input	Leading dimension of a two-dimensional array used to store the matrix `C`.
computeType		input	Enumerant specifying the computation type.
algo		input	Enumerant specifying the algorithm. See cublasGemmAlgo_t.

cublasGemmEx() supports the following Compute Type, Scale Type, Atype/Btype, and Ctype:

Compute Type	Scale Type (alpha and beta)	Atype/Btype	Ctype
`CUBLAS_COMPUTE_16F` or `CUBLAS_COMPUTE_16F_PEDANTIC`	`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`
`CUBLAS_COMPUTE_32I` or `CUBLAS_COMPUTE_32I_PEDANTIC`	`CUDA_R_32I`	`CUDA_R_8I`	`CUDA_R_32I`
`CUBLAS_COMPUTE_32F` or `CUBLAS_COMPUTE_32F_PEDANTIC`	`CUDA_R_32F`	`CUDA_R_16BF`	`CUDA_R_16BF`
		`CUDA_R_16F`	`CUDA_R_16F`
		`CUDA_R_8I`	`CUDA_R_32F`
		`CUDA_R_16BF`	`CUDA_R_32F`
		`CUDA_R_16F`	`CUDA_R_32F`
		`CUDA_R_32F`	`CUDA_R_32F`
	`CUDA_C_32F`	`CUDA_C_8I`	`CUDA_C_32F`
	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUBLAS_COMPUTE_32F_FAST_16F` or `CUBLAS_COMPUTE_32F_FAST_16BF` or `CUBLAS_COMPUTE_32F_FAST_TF32`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUBLAS_COMPUTE_64F` or `CUBLAS_COMPUTE_64F_PEDANTIC`	`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUBLAS_COMPUTE_64F` or `CUBLAS_COMPUTE_64F_PEDANTIC`	`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

Note

CUBLAS_COMPUTE_32I and CUBLAS_COMPUTE_32I_PEDANTIC compute types are only supported with A, B being 4-byte aligned and lda, ldb being multiples of 4. For better performance, it is also recommended that IMMA kernels requirements for a regular data ordering listed here are met.

The possible error values returned by this function and their meanings are listed in the following table.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_ARCH_MISMATCH`	cublasGemmEx() is only supported for GPU with architecture capabilities equal or greater than 5.0.
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters `Atype`,`Btype` and `Ctype` or the algorithm,`algo`is not supported.
`CUBLAS_STATUS_INVALID_VALUE`	If `m` < 0 or `n` < 0 or `k` < 0 or if `transa` or `transb` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `lda` < max(1, `m`) if `transa` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldb` < max(1, `k`) if `transb` == `CUBLAS_OP_N` and `ldb` < max(1, `n`) otherwise or if `ldc` < max(1, `m`) or `Atype` or `Btype` or `Ctype` or `algo` is not supported
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.

Starting with release 11.2, using the typed functions instead of the extension functions (cublas**Ex()) helps in reducing the binary size when linking to static cuBLAS Library.

Also refer to: sgemm.

For more information about the numerical behavior of some GEMM algorithms, refer to the GEMM Algorithms Numerical Behavior section.

2.8.13. cublasGemmBatchedEx()

cublasStatus_t cublasGemmBatchedEx(cublasHandle_t handle,
                            cublasOperation_t transa,
                            cublasOperation_t transb,
                            int m,
                            int n,
                            int k,
                            const void    *alpha,
                            const void     *const Aarray[],
                            cudaDataType_t Atype,
                            int lda,
                            const void     *const Barray[],
                            cudaDataType_t Btype,
                            int ldb,
                            const void    *beta,
                            void           *const Carray[],
                            cudaDataType_t Ctype,
                            int ldc,
                            int batchCount,
                            cublasComputeType_t computeType,
                            cublasGemmAlgo_t algo)

#if defined(__cplusplus)
cublasStatus_t cublasGemmBatchedEx(cublasHandle_t handle,
                            cublasOperation_t transa,
                            cublasOperation_t transb,
                            int m,
                            int n,
                            int k,
                            const void     *alpha,
                            const void     *const Aarray[],
                            cudaDataType   Atype,
                            int lda,
                            const void     *const Barray[],
                            cudaDataType   Btype,
                            int ldb,
                            const void     *beta,
                            void           *const Carray[],
                            cudaDataType   Ctype,
                            int ldc,
                            int batchCount,
                            cudaDataType   computeType,
                            cublasGemmAlgo_t algo)
#endif

This function supports the 64-bit Integer Interface.

This function is an extension of cublas<t>gemmBatched that performs the matrix-matrix multiplication of a batch of matrices and allows the user to individually specify the data types for each of the A, B and C matrix arrays, the precision of computation and the GEMM algorithm to be run. Like cublas<t>gemmBatched, the batch is considered to be “uniform”, i.e. all instances have the same dimensions (m, n, k), leading dimensions (lda, ldb, ldc) and transpositions (transa, transb) for their respective A, B and C matrices. The address of the input matrices and the output matrix of each instance of the batch are read from arrays of pointers passed to the function by the caller. Supported combinations of arguments are listed further down in this section.

Note

The second variant of cublasGemmBatchedEx() function is provided for backward compatibility with C++ applications code, where the computeType parameter is of cudaDataType instead of cublasComputeType_t. C applications would still compile with the updated function signature.

$C\lbrack i\rbrack = \alpha\text{op}(A\lbrack i\rbrack)\text{op}(B\lbrack i\rbrack) + \beta C\lbrack i\rbrack,\text{ for i } \in \lbrack 0,batchCount - 1\rbrack$

where $\alpha$ and $\beta$ are scalars, and $A$ , $B$ and $C$ are arrays of pointers to matrices stored in column-major format with dimensions $\text{op}(A\lbrack i\rbrack)$ $m \times k$ , $\text{op}(B\lbrack i\rbrack)$ $k \times n$ and $C\lbrack i\rbrack$ $m \times n$ , respectively. Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

and $\text{op}(B\lbrack i\rbrack)$ is defined similarly for matrix $B\lbrack i\rbrack$ .

Note

$C\lbrack i\rbrack$ matrices must not overlap, i.e. the individual gemm operations must be computable independently; otherwise, undefined behavior is expected.

On certain problem sizes, it might be advantageous to make multiple calls to cublas<t>gemm in different CUDA streams, rather than use this API.

Param.	Memory	In/out	Meaning
handle		input	Handle to the cuBLAS library context.
transa		input	Operation op(`A[i]`) that is non- or (conj.) transpose.
transb		input	Operation op(`B[i]`) that is non- or (conj.) transpose.
m		input	Number of rows of matrix op(`A[i]`) and `C[i]`.
n		input	Number of columns of matrix op(`B[i]`) and `C[i]`.
k		input	Number of columns of op(`A[i]`) and rows of op(`B[i]`).
alpha	host or device	input	Scaling factor for A*B of the type that corresponds to the computeType and Ctype, see the table below for details.
Aarray	device	input	Array of pointers to <Atype> array, with each array of dim. `lda x k` with `lda>=max(1,m)` if `transa == CUBLAS_OP_N` and `lda x m` with `lda>=max(1,k)` otherwise. All pointers must meet certain alignment criteria. Please see below for details.
Atype		input	Enumerant specifying the datatype of `Aarray`.
lda		input	Leading dimension of two-dimensional array used to store the matrix `A[i]`.
Barray	device	input	Array of pointers to <Btype> array, with each array of dim. `ldb x n` with `ldb>=max(1,k)` if `transb == CUBLAS_OP_N` and `ldb x k` with `ldb>=max(1,n)` otherwise. All pointers must meet certain alignment criteria. Please see below for details.
Btype		input	Enumerant specifying the datatype of `Barray`.
ldb		input	Leading dimension of two-dimensional array used to store matrix `B[i]`.
beta	host or device	input	Scaling factor for C of the type that corresponds to the computeType and Ctype, see the table below for details. If `beta==0`, `C[i]` does not have to be a valid input.
Carray	device	in/out	Array of pointers to <Ctype> array. It has dimensions `ldc x n` with `ldc>=max(1,m)`. Matrices `C[i]` should not overlap; otherwise, undefined behavior is expected. All pointers must meet certain alignment criteria. Please see below for details.
Ctype		input	Enumerant specifying the datatype of `Carray`.
ldc		input	Leading dimension of a two-dimensional array used to store each matrix `C[i]`.
batchCount		input	Number of pointers contained in Aarray, Barray and Carray.
computeType		input	Enumerant specifying the computation type.
algo		input	Enumerant specifying the algorithm. See cublasGemmAlgo_t.

cublasGemmBatchedEx() supports the following Compute Type, Scale Type, Atype/Btype, and Ctype:

Compute Type	Scale Type (alpha and beta)	Atype/Btype	Ctype
`CUBLAS_COMPUTE_16F` or `CUBLAS_COMPUTE_16F_PEDANTIC`	`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`
`CUBLAS_COMPUTE_32I` or `CUBLAS_COMPUTE_32I_PEDANTIC`	`CUDA_R_32I`	`CUDA_R_8I`	`CUDA_R_32I`
`CUBLAS_COMPUTE_32F` or `CUBLAS_COMPUTE_32F_PEDANTIC`	`CUDA_R_32F`	`CUDA_R_16BF`	`CUDA_R_16BF`
		`CUDA_R_16F`	`CUDA_R_16F`
		`CUDA_R_8I`	`CUDA_R_32F`
		`CUDA_R_16BF`	`CUDA_R_32F`
		`CUDA_R_16F`	`CUDA_R_32F`
		`CUDA_R_32F`	`CUDA_R_32F`
	`CUDA_C_32F`	`CUDA_C_8I`	`CUDA_C_32F`
	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUBLAS_COMPUTE_32F_FAST_16F` or `CUBLAS_COMPUTE_32F_FAST_16BF` or `CUBLAS_COMPUTE_32F_FAST_TF32`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUBLAS_COMPUTE_64F` or `CUBLAS_COMPUTE_64F_PEDANTIC`	`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUBLAS_COMPUTE_64F` or `CUBLAS_COMPUTE_64F_PEDANTIC`	`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

If Atype is CUDA_R_16F or CUDA_R_16BF, or computeType is any of the FAST options, or when math mode or algo enable fast math modes, pointers (not the pointer arrays) placed in the GPU memory must be properly aligned to avoid misaligned memory access errors. Ideally all pointers are aligned to at least 16 Bytes. Otherwise it is recommended that they meet the following rule:

if k%8==0 then ensure intptr_t(ptr) % 16 == 0,
if k%2==0 then ensure intptr_t(ptr) % 4 == 0.

Note

Compute types CUBLAS_COMPUTE_32I and CUBLAS_COMPUTE_32I_PEDANTIC are only supported with all pointers A[i], B[i] being 4-byte aligned and lda, ldb being multiples of 4. For a better performance, it is also recommended that IMMA kernels requirements for the regular data ordering listed here are met.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_ARCH_MISMATCH`	cublasGemmBatchedEx() is only supported for GPU with architecture capabilities equal to or greater than 5.0.
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters `Atype`,`Btype` and `Ctype` or the algorithm,`algo`is not supported.
`CUBLAS_STATUS_INVALID_VALUE`	If `m` < 0 or `n` < 0 or `k` < 0 or if `transa` or `transb` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `lda` < max(1, `m`) if `transa` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldb` < max(1, `k`) if `transb` == `CUBLAS_OP_N` and `ldb` < max(1, `n`) otherwise or if `ldc` < max(1, `m`) or `Atype` or `Btype` or `Ctype` or `algo` or `computeType` is not supported
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.

Also refer to: sgemm.

2.8.14. cublasGemmStridedBatchedEx()

cublasStatus_t cublasGemmStridedBatchedEx(cublasHandle_t handle,
                            cublasOperation_t transa,
                            cublasOperation_t transb,
                            int m,
                            int n,
                            int k,
                            const void    *alpha,
                            const void     *A,
                            cudaDataType_t Atype,
                            int lda,
                            long long int strideA,
                            const void     *B,
                            cudaDataType_t Btype,
                            int ldb,
                            long long int strideB,
                            const void    *beta,
                            void           *C,
                            cudaDataType_t Ctype,
                            int ldc,
                            long long int strideC,
                            int batchCount,
                            cublasComputeType_t computeType,
                            cublasGemmAlgo_t algo)

#if defined(__cplusplus)
cublasStatus_t cublasGemmStridedBatchedEx(cublasHandle_t handle,
                            cublasOperation_t transa,
                            cublasOperation_t transb,
                            int m,
                            int n,
                            int k,
                            const void    *alpha,
                            const void     *A,
                            cudaDataType Atype,
                            int lda,
                            long long int strideA,
                            const void     *B,
                            cudaDataType Btype,
                            int ldb,
                            long long int strideB,
                            const void    *beta,
                            void           *C,
                            cudaDataType Ctype,
                            int ldc,
                            long long int strideC,
                            int batchCount,
                            cudaDataType computeType,
                            cublasGemmAlgo_t algo)
#endif

This function supports the 64-bit Integer Interface.

This function is an extension of cublas<t>gemmStridedBatched that performs the matrix-matrix multiplication of a batch of matrices and allows the user to individually specify the data types for each of the A, B and C matrices, the precision of computation and the GEMM algorithm to be run. Like cublas<t>gemmStridedBatched, the batch is considered to be “uniform”, i.e. all instances have the same dimensions (m, n, k), leading dimensions (lda, ldb, ldc) and transpositions (transa, transb) for their respective A, B and C matrices. Input matrices A, B and output matrix C for each instance of the batch are located at fixed offsets in number of elements from their locations in the previous instance. Pointers to A, B and C matrices for the first instance are passed to the function by the user along with the offsets in number of elements - strideA, strideB and strideC that determine the locations of input and output matrices in future instances.

Note

The second variant of cublasGemmStridedBatchedEx() function is provided for backward compatibility with C++ applications code, where the computeType parameter is of cudaDataType_t instead of cublasComputeType_t. C applications would still compile with the updated function signature.

$C + i*{strideC} = \alpha\text{op}(A + i*{strideA})\text{op}(B + i*{strideB}) + \beta(C + i*{strideC}),\text{ for i } \in \lbrack 0,batchCount - 1\rbrack$

where $\alpha$ and $\beta$ are scalars, and $A$ , $B$ and $C$ are arrays of pointers to matrices stored in column-major format with dimensions $\text{op}(A\lbrack i\rbrack)$ $m \times k$ , $\text{op}(B\lbrack i\rbrack)$ $k \times n$ and $C\lbrack i\rbrack$ $m \times n$ , respectively. Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

and $\text{op}(B\lbrack i\rbrack)$ is defined similarly for matrix $B\lbrack i\rbrack$ .

Note

$C\lbrack i\rbrack$ matrices must not overlap, i.e. the individual gemm operations must be computable independently; otherwise, undefined behavior is expected.

On certain problem sizes, it might be advantageous to make multiple calls to cublas<t>gemm in different CUDA streams, rather than use this API.

Note

In the table below, we use A[i], B[i], C[i] as notation for A, B and C matrices in the ith instance of the batch, implicitly assuming they are respectively offsets in number of elements strideA, strideB, strideC away from A[i-1], B[i-1], C[i-1]. The unit for the offset is number of elements and must not be zero .

Param.	Memory	In/out	Meaning
handle		input	Handle to the cuBLAS library context.
transa		input	Operation op(`A[i]`) that is non- or (conj.) transpose.
transb		input	Operation op(`B[i]`) that is non- or (conj.) transpose.
m		input	Number of rows of matrix op(`A[i]`) and `C[i]`.
n		input	Number of columns of matrix op(`B[i]`) and `C[i]`.
k		input	Number of columns of op(`A[i]`) and rows of op(`B[i]`).
alpha	host or device	input	Scaling factor for A*B of the type that corresponds to the computeType and Ctype, see the table below for details.
A	device	input	Pointer to <Atype> matrix, A, corresponds to the first instance of the batch, with dimensions `lda x k` with `lda>=max(1,m)` if `transa == CUBLAS_OP_N` and `lda x m` with `lda>=max(1,k)` otherwise.
Atype		input	Enumerant specifying the datatype of `A`.
lda		input	Leading dimension of two-dimensional array used to store the matrix `A[i]`.
strideA		input	Value of type long long int that gives the offset in number of elements between `A[i]` and `A[i+1]`.
B	device	input	Pointer to <Btype> matrix, B, corresponds to the first instance of the batch, with dimensions `ldb x n` with `ldb>=max(1,k)` if `transb == CUBLAS_OP_N` and `ldb x k` with `ldb>=max(1,n)` otherwise.
Btype		input	Enumerant specifying the datatype of `B`.
ldb		input	Leading dimension of two-dimensional array used to store matrix `B[i]`.
strideB		input	Value of type long long int that gives the offset in number of elements between `B[i]` and `B[i+1]`.
beta	host or device	input	Scaling factor for C of the type that corresponds to the computeType and Ctype, see the table below for details. If `beta==0`, `C[i]` does not have to be a valid input.
C	device	in/out	Pointer to <Ctype> matrix, C, corresponds to the first instance of the batch, with dimensions `ldc x n` with `ldc>=max(1,m)`. Matrices `C[i]` should not overlap; otherwise, undefined behavior is expected.
Ctype		input	Enumerant specifying the datatype of `C`.
ldc		input	Leading dimension of a two-dimensional array used to store each matrix `C[i]`.
strideC		input	Value of type long long int that gives the offset in number of elements between `C[i]` and `C[i+1]`.
batchCount		input	Number of GEMMs to perform in the batch.
computeType		input	Enumerant specifying the computation type.
algo		input	Enumerant specifying the algorithm. See cublasGemmAlgo_t.

cublasGemmStridedBatchedEx() supports the following Compute Type, Scale Type, Atype/Btype, and Ctype:

Compute Type	Scale Type (alpha and beta)	Atype/Btype	Ctype
`CUBLAS_COMPUTE_16F` or `CUBLAS_COMPUTE_16F_PEDANTIC`	`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`
`CUBLAS_COMPUTE_32I` or `CUBLAS_COMPUTE_32I_PEDANTIC`	`CUDA_R_32I`	`CUDA_R_8I`	`CUDA_R_32I`
`CUBLAS_COMPUTE_32F` or `CUBLAS_COMPUTE_32F_PEDANTIC`	`CUDA_R_32F`	`CUDA_R_16BF`	`CUDA_R_16BF`
		`CUDA_R_16F`	`CUDA_R_16F`
		`CUDA_R_8I`	`CUDA_R_32F`
		`CUDA_R_16BF`	`CUDA_R_32F`
		`CUDA_R_16F`	`CUDA_R_32F`
		`CUDA_R_32F`	`CUDA_R_32F`
	`CUDA_C_32F`	`CUDA_C_8I`	`CUDA_C_32F`
	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUBLAS_COMPUTE_32F_FAST_16F` or `CUBLAS_COMPUTE_32F_FAST_16BF` or `CUBLAS_COMPUTE_32F_FAST_TF32`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUBLAS_COMPUTE_64F` or `CUBLAS_COMPUTE_64F_PEDANTIC`	`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUBLAS_COMPUTE_64F` or `CUBLAS_COMPUTE_64F_PEDANTIC`	`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

Note

Compute types CUBLAS_COMPUTE_32I and CUBLAS_COMPUTE_32I_PEDANTIC are only supported with all pointers A[i], B[i] being 4-byte aligned and lda, ldb being multiples of 4. For a better performance, it is also recommended that IMMA kernels requirements for the regular data ordering listed here are met.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_ARCH_MISMATCH`	cublasGemmBatchedEx() is only supported for GPU with architecture capabilities equal or greater than 5.0.
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters `Atype`,`Btype` and `Ctype` or the algorithm,`algo`is not supported.
`CUBLAS_STATUS_INVALID_VALUE`	If `m` < 0 or `n` < 0 or `k` < 0 or if `transa` or `transb` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `lda` < max(1, `m`) if `transa` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldb` < max(1, `k`) if `transb` == `CUBLAS_OP_N` and `ldb` < max(1, `n`) otherwise or if `ldc` < max(1, `m`) or `Atype` or `Btype` or `Ctype` or `algo` or `computeType` is not supported
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

Also refer to: sgemm.

2.8.15. cublasCsyrkEx()

cublasStatus_t cublasCsyrkEx(cublasHandle_t handle,
                             cublasFillMode_t uplo,
                             cublasOperation_t trans,
                             int n,
                             int k,
                             const float     *alpha,
                             const void      *A,
                             cudaDataType    Atype,
                             int lda,
                             const float    *beta,
                             cuComplex      *C,
                             cudaDataType   Ctype,
                             int ldc)

This function supports the 64-bit Integer Interface.

This function is an extension of cublasCsyrk() where the input matrix and output matrix can have a lower precision but the computation is still done in the type cuComplex

This function performs the symmetric rank- $k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{T} + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a symmetric matrix stored in lower or upper mode, and $A$ is a matrix with dimensions $\text{op}(A)$ $n \times k$ . Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ \end{matrix} \right.$

Note

This routine is only supported on GPUs with architecture capabilities equal to or greater than 5.0

Param.	Memory	In/out	Meaning
handle		input	Handle to the cuBLAS library context.
uplo		input	Indicates if matrix `C` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	Operation op(`A`) that is non- or transpose.
n		input	Number of rows of matrix op(`A`) and `C`.
k		input	Number of columns of matrix op(`A`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `trans == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
Atype		input	Enumerant specifying the datatype of matrix `A`.
lda		input	Leading dimension of two-dimensional array used to store matrix A.
beta	host or device	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`.
Ctype		input	Enumerant specifying the datatype of matrix `C`.
ldc		input	Leading dimension of two-dimensional array used to store matrix `C`.

The matrix types combinations supported for cublasCsyrkEx() are listed below:

A	C
`CUDA_C_8I`	`CUDA_C_32F`
`CUDA_C_32F`	`CUDA_C_32F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or `k` < 0 or if `uplo` != `CUBLAS_FILL_MODE_UPPER`, `CUBLAS_FILL_MODE_LOWER` or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `lda` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldc` < max(1, `n`) or `Atype` or `Ctype` is not supported
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters `Atype` and `Ctype` is not supported.
`CUBLAS_STATUS_ARCH_MISMATCH`	The device has a compute capability lower than 5.0.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.

For references please refer to:

2.8.16. cublasCsyrk3mEx()

cublasStatus_t cublasCsyrk3mEx(cublasHandle_t handle,
                               cublasFillMode_t uplo,
                               cublasOperation_t trans,
                               int n,
                               int k,
                               const float     *alpha,
                               const void      *A,
                               cudaDataType    Atype,
                               int lda,
                               const float    *beta,
                               cuComplex      *C,
                               cudaDataType   Ctype,
                               int ldc)

This function supports the 64-bit Integer Interface.

This function is an extension of cublasCsyrk() where the input matrix and output matrix can have a lower precision but the computation is still done in the type cuComplex. This routine is implemented using the Gauss complexity reduction algorithm which can lead to an increase in performance up to 25%

This function performs the symmetric rank- $k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{T} + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a symmetric matrix stored in lower or upper mode, and $A$ is a matrix with dimensions $\text{op}(A)$ $n \times k$ . Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ \end{matrix} \right.$

Note

This routine is only supported on GPUs with architecture capabilities equal to or greater than 5.0

Param.	Memory	In/out	Meaning
handle		input	Handle to the cuBLAS library context.
uplo		input	Indicates if matrix `C` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	Operation op(`A`) that is non- or transpose.
n		input	Number of rows of matrix op(`A`) and `C`.
k		input	Number of columns of matrix op(`A`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `trans == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
Atype		input	Enumerant specifying the datatype of matrix `A`.
lda		input	Leading dimension of two-dimensional array used to store matrix A.
beta	host or device	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`.
Ctype		input	Enumerant specifying the datatype of matrix `C`.
ldc		input	Leading dimension of two-dimensional array used to store matrix `C`.

The matrix types combinations supported for cublasCsyrk3mEx() are listed below :

A	C
`CUDA_C_8I`	`CUDA_C_32F`
`CUDA_C_32F`	`CUDA_C_32F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or `k` < 0 or if `uplo` != `CUBLAS_FILL_MODE_UPPER`, `CUBLAS_FILL_MODE_LOWER` or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `lda` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldc` < max(1, `n`) or `Atype` or `Ctype` is not supported
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters `Atype` and `Ctype` is not supported.
`CUBLAS_STATUS_ARCH_MISMATCH`	The device has a compute capability lower than 5.0.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.

For references please refer to:

cublasLtLoggerSetCallback(), cublasLtLoggerSetFile(), cublasLtLoggerOpenFile(), cublasLtLoggerSetLevel(), cublasLtLoggerSetMask(), cublasLtLoggerForceDisable()

2.8.17. cublasCherkEx()

cublasStatus_t cublasCherkEx(cublasHandle_t handle,
                           cublasFillMode_t uplo,
                           cublasOperation_t trans,
                           int n,
                           int k,
                           const float     *alpha,
                           const void      *A,
                           cudaDataType    Atype,
                           int lda,
                           const float    *beta,
                           cuComplex      *C,
                           cudaDataType   Ctype,
                           int ldc)

This function supports the 64-bit Integer Interface.

This function is an extension of cublasCherk() where the input matrix and output matrix can have a lower precision but the computation is still done in the type cuComplex

This function performs the Hermitian rank- $k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{H} + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a Hermitian matrix stored in lower or upper mode, and $A$ is a matrix with dimensions $\text{op}(A)$ $n \times k$ . Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

Note

This routine is only supported on GPUs with architecture capabilities equal to or greater than 5.0

Param.	Memory	In/out	Meaning
handle		input	Handle to the cuBLAS library context.
uplo		input	Indicates if matrix `C` lower or upper part is stored, the other Hermitian part is not referenced.
trans		input	Operation op(`A`) that is non- or (conj.) transpose.
n		input	Number of rows of matrix op(`A`) and `C`.
k		input	Number of columns of matrix op(`A`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
Atype		input	Enumerant specifying the datatype of matrix `A`.
lda		input	Leading dimension of two-dimensional array used to store matrix `A`.
beta		input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
Ctype		input	Enumerant specifying the datatype of matrix `C`.
ldc		input	Leading dimension of two-dimensional array used to store matrix `C`.

The matrix types combinations supported for cublasCherkEx() are listed in the following table:

A	C
`CUDA_C_8I`	`CUDA_C_32F`
`CUDA_C_32F`	`CUDA_C_32F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or `k` < 0 or if `uplo` != `CUBLAS_FILL_MODE_UPPER`, `CUBLAS_FILL_MODE_LOWER` or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `lda` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldc` < max(1, `n`) or `Atype` or `Ctype` is not supported
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters `Atype` and `Ctype` is not supported.
`CUBLAS_STATUS_ARCH_MISMATCH`	The device has a compute capability lower than 5.0.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.

For references please refer to:

cherk

2.8.18. cublasCherk3mEx()

cublasStatus_t cublasCherk3mEx(cublasHandle_t handle,
                           cublasFillMode_t uplo,
                           cublasOperation_t trans,
                           int n,
                           int k,
                           const float     *alpha,
                           const void      *A,
                           cudaDataType    Atype,
                           int lda,
                           const float    *beta,
                           cuComplex      *C,
                           cudaDataType   Ctype,
                           int ldc)

This function supports the 64-bit Integer Interface.

This function is an extension of cublasCherk() where the input matrix and output matrix can have a lower precision but the computation is still done in the type cuComplex. This routine is implemented using the Gauss complexity reduction algorithm which can lead to an increase in performance up to 25%

This function performs the Hermitian rank- $k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{H} + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a Hermitian matrix stored in lower or upper mode, and $A$ is a matrix with dimensions $\text{op}(A)$ $n \times k$ . Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

Note

This routine is only supported on GPUs with architecture capabilities equal to or greater than 5.0

Param.	Memory	In/out	Meaning
handle		input	Handle to the cuBLAS library context.
uplo		input	Indicates if matrix `C` lower or upper part is stored, the other Hermitian part is not referenced.
trans		input	Operation op(`A`) that is non- or (conj.) transpose.
n		input	Number of rows of matrix op(`A`) and `C`.
k		input	Number of columns of matrix op(`A`).
alpha	host or device	input	<type> scalar used for multiplication.
A	device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
Atype		input	Enumerant specifying the datatype of matrix `A`.
lda		input	Leading dimension of two-dimensional array used to store matrix `A`.
beta		input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
Ctype		input	Enumerant specifying the datatype of matrix `C`.
ldc		input	Leading dimension of two-dimensional array used to store matrix `C`.

The matrix types combinations supported for cublasCherk3mEx() are listed in the following table:

A	C
`CUDA_C_8I`	`CUDA_C_32F`
`CUDA_C_32F`	`CUDA_C_32F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_INVALID_VALUE`	If `n` < 0 or `k` < 0 or if `uplo` != `CUBLAS_FILL_MODE_UPPER`, `CUBLAS_FILL_MODE_LOWER` or if `trans` != `CUBLAS_OP_N`, `CUBLAS_OP_C`, `CUBLAS_OP_T` or if `lda` < max(1, `n`) if `trans` == `CUBLAS_OP_N` and `lda` < max(1, `k`) otherwise or if `ldc` < max(1, `n`) or `Atype` or `Ctype` is not supported
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters `Atype` and `Ctype` is not supported.
`CUBLAS_STATUS_ARCH_MISMATCH`	The device has a compute capability lower than 5.0.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.

For references please refer to:

cherk

2.8.19. cublasNrm2Ex()

cublasStatus_t  cublasNrm2Ex( cublasHandle_t handle,
                              int n,
                              const void *x,
                              cudaDataType xType,
                              int incx,
                              void *result,
                              cudaDataType resultType,
                              cudaDataType executionType)

This function supports the 64-bit Integer Interface.

This function is an API generalization of the routine cublas<t>nrm2 where input data, output data and compute type can be specified independently.

This function computes the Euclidean norm of the vector x. The code uses a multiphase model of accumulation to avoid intermediate underflow and overflow, with the result being equivalent to $\sqrt{\sum_{i = 1}^{n}\left( {\mathbf{x}\lbrack j\rbrack \times \mathbf{x}\lbrack j\rbrack} \right)}$ where $j = 1 + \left( {i - 1} \right)*\text{incx}$ in exact arithmetic. Notice that the last equation reflects 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
n		input	number of elements in the vector `x`.
x	device	input	<type> vector with `n` elements.
xType		input	enumerant specifying the datatype of vector `x`.
incx		input	stride between consecutive elements of `x`.
result	host or device	output	the resulting norm, which is `0.0` if `n,incx<=0`.
resultType		input	enumerant specifying the datatype of the `result`.
executionType		input	enumerant specifying the datatype in which the computation is executed.

The datatypes combinations currently supported for cublasNrm2Ex() are listed below :

x	result	execution
`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_32F`
`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUDA_C_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUDA_C_64F`	`CUDA_R_64F`	`CUDA_R_64F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_ALLOC_FAILED`	the reduction buffer could not be allocated
`CUBLAS_STATUS_NOT_SUPPORTED`	the combination of the parameters `xType`, `resultType` and `executionType` is not supported
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU
`CUBLAS_STATUS_INVALID_VALUE`	If `xType` or `resultType` or `executionType` is not supported or `result` == NULL

For references please refer to:

snrm2, dnrm2, scnrm2, dznrm2

2.8.20. cublasAxpyEx()

cublasStatus_t cublasAxpyEx (cublasHandle_t handle,
                             int n,
                             const void *alpha,
                             cudaDataType alphaType,
                             const void *x,
                             cudaDataType xType,
                             int incx,
                             void *y,
                             cudaDataType yType,
                             int incy,
                             cudaDataType executiontype);

This function supports the 64-bit Integer Interface.

This function is an API generalization of the routine cublas<t>axpy where input data, output data and compute type can be specified independently.

This function multiplies the vector x by the scalar $\alpha$ and adds it to the vector y overwriting the latest vector with the result. Hence, the performed operation is $\mathbf{y}\lbrack j\rbrack = \alpha \times \mathbf{x}\lbrack k\rbrack + \mathbf{y}\lbrack j\rbrack$ for $i = 1,\ldots,n$ , $k = 1 + \left( {i - 1} \right)*\text{incx}$ and $j = 1 + \left( {i - 1} \right)*\text{incy}$ . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
handle		input	Handle to the cuBLAS library context.
n		input	Number of elements in the vector `x` and `y`.
alpha	host or device	input	<type> scalar used for multiplication.
alphaType		input	Enumerant specifying the datatype of scalar `alpha`.
x	device	input	<type> vector with `n` elements.
xType		input	Enumerant specifying the datatype of vector `x`.
incx		input	Stride between consecutive elements of `x`.
y	device	in/out	<type> vector with `n` elements.
yType		input	Enumerant specifying the datatype of vector `y`.
incy		input	Stride between consecutive elements of `y`.
executionType		input	Enumerant specifying the datatype in which the computation is executed.

The datatypes combinations currently supported for cublasAxpyEx() are listed in the following table:

alpha	x	y	execution
`CUDA_R_32F`	`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters `xType`,`yType`, and `executionType` is not supported.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.
`CUBLAS_STATUS_INVALID_VALUE`	`alphaType` or `xType` or `yType` or `executionType` is not supported.

For references please refer to:

saxpy, daxpy, caxpy, zaxpy

2.8.21. cublasDotEx()

cublasStatus_t cublasDotEx (cublasHandle_t handle,
                            int n,
                            const void *x,
                            cudaDataType xType,
                            int incx,
                            const void *y,
                            cudaDataType yType,
                            int incy,
                            void *result,
                            cudaDataType resultType,
                            cudaDataType executionType);

cublasStatus_t cublasDotcEx (cublasHandle_t handle,
                             int n,
                             const void *x,
                             cudaDataType xType,
                             int incx,
                             const void *y,
                             cudaDataType yType,
                             int incy,
                             void *result,
                             cudaDataType resultType,
                             cudaDataType executionType);

These functions support the 64-bit Integer Interface.

These functions are an API generalization of the routines cublas<t>dot and cublas<t>dotc where input data, output data and compute type can be specified independently. Note: cublas<t>dotc is dot product conjugated, cublas<t>dotu is dot product unconjugated.

This function computes the dot product of vectors x and y. Hence, the result is $\sum_{i = 1}^{n}\left( {\mathbf{x}\lbrack k\rbrack \times \mathbf{y}\lbrack j\rbrack} \right)$ where $k = 1 + \left( {i - 1} \right)*\text{incx}$ and $j = 1 + \left( {i - 1} \right)*\text{incy}$ . Notice that in the first equation the conjugate of the element of vector x should be used if the function name ends in character ‘c’ and that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
handle		input	Handle to the cuBLAS library context.
n		input	Number of elements in the vectors `x` and `y`.
x	device	input	<type> vector with `n` elements.
xType		input	Enumerant specifying the datatype of vector `x`.
incx		input	Stride between consecutive elements of `x`.
y	device	input	<type> vector with `n` elements.
yType		input	Enumerant specifying the datatype of vector `y`.
incy		input	Stride between consecutive elements of `y`.
result	host or device	output	The resulting dot product, which is `0.0` if `n<=0`.
resultType		input	Enumerant specifying the datatype of the `result`.
executionType		input	Enumerant specifying the datatype in which the computation is executed.

The datatypes combinations currently supported for cublasDotEx() and cublasDotcEx() are listed below:

x	y	result	execution
`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_32F`
`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

The possible error values returned by this function and their meanings are listed in the following table:

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_ALLOC_FAILED`	The reduction buffer could not be allocated.
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters `xType`,`yType`, `resultType` and `executionType` is not supported.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.
`CUBLAS_STATUS_INVALID_VALUE`	`xType` or `yType` or `resultType` or `executionType` is not supported.

For references please refer to:

sdot, ddot, cdotu, cdotc, zdotu, zdotc

2.8.22. cublasRotEx()

cublasStatus_t cublasRotEx(cublasHandle_t handle,
                           int n,
                           void *x,
                           cudaDataType xType,
                           int incx,
                           void *y,
                           cudaDataType yType,
                           int incy,
                           const void *c,  /* host or device pointer */
                           const void *s,
                           cudaDataType csType,
                           cudaDataType executiontype);

This function supports the 64-bit Integer Interface.

This function is an extension to the routine cublas<t>rot where input data, output data, cosine/sine type, and compute type can be specified independently.

This function applies Givens rotation matrix (i.e., rotation in the x,y plane counter-clockwise by angle defined by cos(alpha)=c, sin(alpha)=s):

$G = \begin{pmatrix} c & s \\ {- s} & c \\ \end{pmatrix}$

to vectors x and y.

Hence, the result is $\mathbf{x}\lbrack k\rbrack = c \times \mathbf{x}\lbrack k\rbrack + s \times \mathbf{y}\lbrack j\rbrack$ and $\mathbf{y}\lbrack j\rbrack = - s \times \mathbf{x}\lbrack k\rbrack + c \times \mathbf{y}\lbrack j\rbrack$ where $k = 1 + \left( {i - 1} \right)*\text{incx}$ and $j = 1 + \left( {i - 1} \right)*\text{incy}$ . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
n		input	number of elements in the vectors `x` and `y`.
x	device	in/out	<type> vector with `n` elements.
xType		input	enumerant specifying the datatype of vector `x`.
incx		input	stride between consecutive elements of `x`.
y	device	in/out	<type> vector with `n` elements.
yType		input	enumerant specifying the datatype of vector `y`.
incy		input	stride between consecutive elements of `y`.
c	host or device	input	cosine element of the rotation matrix.
s	host or device	input	sine element of the rotation matrix.
csType		input	enumerant specifying the datatype of `c` and `s`.
executionType		input	enumerant specifying the datatype in which the computation is executed.

The datatypes combinations currently supported for cublasRotEx() are listed below :

executionType	xType / yType	csType
`CUDA_R_32F`	`CUDA_R_16BF` `CUDA_R_16F` `CUDA_R_32F`	`CUDA_R_16BF` `CUDA_R_16F` `CUDA_R_32F`
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUDA_C_32F`	`CUDA_C_32F` `CUDA_C_32F`	`CUDA_R_32F` `CUDA_C_32F`
`CUDA_C_64F`	`CUDA_C_64F` `CUDA_C_64F`	`CUDA_R_64F` `CUDA_C_64F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

srot, drot, crot, csrot, zrot, zdrot

2.8.23. cublasScalEx()

cublasStatus_t  cublasScalEx(cublasHandle_t handle,
                             int n,
                             const void *alpha,
                             cudaDataType alphaType,
                             void *x,
                             cudaDataType xType,
                             int incx,
                             cudaDataType executionType);

This function supports the 64-bit Integer Interface.

This function scales the vector x by the scalar $\alpha$ and overwrites it with the result. Hence, the performed operation is $\mathbf{x}\lbrack j\rbrack = \alpha \times \mathbf{x}\lbrack j\rbrack$ for $i = 1,\ldots,n$ and $j = 1 + \left( {i - 1} \right)*\text{incx}$ . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLAS library context.
n		input	number of elements in the vector `x`.
alpha	host or device	input	<type> scalar used for multiplication.
alphaType		input	enumerant specifying the datatype of scalar `alpha`.
x	device	in/out	<type> vector with `n` elements.
xType		input	enumerant specifying the datatype of vector `x`.
incx		input	stride between consecutive elements of `x`.
executionType		input	enumerant specifying the datatype in which the computation is executed.

The datatypes combinations currently supported for cublasScalEx() are listed below :

alpha	x	execution
`CUDA_R_32F`	`CUDA_R_16F`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_16BF`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_NOT_SUPPORTED`	the combination of the parameters `xType` and `executionType` is not supported
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU
`CUBLAS_STATUS_INVALID_VALUE`	`alphaType` or `xType` or `executionType` is not supported

For references please refer to:

sscal, dscal, csscal, cscal, zdscal, zscal

3. Using the cuBLASLt API

3.1. General Description

The cuBLASLt library is a new lightweight library dedicated to GEneral Matrix-to-matrix Multiply (GEMM) operations with a new flexible API. This new library adds flexibility in matrix data layouts, input types, compute types, and also in choosing the algorithmic implementations and heuristics through parameter programmability.

Once a set of options for the intended GEMM operation are identified by the user, these options can be used repeatedly for different inputs. This is analogous to how cuFFT and FFTW first create a plan and reuse for same size and type FFTs with different input data.

Note

The cuBLASLt library does not guarantee the support of all possible sizes and configurations, however, since CUDA 12.2 update 2, the problem size limitations on m, n, and batch size have been largely resolved. The main focus of the library is to provide the most performant kernels, which might have some implied limitations. Some non-standard configurations may require a user to handle them manually, typically by decomposing the problem into smaller parts (see Problem Size Limitations).

3.1.1. Problem Size Limitations

There are inherent problem size limitations that are a result of limitations in CUDA grid dimensions. For example, many kernels do not support batch sizes greater than 65535 due to a limitation on the z dimension of a grid. There are similar restriction on the m and n values for a given problem.

In cases where a problem cannot be run by a single kernel, cuBLASLt will attempt to decompose the problem into multiple sub-problems and solve it by running the kernel on each sub-problem.

There are some restrictions on cuBLASLt internal problem decomposition which are summarized below:

Amax computations are not supported. This means that CUBLASLT_MATMUL_DESC_AMAX_D_POINTER and CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_AMAX_POINTER must be left unset (see cublasLtMatmulDescAttributes_t)
All matrix layouts must have CUBLASLT_MATRIX_LAYOUT_ORDER set to CUBLASLT_ORDER_COL (see cublasLtOrder_t)
cuBLASLt will not partition along the n dimension when CUBLASLT_MATMUL_DESC_EPILOGUE is set to CUBLASLT_EPILOGUE_DRELU_BGRAD or CUBLASLT_EPILOGUE_DGELU_BGRAD (see cublasLtEpilogue_t)

To overcome these limitations, a user may want to partition the problem themself, launch kernels for each sub-problem, and compute any necessary reductions to combine the results.

3.1.2. Heuristics Cache

cuBLASLt uses heuristics to pick the most suitable matmul kernel for execution based on the problem sizes, GPU configuration, and other parameters. This requires performing some computations on the host CPU, which could take tens of microseconds. To overcome this overhead, it is recommended to query the heuristics once using cublasLtMatmulAlgoGetHeuristic() and then reuse the result for subsequent computations using cublasLtMatmul().

For the cases where querying heuristics once and then reusing them is not feasible, cuBLASLt implements a heuristics cache that maps matmul problems to kernels previously selected by heuristics. The heuristics cache uses an LRU-like eviction policy and is thread-safe.

The user can control the heuristics cache capacity with the CUBLASLT_HEURISTICS_CACHE_CAPACITY environment variable or with the cublasLtHeuristicsCacheSetCapacity() function which has higher precedence. The capacity is measured in number of entries and might be rounded up to the nearest multiple of some factor for performance reasons. Each entry takes about 360 bytes but is subject to change. The default capacity is 8192 entries.

Note

Setting capacity to zero disables the cache completely. This can be useful for workloads that do not have a steady state and for which cache operations may have higher overhead than regular heuristics computations.

Note

The cache is not ideal for performance reasons, so it is sometimes necessary to increase its capacity 1.5x-2.x over the anticipated number of unique matmul problems to achieve a nearly perfect hit rate.

3.1.3. cuBLASLt Logging

cuBLASLt logging mechanism can be enabled by setting the following environment variables before launching the target application:

CUBLASLT_LOG_LEVEL=<level>, where <level> is one of the following levels:
- “0” - Off - logging is disabled (default)
- “1” - Error - only errors will be logged
- “2” - Trace - API calls that launch CUDA kernels will log their parameters and important information
- “3” - Hints - hints that can potentially improve the application’s performance
- “4” - Info - provides general information about the library execution, may contain details about heuristic status
- “5” - API Trace - API calls will log their parameter and important information
CUBLASLT_LOG_MASK=<mask>, where <mask> is a combination of the following flags:
- “0” - Off
- “1” - Error
- “2” - Trace
- “4” - Hints
- “8” - Info
- “16” - API Trace
For example, use CUBLASLT_LOG_MASK=5 to enable Error and Hints messages.
CUBLASLT_LOG_FILE=<file_name>, where <file_name> is a path to a logging file. File name may contain %i, that will be replaced with the process ID. For example <file_name>_%i.log.

If CUBLASLT_LOG_FILE is not defined, the log messages are printed to stdout.

Another option is to use the experimental cuBLASLt logging API. See:

3.1.4. 8-bit Floating Point Data Types (FP8) Usage

FP8 was first introduced with Ada and Hopper GPUs (compute capability 8.9 and above) and is designed to further accelerate matrix multiplications. There are two types of FP8 available:

CUDA_R_8F_E4M3 is designed to be accurate at a smaller dynamic range than half precision. The E4 and M3 represent a 4-bit exponent and a 3-bit mantissa respectively. For more details, see __nv__fp8__e4m3.
CUDA_R_8F_E5M2 is designed to be accurate at a similar dynamic range as half precision. The E5 and M2 represent a 5-bit exponent and a 2-bit mantissa respectively. For more information see __nv__fp8__e5m2.

Note

Unless otherwise stated, FP8 refers to both CUDA_R_8F_E4M3 and CUDA_R_8F_E5M2.

In order to maintain accurate FP8 matrix multiplications, we define native compute FP8 matrix multiplication as follows:

\[D = scale_D \cdot (\alpha \cdot scale_A \cdot scale_B \cdot \text{op}(A) \text{op}(B) + \beta \cdot scale_C \cdot C)\]

where A, B, and C are input matrices, and scaleA, scaleB, scaleC, scaleD, alpha, and beta are input scalars. This differs from the other matrix multiplication routines because of this addition of scaling factors for each matrix. The scaleA, scaleB, and scaleC are used for de-quantization, and scaleD is used for quantization. Note that all the scaling factors are applied multiplicatively. This means that sometimes it is necessary to use a scaling factor or its reciprocal depending on the context in which it is applied. For more information on FP8, see cublasLtMatmul() and cublasLtMatmulDescAttributes_t.

For FP8 matrix multiplications, epilogues and amaxD may be computed as follows:

\[\begin{split}D_{temp}, Aux_{temp} & = \mathop{Epilogue}(\alpha \cdot scale_A \cdot scale_B \cdot \text{op}(A) \text{op}(B) + \beta \cdot scale_C \cdot C) \\ amax_{D} & = \mathop{absmax}(D_{temp}) \\ amax_{Aux} & = \mathop{absmax}(Aux_{temp}) \\ D & = scale_D * D_{temp} \\ Aux & = scale_{Aux} * Aux_{temp} \\\end{split}\]

Here Aux is an auxiliary output of an epilogue function like GELU, scaleAux is an optional scaling factor that can be applied to Aux, and amaxAux is the maximum absolute value in Aux before scaling. For more information, see attributes CUBLASLT_MATMUL_DESC_AMAX_D_POINTER and CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_AMAX_POINTER in cublasLtMatmulDescAttributes_t.

3.1.5. Disabling CPU Instructions

As mentioned in the Heuristics Cache section, cuBLASLt heuristics perform some compute-intensive operations on the host CPU. To speed-up the operations, the implementation detects CPU capabilities and may use special instructions, such as Advanced Vector Extensions (AVX) on x86-64 CPUs. However, in some rare cases this might be not desirable. For instance, using advanced instructions may result in CPU running at a lower frequency, which would affect performance of the other host code.

The user can optionally instruct the cuBLASLt library to not use some CPU instructions with the CUBLASLT_DISABLE_CPU_INSTRUCTIONS_MASK environment variable or with the cublasLtDisableCpuInstructionsSetMask() function which has higher precedence. The default mask is 0, meaning that there are no restrictions.

Please check cublasLtDisableCpuInstructionsSetMask() for more information.

3.1.6. Atomics Synchronization

Atomics synchronization allows optimizing matmul workloads by enabling cublasLtMatmul() to have a producer or consumer relationship with another concurrently running kernel. This allows overlapping computation and communication with a finer granularity. Conceptually, matmul is provided with an array containing 32-bit integer counters, and then:

In the consumer mode, either matrix A is partitioned into chunks by rows, or matrix B is partitioned into chunks by columns 1. A chunk can be read from memory and used in computations only when the corresponding atomic counter reaches value of 0. The producer should execute a memory fence to ensure that the written value is visible to the concurrently running matmul kernel 2.
In the producer mode, the output matrix C (or D in the out-of-place mode), is partitioned by rows or columns, and after a chunk is computed, the corresponding atomic counter is set to 0. Each counter must be initialized to 1 before the matmul kernel runs.

1: The current implementation allows partitioning either the rows or the columns of the matrixes, but not both. Batched cases are not supported.
2: One possible implementation of a memory fence is cuda::atomic_thread_fence(cuda::memory_order_seq_cst, cuda::thread_scope::thread_scope_device) (see cuda::atomic_thread_fence() for more details).

The array of counters are passed to matmuls via the CUBLASLT_MATMUL_DESC_ATOMIC_SYNC_IN_COUNTERS_POINTER and CUBLASLT_MATMUL_DESC_ATOMIC_SYNC_OUT_COUNTERS_POINTER compute descriptor attributes for the consumer and producer modes respectively 3. The arrays must have a sufficient number of elements for all the chunks.

3: The current implementation allows to only enable either the producer or the consumer mode, but not both. Matmul will return an error if both input and output counter pointers to a non-NULL value.

The number of chunks is controlled by CUBLASLT_MATMUL_DESC_ATOMIC_SYNC_NUM_CHUNKS_D_ROWS and CUBLASLT_MATMUL_DESC_ATOMIC_SYNC_NUM_CHUNKS_D_COLS compute descriptor attributes. Both of these attributes must be set to a value greater than zero for the feature to be enabled. For the column-major layout, the number of chunks must satisfy:

\[\begin{split}0 \leq \text{$\mathrm{NUM\_CHUNKS\_ROWS}$} \leq & \mathop{\text{floor}}\left( \frac{\text{M}}{\text{$\mathrm{TILE\_SIZE\_M}$} * \text{$\mathrm{CLUSTER\_SHAPE\_M}$}} \right) \\ 0 \leq \text{$\mathrm{NUM\_CHUNKS\_COLS}$} \leq & \mathop{\text{floor}}\left( \frac{\text{N}}{\text{$\mathrm{TILE\_SIZE\_N}$} * \text{$\mathrm{CLUSTER\_SHAPE\_N}$}} \right)\end{split}\]

For row-major layout, M and N in tile size and cluster shape must be swapped. These restrictions mean that it is required to first query heuristic via cublasLtMatmulAlgoGetHeuristic() and inspect the result for tile and cluster shapes, and only then set the number of chunks.

The pseudocode below shows the principles of operation:

// The code below shows operation when partitioning over
// rows assuming column-major layout and TN case.
//
// The case when partitioning is done over columns or
// row-major case are handled in a similar fashion,
// with the main difference being the offsets
// computations.
//
// Note that the actual implementation does not
// guarantee in which order the chunks are computed,
// and may employ various optimizations to improve
// overall performance.
//
// Here:
//   - A, B, C -- input matrices in the column-major layout
//   - lda -- leading dimension of matrix A
//   - M, N, K -- the original problem dimensions
//   - counters_in[] and counters_out[] -- the arrays of
//     input and output atomic counters
//
for (int i = 0; i < NUM_CHUNKS_ROWS; i++) {
  // Consumer: wait for the input counter to become 0
  if (consumer) {
    while (counters_in[i] != 0); // spin
  }

  // compute chunk dimensions
  chunk_m_begin = floor((double)M / NUM_CHUNKS_ROWS * i);
  chunk_m_end = floor((double)M / NUM_CHUNKS_ROWS * (i + 1));
  chunk_m = chunk_m_end - chunk_m_begin;

  // Compute the current chunk
  matmul(chunk_m, N, K,
         A[chunk_m_begin * lda], // A is col-major transposed
         B, // B is not partitioned
         C[chunk_m_begin] // C is col-major non-transposed
         );

  // Producer: set the counter to 0 when done
  if (producer) {
    counters_out[i] = 0;
    // make the written value visible to the consumer kernel
    memory_fence();
  }
}

It should be noted that, in general, CUDA programming model provides few kernel co-scheduling guarantees. Thus, use of this feature requires careful orchestration of producer and consumer kernels launch order and resource availability, as it easy to create a deadlock situation. A deadlock may occur in the following cases (this is not an exhaustive list):

If a producer kernel cannot start because consumer kernel was launched first and is occupying some of SMs that are needed by the producer kernel to launch. It is strongly recommended to set CUBLASLT_MATMUL_DESC_SM_COUNT_TARGET to carve out some SMs for non-matmul (typically communication) kernels to execute on.
If cudaDeviceSynchronize() is called after consumer kernel starts but before the producer kernel does.
When lazy module loading is enabled, and producer kernel cannot be loaded while the consumer kernel is running due to locking in the CUDA runtime library. Both kernels also must be loaded before they are run together to avoid this situation. Using CUDA Graphs is another way to avoid deadlocks due to lazy loading.

Note

This feature is aimed at advanced users and is only available on Hopper architecture for FP8 non-batched cases with fast accumulation mode enabled, and is considered to have beta quality due to the large number of restrictions on its use.

3.2. cuBLASLt Code Examples

Please visit https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuBLASLt for updated code examples.

3.3. cuBLASLt Datatypes Reference

3.3.1. cublasLtClusterShape_t

cublasLtClusterShape_t is an enumerated type used to configure thread block cluster dimensions. Thread block clusters add an optional hierarchical level and are made up of thread blocks. Similar to thread blocks, these can be one, two, or three-dimensional. See also Thread Block Clusters.

Value	Description
`CUBLASLT_CLUSTER_SHAPE_AUTO`	Cluster shape is automatically selected.
`CUBLASLT_CLUSTER_SHAPE_1x1x1`	Cluster shape is 1 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x2x1`	Cluster shape is 1 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x4x1`	Cluster shape is 1 x 4 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x1x1`	Cluster shape is 2 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x2x1`	Cluster shape is 2 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x4x1`	Cluster shape is 2 x 4 x 1.
`CUBLASLT_CLUSTER_SHAPE_4x1x1`	Cluster shape is 4 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_4x2x1`	Cluster shape is 4 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_4x4x1`	Cluster shape is 4 x 4 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x8x1`	Cluster shape is 1 x 8 x 1.
`CUBLASLT_CLUSTER_SHAPE_8x1x1`	Cluster shape is 8 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x8x1`	Cluster shape is 2 x 8 x 1.
`CUBLASLT_CLUSTER_SHAPE_8x2x1`	Cluster shape is 8 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x16x1`	Cluster shape is 1 x 16 x 1.
`CUBLASLT_CLUSTER_SHAPE_16x1x1`	Cluster shape is 16 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x3x1`	Cluster shape is 1 x 3 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x5x1`	Cluster shape is 1 x 5 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x6x1`	Cluster shape is 1 x 6 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x7x1`	Cluster shape is 1 x 7 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x9x1`	Cluster shape is 1 x 9 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x10x1`	Cluster shape is 1 x 10 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x11x1`	Cluster shape is 1 x 11 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x12x1`	Cluster shape is 1 x 12 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x13x1`	Cluster shape is 1 x 13 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x14x1`	Cluster shape is 1 x 14 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x15x1`	Cluster shape is 1 x 15 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x3x1`	Cluster shape is 2 x 3 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x5x1`	Cluster shape is 2 x 5 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x6x1`	Cluster shape is 2 x 6 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x7x1`	Cluster shape is 2 x 7 x 1.
`CUBLASLT_CLUSTER_SHAPE_3x1x1`	Cluster shape is 3 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_3x2x1`	Cluster shape is 3 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_3x3x1`	Cluster shape is 3 x 3 x 1.
`CUBLASLT_CLUSTER_SHAPE_3x4x1`	Cluster shape is 3 x 4 x 1.
`CUBLASLT_CLUSTER_SHAPE_3x5x1`	Cluster shape is 3 x 5 x 1.
`CUBLASLT_CLUSTER_SHAPE_4x3x1`	Cluster shape is 4 x 3 x 1.
`CUBLASLT_CLUSTER_SHAPE_5x1x1`	Cluster shape is 5 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_5x2x1`	Cluster shape is 5 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_5x3x1`	Cluster shape is 5 x 3 x 1.
`CUBLASLT_CLUSTER_SHAPE_6x1x1`	Cluster shape is 6 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_6x2x1`	Cluster shape is 6 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_7x1x1`	Cluster shape is 7 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_7x2x1`	Cluster shape is 7 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_9x1x1`	Cluster shape is 9 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_10x1x1`	Cluster shape is 10 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_11x1x1`	Cluster shape is 11 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_12x1x1`	Cluster shape is 12 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_13x1x1`	Cluster shape is 13 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_14x1x1`	Cluster shape is 14 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_15x1x1`	Cluster shape is 15 x 1 x 1.

3.3.2. cublasLtEpilogue_t

The cublasLtEpilogue_t is an enum type to set the postprocessing options for the epilogue.

Value	Description
CUBLASLT_EPILOGUE_DEFAULT = 1	No special postprocessing, just scale and quantize the results if necessary.
CUBLASLT_EPILOGUE_RELU = 2	Apply ReLU point-wise transform to the results (`x := max(x, 0)`).
CUBLASLT_EPILOGUE_RELU_AUX = CUBLASLT_EPILOGUE_RELU \| 128	Apply ReLU point-wise transform to the results (`x := max(x, 0)`). This epilogue mode produces an extra output, see CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER of cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_BIAS = 4	Apply (broadcast) bias from the bias vector. Bias vector length must match matrix D rows, and it must be packed (such as stride between vector elements is 1). Bias vector is broadcast to all columns and added before applying the final postprocessing.
CUBLASLT_EPILOGUE_RELU_BIAS = CUBLASLT_EPILOGUE_RELU \| CUBLASLT_EPILOGUE_BIAS	Apply bias and then ReLU transform.
CUBLASLT_EPILOGUE_RELU_AUX_BIAS = CUBLASLT_EPILOGUE_RELU_AUX \| CUBLASLT_EPILOGUE_BIAS	Apply bias and then ReLU transform. This epilogue mode produces an extra output, see `CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` of cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_DRELU = 8 \| 128	Apply ReLu gradient to matmul output. Store ReLu gradient in the output matrix. This epilogue mode requires an extra input, see `CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` of cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_DRELU_BGRAD = CUBLASLT_EPILOGUE_DRELU \| 16	Apply independently ReLu and Bias gradient to matmul output. Store ReLu gradient in the output matrix, and Bias gradient in the bias buffer (see CUBLASLT_MATMUL_DESC_BIAS_POINTER). This epilogue mode requires an extra input, see `CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` of cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_GELU = 32	Apply GELU point-wise transform to the results (`x := GELU(x)`).
CUBLASLT_EPILOGUE_GELU_AUX = CUBLASLT_EPILOGUE_GELU \| 128	Apply GELU point-wise transform to the results (`x := GELU(x)`). This epilogue mode outputs GELU input as a separate matrix (useful for training). See `CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` of cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_GELU_BIAS = CUBLASLT_EPILOGUE_GELU \| CUBLASLT_EPILOGUE_BIAS	Apply Bias and then GELU transform 4.
CUBLASLT_EPILOGUE_GELU_AUX_BIAS = CUBLASLT_EPILOGUE_GELU_AUX \| CUBLASLT_EPILOGUE_BIAS	Apply Bias and then GELU transform 4. This epilogue mode outputs GELU input as a separate matrix (useful for training). See `CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` of cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_DGELU = 64 \| 128	Apply GELU gradient to matmul output. Store GELU gradient in the output matrix. This epilogue mode requires an extra input, see `CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` of cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_DGELU_BGRAD = CUBLASLT_EPILOGUE_DGELU \| 16	Apply independently GELU and Bias gradient to matmul output. Store GELU gradient in the output matrix, and Bias gradient in the bias buffer (see CUBLASLT_MATMUL_DESC_BIAS_POINTER). This epilogue mode requires an extra input, see `CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` of cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_BGRADA = 256	Apply Bias gradient to the input matrix A. The bias size corresponds to the number of rows of the matrix D. The reduction happens over the GEMM’s “k” dimension. Store Bias gradient in the bias buffer, see `CUBLASLT_MATMUL_DESC_BIAS_POINTER` of cublasLtMatmulDescAttributes_t.
CUBLASLT_EPILOGUE_BGRADB = 512	Apply Bias gradient to the input matrix B. The bias size corresponds to the number of columns of the matrix D. The reduction happens over the GEMM’s “k” dimension. Store Bias gradient in the bias buffer, see `CUBLASLT_MATMUL_DESC_BIAS_POINTER` of cublasLtMatmulDescAttributes_t.

NOTES:

4(1,2): GELU (Gaussian Error Linear Unit) is approximated by: ${0.5}x\left( 1 + \text{tanh}\left( \sqrt{2/\pi}\left( x + {0.044715}x^{3} \right) \right) \right)$

3.3.3. cublasLtHandle_t

The cublasLtHandle_t type is a pointer type to an opaque structure holding the cuBLASLt library context. Use cublasLtCreate() to initialize the cuBLASLt library context and return a handle to an opaque structure holding the cuBLASLt library context, and use cublasLtDestroy() to destroy a previously created cuBLASLt library context descriptor and release the resources.

Note

cuBLAS handle (cublasHandle_t) encapsulates a cuBLASLt handle. Any valid cublasHandle_t can be used in place of cublasLtHandle_t with a simple cast. However, unlike a cuBLAS handle, a cuBLASLt handle is not tied to any particular CUDA context.

3.3.4. cublasLtLoggerCallback_t

cublasLtLoggerCallback_t is a callback function pointer type. A callback function can be set using cublasLtLoggerSetCallback().

Parameters:

Parameter	Input / Output	Description
logLevel	Output	See cuBLASLt Logging.
functionName	Output	The name of the API that logged this message.
message	Output	The log message.

3.3.5. cublasLtMatmulAlgo_t

cublasLtMatmulAlgo_t is an opaque structure holding the description of the matrix multiplication algorithm. This structure can be trivially serialized and later restored for use with the same version of cuBLAS library to save on selecting the right configuration again.

3.3.6. cublasLtMatmulAlgoCapAttributes_t

cublasLtMatmulAlgoCapAttributes_t enumerates matrix multiplication algorithm capability attributes that can be retrieved from an initialized cublasLtMatmulAlgo_t descriptor using cublasLtMatmulAlgoCapGetAttribute().

Value	Description	Data Type
CUBLASLT_ALGO_CAP_SPLITK_SUPPORT	Support for split-K. Boolean (0 or 1) to express if split-K implementation is supported. 0 means no support, and supported otherwise. See CUBLASLT_ALGO_CONFIG_SPLITK_NUM of cublasLtMatmulAlgoConfigAttributes_t.	int32_t
CUBLASLT_ALGO_CAP_REDUCTION_SCHEME_MASK	Mask to express the types of reduction schemes supported, see cublasLtReductionScheme_t. If the reduction scheme is not masked out then it is supported. For example: `int isReductionSchemeComputeTypeSupported ? (reductionSchemeMask & CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE) == CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE ? 1 : 0;`	uint32_t
CUBLASLT_ALGO_CAP_CTA_SWIZZLING_SUPPORT	Support for CTA-swizzling. Boolean (0 or 1) to express if CTA-swizzling implementation is supported. 0 means no support, and 1 means supported value of 1; other values are reserved. See also CUBLASLT_ALGO_CONFIG_CTA_SWIZZLING of cublasLtMatmulAlgoConfigAttributes_t.	uint32_t
CUBLASLT_ALGO_CAP_STRIDED_BATCH_SUPPORT	Support strided batch. 0 means no support, supported otherwise.	int32_t
CUBLASLT_ALGO_CAP_OUT_OF_PLACE_RESULT_SUPPORT	Support results out of place (D != C in D = alpha.A.B + beta.C). 0 means no support, supported otherwise.	int32_t
CUBLASLT_ALGO_CAP_UPLO_SUPPORT	Syrk (symmetric rank k update)/herk (Hermitian rank k update) support (on top of regular gemm). 0 means no support, supported otherwise.	int32_t
CUBLASLT_ALGO_CAP_TILE_IDS	The tile ids possible to use. See cublasLtMatmulTile_t. If no tile ids are supported then use CUBLASLT_MATMUL_TILE_UNDEFINED. Use cublasLtMatmulAlgoCapGetAttribute() with `sizeInBytes = 0` to query the actual count.	Array of uint32_t
CUBLASLT_ALGO_CAP_STAGES_IDS	The stages ids possible to use. See cublasLtMatmulStages_t. If no stages ids are supported then use CUBLASLT_MATMUL_STAGES_UNDEFINED. Use cublasLtMatmulAlgoCapGetAttribute() with `sizeInBytes = 0` to query the actual count.	Array of uint32_t
CUBLASLT_ALGO_CAP_CUSTOM_OPTION_MAX	Custom option range is from 0 to CUBLASLT_ALGO_CAP_CUSTOM_OPTION_MAX (inclusive). See CUBLASLT_ALGO_CONFIG_CUSTOM_OPTION of cublasLtMatmulAlgoConfigAttributes_t .	int32_t
CUBLASLT_ALGO_CAP_MATHMODE_IMPL	Indicates whether the algorithm is using regular compute or tensor operations. 0 means regular compute, 1 means tensor operations. DEPRECATED	int32_t
CUBLASLT_ALGO_CAP_GAUSSIAN_IMPL	Indicate whether the algorithm implements the Gaussian optimization of complex matrix multiplication. 0 means regular compute; 1 means Gaussian. See cublasMath_t. DEPRECATED	int32_t
CUBLASLT_ALGO_CAP_CUSTOM_MEMORY_ORDER	Indicates whether the algorithm supports custom (not COL or ROW memory order). 0 means only COL and ROW memory order is allowed, non-zero means that algo might have different requirements. See cublasLtOrder_t.	int32_t
CUBLASLT_ALGO_CAP_POINTER_MODE_MASK	Bitmask enumerating the pointer modes the algorithm supports. See cublasLtPointerModeMask_t.	uint32_t
CUBLASLT_ALGO_CAP_EPILOGUE_MASK	Bitmask enumerating the kinds of postprocessing algorithm supported in the epilogue. See cublasLtEpilogue_t.	uint32_t
CUBLASLT_ALGO_CAP_LD_NEGATIVE	Support for negative ld for all of the matrices. 0 means no support, supported otherwise.	uint32_t
CUBLASLT_ALGO_CAP_NUMERICAL_IMPL_FLAGS	Details about algorithm’s implementation that affect it’s numerical behavior. See cublasLtNumericalImplFlags_t.	uint64_t
CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_A_BYTES	Minimum alignment required for A matrix in bytes.	uint32_t
CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_B_BYTES	Minimum alignment required for B matrix in bytes.	uint32_t
CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_C_BYTES	Minimum alignment required for C matrix in bytes.	uint32_t
CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_D_BYTES	Minimum alignment required for D matrix in bytes.	uint32_t
CUBLASLT_ALGO_CAP_ATOMIC_SYNC	Support for synchronization via atomic counters. See Atomics Synchronization.	int32_t

3.3.7. cublasLtMatmulAlgoConfigAttributes_t

cublasLtMatmulAlgoConfigAttributes_t is an enumerated type that contains the configuration attributes for cuBLASLt matrix multiply algorithms. The configuration attributes are algorithm-specific, and can be set. The attributes configuration of a given algorithm should agree with its capability attributes. Use cublasLtMatmulAlgoConfigGetAttribute() and cublasLtMatmulAlgoConfigSetAttribute() to get and set the attribute value of a matmul algorithm descriptor.

Value	Description	Data Type
`CUBLASLT_ALGO_CONFIG_ID`	Read-only attribute. Algorithm index. See cublasLtMatmulAlgoGetIds(). Set by cublasLtMatmulAlgoInit().	int32_t
`CUBLASLT_ALGO_CONFIG_TILE_ID`	Tile id. See cublasLtMatmulTile_t. Default: `CUBLASLT_MATMUL_TILE_UNDEFINED`.	uint32_t
`CUBLASLT_ALGO_CONFIG_STAGES_ID`	stages id, see cublasLtMatmulStages_t. Default: `CUBLASLT_MATMUL_STAGES_UNDEFINED`.	uint32_t
`CUBLASLT_ALGO_CONFIG_SPLITK_NUM`	Number of K splits. If the number of K splits is greater than one, SPLITK_NUM parts of matrix multiplication will be computed in parallel. The results will be accumulated according to `CUBLASLT_ALGO_CONFIG_REDUCTION_SCHEME`.	uint32_t
`CUBLASLT_ALGO_CONFIG_REDUCTION_SCHEME`	Reduction scheme to use when splitK value > 1. Default: `CUBLASLT_REDUCTION_SCHEME_NONE`. See cublasLtReductionScheme_t.	uint32_t
`CUBLASLT_ALGO_CONFIG_CTA_SWIZZLING`	Enable/Disable CTA swizzling. Change mapping from CUDA grid coordinates to parts of the matrices. Possible values: 0 and 1; other values reserved.	uint32_t
`CUBLASLT_ALGO_CONFIG_CUSTOM_OPTION`	Custom option value. Each algorithm can support some custom options that don’t fit the description of the other configuration attributes. See the `CUBLASLT_ALGO_CAP_CUSTOM_OPTION_MAX` of cublasLtMatmulAlgoCapAttributes_t for the accepted range for a specific case.	uint32_t
`CUBLASLT_ALGO_CONFIG_INNER_SHAPE_ID`	Inner shape ID. Refer to `cublasLtMatmulInnerShape_t.` Default: `CUBLASLT_MATMUL_INNER_SHAPE_UNDEFINED`.	uint16_t
`CUBLASLT_ALGO_CONFIG_CLUSTER_SHAPE_ID`	Cluster shape ID. Refer to `cublasLtClusterShape_t.` Default: `CUBLASLT_CLUSTER_SHAPE_AUTO`.	uint16_t

3.3.8. cublasLtMatmulDesc_t

The cublasLtMatmulDesc_t is a pointer to an opaque structure holding the description of the matrix multiplication operation cublasLtMatmul(). A descriptor can be created by calling cublasLtMatmulDescCreate() and destroyed by calling cublasLtMatmulDescDestroy().

3.3.9. cublasLtMatmulDescAttributes_t

cublasLtMatmulDescAttributes_t is a descriptor structure containing the attributes that define the specifics of the matrix multiply operation. Use cublasLtMatmulDescGetAttribute() and cublasLtMatmulDescSetAttribute() to get and set the attribute value of a matmul descriptor.

Attribute Name	Description	Data Type
`CUBLASLT_MATMUL_DESC_COMPUTE_TYPE`	Compute type. Defines the data type used for multiply and accumulate operations, and the accumulator during the matrix multiplication. See cublasComputeType_t.	int32_t
`CUBLASLT_MATMUL_DESC_SCALE_TYPE`	Scale type. Defines the data type of the scaling factors `alpha` and `beta`. The accumulator value and the value from matrix C are typically converted to scale type before final scaling. The value is then converted from scale type to the type of matrix D before storing in memory. Default value is aligned with CUBLASLT_MATMUL_DESC_COMPUTE_TYPE. See cudaDataType_t.	int32_t
`CUBLASLT_MATMUL_DESC_POINTER_MODE`	Specifies `alpha` and `beta` are passed by reference, whether they are scalars on the host or on the device, or device vectors. Default value is: `CUBLASLT_POINTER_MODE_HOST` (i.e., on the host). See cublasLtPointerMode_t.	int32_t
`CUBLASLT_MATMUL_DESC_TRANSA`	Specifies the type of transformation operation that should be performed on matrix A. Default value is: `CUBLAS_OP_N` (i.e., non-transpose operation). See cublasOperation_t.	int32_t
`CUBLASLT_MATMUL_DESC_TRANSB`	Specifies the type of transformation operation that should be performed on matrix B. Default value is: `CUBLAS_OP_N` (i.e., non-transpose operation). See cublasOperation_t.	int32_t
`CUBLASLT_MATMUL_DESC_TRANSC`	Specifies the type of transformation operation that should be performed on matrix C. Currently only `CUBLAS_OP_N` is supported. Default value is: `CUBLAS_OP_N` (i.e., non-transpose operation). See cublasOperation_t.	int32_t
`CUBLASLT_MATMUL_DESC_FILL_MODE`	Indicates whether the lower or upper part of the dense matrix was filled, and consequently should be used by the function. Default value is: `CUBLAS_FILL_MODE_FULL`.See cublasFillMode_t.	int32_t
`CUBLASLT_MATMUL_DESC_EPILOGUE`	Epilogue function. See cublasLtEpilogue_t. Default value is: `CUBLASLT_EPILOGUE_DEFAULT`.	uint32_t
`CUBLASLT_MATMUL_DESC_BIAS_POINTER`	Bias or Bias gradient vector pointer in the device memory. Input vector with length that matches the number of rows of matrix D when one of the following epilogues is used: `CUBLASLT_EPILOGUE_BIAS`, `CUBLASLT_EPILOGUE_RELU_BIAS`, `CUBLASLT_EPILOGUE_RELU_AUX_BIAS`, `CUBLASLT_EPILOGUE_GELU_BIAS`, `CUBLASLT_EPILOGUE_GELU_AUX_BIAS`. Output vector with length that matches the number of rows of matrix D when one of the following epilogues is used: `CUBLASLT_EPILOGUE_DRELU_BGRAD`, `CUBLASLT_EPILOGUE_DGELU_BGRAD`, `CUBLASLT_EPILOGUE_BGRADA`. Output vector with length that matches the number of columns of matrix D when one of the following epilogues is used: `CUBLASLT_EPILOGUE_BGRADB`. Bias vector elements are the same type as `alpha` and `beta` (see `CUBLASLT_MATMUL_DESC_SCALE_TYPE` in this table) when matrix D datatype is `CUDA_R_8I` and same as matrix D datatype otherwise. See the datatypes table under cublasLtMatmul() for detailed mapping. Default value is: NULL.	void * / const void *
`CUBLASLT_MATMUL_DESC_BIAS_BATCH_STRIDE`	Stride (in elements) to the next bias or bias gradient vector for strided batch operations. The default value is 0.	int64_t
`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER`	Pointer for epilogue auxiliary buffer. Output vector for ReLu bit-mask in forward pass when `CUBLASLT_EPILOGUE_RELU_AUX` or `CUBLASLT_EPILOGUE_RELU_AUX_BIAS` epilogue is used. Input vector for ReLu bit-mask in backward pass when `CUBLASLT_EPILOGUE_DRELU` or `CUBLASLT_EPILOGUE_DRELU_BGRAD` epilogue is used. Output of GELU input matrix in forward pass when `CUBLASLT_EPILOGUE_GELU_AUX_BIAS` epilogue is used. Input of GELU input matrix for backward pass when `CUBLASLT_EPILOGUE_DGELU` or `CUBLASLT_EPILOGUE_DGELU_BGRAD` epilogue is used. For aux data type, see `CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_DATA_TYPE`. Routines that don’t dereference this pointer, like cublasLtMatmulAlgoGetHeuristic() depend on its value to determine expected pointer alignment. Requires setting the `CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_LD` attribute.	void * / const void *
`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_LD`	Leading dimension for epilogue auxiliary buffer. ReLu bit-mask matrix leading dimension in elements (i.e. bits) when `CUBLASLT_EPILOGUE_RELU_AUX`, `CUBLASLT_EPILOGUE_RELU_AUX_BIAS`, `CUBLASLT_EPILOGUE_DRELU_BGRAD`, or `CUBLASLT_EPILOGUE_DRELU_BGRAD` epilogue is used. Must be divisible by 128 and be no less than the number of rows in the output matrix. GELU input matrix leading dimension in elements when `CUBLASLT_EPILOGUE_GELU_AUX_BIAS`, `CUBLASLT_EPILOGUE_DGELU`, or `CUBLASLT_EPILOGUE_DGELU_BGRAD` epilogue used. Must be divisible by 8 and be no less than the number of rows in the output matrix.	int64_t
`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_BATCH_STRIDE`	Batch stride for epilogue auxiliary buffer. ReLu bit-mask matrix batch stride in elements (i.e. bits) when `CUBLASLT_EPILOGUE_RELU_AUX`, `CUBLASLT_EPILOGUE_RELU_AUX_BIAS` or `CUBLASLT_EPILOGUE_DRELU_BGRAD` epilogue is used. Must be divisible by 128. GELU input matrix batch stride in elements when `CUBLASLT_EPILOGUE_GELU_AUX_BIAS`, `CUBLASLT_EPILOGUE_DRELU`, or `CUBLASLT_EPILOGUE_DGELU_BGRAD` epilogue used. Must be divisible by 8. Default value: 0.	int64_t
`CUBLASLT_MATMUL_DESC_ALPHA_VECTOR_BATCH_STRIDE`	Batch stride for alpha vector. Used together with `CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST` when matrix D’s `CUBLASLT_MATRIX_LAYOUT_BATCH_COUNT` is greater than 1. If `CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_ZERO` is set then `CUBLASLT_MATMUL_DESC_ALPHA_VECTOR_BATCH_STRIDE` must be set to 0 as this mode doesn’t support batched alpha vector. Default value: 0.	int64_t
`CUBLASLT_MATMUL_DESC_SM_COUNT_TARGET`	Number of SMs to target for parallel execution. Optimizes heuristics for execution on a different number of SMs when user expects a concurrent stream to be using some of the device resources. Default value: 0.	int32_t
`CUBLASLT_MATMUL_DESC_A_SCALE_POINTER`	Device pointer to the scale factor value that converts data in matrix A to the compute data type range. The scaling factor must have the same type as the compute type. If not specified, or set to NULL, the scaling factor is assumed to be 1. If set for an unsupported matrix data, scale, and compute type combination, calling cublasLtMatmul() will return `CUBLAS_INVALID_VALUE`. Default value: NULL	const void*
`CUBLASLT_MATMUL_DESC_B_SCALE_POINTER`	Equivalent to `CUBLASLT_MATMUL_DESC_A_SCALE_POINTER` for matrix B. Default value: NULL	const void*
`CUBLASLT_MATMUL_DESC_C_SCALE_POINTER`	Equivalent to `CUBLASLT_MATMUL_DESC_A_SCALE_POINTER` for matrix C. Default value: NULL	const void*
`CUBLASLT_MATMUL_DESC_D_SCALE_POINTER`	Equivalent to `CUBLASLT_MATMUL_DESC_A_SCALE_POINTER` for matrix D. Default value: NULL	const void*
`CUBLASLT_MATMUL_DESC_AMAX_D_POINTER`	Device pointer to the memory location that on completion will be set to the maximum of absolute values in the output matrix. The computed value has the same type as the compute type. If not specified, or set to NULL, the maximum absolute value is not computed. If set for an unsupported matrix data, scale, and compute type combination, calling cublasLtMatmul() will return `CUBLAS_INVALID_VALUE`. Default value: NULL	void *
`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_DATA_TYPE`	The type of the data that will be stored in `CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER`. If unset (or set to the default value of -1), the data type is set to be the output matrix element data type (DType) with some exceptions: ReLu uses a bit-mask. For FP8 kernels with an output type (DType) of `CUDA_R_8F_E4M3`, the data type can be set to a non-default value if: AType and BType are `CUDA_R_8F_E4M3`. Bias Type is `CUDA_R_16F`. CType is `CUDA_R_16BF` or `CUDA_R_16F` `CUBLASLT_MATMUL_DESC_EPILOGUE` is set to `CUBLASLT_EPILOGUE_GELU_AUX` When CType is `CUDA_R_16BF`, the data type may be set to `CUDA_R_16BF` or `CUDA_R_8F_E4M3`. When CType is `CUDA_R_16F`, the data type may be set to `CUDA_R_16F`. Otherwise, the data type should be left unset or set to the default value of -1. If set for an unsupported matrix data, scale, and compute type combination, calling cublasLtMatmul() will return `CUBLAS_INVALID_VALUE`. Default value: -1	int32_t based on cudaDataType
`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_SCALE_POINTER`	Device pointer to the scaling factor value to convert results from compute type data range to storage data range in the auxiliary matrix that is set via `CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER`. The scaling factor value must have the same type as the compute type. If not specified, or set to NULL, the scaling factor is assumed to be 1. If set for an unsupported matrix data, scale, and compute type combination, calling cublasLtMatmul() will return `CUBLAS_INVALID_VALUE`. Default value: NULL	void *
`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_AMAX_POINTER`	Device pointer to the memory location that on completion will be set to the maximum of absolute values in the buffer that is set via `CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER`. The computed value has the same type as the compute type. If not specified, or set to NULL, the maximum absolute value is not computed. If set for an unsupported matrix data, scale, and compute type combination, calling cublasLtMatmul() will return CUBLAS_INVALID_VALUE. Default value: NULL	void *
`CUBLASLT_MATMUL_DESC_FAST_ACCUM`	Flag for managing FP8 fast accumulation mode. When enabled, problem execution might be faster but at the cost of lower accuracy because intermediate results will not periodically be promoted to a higher precision. Default value: 0 - fast accumulation mode is disabled	int8_t
`CUBLASLT_MATMUL_DESC_BIAS_DATA_TYPE`	Type of the bias or bias gradient vector in the device memory. Bias case: see `CUBLASLT_EPILOGUE_BIAS`. If unset (or set to the default value of -1), the bias vector elements are the same type as the elements of the output matrix (Dtype) with the following exceptions: IMMA kernels with computeType=`CUDA_R_32I` and `Ctype=CUDA_R_8I` where the bias vector elements are the same type as alpha, beta (`CUBLASLT_MATMUL_DESC_SCALE_TYPE=CUDA_R_32F`) For FP8 kernels with an output type of `CUDA_R_32F`, `CUDA_R_8F_E4M3` or `CUDA_R_8F_E5M2`. See cublasLtMatmul() for more details. Default value: -1	int32_t based on cudaDataType
`CUBLASLT_MATMUL_DESC_ATOMIC_SYNC_IN_COUNTERS_POINTER`	Pointer to a device array of input atomic counters consumed by a matmul. When a counter reaches zero, computation of the corresponding chunk of the output tensor is allowed to start. Default: NULL. See Atomics Synchronization.	int32_t *
`CUBLASLT_MATMUL_DESC_ATOMIC_SYNC_OUT_COUNTERS_POINTER`	Pointer to a device array of output atomic counters produced by a matmul. A matmul kernel sets a counter to zero when the computations of the corresponding chunk of the output tensor have completed. All the counters must be initialized to 1 before a matmul kernel is run. Default: NULL. See Atomics Synchronization.	int32_t *
`CUBLASLT_MATMUL_DESC_ATOMIC_SYNC_NUM_CHUNKS_D_ROWS`	Number of atomic synchronization chunks in the row dimension of the output matrix D. Each chunk corresponds to a single atomic counter. Default: 0 (atomics synchronization disabled). See Atomics Synchronization.	int32_t
`CUBLASLT_MATMUL_DESC_ATOMIC_SYNC_NUM_CHUNKS_D_COLS`	Number of atomic synchronization chunks in the column dimension of the output matrix D. Each chunk corresponds to a single atomic counter. Default: 0 (atomics synchronization disabled). See Atomics Synchronization.	int32_t

3.3.10. cublasLtMatmulHeuristicResult_t

cublasLtMatmulHeuristicResult_t is a descriptor that holds the configured matrix multiplication algorithm descriptor and its runtime properties.

Member	Description
cublasLtMatmulAlgo_t algo	Must be initialized with cublasLtMatmulAlgoInit() if the preference CUBLASLT_MATMUL_PERF_SEARCH_MODE is set to CUBLASLT_SEARCH_LIMITED_BY_ALGO_ID. See cublasLtMatmulSearch_t.
`size_t` workspaceSize;	Actual size of workspace memory required.
cublasStatus_t state;	Result status. Other fields are valid only if, after call to cublasLtMatmulAlgoGetHeuristic(), this member is set to CUBLAS_STATUS_SUCCESS.
`float` wavesCount;	Waves count is a device utilization metric. A `wavesCount` value of 1.0f suggests that when the kernel is launched it will fully occupy the GPU.
`int` reserved[4];	Reserved.

3.3.11. cublasLtMatmulInnerShape_t

cublasLtMatmulInnerShape_t is an enumerated type used to configure various aspects of the internal kernel design. This does not impact the CUDA grid size.

Value	Description
`CUBLASLT_MATMUL_INNER_SHAPE_UNDEFINED`	Inner shape is undefined.
`CUBLASLT_MATMUL_INNER_SHAPE_MMA884`	Inner shape is MMA884.
`CUBLASLT_MATMUL_INNER_SHAPE_MMA1684`	Inner shape is MMA1684.
`CUBLASLT_MATMUL_INNER_SHAPE_MMA1688`	Inner shape is MMA1688.
`CUBLASLT_MATMUL_INNER_SHAPE_MMA16816`	Inner shape is MMA16816.

3.3.12. cublasLtMatmulPreference_t

The cublasLtMatmulPreference_t is a pointer to an opaque structure holding the description of the preferences for cublasLtMatmulAlgoGetHeuristic() configuration. Use cublasLtMatmulPreferenceCreate() to create one instance of the descriptor and cublasLtMatmulPreferenceDestroy() to destroy a previously created descriptor and release the resources.

3.3.13. cublasLtMatmulPreferenceAttributes_t

cublasLtMatmulPreferenceAttributes_t is an enumerated type used to apply algorithm search preferences while fine-tuning the heuristic function. Use cublasLtMatmulPreferenceGetAttribute() and cublasLtMatmulPreferenceSetAttribute() to get and set the attribute value of a matmul preference descriptor.

Value	Description	Data Type
CUBLASLT_MATMUL_PREF_SEARCH_MODE	Search mode. See cublasLtMatmulSearch_t. Default is CUBLASLT_SEARCH_BEST_FIT.	uint32_t
CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES	Maximum allowed workspace memory. Default is 0 (no workspace memory allowed).	uint64_t
CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK	Reduction scheme mask. See cublasLtReductionScheme_t. Only algorithm configurations specifying CUBLASLT_ALGO_CONFIG_REDUCTION_SCHEME that is not masked out by this attribute are allowed. For example, a mask value of 0x03 will allow only INPLACE and COMPUTE_TYPE reduction schemes. Default is CUBLASLT_REDUCTION_SCHEME_MASK (i.e., allows all reduction schemes).	uint32_t
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_A_BYTES	Minimum buffer alignment for matrix A (in bytes). Selecting a smaller value will exclude algorithms that can not work with matrix A, which is not as strictly aligned as the algorithms need. Default is 256 bytes.	uint32_t
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_B_BYTES	Minimum buffer alignment for matrix B (in bytes). Selecting a smaller value will exclude algorithms that can not work with matrix B, which is not as strictly aligned as the algorithms need. Default is 256 bytes.	uint32_t
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_C_BYTES	Minimum buffer alignment for matrix C (in bytes). Selecting a smaller value will exclude algorithms that can not work with matrix C, which is not as strictly aligned as the algorithms need. Default is 256 bytes.	uint32_t
CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_D_BYTES	Minimum buffer alignment for matrix D (in bytes). Selecting a smaller value will exclude algorithms that can not work with matrix D, which is not as strictly aligned as the algorithms need. Default is 256 bytes.	uint32_t
CUBLASLT_MATMUL_PREF_MAX_WAVES_COUNT	Maximum wave count. See cublasLtMatmulHeuristicResult_t`::wavesCount.` Selecting a non-zero value will exclude algorithms that report device utilization higher than specified. Default is `0.0f.`	float
CUBLASLT_MATMUL_PREF_IMPL_MASK	Numerical implementation details mask. See cublasLtNumericalImplFlags_t. Filters heuristic result to only include algorithms that use the allowed implementations. default: uint64_t(-1) (allow everything)	uint64_t

3.3.14. cublasLtMatmulSearch_t

cublasLtMatmulSearch_t is an enumerated type that contains the attributes for heuristics search type.

Value	Description	Data Type
CUBLASLT_SEARCH_BEST_FIT	Request heuristics for the best algorithm for the given use case.
CUBLASLT_SEARCH_LIMITED_BY_ALGO_ID	Request heuristics only for the pre-configured algo id.

3.3.15. cublasLtMatmulTile_t

cublasLtMatmulTile_t is an enumerated type used to set the tile size in rows x columns. See also CUTLASS: Fast Linear Algebra in CUDA C++.

Value	Description
CUBLASLT_MATMUL_TILE_UNDEFINED	Tile size is undefined.
CUBLASLT_MATMUL_TILE_8x8	Tile size is 8 rows x 8 columns.
CUBLASLT_MATMUL_TILE_8x16	Tile size is 8 rows x 16 columns.
CUBLASLT_MATMUL_TILE_16x8	Tile size is 16 rows x 8 columns.
CUBLASLT_MATMUL_TILE_8x32	Tile size is 8 rows x 32 columns.
CUBLASLT_MATMUL_TILE_16x16	Tile size is 16 rows x 16 columns.
CUBLASLT_MATMUL_TILE_32x8	Tile size is 32 rows x 8 columns.
CUBLASLT_MATMUL_TILE_8x64	Tile size is 8 rows x 64 columns.
CUBLASLT_MATMUL_TILE_16x32	Tile size is 16 rows x 32 columns.
CUBLASLT_MATMUL_TILE_32x16	Tile size is 32 rows x 16 columns.
CUBLASLT_MATMUL_TILE_64x8	Tile size is 64 rows x 8 columns.
CUBLASLT_MATMUL_TILE_32x32	Tile size is 32 rows x 32 columns.
CUBLASLT_MATMUL_TILE_32x64	Tile size is 32 rows x 64 columns.
CUBLASLT_MATMUL_TILE_64x32	Tile size is 64 rows x 32 columns.
CUBLASLT_MATMUL_TILE_32x128	Tile size is 32 rows x 128 columns.
CUBLASLT_MATMUL_TILE_64x64	Tile size is 64 rows x 64 columns.
CUBLASLT_MATMUL_TILE_128x32	Tile size is 128 rows x 32 columns.
CUBLASLT_MATMUL_TILE_64x128	Tile size is 64 rows x 128 columns.
CUBLASLT_MATMUL_TILE_128x64	Tile size is 128 rows x 64 columns.
CUBLASLT_MATMUL_TILE_64x256	Tile size is 64 rows x 256 columns.
CUBLASLT_MATMUL_TILE_128x128	Tile size is 128 rows x 128 columns.
CUBLASLT_MATMUL_TILE_256x64	Tile size is 256 rows x 64 columns.
CUBLASLT_MATMUL_TILE_64x512	Tile size is 64 rows x 512 columns.
CUBLASLT_MATMUL_TILE_128x256	Tile size is 128 rows x 256 columns.
CUBLASLT_MATMUL_TILE_256x128	Tile size is 256 rows x 128 columns.
CUBLASLT_MATMUL_TILE_512x64	Tile size is 512 rows x 64 columns.
CUBLASLT_MATMUL_TILE_64x96	Tile size is 64 rows x 96 columns.
CUBLASLT_MATMUL_TILE_96x64	Tile size is 96 rows x 64 columns.
CUBLASLT_MATMUL_TILE_96x128	Tile size is 96 rows x 128 columns.
CUBLASLT_MATMUL_TILE_128x160	Tile size is 128 rows x 160 columns.
CUBLASLT_MATMUL_TILE_160x128	Tile size is 160 rows x 128 columns.
CUBLASLT_MATMUL_TILE_192x128	Tile size is 192 rows x 128 columns.
CUBLASLT_MATMUL_TILE_128x192	Tile size is 128 rows x 192 columns.
CUBLASLT_MATMUL_TILE_128x96	Tile size is 128 rows x 96 columns.

3.3.16. cublasLtMatmulStages_t

cublasLtMatmulStages_t is an enumerated type used to configure the size and number of shared memory buffers where input elements are staged. Number of staging buffers defines kernel’s pipeline depth.

Value	Description
CUBLASLT_MATMUL_STAGES_UNDEFINED	Stage size is undefined.
CUBLASLT_MATMUL_STAGES_16x1	Stage size is 16, number of stages is 1.
CUBLASLT_MATMUL_STAGES_16x2	Stage size is 16, number of stages is 2.
CUBLASLT_MATMUL_STAGES_16x3	Stage size is 16, number of stages is 3.
CUBLASLT_MATMUL_STAGES_16x4	Stage size is 16, number of stages is 4.
CUBLASLT_MATMUL_STAGES_16x5	Stage size is 16, number of stages is 5.
CUBLASLT_MATMUL_STAGES_16x6	Stage size is 16, number of stages is 6.
CUBLASLT_MATMUL_STAGES_32x1	Stage size is 32, number of stages is 1.
CUBLASLT_MATMUL_STAGES_32x2	Stage size is 32, number of stages is 2.
CUBLASLT_MATMUL_STAGES_32x3	Stage size is 32, number of stages is 3.
CUBLASLT_MATMUL_STAGES_32x4	Stage size is 32, number of stages is 4.
CUBLASLT_MATMUL_STAGES_32x5	Stage size is 32, number of stages is 5.
CUBLASLT_MATMUL_STAGES_32x6	Stage size is 32, number of stages is 6.
CUBLASLT_MATMUL_STAGES_64x1	Stage size is 64, number of stages is 1.
CUBLASLT_MATMUL_STAGES_64x2	Stage size is 64, number of stages is 2.
CUBLASLT_MATMUL_STAGES_64x3	Stage size is 64, number of stages is 3.
CUBLASLT_MATMUL_STAGES_64x4	Stage size is 64, number of stages is 4.
CUBLASLT_MATMUL_STAGES_64x5	Stage size is 64, number of stages is 5.
CUBLASLT_MATMUL_STAGES_64x6	Stage size is 64, number of stages is 6.
CUBLASLT_MATMUL_STAGES_128x1	Stage size is 128, number of stages is 1.
CUBLASLT_MATMUL_STAGES_128x2	Stage size is 128, number of stages is 2.
CUBLASLT_MATMUL_STAGES_128x3	Stage size is 128, number of stages is 3.
CUBLASLT_MATMUL_STAGES_128x4	Stage size is 128, number of stages is 4.
CUBLASLT_MATMUL_STAGES_128x5	Stage size is 128, number of stages is 5.
CUBLASLT_MATMUL_STAGES_128x6	Stage size is 128, number of stages is 6.
CUBLASLT_MATMUL_STAGES_32x10	Stage size is 32, number of stages is 10.
CUBLASLT_MATMUL_STAGES_8x4	Stage size is 8, number of stages is 4.
CUBLASLT_MATMUL_STAGES_16x10	Stage size is 16, number of stages is 10.
CUBLASLT_MATMUL_STAGES_8x5	Stage size is 8, number of stages is 5.
CUBLASLT_MATMUL_STAGES_8x3	Stage size is 8, number of stages is 3.
CUBLASLT_MATMUL_STAGES_8xAUTO	Stage size is 8, number of stages is selected automatically.
CUBLASLT_MATMUL_STAGES_16xAUTO	Stage size is 16, number of stages is selected automatically.
CUBLASLT_MATMUL_STAGES_32xAUTO	Stage size is 32, number of stages is selected automatically.
CUBLASLT_MATMUL_STAGES_64xAUTO	Stage size is 64, number of stages is selected automatically.
CUBLASLT_MATMUL_STAGES_128xAUTO	Stage size is 128, number of stages is selected automatically.

3.3.17. cublasLtNumericalImplFlags_t

cublasLtNumericalImplFlags_t: a set of bit-flags that can be specified to select implementation details that may affect numerical behavior of algorithms.

Flags below can be combined using the bit OR operator “|”.

Value	Description
CUBLASLT_NUMERICAL_IMPL_FLAGS_FMA	Specify that the implementation is based on [H,F,D]FMA (fused multiply-add) family instructions.
CUBLASLT_NUMERICAL_IMPL_FLAGS_HMMA	Specify that the implementation is based on HMMA (tensor operation) family instructions.
CUBLASLT_NUMERICAL_IMPL_FLAGS_IMMA	Specify that the implementation is based on IMMA (integer tensor operation) family instructions.
CUBLASLT_NUMERICAL_IMPL_FLAGS_DMMA	Specify that the implementation is based on DMMA (double precision tensor operation) family instructions.
CUBLASLT_NUMERICAL_IMPL_FLAGS_TENSOR_OP_MASK	Mask to filter implementations using any of the above kinds of tensor operations.
CUBLASLT_NUMERICAL_IMPL_FLAGS_OP_TYPE_MASK	Mask to filter implementation details about multiply-accumulate instructions used.

CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_16F	Specify that the implementation’s inner dot product is using half precision accumulator.
CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_32F	Specify that the implementation’s inner dot product is using single precision accumulator.
CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_64F	Specify that the implementation’s inner dot product is using double precision accumulator.
CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_32I	Specify that the implementation’s inner dot product is using 32 bit signed integer precision accumulator.
CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_TYPE_MASK	Mask to filter implementation details about accumulator used.

CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_16F	Specify that the implementation’s inner dot product multiply-accumulate instruction is using half-precision inputs.
CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_16BF	Specify that the implementation’s inner dot product multiply-accumulate instruction is using bfloat16 inputs.
CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_TF32	Specify that the implementation’s inner dot product multiply-accumulate instruction is using TF32 inputs.
CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_32F	Specify that the implementation’s inner dot product multiply-accumulate instruction is using single-precision inputs.
CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_64F	Specify that the implementation’s inner dot product multiply-accumulate instruction is using double-precision inputs.
CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_8I	Specify that the implementation’s inner dot product multiply-accumulate instruction is using 8-bit integer inputs.
CUBLASLT_NUMERICAL_IMPL_FLAGS_OP_INPUT_TYPE_MASK	Mask to filter implementation details about accumulator input used.

CUBLASLT_NUMERICAL_IMPL_FLAGS_GAUSSIAN	Specify that the implementation applies Gauss complexity reduction algorithm to reduce arithmetic complexity of the complex matrix multiplication problem

3.3.18. cublasLtMatrixLayout_t

The cublasLtMatrixLayout_t is a pointer to an opaque structure holding the description of a matrix layout. Use cublasLtMatrixLayoutCreate() to create one instance of the descriptor and cublasLtMatrixLayoutDestroy() to destroy a previously created descriptor and release the resources.

3.3.19. cublasLtMatrixLayoutAttribute_t

cublasLtMatrixLayoutAttribute_t is a descriptor structure containing the attributes that define the details of the matrix operation. Use cublasLtMatrixLayoutGetAttribute() and cublasLtMatrixLayoutSetAttribute() to get and set the attribute value of a matrix layout descriptor.

Attribute Name	Description	Data Type
CUBLASLT_MATRIX_LAYOUT_TYPE	Specifies the data precision type. See cudaDataType_t.	uint32_t
CUBLASLT_MATRIX_LAYOUT_ORDER	Specifies the memory order of the data of the matrix. Default value is CUBLASLT_ORDER_COL. See cublasLtOrder_t .	int32_t
CUBLASLT_MATRIX_LAYOUT_ROWS	Describes the number of rows in the matrix. Normally only values that can be expressed as `int32_t` are supported.	uint64_t
CUBLASLT_MATRIX_LAYOUT_COLS	Describes the number of columns in the matrix. Normally only values that can be expressed as `int32_t` are supported.	uint64_t
CUBLASLT_MATRIX_LAYOUT_LD	The leading dimension of the matrix. For CUBLASLT_ORDER_COL this is the stride (in elements) of matrix column. See also cublasLtOrder_t. Currently only non-negative values are supported. Must be large enough so that matrix memory locations are not overlapping (e.g., greater or equal to CUBLASLT_MATRIX_LAYOUT_ROWS in case of CUBLASLT_ORDER_COL).	int64_t
CUBLASLT_MATRIX_LAYOUT_BATCH_COUNT	Number of matmul operations to perform in the batch. Default value is 1. See also CUBLASLT_ALGO_CAP_STRIDED_BATCH_SUPPORT in cublasLtMatmulAlgoCapAttributes_t.	int32_t
CUBLASLT_MATRIX_LAYOUT_STRIDED_BATCH_OFFSET	Stride (in elements) to the next matrix for the strided batch operation. Default value is 0. When matrix type is planar-complex (CUBLASLT_MATRIX_LAYOUT_PLANE_OFFSET != 0), batch stride is interpreted by cublasLtMatmul() in number of real valued sub-elements. E.g. for data of type CUDA_C_16F, offset of 1024B is encoded as a stride of value 512 (since each element of the real and imaginary matrices is a 2B (16bit) floating point type). NOTE: A bug in cublasLtMatrixTransform() causes it to interpret the batch stride for a planar-complex matrix as if it was specified in number of complex elements. Therefore an offset of 1024B must be encoded as stride value 256 when calling cublasLtMatrixTransform() (each complex element is 4B with real and imaginary values 2B each). This behavior is expected to be corrected in the next major cuBLAS version.	int64_t
CUBLASLT_MATRIX_LAYOUT_PLANE_OFFSET	Stride (in bytes) to the imaginary plane for planar-complex layout. Default value is 0, indicating that the layout is regular (real and imaginary parts of complex numbers are interleaved in memory for each element).	int64_t

3.3.20. cublasLtMatrixTransformDesc_t

The cublasLtMatrixTransformDesc_t is a pointer to an opaque structure holding the description of a matrix transformation operation. Use cublasLtMatrixTransformDescCreate() to create one instance of the descriptor and cublasLtMatrixTransformDescDestroy() to destroy a previously created descriptor and release the resources.

3.3.21. cublasLtMatrixTransformDescAttributes_t

cublasLtMatrixTransformDescAttributes_t is a descriptor structure containing the attributes that define the specifics of the matrix transform operation. Use cublasLtMatrixTransformDescGetAttribute() and cublasLtMatrixTransformDescSetAttribute() to set the attribute value of a matrix transform descriptor.

Transform Attribute Name	Description	Data Type
CUBLASLT_MATRIX_TRANSFORM_DESC_SCALE_TYPE	Scale type. Inputs are converted to the scale type for scaling and summation, and results are then converted to the output type to store in the memory. For the supported data types see cudaDataType_t.	int32_t
CUBLASLT_MATRIX_TRANSFORM_DESC_POINTER_MODE	Specifies the scalars alpha and beta are passed by reference whether on the host or on the device. Default value is: CUBLASLT_POINTER_MODE_HOST (i.e., on the host). See cublasLtPointerMode_t.	int32_t
CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSA	Specifies the type of operation that should be performed on the matrix A. Default value is: CUBLAS_OP_N (i.e., non-transpose operation). See cublasOperation_t.	int32_t
CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSB	Specifies the type of operation that should be performed on the matrix B. Default value is: CUBLAS_OP_N (i.e., non-transpose operation). See cublasOperation_t.	int32_t

3.3.22. cublasLtOrder_t

cublasLtOrder_t is an enumerated type used to indicate the data ordering of the matrix.

Value	Data Order Description
CUBLASLT_ORDER_COL	Data is ordered in column-major format. The leading dimension is the stride (in elements) to the beginning of next column in memory.
CUBLASLT_ORDER_ROW	Data is ordered in row-major format. The leading dimension is the stride (in elements) to the beginning of next row in memory.
CUBLASLT_ORDER_COL32	Data is ordered in column-major ordered tiles of 32 columns. The leading dimension is the stride (in elements) to the beginning of next group of 32-columns. For example, if the matrix has 33 columns and 2 rows, then the leading dimension must be at least (32) * 2 = 64.
CUBLASLT_ORDER_COL4_4R2_8C	Data is ordered in column-major ordered tiles of composite tiles with total 32 columns and 8 rows. A tile is composed of interleaved inner tiles of 4 columns within 4 even or odd rows in an alternating pattern. The leading dimension is the stride (in elements) to the beginning of the first 32 column x 8 row tile for the next 32-wide group of columns. For example, if the matrix has 33 columns and 1 row, the leading dimension must be at least (32 * 8) * 1 = 256.
CUBLASLT_ORDER_COL32_2R_4R4	Data is ordered in column-major ordered tiles of composite tiles with total 32 columns ands 32 rows. Element offset within the tile is calculated as (((row%8)/24+row/8)2+row%2)32+col. Leading dimension is the stride (in elements) to the beginning of the first 32 column x 32 row tile for the next 32-wide group of columns. E.g. if matrix has 33 columns and 1 row, ld must be at least (3232)*1 = 1024.

3.3.23. cublasLtPointerMode_t

cublasLtPointerMode_t is an enumerated type used to set the pointer mode for the scaling factors alpha and beta.

Value	Description
CUBLASLT_POINTER_MODE_HOST = CUBLAS_POINTER_MODE_HOST	Matches CUBLAS_POINTER_MODE_HOST, and the pointer targets a single value host memory.
CUBLASLT_POINTER_MODE_DEVICE = CUBLAS_POINTER_MODE_DEVICE	Matches CUBLAS_POINTER_MODE_DEVICE, and the pointer targets a single value device memory.
CUBLASLT_POINTER_MODE_DEVICE_VECTOR = 2	Pointers target device memory vectors of length equal to the number of rows of matrix D.
CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_ZERO = 3	`alpha` pointer targets a device memory vector of length equal to the number of rows of matrix D, and `beta` is zero.
CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST = 4	`alpha` pointer targets a device memory vector of length equal to the number of rows of matrix D, and `beta` is a single value in host memory.

3.3.24. cublasLtPointerModeMask_t

cublasLtPointerModeMask_t is an enumerated type used to define and query the pointer mode capability.

Value	Description
CUBLASLT_POINTER_MODE_MASK_HOST = 1	See CUBLASLT_POINTER_MODE_HOST in cublasLtPointerMode_t.
CUBLASLT_POINTER_MODE_MASK_DEVICE = 2	See CUBLASLT_POINTER_MODE_DEVICE in cublasLtPointerMode_t.
CUBLASLT_POINTER_MODE_MASK_DEVICE_VECTOR = 4	See CUBLASLT_POINTER_MODE_DEVICE_VECTOR in cublasLtPointerMode_t
CUBLASLT_POINTER_MODE_MASK_ALPHA_DEVICE_VECTOR_BETA_ZERO = 8	See CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_ZERO in cublasLtPointerMode_t
CUBLASLT_POINTER_MODE_MASK_ALPHA_DEVICE_VECTOR_BETA_HOST = 16	See CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST in cublasLtPointerMode_t

3.3.25. cublasLtReductionScheme_t

cublasLtReductionScheme_t is an enumerated type used to specify a reduction scheme for the portions of the dot-product calculated in parallel (i.e., “split - K”).

Value	Description
CUBLASLT_REDUCTION_SCHEME_NONE	Do not apply reduction. The dot-product will be performed in one sequence.
CUBLASLT_REDUCTION_SCHEME_INPLACE	Reduction is performed “in place” using the output buffer, parts are added up in the output data type. Workspace is only used for counters that guarantee sequentiality.
CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE	Reduction done out of place in a user-provided workspace. The intermediate results are stored in the compute type in the workspace and reduced in a separate step.
CUBLASLT_REDUCTION_SCHEME_OUTPUT_TYPE	Reduction done out of place in a user-provided workspace. The intermediate results are stored in the output type in the workspace and reduced in a separate step.
CUBLASLT_REDUCTION_SCHEME_MASK	Allows all reduction schemes.

3.4. cuBLASLt API Reference

3.4.1. cublasLtCreate()

cublasStatus_t
      cublasLtCreate(cublasLtHandle_t *lighthandle)

This function initializes the cuBLASLt library and creates a handle to an opaque structure holding the cuBLASLt library context. It allocates light hardware resources on the host and device, and must be called prior to making any other cuBLASLt library calls.

The cuBLASLt library context is tied to the current CUDA device. To use the library on multiple devices, one cuBLASLt handle should be created for each device.

Parameters:

Parameter	Memory	Input / Output	Description
lightHandle		Output	Pointer to the allocated cuBLASLt handle for the created cuBLASLt context.

Returns:

Return Value	Description
CUBLAS_STATUS_SUCCESS	The allocation completed successfully.
CUBLAS_STATUS_NOT_INITIALIZED	The cuBLASLt library was not initialized. This usually happens: when cublasLtCreate() is not called first an error in the CUDA Runtime API called by the cuBLASLt routine, or an error in the hardware setup.
CUBLAS_STATUS_ALLOC_FAILED	Resource allocation failed inside the cuBLASLt library. This is usually caused by a cudaMalloc() failure. To correct: prior to the function call, deallocate the previously allocated memory as much as possible.
`CUBLAS_STATUS_INVALID_VALUE`	`lighthandle` == NULL

See cublasStatus_t for a complete list of valid return codes.

3.4.2. cublasLtDestroy()

cublasStatus_t
      cublasLtDestroy(cublasLtHandle_t lightHandle)

This function releases hardware resources used by the cuBLASLt library. This function is usually the last call with a particular handle to the cuBLASLt library. Because cublasLtCreate() allocates some internal resources and the release of those resources by calling cublasLtDestroy() will implicitly call cudaDeviceSynchronize(), it is recommended to minimize the number of times these functions are called.

Parameters:

Parameter	Memory	Input / Output	Description
lightHandle		Input	Pointer to the cuBLASLt handle to be destroyed.

Returns:

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The cuBLASLt context was successfully destroyed.
`CUBLAS_STATUS_NOT_INITIALIZED`	The cuBLASLt library was not initialized.
`CUBLAS_STATUS_INVALID_VALUE`	`lightHandle` == NULL

See cublasStatus_t for a complete list of valid return codes.

3.4.3. cublasLtDisableCpuInstructionsSetMask()

unsigned cublasLtDisableCpuInstructionsSetMask(unsigned mask);

Instructs cuBLASLt library to not use CPU instructions specified by the flags in the mask. The function takes precedence over the CUBLASLT_DISABLE_CPU_INSTRUCTIONS_MASK environment variable.

Parameters: mask – the flags combined with bitwise OR(|) operator that specify which CPU instructions should not be used.

Supported flags:

Value	Description
`0x1`	x86-64 AVX512 ISA.

Returns: the previous value of the mask.

3.4.4. cublasLtGetCudartVersion()

size_t cublasLtGetCudartVersion(void);

This function returns the version number of the CUDA Runtime library.

Parameters: None.

Returns:size_t - The version number of the CUDA Runtime library.

3.4.5. cublasLtGetProperty()

cublasStatus_t cublasLtGetProperty(libraryPropertyType type, int *value);

This function returns the value of the requested property by writing it to the memory location pointed to by the value parameter.

Parameters:

Parameter	Memory	Input / Output	Description
type		Input	Of the type `libraryPropertyType`, whose value is requested from the property. See libraryPropertyType_t.
value		Output	Pointer to the host memory location where the requested information should be written.

Returns:

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The requested `libraryPropertyType` information is successfully written at the provided address.
`CUBLAS_STATUS_INVALID_VALUE`	If invalid value of the `type` input argument or `value` == NULL

See cublasStatus_t for a complete list of valid return codes.

3.4.6. cublasLtGetStatusName()

const char* cublasLtGetStatusName(cublasStatus_t status);

Returns the string representation of a given status.

Parameters: cublasStatus_t - the status.

Returns: const char* - the NULL-terminated string.

3.4.7. cublasLtGetStatusString()

const char* cublasLtGetStatusString(cublasStatus_t status);

Returns the description string for a given status.

Parameters: cublasStatus_t - the status.

Returns: const char* - the NULL-terminated string.

3.4.8. cublasLtHeuristicsCacheGetCapacity()

cublasStatus_t cublasLtHeuristicsCacheGetCapacity(size_t* capacity);

Returns the Heuristics Cache capacity.

Parameters:

Parameter	Description
`capacity`	The pointer to the returned capacity value.

Returns:

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	The capacity was successfully written.
`CUBLAS_STATUS_INVALID_VALUE`	The capacity was successfully set.

3.4.9. cublasLtHeuristicsCacheSetCapacity()

cublasStatus_t cublasLtHeuristicsCacheSetCapacity(size_t capacity);

Sets the Heuristics Cache capacity. Set the capacity to 0 to disable the heuristics cache.

This function takes precedence over CUBLASLT_HEURISTICS_CACHE_CAPACITY environment variable.

Parameters:

Parameter	Description
`capacity`	The desirable heuristics cache capacity.

Returns:

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	The capacity was successfully set.

3.4.10. cublasLtGetVersion()

size_t cublasLtGetVersion(void);

This function returns the version number of cuBLASLt library.

Parameters: None.

Returns:size_t - The version number of cuBLASLt library.

3.4.11. cublasLtLoggerSetCallback()

cublasStatus_t cublasLtLoggerSetCallback(cublasLtLoggerCallback_t callback);

Experimental: This function sets the logging callback function.

Parameters:

Parameter	Memory	Input / Output	Description
callback		Input	Pointer to a callback function. See cublasLtLoggerCallback_t.

Returns:

Return Value	Description
CUBLAS_STATUS_SUCCESS	If the callback function was successfully set.

See cublasStatus_t for a complete list of valid return codes.

3.4.12. cublasLtLoggerSetFile()

cublasStatus_t cublasLtLoggerSetFile(FILE* file);

Experimental: This function sets the logging output file. Note: once registered using this function call, the provided file handle must not be closed unless the function is called again to switch to a different file handle.

Parameters:

Parameter	Memory	Input / Output	Description
file		Input	Pointer to an open file. File should have write permission.

Returns:

Return Value	Description
CUBLAS_STATUS_SUCCESS	If logging file was successfully set.

See cublasStatus_t for a complete list of valid return codes.

3.4.13. cublasLtLoggerOpenFile()

cublasStatus_t cublasLtLoggerOpenFile(const char* logFile);

Experimental: This function opens a logging output file in the given path.

Parameters:

Parameter	Memory	Input / Output	Description
logFile		Input	Path of the logging output file.

Returns:

Return Value	Description
CUBLAS_STATUS_SUCCESS	If the logging file was successfully opened.

See cublasStatus_t for a complete list of valid return codes.

3.4.14. cublasLtLoggerSetLevel()

cublasStatus_t cublasLtLoggerSetLevel(int level);

Experimental: This function sets the value of the logging level.

Parameters:

Parameter	Memory	Input / Output	Description
level		Input	Value of the logging level. See cuBLASLt Logging.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If the value was not a valid logging level. See cuBLASLt Logging.
CUBLAS_STATUS_SUCCESS	If the logging level was successfully set.

See cublasStatus_t for a complete list of valid return codes.

3.4.15. cublasLtLoggerSetMask()

cublasStatus_t cublasLtLoggerSetMask(int mask);

Experimental: This function sets the value of the logging mask.

Parameters:

Parameter	Memory	Input / Output	Description
mask		Input	Value of the logging mask. See cuBLASLt Logging.

Returns:

Return Value	Description
CUBLAS_STATUS_SUCCESS	If the logging mask was successfully set.

See cublasStatus_t for a complete list of valid return codes.

3.4.16. cublasLtLoggerForceDisable()

cublasStatus_t cublasLtLoggerForceDisable();

Experimental: This function disables logging for the entire run.

Returns:

Return Value	Description
CUBLAS_STATUS_SUCCESS	If logging was successfully disabled.

See cublasStatus_t for a complete list of valid return codes.

3.4.17. cublasLtMatmul()

cublasStatus_t cublasLtMatmul(
      cublasLtHandle_t               lightHandle,
      cublasLtMatmulDesc_t           computeDesc,
      const void                    *alpha,
      const void                    *A,
      cublasLtMatrixLayout_t         Adesc,
      const void                    *B,
      cublasLtMatrixLayout_t         Bdesc,
      const void                    *beta,
      const void                    *C,
      cublasLtMatrixLayout_t         Cdesc,
      void                          *D,
      cublasLtMatrixLayout_t         Ddesc,
      const cublasLtMatmulAlgo_t    *algo,
      void                          *workspace,
      size_t                         workspaceSizeInBytes,
      cudaStream_t                   stream);

This function computes the matrix multiplication of matrices A and B to produce the output matrix D, according to the following operation:

D = alpha*(A*B) + beta*(C),

where A, B, and C are input matrices, and alpha and beta are input scalars.

Note

This function supports both in-place matrix multiplication (C == D and Cdesc == Ddesc) and out-of-place matrix multiplication (C != D, both matrices must have the same data type, number of rows, number of columns, batch size, and memory order). In the out-of-place case, the leading dimension of C can be different from the leading dimension of D. Specifically the leading dimension of C can be 0 to achieve row or column broadcast. If Cdesc is omitted, this function assumes it to be equal to Ddesc.

The workspace pointer has to be aligned to at least 256 bytes. The recommendations on workspaceSizeInBytes are the same as mentioned in the cublasSetWorkspace() section.

Datatypes Supported:

cublasLtMatmul() supports the following computeType, scaleType, Atype/Btype, and Ctype. Footnotes can be found at the end of this section.

Table 1. When A, B, C, and D are Regular Column- or Row-major Matrices
computeType	scaleType	Atype/Btype	Ctype	Bias Type 5
CUBLAS_COMPUTE_16F or CUBLAS_COMPUTE_16F_PEDANTIC	CUDA_R_16F	CUDA_R_16F	CUDA_R_16F	CUDA_R_16F 5
CUBLAS_COMPUTE_32I or CUBLAS_COMPUTE_32I_PEDANTIC	CUDA_R_32I	CUDA_R_8I	CUDA_R_32I	Non-default epilogue not supported.
CUBLAS_COMPUTE_32I or CUBLAS_COMPUTE_32I_PEDANTIC	CUDA_R_32F	CUDA_R_8I	CUDA_R_8I	Non-default epilogue not supported.
CUBLAS_COMPUTE_32F or CUBLAS_COMPUTE_32F_PEDANTIC	CUDA_R_32F	CUDA_R_16BF	CUDA_R_16BF	CUDA_R_16BF 5
		CUDA_R_16F	CUDA_R_16F	CUDA_R_16F 5
		CUDA_R_8I	CUDA_R_32F	Non-default epilogue not supported.
		CUDA_R_16BF	CUDA_R_32F	CUDA_R_32F 5
		CUDA_R_16F	CUDA_R_32F	CUDA_R_32F 5
		CUDA_R_32F	CUDA_R_32F	CUDA_R_32F 5
	CUDA_C_32F 6	CUDA_C_8I 6	CUDA_C_32F 6	Non-default epilogue not supported.
	CUDA_C_32F 6	CUDA_C_32F 6	CUDA_C_32F 6
CUBLAS_COMPUTE_32F_FAST_16F or CUBLAS_COMPUTE_32F_FAST_16BF or CUBLAS_COMPUTE_32F_FAST_TF32	CUDA_R_32F	CUDA_R_32F	CUDA_R_32F	CUDA_R_32F 5
	CUDA_C_32F 6	CUDA_C_32F 6	CUDA_C_32F 6	Non-default epilogue not supported.
CUBLAS_COMPUTE_64F or CUBLAS_COMPUTE_64F_PEDANTIC	CUDA_R_64F	CUDA_R_64F	CUDA_R_64F	CUDA_R_64F 5
CUBLAS_COMPUTE_64F or CUBLAS_COMPUTE_64F_PEDANTIC	CUDA_C_64F 6	CUDA_C_64F 6	CUDA_C_64F 6	Non-default epilogue not supported.

To use IMMA kernels, one of the following sets of requirements, with the first being the preferred one, must be met:

Using a regular data ordering:
- All matrix pointers must be 4-byte aligned. For even better performance, this condition should hold with 16 instead of 4.
- Leading dimensions of matrices A, B, C must be multiples of 4.
- Only the “TN” format is supported - A must be transposed and B non-transposed.
- Pointer mode can be CUBLASLT_POINTER_MODE_HOST, CUBLASLT_POINTER_MODE_DEVICE or CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST. With the latter mode, the kernels support the CUBLASLT_MATMUL_DESC_ALPHA_VECTOR_BATCH_STRIDE attribute.
- Dimensions m and k must be multiples of 4.
Using the IMMA-specific data ordering on Ampere or Turing (but not Hopper) architecture - CUBLASLT_ORDER_COL32` for matrices A, C, D, and CUBLASLT_ORDER_COL4_4R2_8C (on Turing or Ampere architecture) or CUBLASLT_ORDER_COL32_2R_4R4 (on Ampere architecture) for matrix B:
- Leading dimensions of matrices A, B, C must fulfill conditions specific to the memory ordering (see cublasLtOrder_t).
- Matmul descriptor must specify CUBLAS_OP_T on matrix B and CUBLAS_OP_N (default) on matrix A and C.
- If scaleType CUDA_R_32I is used, the only supported values for alpha and beta are 0 or 1.
- Pointer mode can be CUBLASLT_POINTER_MODE_HOST, CUBLASLT_POINTER_MODE_DEVICE, CUBLASLT_POINTER_MODE_DEVICE_VECTOR or CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_ZERO. These kernels do not support CUBLASLT_MATMUL_DESC_ALPHA_VECTOR_BATCH_STRIDE.
- Only the “NT” format is supported - A must be transposed and B non-transposed.

Table 2. When A, B, C, and D Use Layouts for IMMA
computeType	scaleType	Atype/Btype	Ctype	Bias Type
CUBLAS_COMPUTE_32I or CUBLAS_COMPUTE_32I_PEDANTIC	CUDA_R_32I	CUDA_R_8I	CUDA_R_32I	Non-default epilogue not supported.
CUBLAS_COMPUTE_32I or CUBLAS_COMPUTE_32I_PEDANTIC	CUDA_R_32F	CUDA_R_8I	CUDA_R_8I	CUDA_R_32F

To use FP8 kernels, the following set of requirements must be satisfied:

All matrix pointers must be 16-byte aligned.
A must be transposed and B non-transposed (The “TN” format).
The compute type must be CUBLAS_COMPUTE_32F.
The scale type must be CUDA_R_32F.

See the table below when using FP8 kernels:

Table 3. When A, B, C, and D Use Layouts for FP8
AType	BType	CType	DType	Bias Type
CUDA_R_8F_E4M3	CUDA_R_8F_E4M3	CUDA_R_16BF	CUDA_R_16BF	CUDA_R_16BF 5
		CUDA_R_16BF	CUDA_R_8F_E4M3	CUDA_R_16BF 5
		CUDA_R_16F	CUDA_R_8F_E4M3	CUDA_R_16F 5
		CUDA_R_16F	CUDA_R_16F	CUDA_R_16F 5
		CUDA_R_32F	CUDA_R_32F	CUDA_R_16BF 5
	CUDA_R_8F_E5M2	CUDA_R_16BF	CUDA_R_16BF	CUDA_R_16BF 5
			CUDA_R_8F_E4M3	CUDA_R_16BF 5
			CUDA_R_8F_E5M2	CUDA_R_16BF 5
		CUDA_R_16F	CUDA_R_8F_E4M3	CUDA_R_16F 5
			CUDA_R_8F_E5M2	CUDA_R_16F 5
			CUDA_R_16F	CUDA_R_16F 5
		CUDA_R_32F	CUDA_R_32F	CUDA_R_16BF 5
CUDA_R_8F_E5M2	CUDA_R_8F_E4M3	CUDA_R_16BF	CUDA_R_16BF	CUDA_R_16BF 5
			CUDA_R_8F_E4M3	CUDA_R_16BF 5
			CUDA_R_8F_E5M2	CUDA_R_16BF 5
		CUDA_R_16F	CUDA_R_8F_E4M3	CUDA_R_16F 5
			CUDA_R_8F_E5M2	CUDA_R_16F 5
			CUDA_R_16F	CUDA_R_16F 5
		CUDA_R_32F	CUDA_R_32F	CUDA_R_16BF 5

And finally, see below table when A,B,C,D are planar-complex matrices (CUBLASLT_MATRIX_LAYOUT_PLANE_OFFSET != 0, see cublasLtMatrixLayoutAttribute_t) to make use of mixed precision tensor core acceleration.

Table 4. When A, B, C, and D are Planar-Complex Matrices
computeType	scaleType	Atype/Btype	Ctype
CUBLAS_COMPUTE_32F	CUDA_C_32F	CUDA_C_16F 6	CUDA_C_16F 6
		CUDA_C_16F 6	CUDA_C_32F 6
		CUDA_C_16BF 6	CUDA_C_16BF 6
		CUDA_C_16BF 6	CUDA_C_32F 6

NOTES:

5(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28): ReLU, dReLu, GELU, dGELU and Bias epilogue modes (see CUBLASLT_MATMUL_DESC_EPILOGUE in cublasLtMatmulDescAttributes_t) are not supported when D matrix memory order is defined as CUBLASLT_ORDER_ROW. For best performance when using the bias vector, specify zero beta and set pointer mode to CUBLASLT_POINTER_MODE_HOST.
6(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17): Use of CUBLAS_ORDER_ROW together with CUBLAS_OP_C (Hermitian operator) is not supported unless all of A, B, C, and D matrices use the CUBLAS_ORDER_ROW ordering.

Parameters:

Parameter	Memory	Input / Output	Description
lightHandle		Input	Pointer to the allocated cuBLASLt handle for the cuBLASLt context. See cublasLtHandle_t.
computeDesc		Input	Handle to a previously created matrix multiplication descriptor of type cublasLtMatmulDesc_t.
alpha, beta	Device or host	Input	Pointers to the scalars used in the multiplication.
A, B, and C	Device	Input	Pointers to the GPU memory associated with the corresponding descriptors Adesc, Bdesc and Cdesc.
Adesc, Bdesc and Cdesc.		Input	Handles to the previous created descriptors of the type cublasLtMatrixLayout_t.
D	Device	Output	Pointer to the GPU memory associated with the descriptor Ddesc.
Ddesc		Input	Handle to the previous created descriptor of the type cublasLtMatrixLayout_t.
algo		Input	Handle for matrix multiplication algorithm to be used. See cublasLtMatmulAlgo_t. When NULL, an implicit heuritics query with default search preferences will be performed to determine actual algorithm to use.
workspace	Device		Pointer to the workspace buffer allocated in the GPU memory. Pointer must be 16B aligned (i.e. lowest 4 bits of address must be 0).
workspaceSizeInBytes		Input	Size of the workspace.
stream	Host	Input	The CUDA stream where all the GPU work will be submitted.

Returns:

Return Value	Description
CUBLAS_STATUS_NOT_INITIALIZED	If cuBLASLt handle has not been initialized.
CUBLAS_STATUS_INVALID_VALUE	If the parameters are unexpectedly NULL, in conflict or in an impossible configuration. For example, when `workspaceSizeInBytes` is less than workspace required by the configured algo.
CUBLAS_STATUS_NOT_SUPPORTED	If the current implementation on the selected device doesn’t support the configured operation.
CUBLAS_STATUS_ARCH_MISMATCH	If the configured operation cannot be run using the selected device.
CUBLAS_STATUS_EXECUTION_FAILED	If CUDA reported an execution error from the device.
CUBLAS_STATUS_SUCCESS	If the operation completed successfully.

See cublasStatus_t for a complete list of valid return codes.

3.4.18. cublasLtMatmulAlgoCapGetAttribute()

cublasStatus_t cublasLtMatmulAlgoCapGetAttribute(
      const cublasLtMatmulAlgo_t *algo,
      cublasLtMatmulAlgoCapAttributes_t attr,
      void *buf,
      size_t sizeInBytes,
      size_t *sizeWritten);

This function returns the value of the queried capability attribute for an initialized cublasLtMatmulAlgo_t descriptor structure. The capability attribute value is retrieved from the enumerated type cublasLtMatmulAlgoCapAttributes_t.

For example, to get list of supported Tile IDs:

cublasLtMatmulTile_t tiles[CUBLASLT_MATMUL_TILE_END];
size_t num_tiles, size_written;
if (cublasLtMatmulAlgoCapGetAttribute(algo, CUBLASLT_ALGO_CAP_TILE_IDS, tiles, sizeof(tiles), &size_written) == CUBLAS_STATUS_SUCCESS) {
  num_tiles = size_written / sizeof(tiles[0]);}

Parameters:

Parameter	Input / Output	Description
algo	Input	Pointer to the previously created opaque structure holding the matrix multiply algorithm descriptor. See cublasLtMatmulAlgo_t.
attr	Input	The capability attribute whose value will be retrieved by this function. See cublasLtMatmulAlgoCapAttributes_t.
buf	Output	The attribute value returned by this function.
sizeInBytes	Input	Size of `buf` buffer (in bytes) for verification.
sizeWritten	Output	Valid only when the return value is CUBLAS_STATUS_SUCCESS. If `sizeInBytes` is non-zero: then `sizeWritten` is the number of bytes actually written; if `sizeInBytes` is 0: then `sizeWritten` is the number of bytes needed to write full contents.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If `sizeInBytes` is 0 and `sizeWritten` is NULL, or if `sizeInBytes` is non-zero and `buf` is NULL, or `sizeInBytes` doesn’t match size of internal storage for the selected attribute
CUBLAS_STATUS_SUCCESS	If attribute’s value was successfully written to user memory.

See cublasStatus_t for a complete list of valid return codes.

3.4.19. cublasLtMatmulAlgoCheck()

cublasStatus_t cublasLtMatmulAlgoCheck(
      cublasLtHandle_t lightHandle,
      cublasLtMatmulDesc_t operationDesc,
      cublasLtMatrixLayout_t Adesc,
      cublasLtMatrixLayout_t Bdesc,
      cublasLtMatrixLayout_t Cdesc,
      cublasLtMatrixLayout_t Ddesc,
      const cublasLtMatmulAlgo_t *algo,
      cublasLtMatmulHeuristicResult_t *result);

This function performs the correctness check on the matrix multiply algorithm descriptor for the matrix multiply operation cublasLtMatmul() function with the given input matrices A, B and C, and the output matrix D. It checks whether the descriptor is supported on the current device, and returns the result containing the required workspace and the calculated wave count.

Note

CUBLAS_STATUS_SUCCESS doesn’t fully guarantee that the algo will run. The algo will fail if, for example, the buffers are not correctly aligned. However, if cublasLtMatmulAlgoCheck() fails, the algo will not run.

Parameters:

Parameter	Input / Output	Description
lightHandle	Input	Pointer to the allocated cuBLASLt handle for the cuBLASLt context. See cublasLtHandle_t.
operationDesc	Input	Handle to a previously created matrix multiplication descriptor of type cublasLtMatmulDesc_t.
Adesc, Bdesc, Cdesc, and Ddesc	Input	Handles to the previously created matrix layout descriptors of the type cublasLtMatrixLayout_t.
algo	Input	Descriptor which specifies which matrix multiplication algorithm should be used. See cublasLtMatmulAlgo_t. May point to `result->algo`.
result	Output	Pointer to the structure holding the results returned by this function. The results comprise of the required workspace and the calculated wave count. The algo field is never updated. See cublasLtMatmulHeuristicResult_t.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If matrix layout descriptors or the operation descriptor do not match the algo descriptor.
CUBLAS_STATUS_NOT_SUPPORTED	If the algo configuration or data type combination is not currently supported on the given device.
CUBLAS_STATUS_ARCH_MISMATCH	If the algo configuration cannot be run using the selected device.
CUBLAS_STATUS_SUCCESS	If the check was successful.

See cublasStatus_t for a complete list of valid return codes.

3.4.20. cublasLtMatmulAlgoConfigGetAttribute()

cublasStatus_t cublasLtMatmulAlgoConfigGetAttribute(
      const cublasLtMatmulAlgo_t *algo,
      cublasLtMatmulAlgoConfigAttributes_t attr,
      void *buf,
      size_t sizeInBytes,
      size_t *sizeWritten);

This function returns the value of the queried configuration attribute for an initialized cublasLtMatmulAlgo_t descriptor. The configuration attribute value is retrieved from the enumerated type cublasLtMatmulAlgoConfigAttributes_t.

Parameters:

Parameter	Input / Output	Description
algo	Input	Pointer to the previously created opaque structure holding the matrix multiply algorithm descriptor. See cublasLtMatmulAlgo_t.
attr	Input	The configuration attribute whose value will be retrieved by this function. See cublasLtMatmulAlgoConfigAttributes_t.
buf	Output	The attribute value returned by this function.
sizeInBytes	Input	Size of `buf` buffer (in bytes) for verification.
sizeWritten	Output	Valid only when the return value is CUBLAS_STATUS_SUCCESS. If `sizeInBytes` is non-zero: then `sizeWritten` is the number of bytes actually written; if `sizeInBytes` is 0: then `sizeWritten` is the number of bytes needed to write full contents.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If `sizeInBytes` is 0 and `sizeWritten` is NULL, or if `sizeInBytes` is non-zero and `buf` is NULL, or `sizeInBytes` doesn’t match size of internal storage for the selected attribute
CUBLAS_STATUS_SUCCESS	If attribute’s value was successfully written to user memory.

See cublasStatus_t for a complete list of valid return codes.

3.4.21. cublasLtMatmulAlgoConfigSetAttribute()

cublasStatus_t cublasLtMatmulAlgoConfigSetAttribute(
      cublasLtMatmulAlgo_t *algo,
      cublasLtMatmulAlgoConfigAttributes_t attr,
      const void *buf,
      size_t sizeInBytes);

This function sets the value of the specified configuration attribute for an initialized cublasLtMatmulAlgo_t descriptor. The configuration attribute is an enumerant of the type cublasLtMatmulAlgoConfigAttributes_t.

Parameters:

Parameter	Input / Output	Description
algo	Input	Pointer to the previously created opaque structure holding the matrix multiply algorithm descriptor. See cublasLtMatmulAlgo_t.
attr	Input	The configuration attribute whose value will be set by this function. See cublasLtMatmulAlgoConfigAttributes_t.
buf	Input	The value to which the configuration attribute should be set.
sizeInBytes	Input	Size of `buf` buffer (in bytes) for verification.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If `buf` is NULL or `sizeInBytes` doesn’t match the size of the internal storage for the selected attribute.
CUBLAS_STATUS_SUCCESS	If the attribute was set successfully.

See cublasStatus_t for a complete list of valid return codes.

3.4.22. cublasLtMatmulAlgoGetHeuristic()

cublasStatus_t cublasLtMatmulAlgoGetHeuristic(
      cublasLtHandle_t lightHandle,
      cublasLtMatmulDesc_t operationDesc,
      cublasLtMatrixLayout_t Adesc,
      cublasLtMatrixLayout_t Bdesc,
      cublasLtMatrixLayout_t Cdesc,
      cublasLtMatrixLayout_t Ddesc,
      cublasLtMatmulPreference_t preference,
      int requestedAlgoCount,
      cublasLtMatmulHeuristicResult_t heuristicResultsArray[]
      int *returnAlgoCount);

This function retrieves the possible algorithms for the matrix multiply operation cublasLtMatmul() function with the given input matrices A, B and C, and the output matrix D. The output is placed in heuristicResultsArray[] in the order of increasing estimated compute time.

Parameters:

Parameter	Input / Output	Description
lightHandle	Input	Pointer to the allocated cuBLASLt handle for the cuBLASLt context. See cublasLtHandle_t.
operationDesc	Input	Handle to a previously created matrix multiplication descriptor of type cublasLtMatmulDesc_t.
Adesc, Bdesc, Cdesc, and Ddesc	Input	Handles to the previously created matrix layout descriptors of the type cublasLtMatrixLayout_t.
preference	Input	Pointer to the structure holding the heuristic search preferences descriptor. See cublasLtMatmulPreference_t.
requestedAlgoCount	Input	Size of the `heuristicResultsArray` (in elements). This is the requested maximum number of algorithms to return.
heuristicResultsArray[]	Output	Array containing the algorithm heuristics and associated runtime characteristics, returned by this function, in the order of increasing estimated compute time.
returnAlgoCount	Output	Number of algorithms returned by this function. This is the number of `heuristicResultsArray` elements written.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If `requestedAlgoCount` is less or equal to zero.
CUBLAS_STATUS_NOT_SUPPORTED	If no heuristic function available for current configuration.
CUBLAS_STATUS_SUCCESS	If query was successful. Inspect `heuristicResultsArray[0 to (returnAlgoCount -1)].state` for the status of the results.

See cublasStatus_t for a complete list of valid return codes.

Note

This function may load some kernels using CUDA Driver API which may fail when there is no available GPU memory. Do not allocate the entire VRAM before running cublasLtMatmulAlgoGetHeuristic().

3.4.23. cublasLtMatmulAlgoGetIds()

cublasStatus_t cublasLtMatmulAlgoGetIds(
      cublasLtHandle_t lightHandle,
      cublasComputeType_t computeType,
      cudaDataType_t scaleType,
      cudaDataType_t Atype,
      cudaDataType_t Btype,
      cudaDataType_t Ctype,
      cudaDataType_t Dtype,
      int requestedAlgoCount,
      int algoIdsArray[],
      int *returnAlgoCount);

This function retrieves the IDs of all the matrix multiply algorithms that are valid, and can potentially be run by the cublasLtMatmul() function, for given types of the input matrices A, B and C, and of the output matrix D.

Note

The IDs are returned in no particular order. To make sure the best possible algo is contained in the list, make requestedAlgoCount large enough to receive the full list. The list is guaranteed to be full if returnAlgoCount < requestedAlgoCount.

Parameters:

Parameter	Input / Output	Description
lightHandle	Input	Pointer to the allocated cuBLASLt handle for the cuBLASLt context. See cublasLtHandle_t.
computeType, scaleType, Atype, Btype, Ctype, and Dtype	Inputs	Data types of the computation type, scaling factors and of the operand matrices. See `cudaDataType_t`.
requestedAlgoCount	Input	Number of algorithms requested. Must be > 0.
algoIdsArray[]	Output	Array containing the algorithm IDs returned by this function.
returnAlgoCount	Output	Number of algorithms actually returned by this function.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If `requestedAlgoCount` is less or equal to zero.
CUBLAS_STATUS_SUCCESS	If query was successful. Inspect `returnAlgoCount` to get actual number of IDs available.

See cublasStatus_t for a complete list of valid return codes.

3.4.24. cublasLtMatmulAlgoInit()

cublasStatus_t cublasLtMatmulAlgoInit(
      cublasLtHandle_t lightHandle,
      cublasComputeType_t computeType,
      cudaDataType_t scaleType,
      cudaDataType_t Atype,
      cudaDataType_t Btype,
      cudaDataType_t Ctype,
      cudaDataType_t Dtype,
      int algoId,
      cublasLtMatmulAlgo_t *algo);

This function initializes the matrix multiply algorithm structure for the cublasLtMatmul() , for a specified matrix multiply algorithm and input matrices A, B and C, and the output matrix D.

Parameters:

Parameter	Input / Output	Description
lightHandle	Input	Pointer to the allocated cuBLASLt handle for the cuBLASLt context. See cublasLtHandle_t.
computeType	Input	Compute type. See `CUBLASLT_MATMUL_DESC_COMPUTE_TYPE` of cublasLtMatmulDescAttributes_t.
scaleType	Input	Scale type. See `CUBLASLT_MATMUL_DESC_SCALE_TYPE`of cublasLtMatmulDescAttributes_t. Usually same as computeType.
Atype, Btype, Ctype, and Dtype	Input	Datatype precision for the input and output matrices. See cudaDataType_t .
algoId	Input	Specifies the algorithm being initialized. Should be a valid `algoId` returned by the cublasLtMatmulAlgoGetIds() function.
algo	Input	Pointer to the opaque structure to be initialized. See cublasLtMatmulAlgo_t.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If `algo` is NULL or `algoId` is outside the recognized range.
CUBLAS_STATUS_NOT_SUPPORTED	If `algoId` is not supported for given combination of data types.
CUBLAS_STATUS_SUCCESS	If the structure was successfully initialized.

See cublasStatus_t for a complete list of valid return codes.

3.4.25. cublasLtMatmulDescCreate()

cublasStatus_t cublasLtMatmulDescCreate( cublasLtMatmulDesc_t *matmulDesc,
                                         cublasComputeType_t computeType,
                                         cudaDataType_t scaleType);

This function creates a matrix multiply descriptor by allocating the memory needed to hold its opaque structure.

Parameters:

Parameter	Input / Output	Description
matmulDesc	Output	Pointer to the structure holding the matrix multiply descriptor created by this function. See cublasLtMatmulDesc_t.
computeType	Input	Enumerant that specifies the data precision for the matrix multiply descriptor this function creates. See cublasComputeType_t.
scaleType	Input	Enumerant that specifies the data precision for the matrix transform descriptor this function creates. See `cudaDataType`.

Returns:

Return Value	Description
CUBLAS_STATUS_ALLOC_FAILED	If memory could not be allocated.
CUBLAS_STATUS_SUCCESS	If the descriptor was created successfully.

See cublasStatus_t for a complete list of valid return codes.

3.4.26. cublasLtMatmulDescInit()

cublasStatus_t cublasLtMatmulDescInit( cublasLtMatmulDesc_t matmulDesc,
                                       cublasComputeType_t computeType,
                                       cudaDataType_t scaleType);

This function initializes a matrix multiply descriptor in a previously allocated one.

Parameters:

Parameter	Input / Output	Description
matmulDesc	Output	Pointer to the structure holding the matrix multiply descriptor initialized by this function. See cublasLtMatmulDesc_t.
computeType	Input	Enumerant that specifies the data precision for the matrix multiply descriptor this function initializes. See cublasComputeType_t.
scaleType	Input	Enumerant that specifies the data precision for the matrix transform descriptor this function initializes. See `cudaDataType`.

Returns:

Return Value	Description
CUBLAS_STATUS_ALLOC_FAILED	If memory could not be allocated.
CUBLAS_STATUS_SUCCESS	If the descriptor was created successfully.

See cublasStatus_t for a complete list of valid return codes.

3.4.27. cublasLtMatmulDescDestroy()

cublasStatus_t cublasLtMatmulDescDestroy(
      cublasLtMatmulDesc_t matmulDesc);

This function destroys a previously created matrix multiply descriptor object.

Parameters:

Parameter	Memory	Input / Output	Description
matmulDesc		Input	Pointer to the structure holding the matrix multiply descriptor that should be destroyed by this function. See cublasLtMatmulDesc_t.

Returns:

Return Value	Description
CUBLAS_STATUS_SUCCESS	If operation was successful.

See cublasStatus_t for a complete list of valid return codes.

3.4.28. cublasLtMatmulDescGetAttribute()

cublasStatus_t cublasLtMatmulDescGetAttribute(
      cublasLtMatmulDesc_t matmulDesc,
      cublasLtMatmulDescAttributes_t attr,
      void *buf,
      size_t sizeInBytes,
      size_t *sizeWritten);

This function returns the value of the queried attribute belonging to a previously created matrix multiply descriptor.

Parameters:

Parameter	Input / Output	Description
matmulDesc	Input	Pointer to the previously created structure holding the matrix multiply descriptor queried by this function. See cublasLtMatmulDesc_t.
attr	Input	The attribute that will be retrieved by this function. See cublasLtMatmulDescAttributes_t.
buf	Output	Memory address containing the attribute value retrieved by this function.
sizeInBytes	Input	Size of `buf` buffer (in bytes) for verification.
sizeWritten	Output	Valid only when the return value is CUBLAS_STATUS_SUCCESS. If `sizeInBytes` is non-zero: then `sizeWritten` is the number of bytes actually written; if `sizeInBytes` is 0: then `sizeWritten` is the number of bytes needed to write full contents.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If `sizeInBytes` is 0 and `sizeWritten` is NULL, or if `sizeInBytes` is non-zero and `buf` is NULL, or `sizeInBytes` doesn’t match size of internal storage for the selected attribute
CUBLAS_STATUS_SUCCESS	If attribute’s value was successfully written to user memory.

See cublasStatus_t for a complete list of valid return codes.

3.4.29. cublasLtMatmulDescSetAttribute()

cublasStatus_t cublasLtMatmulDescSetAttribute(
      cublasLtMatmulDesc_t matmulDesc,
      cublasLtMatmulDescAttributes_t attr,
      const void *buf,
      size_t sizeInBytes);

This function sets the value of the specified attribute belonging to a previously created matrix multiply descriptor.

Parameters:

Parameter	Input / Output	Description
matmulDesc	Input	Pointer to the previously created structure holding the matrix multiply descriptor queried by this function. See cublasLtMatmulDesc_t.
attr	Input	The attribute that will be set by this function. See cublasLtMatmulDescAttributes_t.
buf	Input	The value to which the specified attribute should be set.
sizeInBytes	Input	Size of `buf` buffer (in bytes) for verification.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If `buf` is NULL or `sizeInBytes` doesn’t match the size of the internal storage for the selected attribute.
CUBLAS_STATUS_SUCCESS	If the attribute was set successfully.

See cublasStatus_t for a complete list of valid return codes.

3.4.30. cublasLtMatmulPreferenceCreate()

cublasStatus_t cublasLtMatmulPreferenceCreate(
      cublasLtMatmulPreference_t *pref);

This function creates a matrix multiply heuristic search preferences descriptor by allocating the memory needed to hold its opaque structure.

Parameters:

Parameter	Memory	Input / Output	Description
pref		Output	Pointer to the structure holding the matrix multiply preferences descriptor created by this function. See cublasLtMatrixLayout_t.

Returns:

Return Value	Description
CUBLAS_STATUS_ALLOC_FAILED	If memory could not be allocated.
CUBLAS_STATUS_SUCCESS	If the descriptor was created successfully.

See cublasStatus_t for a complete list of valid return codes.

3.4.31. cublasLtMatmulPreferenceInit()

cublasStatus_t cublasLtMatmulPreferenceInit(
      cublasLtMatmulPreference_t pref);

This function initializes a matrix multiply heuristic search preferences descriptor in a previously allocated one.

Parameters:

Parameter	Memory	Input / Output	Description
pref		Output	Pointer to the structure holding the matrix multiply preferences descriptor created by this function. See cublasLtMatrixLayout_t.

Returns:

Return Value	Description
CUBLAS_STATUS_ALLOC_FAILED	If memory could not be allocated.
CUBLAS_STATUS_SUCCESS	If the descriptor was created successfully.

See cublasStatus_t for a complete list of valid return codes.

3.4.32. cublasLtMatmulPreferenceDestroy()

cublasStatus_t cublasLtMatmulPreferenceDestroy(
      cublasLtMatmulPreference_t pref);

This function destroys a previously created matrix multiply preferences descriptor object.

Parameters:

Parameter	Memory	Input / Output	Description
pref		Input	Pointer to the structure holding the matrix multiply preferences descriptor that should be destroyed by this function. See cublasLtMatmulPreference_t.

Returns:

Return Value	Description
CUBLAS_STATUS_SUCCESS	If the operation was successful.

See cublasStatus_t for a complete list of valid return codes.

3.4.33. cublasLtMatmulPreferenceGetAttribute()

cublasStatus_t cublasLtMatmulPreferenceGetAttribute(
      cublasLtMatmulPreference_t pref,
      cublasLtMatmulPreferenceAttributes_t attr,
      void *buf,
      size_t sizeInBytes,
      size_t *sizeWritten);

This function returns the value of the queried attribute belonging to a previously created matrix multiply heuristic search preferences descriptor.

Parameters:

Parameter	Input / Output	Description
pref	Input	Pointer to the previously created structure holding the matrix multiply heuristic search preferences descriptor queried by this function. See cublasLtMatmulPreference_t.
attr	Input	The attribute that will be queried by this function. See cublasLtMatmulPreferenceAttributes_t.
buf	Output	Memory address containing the attribute value retrieved by this function.
sizeInBytes	Input	Size of `buf` buffer (in bytes) for verification.
sizeWritten	Output	Valid only when the return value is CUBLAS_STATUS_SUCCESS. If `sizeInBytes` is non-zero: then `sizeWritten` is the number of bytes actually written; if `sizeInBytes` is 0: then `sizeWritten` is the number of bytes needed to write full contents.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If `sizeInBytes` is 0 and `sizeWritten` is NULL, or if `sizeInBytes` is non-zero and `buf` is NULL, or `sizeInBytes` doesn’t match size of internal storage for the selected attribute
CUBLAS_STATUS_SUCCESS	If attribute’s value was successfully written to user memory.

See cublasStatus_t for a complete list of valid return codes.

3.4.34. cublasLtMatmulPreferenceSetAttribute()

cublasStatus_t cublasLtMatmulPreferenceSetAttribute(
      cublasLtMatmulPreference_t pref,
      cublasLtMatmulPreferenceAttributes_t attr,
      const void *buf,
      size_t sizeInBytes);

This function sets the value of the specified attribute belonging to a previously created matrix multiply preferences descriptor.

Parameters:

Parameter	Input / Output	Description
pref	Input	Pointer to the previously created structure holding the matrix multiply preferences descriptor queried by this function. See cublasLtMatmulPreference_t.
attr	Input	The attribute that will be set by this function. See cublasLtMatmulPreferenceAttributes_t.
buf	Input	The value to which the specified attribute should be set.
sizeInBytes	Input	Size of `buf` buffer (in bytes) for verification.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If buf is NULL or `sizeInBytes` doesn’t match the size of the internal storage for the selected attribute.
CUBLAS_STATUS_SUCCESS	If the attribute was set successfully.

See cublasStatus_t for a complete list of valid return codes.

3.4.35. cublasLtMatrixLayoutCreate()

cublasStatus_t cublasLtMatrixLayoutCreate( cublasLtMatrixLayout_t *matLayout,
                                           cudaDataType type,
                                           uint64_t rows,
                                           uint64_t cols,
                                           int64_t ld);

This function creates a matrix layout descriptor by allocating the memory needed to hold its opaque structure.

Parameters:

Parameter	Input / Output	Description
matLayout	Output	Pointer to the structure holding the matrix layout descriptor created by this function. See cublasLtMatrixLayout_t.
type	Input	Enumerant that specifies the data precision for the matrix layout descriptor this function creates. See `cudaDataType`.
rows, cols	Input	Number of rows and columns of the matrix.
ld	Input	The leading dimension of the matrix. In column major layout, this is the number of elements to jump to reach the next column. Thus ld >= m (number of rows).

Returns:

Return Value	Description
CUBLAS_STATUS_ALLOC_FAILED	If the memory could not be allocated.
CUBLAS_STATUS_SUCCESS	If the descriptor was created successfully.

See cublasStatus_t for a complete list of valid return codes.

3.4.36. cublasLtMatrixLayoutInit()

cublasStatus_t cublasLtMatrixLayoutInit( cublasLtMatrixLayout_t matLayout,
                                         cudaDataType type,
                                         uint64_t rows,
                                         uint64_t cols,
                                         int64_t ld);

This function initializes a matrix layout descriptor in a previously allocated one.

Parameters:

Parameter	Input / Output	Description
matLayout	Output	Pointer to the structure holding the matrix layout descriptor initialized by this function. See cublasLtMatrixLayout_t.
type	Input	Enumerant that specifies the data precision for the matrix layout descriptor this function initializes. See `cudaDataType`.
rows, cols	Input	Number of rows and columns of the matrix.
ld	Input	The leading dimension of the matrix. In column major layout, this is the number of elements to jump to reach the next column. Thus ld >= m (number of rows).

Returns:

Return Value	Description
CUBLAS_STATUS_ALLOC_FAILED	If the memory could not be allocated.
CUBLAS_STATUS_SUCCESS	If the descriptor was created successfully.

See cublasStatus_t for a complete list of valid return codes.

3.4.37. cublasLtMatrixLayoutDestroy()

cublasStatus_t cublasLtMatrixLayoutDestroy(
      cublasLtMatrixLayout_t matLayout);

This function destroys a previously created matrix layout descriptor object.

Parameters:

Parameter	Memory	Input / Output	Description
matLayout		Input	Pointer to the structure holding the matrix layout descriptor that should be destroyed by this function. See cublasLtMatrixLayout_t.

Returns:

Return Value	Description
CUBLAS_STATUS_SUCCESS	If the operation was successful.

See cublasStatus_t for a complete list of valid return codes.

3.4.38. cublasLtMatrixLayoutGetAttribute()

cublasStatus_t cublasLtMatrixLayoutGetAttribute(
      cublasLtMatrixLayout_t matLayout,
      cublasLtMatrixLayoutAttribute_t attr,
      void *buf,
      size_t sizeInBytes,
      size_t *sizeWritten);

This function returns the value of the queried attribute belonging to the specified matrix layout descriptor.

Parameters:

Parameter	Input / Output	Description
matLayout	Input	Pointer to the previously created structure holding the matrix layout descriptor queried by this function. See cublasLtMatrixLayout_t.
attr	Input	The attribute being queried for. See cublasLtMatrixLayoutAttribute_t.
buf	Output	The attribute value returned by this function.
sizeInBytes	Input	Size of `buf` buffer (in bytes) for verification.
sizeWritten	Output	Valid only when the return value is CUBLAS_STATUS_SUCCESS. If `sizeInBytes` is non-zero: then `sizeWritten` is the number of bytes actually written; if `sizeInBytes` is 0: then `sizeWritten` is the number of bytes needed to write full contents.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If `sizeInBytes` is 0 and `sizeWritten` is NULL, or if `sizeInBytes` is non-zero and `buf` is NULL, or `sizeInBytes` doesn’t match size of internal storage for the selected attribute
CUBLAS_STATUS_SUCCESS	If attribute’s value was successfully written to user memory.

See cublasStatus_t for a complete list of valid return codes.

3.4.39. cublasLtMatrixLayoutSetAttribute()

cublasStatus_t cublasLtMatrixLayoutSetAttribute(
      cublasLtMatrixLayout_t matLayout,
      cublasLtMatrixLayoutAttribute_t attr,
      const void *buf,
      size_t sizeInBytes);

This function sets the value of the specified attribute belonging to a previously created matrix layout descriptor.

Parameters:

Parameter	Input / Output	Description
matLayout	Input	Pointer to the previously created structure holding the matrix layout descriptor queried by this function. See cublasLtMatrixLayout_t.
attr	Input	The attribute that will be set by this function. See cublasLtMatrixLayoutAttribute_t.
buf	Input	The value to which the specified attribute should be set.
sizeInBytes	Input	Size of `buf`, the attribute buffer.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If `buf` is NULL or `sizeInBytes` doesn’t match size of internal storage for the selected attribute.
CUBLAS_STATUS_SUCCESS	If attribute was set successfully.

See cublasStatus_t for a complete list of valid return codes.

3.4.40. cublasLtMatrixTransform()

cublasStatus_t cublasLtMatrixTransform(
      cublasLtHandle_t lightHandle,
      cublasLtMatrixTransformDesc_t transformDesc,
      const void *alpha,
      const void *A,
      cublasLtMatrixLayout_t Adesc,
      const void *beta,
      const void *B,
      cublasLtMatrixLayout_t Bdesc,
      void *C,
      cublasLtMatrixLayout_t Cdesc,
      cudaStream_t stream);

This function computes the matrix transformation operation on the input matrices A and B, to produce the output matrix C, according to the below operation:

C = alpha*transformation(A) + beta*transformation(B),

where A, B are input matrices, and alpha and beta are input scalars. The transformation operation is defined by the transformDesc pointer. This function can be used to change the memory order of data or to scale and shift the values.

Parameters:

Parameter	Memory	Input / Output	Description
lightHandle		Input	Pointer to the allocated cuBLASLt handle for the cuBLASLt context. See cublasLtHandle_t.
transformDesc		Input	Pointer to the opaque descriptor holding the matrix transformation operation. See cublasLtMatrixTransformDesc_t.
alpha, beta	Device or host	Input	Pointers to the scalars used in the multiplication.
A, B, and C	Device	Input	Pointers to the GPU memory associated with the corresponding descriptors `Adesc`, `Bdesc` and `Cdesc`.
Adesc, Bdesc and Cdesc.		Input	Handles to the previous created descriptors of the type cublasLtMatrixLayout_t. `Adesc` or `Bdesc` can be NULL if corresponding pointer is NULL and corresponding scalar is zero.
stream	Host	Input	The CUDA stream where all the GPU work will be submitted.

Returns:

Return Value	Description
CUBLAS_STATUS_NOT_INITIALIZED	If cuBLASLt handle has not been initialized.
CUBLAS_STATUS_INVALID_VALUE	If the parameters are in conflict or in an impossible configuration. For example, when `A` is not NULL, but `Adesc` is NULL.
CUBLAS_STATUS_NOT_SUPPORTED	If the current implementation on the selected device does not support the configured operation.
CUBLAS_STATUS_ARCH_MISMATCH	If the configured operation cannot be run using the selected device.
CUBLAS_STATUS_EXECUTION_FAILED	If CUDA reported an execution error from the device.
CUBLAS_STATUS_SUCCESS	If the operation completed successfully.

See cublasStatus_t for a complete list of valid return codes.

3.4.41. cublasLtMatrixTransformDescCreate()

cublasStatus_t cublasLtMatrixTransformDescCreate(
      cublasLtMatrixTransformDesc_t *transformDesc,
      cudaDataType scaleType);

This function creates a matrix transform descriptor by allocating the memory needed to hold its opaque structure.

Parameters:

Parameter	Memory	Input / Output	Description
transformDesc		Output	Pointer to the structure holding the matrix transform descriptor created by this function. See cublasLtMatrixTransformDesc_t.
scaleType		Input	Enumerant that specifies the data precision for the matrix transform descriptor this function creates. See `cudaDataType`.

Returns:

Return Value	Description
CUBLAS_STATUS_ALLOC_FAILED	If memory could not be allocated.
CUBLAS_STATUS_SUCCESS	If the descriptor was created successfully.

See cublasStatus_t for a complete list of valid return codes.

3.4.42. cublasLtMatrixTransformDescInit()

cublasStatus_t cublasLtMatrixTransformDescInit(
      cublasLtMatrixTransformDesc_t transformDesc,
      cudaDataType scaleType);

This function initializes a matrix transform descriptor in a previously allocated one.

Parameters:

Parameter	Memory	Input / Output	Description
transformDesc		Output	Pointer to the structure holding the matrix transform descriptor initialized by this function. See cublasLtMatrixTransformDesc_t.
scaleType		Input	Enumerant that specifies the data precision for the matrix transform descriptor this function initializes. See `cudaDataType`.

Returns:

Return Value	Description
CUBLAS_STATUS_ALLOC_FAILED	If memory could not be allocated.
CUBLAS_STATUS_SUCCESS	If the descriptor was created successfully.

See cublasStatus_t for a complete list of valid return codes.

3.4.43. cublasLtMatrixTransformDescDestroy()

cublasStatus_t cublasLtMatrixTransformDescDestroy(
      cublasLtMatrixTransformDesc_t transformDesc);

This function destroys a previously created matrix transform descriptor object.

Parameters:

Parameter	Memory	Input / Output	Description
transformDesc		Input	Pointer to the structure holding the matrix transform descriptor that should be destroyed by this function. See cublasLtMatrixTransformDesc_t.

Returns:

Return Value	Description
CUBLAS_STATUS_SUCCESS	If the operation was successful.

See cublasStatus_t for a complete list of valid return codes.

3.4.44. cublasLtMatrixTransformDescGetAttribute()

cublasStatus_t cublasLtMatrixTransformDescGetAttribute(
      cublasLtMatrixTransformDesc_t transformDesc,
      cublasLtMatrixTransformDescAttributes_t attr,
      void *buf,
      size_t sizeInBytes,
      size_t *sizeWritten);

This function returns the value of the queried attribute belonging to a previously created matrix transform descriptor.

Parameters:

Parameter	Input / Output	Description
transformDesc	Input	Pointer to the previously created structure holding the matrix transform descriptor queried by this function. See cublasLtMatrixTransformDesc_t.
attr	Input	The attribute that will be retrieved by this function. See cublasLtMatrixTransformDescAttributes_t.
buf	Output	Memory address containing the attribute value retrieved by this function.
sizeInBytes	Input	Size of `buf` buffer (in bytes) for verification.
sizeWritten	Output	Valid only when the return value is CUBLAS_STATUS_SUCCESS. If `sizeInBytes` is non-zero: then `sizeWritten` is the number of bytes actually written; if `sizeInBytes` is 0: then `sizeWritten` is the number of bytes needed to write full contents.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If `sizeInBytes` is 0 and `sizeWritten` is NULL, or if `sizeInBytes` is non-zero and `buf` is NULL, or `sizeInBytes` doesn’t match size of internal storage for the selected attribute
CUBLAS_STATUS_SUCCESS	If attribute’s value was successfully written to user memory.

See cublasStatus_t for a complete list of valid return codes.

3.4.45. cublasLtMatrixTransformDescSetAttribute()

cublasStatus_t cublasLtMatrixTransformDescSetAttribute(
      cublasLtMatrixTransformDesc_t transformDesc,
      cublasLtMatrixTransformDescAttributes_t attr,
      const void *buf,
      size_t sizeInBytes);

This function sets the value of the specified attribute belonging to a previously created matrix transform descriptor.

Parameters:

Parameter	Input / Output	Description
transformDesc	Input	Pointer to the previously created structure holding the matrix transform descriptor queried by this function. See cublasLtMatrixTransformDesc_t.
attr	Input	The attribute that will be set by this function. See cublasLtMatrixTransformDescAttributes_t.
buf	Input	The value to which the specified attribute should be set.
sizeInBytes	Input	Size of `buf` buffer (in bytes) for verification.

Returns:

Return Value	Description
CUBLAS_STATUS_INVALID_VALUE	If `buf` is NULL or `sizeInBytes` does not match size of the internal storage for the selected attribute.
CUBLAS_STATUS_SUCCESS	If the attribute was set successfully.

See cublasStatus_t for a complete list of valid return codes.

4. Using the cuBLASXt API

4.1. General description

The cuBLASXt API of cuBLAS exposes a multi-GPU capable host interface: when using this API the application only needs to allocate the required matrices on the host memory space. Additionally, the current implementation supports managed memory on Linux with GPU devices that have compute capability 6.x or greater but treats it as host memory. Managed memory is not supported on Windows. There are no restriction on the sizes of the matrices as long as they can fit into the host memory. The cuBLASXt API takes care of allocating the memory across the designated GPUs and dispatched the workload between them and finally retrieves the results back to the host. The cuBLASXt API supports only the compute-intensive BLAS3 routines (e.g matrix-matrix operations) where the PCI transfers back and forth from the GPU can be amortized. The cuBLASXt API has its own header file cublasXt.h.

Starting with release 8.0, cuBLASXt API allows any of the matrices to be located on a GPU device.

Note : The cuBLASXt API is only supported on 64-bit platforms.

4.1.1. Tiling design approach

To be able to share the workload between multiples GPUs, the cuBLASXt API uses a tiling strategy : every matrix is divided in square tiles of user-controllable dimension BlockDim x BlockDim. The resulting matrix tiling defines the static scheduling policy : each resulting tile is affected to a GPU in a round robin fashion One CPU thread is created per GPU and is responsible to do the proper memory transfers and cuBLAS operations to compute all the tiles that it is responsible for. From a performance point of view, due to this static scheduling strategy, it is better that compute capabilites and PCI bandwidth are the same for every GPU. The figure below illustrates the tiles distribution between 3 GPUs. To compute the first tile G0 from C, the CPU thread 0 responsible of GPU0, have to load 3 tiles from the first row of A and tiles from the first columun of B in a pipeline fashion in order to overlap memory transfer and computations and sum the results into the first tile G0 of C before to move on to the next tile G0.

Example of cublasXt<t>gemm() tiling for 3 Gpus — Example of cublasXt<t>gemm tiling for 3 Gpus

When the tile dimension is not an exact multiple of the dimensions of C, some tiles are partially filled on the right border or/and the bottom border. The current implementation does not pad the incomplete tiles but simply keep track of those incomplete tiles by doing the right reduced cuBLAS opearations : this way, no extra computation is done. However it still can lead to some load unbalance when all GPUS do not have the same number of incomplete tiles to work on.

When one or more matrices are located on some GPU devices, the same tiling approach and workload sharing is applied. The memory transfers are in this case done between devices. However, when the computation of a tile and some data are located on the same GPU device, the memory transfer to/from the local data into tiles is bypassed and the GPU operates directly on the local data. This can lead to a significant performance increase, especially when only one GPU is used for the computation.

The matrices can be located on any GPU device, and do not have to be located on the same GPU device. Furthermore, the matrices can even be located on a GPU device that do not participate to the computation.

On the contrary of the cuBLAS API, even if all matrices are located on the same device, the cuBLASXt API is still a blocking API from the host point of view : the data results wherever located will be valid on the call return and no device synchronization is required.

4.1.2. Hybrid CPU-GPU computation

In the case of very large problems, the cuBLASXt API offers the possibility to offload some of the computation to the host CPU. This feature can be setup with the routines cublasXtSetCpuRoutine() and cublasXtSetCpuRatio() The workload affected to the CPU is put aside : it is simply a percentage of the resulting matrix taken from the bottom and the right side whichever dimension is bigger. The GPU tiling is done after that on the reduced resulting matrix.

If any of the matrices is located on a GPU device, the feature is ignored and all computation will be done only on the GPUs

This feature should be used with caution because it could interfere with the CPU threads responsible of feeding the GPUs.

Currently, only the routine cublasXt<t>gemm supports this feature.

4.1.3. Results reproducibility

Currently all cuBLASXt API routines from a given toolkit version, generate the same bit-wise results when the following conditions are respected :

all GPUs particating to the computation have the same compute capabilities and the same number of SMs.
the tiles size is kept the same between run.
either the CPU hybrid computation is not used or the CPU Blas provided is also guaranteed to produce reproducible results.

4.2. cuBLASXt API Datatypes Reference

4.2.1. cublasXtHandle_t

The cublasXtHandle_t type is a pointer type to an opaque structure holding the cuBLASXt API context. The cuBLASXt API context must be initialized using cublasXtCreate() and the returned handle must be passed to all subsequent cuBLASXt API function calls. The context should be destroyed at the end using cublasXtDestroy().

4.2.2. cublasXtOpType_t

The cublasOpType_t enumerates the four possible types supported by BLAS routines. This enum is used as parameters of the routines cublasXtSetCpuRoutine and cublasXtSetCpuRatio to setup the hybrid configuration.

Value	Meaning
`CUBLASXT_FLOAT`	float or single precision type
`CUBLASXT_DOUBLE`	double precision type
`CUBLASXT_COMPLEX`	single precision complex
`CUBLASXT_DOUBLECOMPLEX`	double precision complex

4.2.3. cublasXtBlasOp_t

The cublasXtBlasOp_t type enumerates the BLAS3 or BLAS-like routine supported by cuBLASXt API. This enum is used as parameters of the routines cublasXtSetCpuRoutine and cublasXtSetCpuRatio to setup the hybrid configuration.

Value	Meaning
`CUBLASXT_GEMM`	GEMM routine
`CUBLASXT_SYRK`	SYRK routine
`CUBLASXT_HERK`	HERK routine
`CUBLASXT_SYMM`	SYMM routine
`CUBLASXT_HEMM`	HEMM routine
`CUBLASXT_TRSM`	TRSM routine
`CUBLASXT_SYR2K`	SYR2K routine
`CUBLASXT_HER2K`	HER2K routine
`CUBLASXT_SPMM`	SPMM routine
`CUBLASXT_SYRKX`	SYRKX routine
`CUBLASXT_HERKX`	HERKX routine

4.2.4. cublasXtPinningMemMode_t

The type is used to enable or disable the Pinning Memory mode through the routine cubasMgSetPinningMemMode

Value	Meaning
`CUBLASXT_PINNING_DISABLED`	the Pinning Memory mode is disabled
`CUBLASXT_PINNING_ENABLED`	the Pinning Memory mode is enabled

4.3. cuBLASXt API Helper Function Reference

4.3.1. cublasXtCreate()

cublasStatus_t
cublasXtCreate(cublasXtHandle_t *handle)

This function initializes the cuBLASXt API and creates a handle to an opaque structure holding the cuBLASXt API context. It allocates hardware resources on the host and device and must be called prior to making any other cuBLASXt API calls.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the initialization succeeded
`CUBLAS_STATUS_ALLOC_FAILED`	the resources could not be allocated
`CUBLAS_STATUS_NOT_SUPPORTED`	cuBLASXt API is only supported on 64-bit platform

4.3.2. cublasXtDestroy()

cublasStatus_t
cublasXtDestroy(cublasXtHandle_t handle)

This function releases hardware resources used by the cuBLASXt API context. The release of GPU resources may be deferred until the application exits. This function is usually the last call with a particular handle to the cuBLASXt API.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the shut down succeeded
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized

4.3.3. cublasXtDeviceSelect()

cublasXtDeviceSelect(cublasXtHandle_t handle, int nbDevices, int deviceId[])

This function allows the user to provide the number of GPU devices and their respective Ids that will participate to the subsequent cuBLASXt API Math function calls. This function will create a cuBLAS context for every GPU provided in that list. Currently the device configuration is static and cannot be changed between Math function calls. In that regard, this function should be called only once after cublasXtCreate. To be able to run multiple configurations, multiple cuBLASXt API contexts should be created.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	User call was sucessful
`CUBLAS_STATUS_INVALID_VALUE`	Access to at least one of the device could not be done or a cuBLAS context could not be created on at least one of the device
`CUBLAS_STATUS_ALLOC_FAILED`	Some resources could not be allocated.

4.3.4. cublasXtSetBlockDim()

cublasXtSetBlockDim(cublasXtHandle_t handle, int blockDim)

This function allows the user to set the block dimension used for the tiling of the matrices for the subsequent Math function calls. Matrices are split in square tiles of blockDim x blockDim dimension. This function can be called anytime and will take effect for the following Math function calls. The block dimension should be chosen in a way to optimize the math operation and to make sure that the PCI transfers are well overlapped with the computation.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful
`CUBLAS_STATUS_INVALID_VALUE`	blockDim <= 0

4.3.5. cublasXtGetBlockDim()

cublasXtGetBlockDim(cublasXtHandle_t handle, int *blockDim)

This function allows the user to query the block dimension used for the tiling of the matrices.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful

4.3.6. cublasXtSetCpuRoutine()

cublasXtSetCpuRoutine(cublasXtHandle_t handle, cublasXtBlasOp_t blasOp, cublasXtOpType_t type, void *blasFunctor)

This function allows the user to provide a CPU implementation of the corresponding BLAS routine. This function can be used with the function cublasXtSetCpuRatio() to define an hybrid computation between the CPU and the GPUs. Currently the hybrid feature is only supported for the xGEMM routines.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful
`CUBLAS_STATUS_INVALID_VALUE`	blasOp or type define an invalid combination
`CUBLAS_STATUS_NOT_SUPPORTED`	CPU-GPU Hybridization for that routine is not supported

4.3.7. cublasXtSetCpuRatio()

cublasXtSetCpuRatio(cublasXtHandle_t handle, cublasXtBlasOp_t blasOp, cublasXtOpType_t type, float ratio )

This function allows the user to define the percentage of workload that should be done on a CPU in the context of an hybrid computation. This function can be used with the function cublasXtSetCpuRoutine() to define an hybrid computation between the CPU and the GPUs. Currently the hybrid feature is only supported for the xGEMM routines.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful
`CUBLAS_STATUS_INVALID_VALUE`	blasOp or type define an invalid combination
`CUBLAS_STATUS_NOT_SUPPORTED`	CPU-GPU Hybridization for that routine is not supported

4.3.8. cublasXtSetPinningMemMode()

cublasXtSetPinningMemMode(cublasXtHandle_t handle, cublasXtPinningMemMode_t mode)

This function allows the user to enable or disable the Pinning Memory mode. When enabled, the matrices passed in subsequent cuBLASXt API calls will be pinned/unpinned using the CUDART routine cudaHostRegister() and cudaHostUnregister() respectively if the matrices are not already pinned. If a matrix happened to be pinned partially, it will also not be pinned. Pinning the memory improve PCI transfer performace and allows to overlap PCI memory transfer with computation. However pinning/unpinning the memory take some time which might not be amortized. It is advised that the user pins the memory on its own using cudaMallocHost() or cudaHostRegister() and unpin it when the computation sequence is completed. By default, the Pinning Memory mode is disabled.

Note

The Pinning Memory mode should not enabled when matrices used for different calls to cuBLASXt API overlap. cuBLASXt determines that a matrix is pinned or not if the first address of that matrix is pinned using cudaHostGetFlags(), thus cannot know if the matrix is already partially pinned or not. This is especially true in multi-threaded application where memory could be partially or totally pinned or unpinned while another thread is accessing that memory.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful
`CUBLAS_STATUS_INVALID_VALUE`	the mode value is different from `CUBLASXT_PINNING_DISABLED` and `CUBLASXT_PINNING_ENABLED`

4.3.9. cublasXtGetPinningMemMode()

cublasXtGetPinningMemMode(cublasXtHandle_t handle, cublasXtPinningMemMode_t *mode)

This function allows the user to query the Pinning Memory mode. By default, the Pinning Memory mode is disabled.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful

4.4. cuBLASXt API Math Functions Reference

In this chapter we describe the actual Linear Agebra routines that cuBLASXt API supports. We will use abbreviations <type> for type and <t> for the corresponding short type to make a more concise and clear presentation of the implemented functions. Unless otherwise specified <type> and <t> have the following meanings:

<type>	<t>	Meaning
`float`	‘s’ or ‘S’	real single-precision
`double`	‘d’ or ‘D’	real double-precision
`cuComplex`	‘c’ or ‘C’	complex single-precision
`cuDoubleComplex`	‘z’ or ‘Z’	complex double-precision

The abbreviation $\mathbf{Re}(\cdot)$ and $\mathbf{Im}(\cdot)$ will stand for the real and imaginary part of a number, respectively. Since imaginary part of a real number does not exist, we will consider it to be zero and can usually simply discard it from the equation where it is being used. Also, the $\bar{\alpha}$ will denote the complex conjugate of $\alpha$ .

In general throughout the documentation, the lower case Greek symbols $\alpha$ and $\beta$ will denote scalars, lower case English letters in bold type $\mathbf{x}$ and $\mathbf{y}$ will denote vectors and capital English letters $A$ , $B$ and $C$ will denote matrices.

4.4.1. cublasXt<t>gemm()

cublasStatus_t cublasXtSgemm(cublasXtHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           size_t m, size_t n, size_t k,
                           const float           *alpha,
                           const float           *A, int lda,
                           const float           *B, int ldb,
                           const float           *beta,
                           float           *C, int ldc)
cublasStatus_t cublasXtDgemm(cublasXtHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const double          *alpha,
                           const double          *A, int lda,
                           const double          *B, int ldb,
                           const double          *beta,
                           double          *C, int ldc)
cublasStatus_t cublasXtCgemm(cublasXtHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           const cuComplex       *B, int ldb,
                           const cuComplex       *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasXtZgemm(cublasXtHandle_t handle,
                           cublasOperation_t transa, cublasOperation_t transb,
                           int m, int n, int k,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           const cuDoubleComplex *B, int ldb,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, int ldc)

This function performs the matrix-matrix multiplication

$C = \alpha\text{op}(A)\text{op}(B) + \beta C$

where $\alpha$ and $\beta$ are scalars, and $A$ , $B$ and $C$ are matrices stored in column-major format with dimensions $\text{op}(A)$ $m \times k$ , $\text{op}(B)$ $k \times n$ and $C$ $m \times n$ , respectively. Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

and $\text{op}(B)$ is defined similarly for matrix $B$ .

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLASXt API context.
transa		input	operation op(`A`) that is non- or (conj.) transpose.
transb		input	operation op(`B`) that is non- or (conj.) transpose.
m		input	number of rows of matrix op(`A`) and `C`.
n		input	number of columns of matrix op(`B`) and `C`.
k		input	number of columns of op(`A`) and rows of op(`B`).
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimensions `lda x k` with `lda>=max(1,m)` if `transa == CUBLAS_OP_N` and `lda x m` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store the matrix `A`.
B	host or device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,k)` if `transb == CUBLAS_OP_N` and `ldb x k` with `ldb>=max(1,n)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host	input	<type> scalar used for multiplication. If `beta==0`, `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of a two-dimensional array used to store the matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n,k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

sgemm, dgemm, cgemm, zgemm

4.4.2. cublasXt<t>hemm()

cublasStatus_t cublasXtChemm(cublasXtHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           size_t m, size_t n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, size_t lda,
                           const cuComplex       *B, size_t ldb,
                           const cuComplex       *beta,
                           cuComplex       *C, size_t ldc)
cublasStatus_t cublasXtZhemm(cublasXtHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           size_t m, size_t n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, size_t lda,
                           const cuDoubleComplex *B, size_t ldb,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, size_t ldc)

This function performs the Hermitian matrix-matrix multiplication

$C = \left\{ \begin{matrix} {\alpha AB + \beta C} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\ {\alpha BA + \beta C} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\ \end{matrix} \right.$

where $A$ is a Hermitian matrix stored in lower or upper mode, $B$ and $C$ are $m \times n$ matrices, and $\alpha$ and $\beta$ are scalars.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLASXt API context.
side		input	indicates if matrix `A` is on the left or right of `B`.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
m		input	number of rows of matrix `C` and `B`, with matrix `A` sized accordingly.
n		input	number of columns of matrix `C` and `B`, with matrix `A` sized accordingly.
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x m` with `lda>=max(1,m)` if `side==CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise. The imaginary parts of the diagonal elements are assumed to be zero.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	host or device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,m)`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

chemm, zhemm

4.4.3. cublasXt<t>symm()

cublasStatus_t cublasXtSsymm(cublasXtHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           size_t m, size_t n,
                           const float           *alpha,
                           const float           *A, size_t lda,
                           const float           *B, size_t ldb,
                           const float           *beta,
                           float           *C, size_t ldc)
cublasStatus_t cublasXtDsymm(cublasXtHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           size_t m, size_t n,
                           const double          *alpha,
                           const double          *A, size_t lda,
                           const double          *B, size_t ldb,
                           const double          *beta,
                           double          *C, size_t ldc)
cublasStatus_t cublasXtCsymm(cublasXtHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           size_t m, size_t n,
                           const cuComplex       *alpha,
                           const cuComplex       *A, size_t lda,
                           const cuComplex       *B, size_t ldb,
                           const cuComplex       *beta,
                           cuComplex       *C, size_t ldc)
cublasStatus_t cublasXtZsymm(cublasXtHandle_t handle,
                           cublasSideMode_t side, cublasFillMode_t uplo,
                           size_t m, size_t n,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, size_t lda,
                           const cuDoubleComplex *B, size_t ldb,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, size_t ldc)

This function performs the symmetric matrix-matrix multiplication

$C = \left\{ \begin{matrix} {\alpha AB + \beta C} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\ {\alpha BA + \beta C} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\ \end{matrix} \right.$

where $A$ is a symmetric matrix stored in lower or upper mode, $A$ and $A$ are $m \times n$ matrices, and $\alpha$ and $\beta$ are scalars.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLASXt API context.
side		input	indicates if matrix `A` is on the left or right of `B`.
uplo		input	indicates if matrix `A` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
m		input	number of rows of matrix `A` and `B`, with matrix `A` sized accordingly.
n		input	number of columns of matrix `C` and `A`, with matrix `A` sized accordingly.
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x m` with `lda>=max(1,m)` if `side == CUBLAS_SIDE_LEFT` and `lda x n` with `lda>=max(1,n)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	host or device	input	<type> array of dimension `ldb x n` with `ldb>=max(1,m)`.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host	input	<type> scalar used for multiplication, if `beta == 0` then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimension `ldc x n` with `ldc>=max(1,m)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `m,n<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

ssymm, dsymm, csymm, zsymm

4.4.4. cublasXt<t>syrk()

cublasStatus_t cublasXtSsyrk(cublasXtHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const float           *alpha,
                           const float           *A, int lda,
                           const float           *beta,
                           float           *C, int ldc)
cublasStatus_t cublasXtDsyrk(cublasXtHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const double          *alpha,
                           const double          *A, int lda,
                           const double          *beta,
                           double          *C, int ldc)
cublasStatus_t cublasXtCsyrk(cublasXtHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const cuComplex       *alpha,
                           const cuComplex       *A, int lda,
                           const cuComplex       *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasXtZsyrk(cublasXtHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const cuDoubleComplex *alpha,
                           const cuDoubleComplex *A, int lda,
                           const cuDoubleComplex *beta,
                           cuDoubleComplex *C, int ldc)

This function performs the symmetric rank- $k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{T} + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a symmetric matrix stored in lower or upper mode, and $A$ is a matrix with dimensions $\text{op}(A)$ $n \times k$ . Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\ \end{matrix} \right.$

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLASXt API context.
uplo		input	indicates if matrix `C` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or transpose.
n		input	number of rows of matrix op(`A`) and `C`.
k		input	number of columns of matrix op(`A`).
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `trans == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix A.
beta	host	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

4.4.5. cublasXt<t>syr2k()

cublasStatus_t cublasXtSsyr2k(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const float           *alpha,
                            const float           *A, size_t lda,
                            const float           *B, size_t ldb,
                            const float           *beta,
                            float           *C, size_t ldc)
cublasStatus_t cublasXtDsyr2k(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const double          *alpha,
                            const double          *A, size_t lda,
                            const double          *B, size_t ldb,
                            const double          *beta,
                            double          *C, size_t ldc)
cublasStatus_t cublasXtCsyr2k(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, size_t lda,
                            const cuComplex       *B, size_t ldb,
                            const cuComplex       *beta,
                            cuComplex       *C, size_t ldc)
cublasStatus_t cublasXtZsyr2k(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, size_t lda,
                            const cuDoubleComplex *B, size_t ldb,
                            const cuDoubleComplex *beta,
                            cuDoubleComplex *C, size_t ldc)

This function performs the symmetric rank- $2k$ update

$C = \alpha(\text{op}(A)\text{op}(B)^{T} + \text{op}(B)\text{op}(A)^{T}) + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a symmetric matrix stored in lower or upper mode, and $A$ and $B$ are matrices with dimensions $\text{op}(A)$ $n \times k$ and $\text{op}(B)$ $n \times k$ , respectively. Also, for matrix $A$ and $B$

$\text{op(}A\text{) and op(}B\text{)} = \left\{ \begin{matrix} {A\text{ and }B} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ {A^{T}\text{ and }B^{T}} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ \end{matrix} \right.$

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLASXt API context.
uplo		input	indicates if matrix `C` lower or upper part, is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	host or device	input	<type> array of dimensions `ldb x k` with `ldb>=max(1,n)` if `transb == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host	input	<type> scalar used for multiplication, if `beta==0`, then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,n)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

4.4.6. cublasXt<t>syrkx()

cublasStatus_t cublasXtSsyrkx(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const float           *alpha,
                            const float           *A, size_t lda,
                            const float           *B, size_t ldb,
                            const float           *beta,
                            float           *C, size_t ldc)
cublasStatus_t cublasXtDsyrkx(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const double          *alpha,
                            const double          *A, size_t lda,
                            const double          *B, size_t ldb,
                            const double          *beta,
                            double          *C, size_t ldc)
cublasStatus_t cublasXtCsyrkx(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, size_t lda,
                            const cuComplex       *B, size_t ldb,
                            const cuComplex       *beta,
                            cuComplex       *C, size_t ldc)
cublasStatus_t cublasXtZsyrkx(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, size_t lda,
                            const cuDoubleComplex *B, size_t ldb,
                            const cuDoubleComplex *beta,
                            cuDoubleComplex *C, size_t ldc)

This function performs a variation of the symmetric rank- $k$ update

$C = \alpha(\text{op}(A)\text{op}(B)^{T} + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a symmetric matrix stored in lower or upper mode, and $A$ and $B$ are matrices with dimensions $\text{op}(A)$ $n \times k$ and $\text{op}(B)$ $n \times k$ , respectively. Also, for matrix $A$ and $B$

$\text{op(}A\text{) and op(}B\text{)} = \left\{ \begin{matrix} {A\text{ and }B} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ {A^{T}\text{ and }B^{T}} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ \end{matrix} \right.$

This routine can be used when B is in such way that the result is guaranteed to be symmetric. An usual example is when the matrix B is a scaled form of the matrix A : this is equivalent to B being the product of the matrix A and a diagonal matrix.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLASXt API context.
uplo		input	indicates if matrix `C` lower or upper part, is stored, the other symmetric part is not referenced and is inferred from the stored elements.
trans		input	operation op(`A`) that is non- or transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	host or device	input	<type> array of dimensions `ldb x k` with `ldb>=max(1,n)` if `transb == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host	input	<type> scalar used for multiplication, if `beta==0`, then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimensions `ldc x n` with `ldc>=max(1,n)`.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

ssyrk, dsyrk, csyrk, zsyrk and

4.4.7. cublasXt<t>herk()

cublasStatus_t cublasXtCherk(cublasXtHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const float  *alpha,
                           const cuComplex       *A, int lda,
                           const float  *beta,
                           cuComplex       *C, int ldc)
cublasStatus_t cublasXtZherk(cublasXtHandle_t handle,
                           cublasFillMode_t uplo, cublasOperation_t trans,
                           int n, int k,
                           const double *alpha,
                           const cuDoubleComplex *A, int lda,
                           const double *beta,
                           cuDoubleComplex *C, int ldc)

This function performs the Hermitian rank- $k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{H} + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a Hermitian matrix stored in lower or upper mode, and $A$ is a matrix with dimensions $\text{op}(A)$ $n \times k$ . Also, for matrix $A$

$\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLASXt API context.
uplo		input	indicates if matrix `C` lower or upper part is stored, the other Hermitian part is not referenced.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`) and `C`.
k		input	number of columns of matrix op(`A`).
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
beta	host	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

cherk, zherk

4.4.8. cublasXt<t>her2k()

cublasStatus_t cublasXtCher2k(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, size_t lda,
                            const cuComplex       *B, size_t ldb,
                            const float  *beta,
                            cuComplex       *C, size_t ldc)
cublasStatus_t cublasXtZher2k(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, size_t lda,
                            const cuDoubleComplex *B, size_t ldb,
                            const double *beta,
                            cuDoubleComplex *C, size_t ldc)

This function performs the Hermitian rank- $2k$ update

$C = \alpha\text{op}(A)\text{op}(B)^{H} + \overset{ˉ}{\alpha}\text{op}(B)\text{op}(A)^{H} + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a Hermitian matrix stored in lower or upper mode, and $A$ and $B$ are matrices with dimensions $\text{op}(A)$ $n \times k$ and $\text{op}(B)$ $n \times k$ , respectively. Also, for matrix $A$ and $B$

$\text{op(}A\text{) and op(}B\text{)} = \left\{ \begin{matrix} {A\text{ and }B} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ {A^{H}\text{ and }B^{H}} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLASXt API context.
uplo		input	indicates if matrix `C` lower or upper part is stored, the other Hermitian part is not referenced.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	host or device	input	<type> array of dimension `ldb x k` with `ldb>=max(1,n)` if `transb == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host	input	<type> scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

4.4.9. cublasXt<t>herkx()

cublasStatus_t cublasXtCherkx(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuComplex       *alpha,
                            const cuComplex       *A, size_t lda,
                            const cuComplex       *B, size_t ldb,
                            const float  *beta,
                            cuComplex       *C, size_t ldc)
cublasStatus_t cublasXtZherkx(cublasXtHandle_t handle,
                            cublasFillMode_t uplo, cublasOperation_t trans,
                            size_t n, size_t k,
                            const cuDoubleComplex *alpha,
                            const cuDoubleComplex *A, size_t lda,
                            const cuDoubleComplex *B, size_t ldb,
                            const double *beta,
                            cuDoubleComplex *C, size_t ldc)

This function performs a variation of the Hermitian rank- $k$ update

$C = \alpha\text{op}(A)\text{op}(B)^{H} + \beta C$

where $\alpha$ and $\beta$ are scalars, $C$ is a Hermitian matrix stored in lower or upper mode, and $A$ and $B$ are matrices with dimensions $\text{op}(A)$ $n \times k$ and $\text{op}(B)$ $n \times k$ , respectively. Also, for matrix $A$ and $B$

$\text{op(}A\text{) and op(}B\text{)} = \left\{ \begin{matrix} {A\text{ and }B} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ {A^{H}\text{ and }B^{H}} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.$

This routine can be used when the matrix B is in such way that the result is guaranteed to be hermitian. An usual example is when the matrix B is a scaled form of the matrix A : this is equivalent to B being the product of the matrix A and a diagonal matrix. For an efficient computation of the product of a regular matrix with a diagonal matrix, refer to the routine cublasXt<t>dgmm.

Param.	Memory	In/out	Meaning
handle		input	handle to the cuBLASXt API context.
uplo		input	indicates if matrix `C` lower or upper part is stored, the other Hermitian part is not referenced.
trans		input	operation op(`A`) that is non- or (conj.) transpose.
n		input	number of rows of matrix op(`A`), op(`B`) and `C`.
k		input	number of columns of matrix op(`A`) and op(`B`).
alpha	host	input	<type> scalar used for multiplication.
A	host or device	input	<type> array of dimension `lda x k` with `lda>=max(1,n)` if `transa == CUBLAS_OP_N` and `lda x n` with `lda>=max(1,k)` otherwise.
lda		input	leading dimension of two-dimensional array used to store matrix `A`.
B	host or device	input	<type> array of dimension `ldb x k` with `ldb>=max(1,n)` if `transb == CUBLAS_OP_N` and `ldb x n` with `ldb>=max(1,k)` otherwise.
ldb		input	leading dimension of two-dimensional array used to store matrix `B`.
beta	host	input	real scalar used for multiplication, if `beta==0` then `C` does not have to be a valid input.
C	host or device	in/out	<type> array of dimension `ldc x n`, with `ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
ldc		input	leading dimension of two-dimensional array used to store matrix `C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters `n,k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to:

cherk, zherk and